Question 1

I'm trying the mongodb hadoop integration with spark but can't figure out how to make the jars accessible to an IPython notebook.

Here what I'm trying to do:

# set up parameters for reading from MongoDB via Hadoop input format
config = {"mongo.input.uri": "mongodb://localhost:27017/db.collection"}
inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"# these values worked but others might as well
keyClassName = "org.apache.hadoop.io.Text"
valueClassName = "org.apache.hadoop.io.MapWritable"# Do some reading from mongo
items = sc.newAPIHadoopRDD(inputFormatClassName, keyClassName, valueClassName, None, None, config)

This code works fine when I launch it in pyspark using the following command:

spark-1.4.1/bin/pyspark --jars 'mongo-hadoop-core-1.4.0.jar,mongo-java-driver-3.0.2.jar'

where mongo-hadoop-core-1.4.0.jar and mongo-java-driver-2.10.1.jar allows using mongodb from java. However, when I do this:

IPYTHON_OPTS="notebook" spark-1.4.1/bin/pyspark --jars 'mongo-hadoop-core-1.4.0.jar,mongo-java-driver-3.0.2.jar'

The jars are not available anymore and I get the following error:

java.lang.ClassNotFoundException: com.mongodb.hadoop.MongoInputFormat

Does anyone know how to make jars available to the spark in the IPython notebook? I'm pretty sure this is not specific to mongo so maybe someone already has succeeded in adding jars to the classpath while using the notebook?

Question 2

Very similar, please let me know if this helps: https://issues.apache.org/jira/browse/SPARK-5185

Add jar to pyspark when using notebook

Related Q&A

how do I create a python list with a negative index

Select subset of Data Frame rows based on a list in Pandas

convert csv to json (nested objects)

How can I read exactly one response chunk with pythons http.client?

ValueError: cannot reindex from a duplicate axis in groupby Pandas

How to calculate class weights of a Pandas DataFrame for Keras?

How to change the layout of a Gtk application on fullscreen?

How to upload multiple file in django admin models

Convert numpy array to list of datetimes

PyQt: how to handle event without inheritance