Pyspark module not found

2024/11/13 12:57:04

I'm trying to execute a simple Pyspark job in Yarn. This is the code:

from pyspark import SparkConf, SparkContextconf = (SparkConf().setMaster("yarn-client").setAppName("HDFS Filter").set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)inputFile = sc.textFile("hdfs://myserver:9000/1436304078054.json.gz").cache()
matchTerm = "spark"
numMatches = inputFile.filter(lambda line: matchTerm in line).count()
print(numMatches, "lines contain", matchTerm)

I don't know if the code will work and that is not the point. The problem is that when I run it with the command ./bin/pyspark ../job.py from inside spark directory, I get the next error (just an small park of the whole output):

15/09/01 17:57:02 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop-05:44841 (size: 3.8 KB, free: 534.5 MB)
15/09/01 17:57:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hadoop-05): org.apache.spark.SparkException: 
Error from python worker:/usr/bin/python2.7: No module named pyspark
PYTHONPATH was:/usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/16/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFExceptionat java.io.DataInputStream.readInt(DataInputStream.java:392)at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)at org.apache.spark.scheduler.Task.run(Task.scala:70)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)at java.lang.Thread.run(Thread.java:745)15/09/01 17:57:02 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, hadoop-03, RACK_LOCAL, 1475 bytes)
15/09/01 17:57:04 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop-03:33268 (size: 3.8 KB, free: 534.5 MB)
15/09/01 17:57:05 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, hadoop-03): org.apache.spark.SparkException: 
Error from python worker:/usr/bin/python2.7: No module named pyspark
PYTHONPATH was:/usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/21/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFExceptionat java.io.DataInputStream.readInt(DataInputStream.java:392)at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)at org.apache.spark.scheduler.Task.run(Task.scala:70)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)at java.lang.Thread.run(Thread.java:745)15/09/01 17:57:05 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, hadoop-05, RACK_LOCAL, 1475 bytes)
15/09/01 17:57:05 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on executor hadoop-05: org.apache.spark.SparkException (
Error from python worker:/usr/bin/python2.7: No module named pyspark
PYTHONPATH was:/usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/16/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFException) [duplicate 1]
15/09/01 17:57:05 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, hadoop-05, RACK_LOCAL, 1475 bytes)
15/09/01 17:57:05 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on executor hadoop-05: org.apache.spark.SparkException (
Error from python worker:/usr/bin/python2.7: No module named pyspark
PYTHONPATH was:/usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/16/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFException) [duplicate 2]
15/09/01 17:57:05 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
15/09/01 17:57:05 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/09/01 17:57:05 INFO cluster.YarnScheduler: Cancelling stage 0
15/09/01 17:57:05 INFO scheduler.DAGScheduler: ResultStage 0 (count at /home/hduser/spark-1.4.1-bin-without-hadoop/../test.py:11) failed in 5.093 s
15/09/01 17:57:05 INFO scheduler.DAGScheduler: Job 0 failed: count at /home/hduser/spark-1.4.1-bin-without-hadoop/../test.py:11, took 5.238381 s
Traceback (most recent call last):File "/home/hduser/spark-1.4.1-bin-without-hadoop/../test.py", line 11, in <module>
numMatches = inputFile.filter(lambda line: matchTerm in line).count()File "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/pyspark.zip/pyspark/rdd.py", line 984, in countFile "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/pyspark.zip/pyspark/rdd.py", line 975, in sumFile "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/pyspark.zip/pyspark/rdd.py", line 852, in foldFile "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/pyspark.zip/pyspark/rdd.py", line 757, in collectFile "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__File "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hadoop-05): org.apache.spark.SparkException: 
Error from python worker:/usr/bin/python2.7: No module named pyspark
PYTHONPATH was:/usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/16/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFExceptionat java.io.DataInputStream.readInt(DataInputStream.java:392)at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)at org.apache.spark.scheduler.Task.run(Task.scala:70)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)at java.lang.Thread.run(Thread.java:745)Driver stacktrace:at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)at scala.Option.foreach(Option.scala:236)at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)15/09/01 17:57:06 INFO spark.SparkContext: Invoking stop() from shutdown hook

Finally, this is my spark-env.sh conf file:

export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

Any idea about what I'm doing wrong?

Answer

What fixed this for me was including a couple of extra settings in the SparkConf, which seem to make sure the workers get access to the PySpark and Py4J modules:

conf = (SparkConf().setMaster("yarn-client").setAppName("HDFS Filter").set("spark.executor.memory", "1g").set('spark.yarn.dist.files','file:/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip,file:/usr/hdp/2.3.2.0-2950/spark/python/lib/py4j-0.8.2.1-src.zip').setExecutorEnv('PYTHONPATH','pyspark.zip:py4j-0.8.2.1-src.zip'))

You'll need to edit the paths as appropriate for your system.

https://en.xdnf.cn/q/72106.html

Related Q&A

Multiple windows in PyQt4?

Ive just begun using pyqt4. I followed a tutorial (http://zetcode.com/tutorials/pyqt4/) One thing that puzzles me is this part:def main():app = QtGui.QApplication(sys.argv)ex = GUI()sys.exit(app.exec()…

Fill missing timeseries data using pandas or numpy

I have a list of dictionaries which looks like this :L=[ { "timeline": "2014-10", "total_prescriptions": 17 }, { "timeline": "2014-11", "total_…

Can Biopython perform Seq.find() accounting for ambiguity codes

I want to be able to search a Seq object for a subsequnce Seq object accounting for ambiguity codes. For example, the following should be true:from Bio.Seq import Seq from Bio.Alphabet.IUPAC import IUP…

MySQL and lock a table, read, and then truncate

I am using mysqldb in python.I need to do the following for a table.1) Lock 2) Read 3) Truncate the table 4) UnlockWhen I run the below code, I get the below error. So, I am rather unsure on how to lo…

Train and predict on variable length sequences

Sensors (of the same type) scattered on my site are manually reporting on irregular intervals to my backend. Between reports the sensors aggregate events and report them as a batch. The following datas…

What should a Python project structure look like for Travis CI to find and run tests?

I currently have a project with the following .travis.yml file:language: python install: "pip install tox" script: "tox"Locally, tox properly executes and runs 35 tests, but on Trav…

Having trouble building a Dns Packet in Python

Im trying to build a dns packet to send over a socket. I dont want to use any libraries because I want direct access to the socket variable that sends it. Whenever I send the DNS packet, wireshark says…

Element wise comparison between 1D and 2D array

Want to perform an element wise comparison between an 1D and 2D array. Each element of the 1D array need to be compared (e.g. greater) against the corresponding row of 2D and a mask will be created. He…

Jquery ajax post request not working

I have a simple form submission with ajax, but it keeps giving me an error. All the error says is "error". No code, no description. No nothing, when I alert it when it fails.Javascript with …

Why is the main() function not defined inside the if __main__?

You can often see this (variation a):def main():do_something()do_sth_else()if __name__ == __main__:main()And I am now wondering why not this (variation b):if __name__ == __main__:do_something()do_sth_e…