Unable to submit Spark job from Windows IDE to Linux cluster

2024/10/13 1:17:30

I just read about findspark and found it quite interesting, as so far I have only used spark-submit which isn't be suited for interactive development on an IDE. I tried executing this file on Windows 10, Anaconda 4.4.0, Python 3.6.1, IPython 5.3.0, Spyder 3.1.4, Spark 2.1.1:

def inc(i):return i + 1import findspark
findspark.init()import pyspark
sc = pyspark.SparkContext(master='local',appName='test1')print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

Spyder generates the command runfile('C:/tests/temp1.py', wdir='C:/tests') and it prints out [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] as expected. However if I try to use a Spark cluster running on Ubuntu I get an error:

def inc(i):return i + 1import findspark
findspark.init()import pyspark
sc = pyspark.SparkContext(master='spark://192.168.1.57:7077',appName='test1')print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

IPython errors:

Traceback (most recent call last):File "<ipython-input-1-820bd4275b8c>", line 1, in <module>runfile('C:/tests/temp.py', wdir='C:/tests')File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfileexecfile(filename, namespace)File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfileexec(compile(f.read(), filename, 'exec'), namespace)File "C:/tests/temp.py", line 11, in <module>print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\pyspark\rdd.py", line 808, in collectport = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__answer, self.gateway_client, self.target_id, self.name)File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_valueformat(target_id, ".", name), value)Py4JJavaError: An error occurred while calling

Worker stderr:

ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "C:\Anaconda3\pythonw.exe": error=2, No such file or directoryat java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at org.apache.spark.scheduler.Task.run(Task.scala:99)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:748)

For some reason this is trying to use a Windows binary path on Linux slave. Any ideas how to overcome this? I get the same outcome with Python console on Spyder, except the error is Cannot run program "C:\Anaconda3\python.exe": error=2, No such file or directory. Actually it happends from command line as well, when running python temp.py.

This version works fine even when submitted from Windows to Linux:

def inc(i):return i + 1import pyspark
sc = pyspark.SparkContext(appName='test2')print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

spark-submit --master spark://192.168.1.57:7077 temp2.py

Answer

I found the solution, which turned out to be very simple. pyspark/context.py uses env variable PYSPARK_PYTHON to determine the Python executable's path, but defaults to the "correct" python. However by default findspark overrides this env variable to match sys.executable, which clearly won't work cross-platform.

Anyway here is the working code for future reference:

def inc(i):return i + 1import findspark
findspark.init(python_path='python') # <-- so simple!import pyspark
sc = pyspark.SparkContext(master='spark://192.168.1.57:7077',appName='test1')print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))
https://en.xdnf.cn/q/118137.html

Related Q&A

Updating variable values when running a thread using QThread in PyQt4

So problem occurred when I tried using Threading in my code. What I want to do is passing default values to the def __init__ and then calling the thread with its instance with updated values but someho…

Show terminal status in a Tkinter widget

I am using python2.7.10 on MacOs Sierra and have created a rsync over ssh connection with my raspberrypi. The Idea is to synchronize my local folder with my remote folder on the raspberrypi. My functio…

Python Convert HTML into JSON using Soup

These are the rulesThe HTML tags will start with any of the following <p>, <ol> or <ul> The content of the HTML when any of step 1 tags is found will contain only the following tags: …

I/O Error while saving Excel file - Python

Im using python to open an existing excel file and do some formatting and save and close the file. My code is working good when the file size is small but when excel size is big (apprx. 40MB) Im gettin…

How to stop the python turtle from drawing

Can anyone tell me why this code always has a line on the screen and also how to stop it?Slight problem with this is that every time this happens, I always get a line on my canvas no matter what I try…

Replace values in a string

So the challenge was to replace a specific word in a sentence with asterisks with equivalent length to that word - 3 letters 3 asterisks etc.Section One does not work, but Section Two does - can anyon…

Select n data points from plot

I want to select points by clicking om them in a plot and store the point in an array. I want to stop selecting points after n selections, by for example pressing a key. How can I do this? This is wha…

Python azure uploaded file content type changed to application/octet-stream

I am using python Azure sdk. When the file uploaded its content type changed to application/octet-stream. I want to set its default content type like image/png for PNG image.I am using following method…

the dumps method of itsdangerous throws a TypeError

I am following the guide of 『Flask Web Development』. I want to use itsdangerous to generate a token, but some problems occured. Here is my code:def generate_confirmation_token(self, expiration=3600):…

SP 500 List python script crashes

So I have been following a youtube tutorial on Python finance and since Yahoo has now closed its doors to the financial market, it has caused a few dwelling problems. I run this codeimport bs4 as bs im…