Question 1

I just read about findspark and found it quite interesting, as so far I have only used spark-submit which isn't be suited for interactive development on an IDE. I tried executing this file on Windows 10, Anaconda 4.4.0, Python 3.6.1, IPython 5.3.0, Spyder 3.1.4, Spark 2.1.1:

def inc(i):return i + 1import findspark
findspark.init()import pyspark
sc = pyspark.SparkContext(master='local',appName='test1')print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

Spyder generates the command runfile('C:/tests/temp1.py', wdir='C:/tests') and it prints out [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] as expected. However if I try to use a Spark cluster running on Ubuntu I get an error:

def inc(i):return i + 1import findspark
findspark.init()import pyspark
sc = pyspark.SparkContext(master='spark://192.168.1.57:7077',appName='test1')print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

IPython errors:

Traceback (most recent call last):File "<ipython-input-1-820bd4275b8c>", line 1, in <module>runfile('C:/tests/temp.py', wdir='C:/tests')File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfileexecfile(filename, namespace)File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfileexec(compile(f.read(), filename, 'exec'), namespace)File "C:/tests/temp.py", line 11, in <module>print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\pyspark\rdd.py", line 808, in collectport = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__answer, self.gateway_client, self.target_id, self.name)File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_valueformat(target_id, ".", name), value)Py4JJavaError: An error occurred while calling

Worker stderr:

ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "C:\Anaconda3\pythonw.exe": error=2, No such file or directoryat java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at org.apache.spark.scheduler.Task.run(Task.scala:99)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:748)

For some reason this is trying to use a Windows binary path on Linux slave. Any ideas how to overcome this? I get the same outcome with Python console on Spyder, except the error is Cannot run program "C:\Anaconda3\python.exe": error=2, No such file or directory. Actually it happends from command line as well, when running python temp.py.

This version works fine even when submitted from Windows to Linux:

def inc(i):return i + 1import pyspark
sc = pyspark.SparkContext(appName='test2')print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

spark-submit --master spark://192.168.1.57:7077 temp2.py

Question 2

I found the solution, which turned out to be very simple. pyspark/context.py uses env variable PYSPARK_PYTHON to determine the Python executable's path, but defaults to the "correct" python. However by default findspark overrides this env variable to match sys.executable, which clearly won't work cross-platform.

Anyway here is the working code for future reference:

def inc(i):return i + 1import findspark
findspark.init(python_path='python') # <-- so simple!import pyspark
sc = pyspark.SparkContext(master='spark://192.168.1.57:7077',appName='test1')print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

Unable to submit Spark job from Windows IDE to Linux cluster

Related Q&A

Updating variable values when running a thread using QThread in PyQt4

Show terminal status in a Tkinter widget

Python Convert HTML into JSON using Soup

I/O Error while saving Excel file - Python

How to stop the python turtle from drawing

Replace values in a string

Select n data points from plot

Python azure uploaded file content type changed to application/octet-stream

the dumps method of itsdangerous throws a TypeError

SP 500 List python script crashes