Question 1

Azure Databricks execution error while parallelizing on pandas dataframe. The code is able to create RDD but breaks at the time of performing .collect()

setup:

import pandas as pd
# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] # Create the pandas DataFrame 
my_df = pd.DataFrame(data, columns = ['Name', 'Age']) def testfn(i):return my_df.iloc[i]
test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
print (test_var)

Error:

Py4JJavaError                             Traceback (most recent call last)
<command-2941072546245585> in <module>1 def testfn(i):2   return my_df.iloc[i]
----> 3 test_var=sc.parallelize([0,1,2],50).map(testfn).collect()4 print (test_var)/databricks/spark/python/pyspark/rdd.py in collect(self)901         # Default path used in OSS Spark / for non-credential passthrough clusters:902         with SCCallSiteSync(self.context) as css:
--> 903             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())904         return list(_load_from_socket(sock_info, self._jrdd_deserializer))905 /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)1303         answer = self.gateway_client.send_command(command)1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)1306 1307         for temp_arg in temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)125     def deco(*a, **kw):126         try:
--> 127             return f(*a, **kw)128         except py4j.protocol.Py4JJavaError as e:129             converted = convert_exception(e.java_exception)/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)326                 raise Py4JJavaError(327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)329             else:330                 raise Py4JError(Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 3845.0 failed 4 times, most recent failure: Lost task 16.3 in stage 3845.0 : org.apache.spark.api.python.PythonException: 'AttributeError: 'DataFrame' object has no attribute '_data'', from <command-2941072546245585>, line 2. Full traceback below:
Traceback (most recent call last):File "/databricks/spark/python/pyspark/worker.py", line 654, in mainprocess()File "/databricks/spark/python/pyspark/worker.py", line 646, in processserializer.dump_stream(out_iter, outfile)File "/databricks/spark/python/pyspark/serializers.py", line 279, in dump_streamvs = list(itertools.islice(iterator, batch))File "/databricks/spark/python/pyspark/util.py", line 109, in wrapperreturn f(*args, **kwargs)File "<command-2941072546245585>", line 2, in testfnFile "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 1767, in __getitem__return self._getitem_axis(maybe_callable, axis=axis)File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axisself._validate_integer(key, axis)File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2060, in _validate_integerlen_axis = len(self.obj._get_axis(axis))File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 424, in _get_axisreturn getattr(self, name)File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__return object.__getattribute__(self, name)File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'

Version details:

spark: '3.0.0' python:3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]

Question 2

I've seen such error when driver & executors had different version of Pandas installed. In my case it was driver with Pandas 1.1.0 (via databricks-connect), and executors were on Databricks Runtime 7.3 with Pandas 1.0.1. Pandas 1.1.0 has a big change in internals, so the code sent by the driver to executors is broken. You need to check that your executors and driver have the same version of the Pandas (you can find version of the Pandas used by Databricks Runtimes in the release notes). You can use the following script to compare version of the Python libraries on executors & driver.

AttributeError: DataFrame object has no attribute _data

Related Q&A

Python: Problem with overloaded constructors

Validate inlines before saving model

Grouping and comparing groups using pandas

Transform a 3-column dataframe into a matrix

python multiline regex

OpenCV Python Bindings for GrabCut Algorithm

showing an image with Graphics View widget

TemplateSyntaxError: settings_tags is not a valid tag library

Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

python variable scope in nested functions