AttributeError: DataFrame object has no attribute _data

2024/10/6 22:31:06

Azure Databricks execution error while parallelizing on pandas dataframe. The code is able to create RDD but breaks at the time of performing .collect()

setup:

import pandas as pd
# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] # Create the pandas DataFrame 
my_df = pd.DataFrame(data, columns = ['Name', 'Age']) def testfn(i):return my_df.iloc[i]
test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
print (test_var)

Error:

Py4JJavaError                             Traceback (most recent call last)
<command-2941072546245585> in <module>1 def testfn(i):2   return my_df.iloc[i]
----> 3 test_var=sc.parallelize([0,1,2],50).map(testfn).collect()4 print (test_var)/databricks/spark/python/pyspark/rdd.py in collect(self)901         # Default path used in OSS Spark / for non-credential passthrough clusters:902         with SCCallSiteSync(self.context) as css:
--> 903             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())904         return list(_load_from_socket(sock_info, self._jrdd_deserializer))905 /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)1303         answer = self.gateway_client.send_command(command)1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)1306 1307         for temp_arg in temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)125     def deco(*a, **kw):126         try:
--> 127             return f(*a, **kw)128         except py4j.protocol.Py4JJavaError as e:129             converted = convert_exception(e.java_exception)/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)326                 raise Py4JJavaError(327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)329             else:330                 raise Py4JError(Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 3845.0 failed 4 times, most recent failure: Lost task 16.3 in stage 3845.0 : org.apache.spark.api.python.PythonException: 'AttributeError: 'DataFrame' object has no attribute '_data'', from <command-2941072546245585>, line 2. Full traceback below:
Traceback (most recent call last):File "/databricks/spark/python/pyspark/worker.py", line 654, in mainprocess()File "/databricks/spark/python/pyspark/worker.py", line 646, in processserializer.dump_stream(out_iter, outfile)File "/databricks/spark/python/pyspark/serializers.py", line 279, in dump_streamvs = list(itertools.islice(iterator, batch))File "/databricks/spark/python/pyspark/util.py", line 109, in wrapperreturn f(*args, **kwargs)File "<command-2941072546245585>", line 2, in testfnFile "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 1767, in __getitem__return self._getitem_axis(maybe_callable, axis=axis)File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axisself._validate_integer(key, axis)File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2060, in _validate_integerlen_axis = len(self.obj._get_axis(axis))File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 424, in _get_axisreturn getattr(self, name)File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__return object.__getattribute__(self, name)File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'

Version details:

spark: '3.0.0' python:3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]

Answer

I've seen such error when driver & executors had different version of Pandas installed. In my case it was driver with Pandas 1.1.0 (via databricks-connect), and executors were on Databricks Runtime 7.3 with Pandas 1.0.1. Pandas 1.1.0 has a big change in internals, so the code sent by the driver to executors is broken. You need to check that your executors and driver have the same version of the Pandas (you can find version of the Pandas used by Databricks Runtimes in the release notes). You can use the following script to compare version of the Python libraries on executors & driver.

https://en.xdnf.cn/q/70316.html

Related Q&A

Python: Problem with overloaded constructors

WARNING: I have been learning Python for all of 10 minutes so apologies for any stupid questions!I have written the following code, however I get the following exception: Message FileName Li…

Validate inlines before saving model

Lets say I have these two models:class Distribution(models.Model):name = models.CharField(max_length=32)class Component(models.Model):distribution = models.ForeignKey(Distribution)percentage = models.I…

Grouping and comparing groups using pandas

I have data that looks like:Identifier Category1 Category2 Category3 Category4 Category5 1000 foo bat 678 a.x ld 1000 foo bat 78 l.o …

Transform a 3-column dataframe into a matrix

I have a dataframe df, for example:A = [["John", "Sunday", 6], ["John", "Monday", 3], ["John", "Tuesday", 2], ["Mary", "Sunday…

python multiline regex

Im having an issue compiling the correct regular expression for a multiline match. Can someone point out what Im doing wrong. Im looping through a basic dhcpd.conf file with hundreds of entries such as…

OpenCV Python Bindings for GrabCut Algorithm

Ive been trying to use the OpenCV implementation of the grab cut method via the Python bindings. I have tried using the version in both cv and cv2 but I am having trouble finding out the correct param…

showing an image with Graphics View widget

Im new to qt designer and python. I want to created a simple project that I should display an image. I used "Graphics View" widget and I named it "graphicsView". I wrote these funct…

TemplateSyntaxError: settings_tags is not a valid tag library

i got this error when i try to run this test case: WHICH IS written in tests.py of my django application:def test_accounts_register( self ):self.url = http://royalflag.com.pk/accounts/register/self.c =…

Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

The NLTK documentation is rather poor in this integration. The steps I followed were:Download http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip to /home/me/stanford Download http:…

python variable scope in nested functions

I am reading this article about decorator.At Step 8 , there is a function defined as:def outer():x = 1def inner():print x # 1return innerand if we run it by:>>> foo = outer() >>> foo.…