AttributeError: Cant get attribute new_block on module pandas.core.internals.blocks

2024/11/20 18:41:11

I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'. Below is a snippet of the code that threw this error:

search =  SearchEngine(db_file_dir = "/tmp/db")
conn = sqlite3.connect("/tmp/db/simple_db.sqlite")
pdf_ = pd.read_sql_query('''select  zipcode, lat, lng, bounds_west, bounds_east, bounds_north, bounds_south from simple_zipcode''',conn)
brd_pdf = spark.sparkContext.broadcast(pdf_) 
conn.close()@udf('string')
def get_zip_b(lat, lng):pdf = brd_pdf.value out = pdf[(np.array(pdf["bounds_north"]) >= lat) & (np.array(pdf["bounds_south"]) <= lat) & (np.array(pdf['bounds_west']) <= lng) & (np.array(pdf['bounds_east']) >= lng) ]if len(out):min_index = np.argmin( (np.array(out["lat"]) - lat)**2 + (np.array(out["lng"]) - lng)**2)zip_ = str(out["zipcode"].iloc[min_index])else:zip_ = 'bad'return zip_df = df.withColumn('zipcode', get_zip_b(col("latitude"),col("longitude")))

Below is the traceback, where line 102, in get_zip_b refers to pdf = brd_pdf.value:

21/08/02 06:18:19 WARN TaskSetManager: Lost task 12.0 in stage 7.0 (TID 1814, ip-10-22-17-94.pclc0.merkle.local, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 605, in mainprocess()File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 597, in processserializer.dump_stream(out_iter, outfile)File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 223, in dump_streamself.serializer.dump_stream(self._batched(iterator), stream)File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 141, in dump_streamfor obj in iterator:File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 212, in _batchedfor item in iterator:File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 450, in mapperresult = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 90, in <lambda>return lambda *a: f(*a)File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/util.py", line 121, in wrapperreturn f(*args, **kwargs)File "/mnt/var/lib/hadoop/steps/s-1IBFS0SYWA19Z/Mobile_ID_process_center.py", line 102, in get_zip_bFile "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 146, in valueself._value = self.load_from_path(self._path)File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 123, in load_from_pathreturn self.load(f)File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 129, in loadreturn pickle.load(file)
AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/mnt/miniconda/lib/python3.9/site-packages/pandas/core/internals/blocks.py'>

Some observations and thought process:

1, After doing some search online, the AttributeError in pyspark seems to be caused by mismatched pandas versions between driver and workers?

2, But I ran the same code on two different datasets, one worked without any errors but the other didn't, which seems very strange and undeterministic, and it seems like the errors may not be caused by mismatched pandas versions. Otherwise, neither two datasets would succeed.

3, I then ran the same code on the successful dataset again, but this time with different spark configurations: setting spark.driver.memory from 2048M to 4192m, and it threw AttributeError.

4, In conclusion, I think the AttributeError has something to do with driver. But I can't tell how they are related from the error message, and how to fix it: AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'.

Answer

Solutions

  • Keeping the pickle file unchanged ,upgrade your pandas version to 1.3.x and then load the pickle file.

Or

  • Keeping your current pandas version unchanged, downgrade the pandas version to 1.2.x on the dumping side, and then dump a new pickle file with v1.2.x. Load it on your side with your pandas of version 1.2.x

In short

your pandas version used to dump the pickle(dump_version, probably 1.3.x) isn't comptaible with your pandas version used to load the pickle (load_version, probably 1.2.x) . To solve it, try to upgrade the pandas version(load_version) to 1.3.x in the loading environment and then load the pickle. Or downgrade the pandas version(dump_version) to 1.2.x and then redump a new pickle. After this, you can load the new pickle with your pandas of version 1.2.x

And this has nothing to do with PySpark

In long

This issue is related to the backward imcompatibility between Pandas version 1.2.x and 1.3.x. In the version 1.2.5 and before, Pandas use the variable name new_blocks in module pandas.core.internals.blocks cf source code v1.2.5. On 2 July 2021, Pandas released version 1.3.0. In this update, Pandas changed the api, the variable name new_blocks in module pandas.core.internals.blocks has been changed to new_block cf source code v1.3.0.

This change of API will result into two imcompatiblity errors:

  • If you have dumped a pickle with Pandas v1.3.x, and you try to load the pickle with Pandas v1.2.x, you will get the following error:

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '.../site-packages/pandas/core/internals/blocks.py'>'>

Python throw this error complaining that it can not found the attribute new_block on your current pandas.core.internals.blocks because in order to pickle load an object, it has to use the exact same class used for dumping the pickle.

This is exactly your case: Having dumped the pickle with Pandas v1.3.x and try to load the pickle with Pandas v1.2.x

To reproduce the error

pip install --upgrade pandas==1.3.4

import numpy as np 
import pandas as pd
df =pd.DataFrame(np.random.rand(3,6))with open("dump_from_v1.3.4.pickle", "wb") as f: pickle.dump(df, f) quit()

pip install --upgrade pandas==1.2.5

import picklewith open("dump_from_v1.3.4.pickle", "rb") as f: df = pickle.load(f) ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-ff5c218eca92> in <module>1 with open("dump_from_v1.3.4.pickle", "rb") as f:
----> 2     df = pickle.load(f)3 AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py'>
https://en.xdnf.cn/q/26292.html

Related Q&A

Disable python import sorting in VSCode

I am trying to disable vscode from formatting my python imports when I save my file. I have some code that must run in between various imports so order is important, but every time I save it just shove…

Log-log lmplot with seaborn

Can Seaborns lmplot plot on log-log scale? This is lmplot with linear axes: import numpy as np import pandas as pd import seaborn as sns x = 10**arange(1, 10) y = 10** arange(1,10)*2 df1 = pd.DataFra…

Django on IronPython

I am interested in getting an install of Django running on IronPython, has anyone had any success getting this running with some level of success? If so can you please tell of your experiences, perfo…

How to create a DataFrame while preserving order of the columns?

How can I create a DataFrame from multiple numpy arrays, Pandas Series, or Pandas DataFrames while preserving the order of the columns?For example, I have these two numpy arrays and I want to combine …

Dynamically limiting queryset of related field

Using Django REST Framework, I want to limit which values can be used in a related field in a creation. For example consider this example (based on the filtering example on https://web.archive.org/web/…

How to clear GPU memory after PyTorch model training without restarting kernel

I am training PyTorch deep learning models on a Jupyter-Lab notebook, using CUDA on a Tesla K80 GPU to train. While doing training iterations, the 12 GB of GPU memory are used. I finish training by sav…

cryptography is required for sha256_password or caching_sha2_password

Good day. Hope your all are well. Can someone help me with fix this? Im new to the MySQL environment. Im trying to connect to MySQL Database remotely. I used the following python code and got this err…

Django: How to access original (unmodified) instance in post_save signal

I want to do a data denormalization for better performance, and put a sum of votes my blog post receives inside Post model:class Post(models.Model):""" Blog entry """autho…

timeit and its default_timer completely disagree

I benchmarked these two functions (they unzip pairs back into source lists, came from here): n = 10**7 a = list(range(n)) b = list(range(n)) pairs = list(zip(a, b))def f1(a, b, pairs):a[:], b[:] = zip(…

How to pickle and unpickle to portable string in Python 3

I need to pickle a Python3 object to a string which I want to unpickle from an environmental variable in a Travis CI build. The problem is that I cant seem to find a way to pickle to a portable string …