AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>
Asked Answered
S

7

54

I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'. Below is a snippet of the code that threw this error:

search =  SearchEngine(db_file_dir = "/tmp/db")
conn = sqlite3.connect("/tmp/db/simple_db.sqlite")
pdf_ = pd.read_sql_query('''select  zipcode, lat, lng, 
                        bounds_west, bounds_east, bounds_north, bounds_south from 
                        simple_zipcode''',conn)
brd_pdf = spark.sparkContext.broadcast(pdf_) 
conn.close()


@udf('string')
def get_zip_b(lat, lng):
    pdf = brd_pdf.value 
    out = pdf[(np.array(pdf["bounds_north"]) >= lat) & 
              (np.array(pdf["bounds_south"]) <= lat) & 
              (np.array(pdf['bounds_west']) <= lng) & 
              (np.array(pdf['bounds_east']) >= lng) ]
    if len(out):
        min_index = np.argmin( (np.array(out["lat"]) - lat)**2 + (np.array(out["lng"]) - lng)**2)
        zip_ = str(out["zipcode"].iloc[min_index])
    else:
        zip_ = 'bad'
    return zip_

df = df.withColumn('zipcode', get_zip_b(col("latitude"),col("longitude")))

Below is the traceback, where line 102, in get_zip_b refers to pdf = brd_pdf.value:

21/08/02 06:18:19 WARN TaskSetManager: Lost task 12.0 in stage 7.0 (TID 1814, ip-10-22-17-94.pclc0.merkle.local, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/util.py", line 121, in wrapper
    return f(*args, **kwargs)
  File "/mnt/var/lib/hadoop/steps/s-1IBFS0SYWA19Z/Mobile_ID_process_center.py", line 102, in get_zip_b
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 146, in value
    self._value = self.load_from_path(self._path)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 123, in load_from_path
    return self.load(f)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 129, in load
    return pickle.load(file)
AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/mnt/miniconda/lib/python3.9/site-packages/pandas/core/internals/blocks.py'>

Some observations and thought process:

1, After doing some search online, the AttributeError in pyspark seems to be caused by mismatched pandas versions between driver and workers?

2, But I ran the same code on two different datasets, one worked without any errors but the other didn't, which seems very strange and undeterministic, and it seems like the errors may not be caused by mismatched pandas versions. Otherwise, neither two datasets would succeed.

3, I then ran the same code on the successful dataset again, but this time with different spark configurations: setting spark.driver.memory from 2048M to 4192m, and it threw AttributeError.

4, In conclusion, I think the AttributeError has something to do with driver. But I can't tell how they are related from the error message, and how to fix it: AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'.

Savor answered 2/8, 2021 at 17:29 Comment(1)
The same error appears when you save a pickle file on pandas 1.3.2, using protocol = 4, and try to open the same pickle file on a pandas 1.2. As of today, I have found nowhere else but here says about this issue.Ungraceful
C
78

Solutions

  • Keeping the pickle file unchanged ,upgrade your pandas version to 1.3.x and then load the pickle file.

Or

  • Keeping your current pandas version unchanged, downgrade the pandas version to 1.2.x on the dumping side, and then dump a new pickle file with v1.2.x. Load it on your side with your pandas of version 1.2.x

In short

your pandas version used to dump the pickle(dump_version, probably 1.3.x) isn't comptaible with your pandas version used to load the pickle (load_version, probably 1.2.x) . To solve it, try to upgrade the pandas version(load_version) to 1.3.x in the loading environment and then load the pickle. Or downgrade the pandas version(dump_version) to 1.2.x and then redump a new pickle. After this, you can load the new pickle with your pandas of version 1.2.x

And this has nothing to do with PySpark

In long

This issue is related to the backward imcompatibility between Pandas version 1.2.x and 1.3.x. In the version 1.2.5 and before, Pandas use the variable name new_blocks in module pandas.core.internals.blocks cf source code v1.2.5. On 2 July 2021, Pandas released version 1.3.0. In this update, Pandas changed the api, the variable name new_blocks in module pandas.core.internals.blocks has been changed to new_block cf source code v1.3.0.

This change of API will result into two imcompatiblity errors:

  • If you have dumped a pickle with Pandas v1.3.x, and you try to load the pickle with Pandas v1.2.x, you will get the following error:

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '.../site-packages/pandas/core/internals/blocks.py'>'>

Python throw this error complaining that it can not found the attribute new_block on your current pandas.core.internals.blocks because in order to pickle load an object, it has to use the exact same class used for dumping the pickle.

This is exactly your case: Having dumped the pickle with Pandas v1.3.x and try to load the pickle with Pandas v1.2.x

To reproduce the error

pip install --upgrade pandas==1.3.4

import numpy as np 
import pandas as pd
df =pd.DataFrame(np.random.rand(3,6))

with open("dump_from_v1.3.4.pickle", "wb") as f: 
    pickle.dump(df, f) 

quit()

pip install --upgrade pandas==1.2.5

import pickle

with open("dump_from_v1.3.4.pickle", "rb") as f: 
    df = pickle.load(f) 


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-ff5c218eca92> in <module>
      1 with open("dump_from_v1.3.4.pickle", "rb") as f:
----> 2     df = pickle.load(f)
      3 

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py'>
Caernarvonshire answered 24/10, 2021 at 15:32 Comment(0)
T
10

I had the same error using pandas 1.3.2 in the server while 1.2 in my client. Downgrading pandas to 1.2 solved the problem.

Tympanist answered 26/8, 2021 at 13:59 Comment(0)
F
7

I had the same AttributeError under de circumstances as follow:

  • Pickle File was created using Pandas 1.4.0 on a Windows machine, Python 3.8
  • I tryed to load the file using Pandas 1.3.5 on a Debian machine, Python 3.7
Fissi answered 25/2, 2022 at 13:18 Comment(1)
I ran into this same issue, I believe it's caused because it was created on windows / loaded on Debian. I had to regenerate the pickle on the server and then it was able to work as expected.Giorgio
N
5

pip install --upgrade --user pandas==1.3 (+ restart)

Netherlands answered 8/1, 2022 at 10:27 Comment(0)
S
1

Alternativte bypass suggestion if you have a large model object and cant change the environment: Simply export your DataFrame object to another file type such as .csv.

Selfinductance answered 7/4, 2022 at 8:19 Comment(0)
F
1

I had the same error using pandas 1.3.2. Installed conda install matplotlib=1.4.3 solved the issue. Thanks to the other posts for pointing the way.

Frivolity answered 19/10, 2022 at 13:53 Comment(0)
K
0

I had the same problem. If you use spyder, try out jupyter notebooks. I had the same error in spyder, but in jupyter notebooks it worked.

Just restarting spyder solved my problem!

Knopp answered 9/12, 2021 at 9:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.