Different behavior while reading DataFrame from parquet using CLI Versus executable on same environment
Asked Answered
M

3

10

Please consider following program as Minimal Reproducible Example -MRE:

import pandas as pd
import pyarrow
from pyarrow import parquet

def foo():
    print(pyarrow.__file__)
    print('version:',pyarrow.cpp_version)
    print('-----------------------------------------------------')
    df = pd.DataFrame({'A': [1,2,3], 'B':['dummy']*3})
    print('Orignal DataFrame:\n', df)
    print('-----------------------------------------------------')
    _table = pyarrow.Table.from_pandas(df)
    parquet.write_table(_table, 'foo')
    _table = parquet.read_table('foo', columns=[])    #passing empty list to columns arg
    df = _table.to_pandas()
    print('After reading from file with columns=[]:\n', df)
    print('-----------------------------------------------------')
    print('Not passing [] to columns parameter')
    _table = parquet.read_table('foo')                #Not passing any list
    df = _table.to_pandas()
    print(df)
    print('-----------------------------------------------------')
    x = input('press any key to exit: ')

if __name__=='__main__':
    foo()

When I run it from console/IDE, it reads the entire data for columns=[]:

(env) D:\foo>python foo.py
D:\foo\env\lib\site-packages\pyarrow\__init__.py
version: 3.0.0
-----------------------------------------------------
Orignal DataFrame:
    A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
After reading from file with columns=[]:
    A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
Not passing [] to columns parameter
   A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
press any key to exit:

But When I run it from executable created using Pyinstaller, it reads no data for columns=[]:

E:\foo\dist\foo\pyarrow\__init__.pyc
version: 3.0.0
-----------------------------------------------------
Orignal DataFrame:
    A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
After reading from file with columns=[]:
 Empty DataFrame
Columns: []
Index: [0, 1, 2]
-----------------------------------------------------
Not passing [] to columns parameter
   A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
press any key to exit:

As you can see passing columns=[] gives empty dataframe in executable file but this behavior is not there while running the python file directly, and I'm not sure why there is this two different behavior for the same code in the same environment.

Looking at docstring of parquet.read_table in source code at GitHub:

columns: list
If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. 'a' will select 'a.b', 'a.c', and 'a.d.e'.

The read_table further calls dataset.read that calls _dataset.to_table which returns call to self.scanner which then returns call to static method from_dataset of Scanner class.

Everywhere, None has been used as default value to columns parameter, if None and [] are directly converted to Boolean in python, both of them will indeed be False, but if [] is checked against None, then it will be False, but it is nowhere mentioned should it fetch all the columns for columns=[] because it evaluates to be False for Boolean value, or should it read no columns at all since the list is empty.

But why the behavior is different while running it from the Command line/IDE, than to running it from the executable created using Pyinstaller for the same version of Pyarrow?

The environment I'm on:

  • Python Version: 3.7.6
  • Pyinstaller Verson: 4.2
  • Pyarrow Version: 3.0.0
  • Windows 10 64 bit OS

Here is the spec file for your reference if you want to give it a try (You need to change pathex parameter):

foo.spec

# -*- mode: python ; coding: utf-8 -*-
import sys ; sys.setrecursionlimit(sys.getrecursionlimit() * 5)
block_cipher = None


a = Analysis(['foo.py'],
             pathex=['D:\\foo'],
             binaries=[],
             datas=[],
             hiddenimports=[],
             hookspath=[],
             runtime_hooks=[],
             excludes=[],
             win_no_prefer_redirects=False,
             win_private_assemblies=False,
             cipher=block_cipher,
             noarchive=False)
pyz = PYZ(a.pure, a.zipped_data,
             cipher=block_cipher)
exe = EXE(pyz,
          a.scripts,
          [],
          exclude_binaries=True,
          name='foo',
          debug=False,
          bootloader_ignore_signals=False,
          strip=False,
          upx=True,
          console=True )
coll = COLLECT(exe,
               a.binaries,
               a.zipfiles,
               a.datas,
               strip=False,
               upx=True,
               upx_exclude=[],
               name='foo')
Miler answered 22/7, 2021 at 13:52 Comment(0)
M
3

Credit to @U12-Forward for assisting me in debugging the issue.

After a bit of research and debugging, and exploring the library program files, I found that pyarrow uses _ParquetDatasetV2 and ParquetDataset functions which are essentially two different functions that reads the data from parquet file, _ParquetDatasetV2 is used as legacy_mode, even though these functions are defined in pyarrow.parquet module, they are coming from Dataset module of pyarrow, which was missing in the executable created using Pyinstaller.

When I added pyarrow.Dataset as hidden imports and created the build, the exe was raising ModuleNotFoundError on execution due to several missing dependencies used by Dataset module. In order to resolve it, I added all .py files from the environment to the hidden imports and created the build again, finally it worked, by it worked, what I mean is I was able to observe the same behavior in both the environments.

The modified spec file looks like this after modification:

# -*- mode: python ; coding: utf-8 -*-
import sys ; sys.setrecursionlimit(sys.getrecursionlimit() * 5)
block_cipher = None


a = Analysis(['foo.py'],
             pathex=['D:\\foo'],
             binaries=[],
             datas=[],
             hiddenimports=['pyarrow.benchmark', 'pyarrow.cffi', 'pyarrow.compat', 'pyarrow.compute', 'pyarrow.csv', 'pyarrow.cuda', 'pyarrow.dataset', 'pyarrow.feather', 'pyarrow.filesystem', 'pyarrow.flight', 'pyarrow.fs', 'pyarrow.hdfs', 'pyarrow.ipc', 'pyarrow.json', 'pyarrow.jvm', 'pyarrow.orc', 'pyarrow.pandas_compat', 'pyarrow.parquet', 'pyarrow.plasma', 'pyarrow.serialization', 'pyarrow.types', 'pyarrow.util', 'pyarrow._generated_version', 'pyarrow.__init__'],
             hookspath=[],
             runtime_hooks=[],
             excludes=[],
             win_no_prefer_redirects=False,
             win_private_assemblies=False,
             cipher=block_cipher,
             noarchive=False)
pyz = PYZ(a.pure, a.zipped_data,
             cipher=block_cipher)
exe = EXE(pyz,
          a.scripts,
          [],
          exclude_binaries=True,
          name='foo',
          debug=False,
          bootloader_ignore_signals=False,
          strip=False,
          upx=True,
          console=True )
coll = COLLECT(exe,
               a.binaries,
               a.zipfiles,
               a.datas,
               strip=False,
               upx=True,
               upx_exclude=[],
               name='foo')

Also, to create the build, I included the path of the virtual environment using --paths argument:

pyinstaller --path D:\foo\env\Lib\site-packages foo.spec

Here is execution after following above mentioned steps:

E:\foo\dist\foo\pyarrow\__init__.pyc
version: 3.0.0
-----------------------------------------------------
Orignal DataFrame:
    A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
After reading from file with columns=[]:
    A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
Not passing [] to columns parameter
   A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
press any key to exit:

It is true that it is nowhere mentioned the desired behavior for columns=[], but looking at ARROW-13436 opened in pyarrow by @Pace, it seems that the desired behavior for columns=[] is to read no data columns at all, but its not an official conformation, so it is possibly a bug in pyarrow 3.0.0 itself.

Miler answered 30/8, 2021 at 6:18 Comment(0)
R
2

The pyarrow documentation for pyarrow.parquet.read_table is probably unclear. I've raised ARROW-13436 to clarify this.

From some testing it seems that the behavior changed at some point from no columns to all columns and then changed back (in 4.0) to no columns. I believe no columns is the correct behavior.

So my guess is that your executable is using a different version of pyarrow than your IDE. You can usually confirm this by running...

import pyarrow
print(pyarrow.__file__)
print(pyarrow.cpp_version)

...on both environments and then comparing the results.

Rudderhead answered 22/7, 2021 at 17:35 Comment(2)
Thanks for the response, I'm using pyarrow 3.0.0 on both the environments, anyways, I'll print it and add that too in the question.Miler
I have updated the question with those details.Miler
O
1

After commenting with @ThePyGuy (the OP), the reason was found.

The reason is that when he ran it on the IDE, he ran it inside an environment, whereas when he ran it on the command line, it was outside the environment. And the DLL files where different. So the fix was to copy the pyarrow package to to outside the environment and it gave the same result.

Onrush answered 30/8, 2021 at 6:11 Comment(1)
No dll files are same, they are not different. And just copying the library program files is not solution, it needs to be done from pyinstaller while creating executableMiler

© 2022 - 2024 — McMap. All rights reserved.