Please consider following program as Minimal Reproducible Example -MRE:
import pandas as pd
import pyarrow
from pyarrow import parquet
def foo():
print(pyarrow.__file__)
print('version:',pyarrow.cpp_version)
print('-----------------------------------------------------')
df = pd.DataFrame({'A': [1,2,3], 'B':['dummy']*3})
print('Orignal DataFrame:\n', df)
print('-----------------------------------------------------')
_table = pyarrow.Table.from_pandas(df)
parquet.write_table(_table, 'foo')
_table = parquet.read_table('foo', columns=[]) #passing empty list to columns arg
df = _table.to_pandas()
print('After reading from file with columns=[]:\n', df)
print('-----------------------------------------------------')
print('Not passing [] to columns parameter')
_table = parquet.read_table('foo') #Not passing any list
df = _table.to_pandas()
print(df)
print('-----------------------------------------------------')
x = input('press any key to exit: ')
if __name__=='__main__':
foo()
When I run it from console/IDE, it reads the entire data for columns=[]
:
(env) D:\foo>python foo.py
D:\foo\env\lib\site-packages\pyarrow\__init__.py
version: 3.0.0
-----------------------------------------------------
Orignal DataFrame:
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
After reading from file with columns=[]:
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
Not passing [] to columns parameter
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
press any key to exit:
But When I run it from executable created using Pyinstaller, it reads no data for columns=[]
:
E:\foo\dist\foo\pyarrow\__init__.pyc
version: 3.0.0
-----------------------------------------------------
Orignal DataFrame:
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
After reading from file with columns=[]:
Empty DataFrame
Columns: []
Index: [0, 1, 2]
-----------------------------------------------------
Not passing [] to columns parameter
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
press any key to exit:
As you can see passing columns=[]
gives empty dataframe in executable file but this behavior is not there while running the python file directly, and I'm not sure why there is this two different behavior for the same code in the same environment.
Looking at docstring of parquet.read_table
in source code at GitHub:
columns: list
If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. 'a' will select 'a.b', 'a.c', and 'a.d.e'.
The read_table further calls dataset.read
that calls _dataset.to_table
which returns call to self.scanner
which then returns call to static method from_dataset
of Scanner
class.
Everywhere, None
has been used as default value to columns
parameter, if None
and []
are directly converted to Boolean in python, both of them will indeed be False
, but if []
is checked against None
, then it will be False
, but it is nowhere mentioned should it fetch all the columns for columns=[]
because it evaluates to be False
for Boolean value, or should it read no columns at all since the list is empty.
But why the behavior is different while running it from the Command line/IDE, than to running it from the executable created using Pyinstaller for the same version of Pyarrow?
The environment I'm on:
- Python Version: 3.7.6
- Pyinstaller Verson: 4.2
- Pyarrow Version: 3.0.0
- Windows 10 64 bit OS
Here is the spec file for your reference if you want to give it a try (You need to change pathex
parameter):
foo.spec
# -*- mode: python ; coding: utf-8 -*-
import sys ; sys.setrecursionlimit(sys.getrecursionlimit() * 5)
block_cipher = None
a = Analysis(['foo.py'],
pathex=['D:\\foo'],
binaries=[],
datas=[],
hiddenimports=[],
hookspath=[],
runtime_hooks=[],
excludes=[],
win_no_prefer_redirects=False,
win_private_assemblies=False,
cipher=block_cipher,
noarchive=False)
pyz = PYZ(a.pure, a.zipped_data,
cipher=block_cipher)
exe = EXE(pyz,
a.scripts,
[],
exclude_binaries=True,
name='foo',
debug=False,
bootloader_ignore_signals=False,
strip=False,
upx=True,
console=True )
coll = COLLECT(exe,
a.binaries,
a.zipfiles,
a.datas,
strip=False,
upx=True,
upx_exclude=[],
name='foo')