How to update data in pyarrow table?

Asked 22/1, 2021 at 13:1 Answered 19/9, 2022 at 21:0

I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this:

for col_name in table2.column_names:
    if col_name in my_columns:
        print('updating values in column '  + col_name)
        
        col_data = pa.Table.column(table2, col_name)
        
        row_ct = 1
        for i in col_data:
            pa.Table.column(table2, col_name)[row_ct] = change_str(pa.StringScalar.as_py(i))
            row_ct += 1

I get this error:

 TypeError: 'pyarrow.lib.ChunkedArray' object does not support item assignment

How can I update these values?

I tried using pandas, but it couldn't handle null values in the original table, and it also incorrectly translated the datatypes of the columns in the original table. Does pyarrow have a native way to edit the data?

Undrape answered 22/1, 2021 at 13:1 Comment(0)

Arrow tables (and arrays) are immutable. So you won't be able to update your table in place.

The way to achieve this is to create copy of the data when modifying it. Arrow supports some basic operation to modify strings, but they are very limited.

Another option is to go use pandas, but as you've noticed going from arrow to pandas and back isn't seamless.

Let's take and example:

>>> table = pa.Table.from_arrays(
    [ 
        pa.array(['abc', 'def'], pa.string()),
        pa.array([1, None], pa.int32()),
    ],
    schema=pa.schema(
    [
        pa.field('str_col', pa.string()), 
        pa.field('int_col', pa.int32()), 
    ]
    )
)
>>> from_pandas = pa.Table.from_pandas(table.to_pandas())
>>> from_pandas.schema
str_col: string
int_col: double
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 487

You can see that converting to pandas and back has changed the type of the int column to double. This is because pandas doesn't support null int values very well, so it converted the int column to double.

To avoid this issue I'd suggest working on a column by column basis, only converting the string columns to pandas:

def my_func(value):
    return 'hello ' + value + '!'


columns = []
my_columns = ['str_col']
for column_name in table.column_names:
    column_data = table[column_name]
    if column_name in my_columns:
        column_data = pa.array(table['str_col'].to_pandas().apply(my_func))
    columns.append(column_data)

updated_table = pa.Table.from_arrays(
    columns, 
    schema=table.schema
)

>>> table['str_col']
<pyarrow.lib.ChunkedArray object at 0x7f05f42b3f40>
[
  [
    "hello abc!",
    "hello def!"
  ]
]

Heathenish answered 24/1, 2021 at 16:32 Comment(1)

I eventually figured out the same thing you suggested with using the pa.array, and it worked perfectly. I posted my answer, but I will mark yours as the "official" answer since it's very similar. Thanks! – Undrape 25/1, 2021 at 13:26

The native way to update the array data in pyarrow is pyarrow compute functions. Converting to pandas, which you described, is also a valid way to achieve this so you might want to figure that out. However, the API is not going to be match the approach you have.

You currently decide, in a Python function change_str, what the new value of each item should be. Hopefully it is possible to express the manipulation you need to perform as a composite of pyarrow compute functions. This will avoid the (expensive) cost of marshalling the entire native array into python objects. If you can describe what you are trying to achieve in change_str (probably in a new question) I can help figure it out.

If, for some reason, you must keep change_str in Python then you will need to convert the entire column to python objects (which will have a pretty hefty performance penalty) using ChunkedArray.to_pylist()

Sussman answered 22/1, 2021 at 20:43 Comment(0)

For a no pandas solution (pyarrow native), try replacing your column with updated values using table.set_column().

In the following example I update the float column 'c' using compute to add 2 to all of the values. I'm just using the to_pandas in the print for a nicer output display. https://arrow.apache.org/docs/python/generated/pyarrow.Table.html?highlight=set_column#pyarrow.Table.set_column

import pyarrow as pa
import numpy as np
import pyarrow.compute as pc

#Create some test data with columns a,b,c
tb = pa.table({'a': range(15), 'b': range(15, 0, -1), 'c': np.random.randn(15)})

print("BeforeUpdate:\n", tb.to_pandas())
col_index = tb.column_names.index('c')
field = tb.field(col_index)
new_data = pc.add(tb.column('c'),2) #Make a new copy of col c and add 2 to each value
tb = tb.set_column(col_index, field, new_data) #overwrite existing table with a copy of it's self with the new c column in place of old
print("AfterUpdate:\n", tb.to_pandas())

Zambrano answered 19/9, 2022 at 21:0 Comment(1)

This answer deserves more attention since the accepted one is no longer actual. – Housekeeping 14/11, 2023 at 19:29

I was able to get it working using these references:

http://arrow.apache.org/docs/python/generated/pyarrow.Table.html

http://arrow.apache.org/docs/python/generated/pyarrow.Field.html

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py

Basically it loops through the original table and creates new columns (pa.array) with the adjusted text that it appends to a new table. It's probably not the best way to do it, but it worked. Most importantly, it let me preserve the nulls and specify the data type of each column.

import sys, getopt
import random
import re
import math

import pyarrow.parquet as pq
import pyarrow.csv as pcsv
import numpy as np
#import pandas as pd
import pyarrow as pa
import os.path

<a lot of other code here>

parquet_file = pq.ParquetFile(in_file)
table2 = pq.read_table(in_file)

<a lot of other code here>

changed_ct = 0
all_cols_ct = 0
table3 = pa.Table.from_arrays([pa.array(range(0,table2.num_rows))], names=('0')) # CREATE TEMP COLUMN!!
#print(table3)
#exit()
changed_column_list = []
for col_name in table2.column_names:
    print('processing column: ' + col_name)
    new_list = []
    col_data = pa.Table.column(table2, col_name)
    col_data_type = table2.schema.field(col_name).type
    printed_changed_flag = False
    for i in col_data:
        # GET STRING REPRESENTATION OF THE COLUMN DATA
        if(col_data_type == 'string'):
            col_str = pa.StringScalar.as_py(i)
        elif(col_data_type == 'int32'):
            col_str = pa.Int32Scalar.as_py(i)
        elif(col_data_type == 'int64'):
            col_str = pa.Int64Scalar.as_py(i)
            
            
        if col_name in change_columns:
            if printed_changed_flag == False:
                print('changing values in column '  + col_name)
                changed_column_list.append(col_name)
                changed_ct += 1
                printed_changed_flag = True

            new_list.append(change_str(col_str))
        
        else:
            new_list.append(col_str)
        
    #set data type for the column
    if(col_data_type == 'string'):
        col_data_type = pa.string()
    elif(col_data_type == 'int32'):
        col_data_type = pa.int32()
    elif(col_data_type == 'int64'):
        col_data_type = pa.int64()
        
    arr = pa.array(new_list, type=col_data_type)
        
    new_field = pa.field(col_name, col_data_type)
    
    table3 = pa.Table.append_column(table3, new_field, arr)
        
    all_cols_ct += 1
    
#for i in table3:
#   print(i)

table3 = pa.Table.remove_column(table3, 0) # REMOVE TEMP COLUMN!!
#print(table2)
#print('-------------------')
#print(table3)
#exit()

print('changed ' + str(changed_ct) + ' columns:')
print(*changed_column_list, sep='\n')

# WRITE NEW PARQUET FILE
pa.parquet.write_table(table3, out_file)

Undrape answered 25/1, 2021 at 13:22 Comment(0)

In order to update data using DatasetDict or any arrow table I can recommend:

Create a new variable with the same type of the data that you want to update
Insert (append() method in python) your new data into a list or numpy array
Insert this list into the variable that you create in the first point

Below how I solved my problem:

dataset_ = datasets.DatasetDict({"train": Dataset.from_dict({
                                
                                  'image': img_list,     #0,20580 --> risolviamo tutti i problemi e prendiamo tutti gli elementi e tutte le classi
                                  'label': dataset['train']['label']    #'labels_list oppure dataset['train']['label']
                                  }),
                                 
                                 "test": Dataset.from_dict({  #20580-4116 (validation) ,20580-2058 (test)
                                  'image':  img_list[len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test], 
                                  'label': labels_list[len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test] }), 
                                 
                                  "validation": Dataset.from_dict({ # 20580-2058 (test)
                                  'image':  img_list[len(dataset['train']) - percentage_divison_test : len(dataset['train'])], 
                                  'label': labels_list[len(dataset['train']) - percentage_divison_test : len(dataset['train'])]}), 
                                })

What I was trying to do here is update the images inside this DatasetDict but because of its structure is not updatable so: I created an img_list, with the images modified and then I inserted it in the DatasetDict!

Nighttime answered 8/7, 2022 at 10:9 Comment(0)

Recommended topics

Hot tags