Polars looping through the rows in a dataset

Asked 2/2, 2023 at 13:15 Answered 5/5, 2024 at 11:45

I am trying to loop through a Polars recordset using the following code:

import polars as pl

df = pl.DataFrame({
    "start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
    "Name": ["John", "Joe", "James"]
})

for row in df.rows():
    print(row)

('2020-01-02', 'John')
('2020-01-03', 'Joe')
('2020-01-04', 'James')

Is there a way to specifically reference 'Name' using the named column as opposed to the index? In Pandas this would look something like:

import pandas as pd

df = pd.DataFrame({
    "start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
    "Name": ["John", "Joe", "James"]
})

for index, row in df.iterrows():
    df['Name'][index]

'John'
'Joe'
'James'

Staggs answered 2/2, 2023 at 13:15 Comment(0)

You can specify that you want the rows to be named

for row in mydf.rows(named=True):
    print(row)

It will give you a dict:

{'start_date': '2020-01-02', 'Name': 'John'}
{'start_date': '2020-01-03', 'Name': 'Joe'}
{'start_date': '2020-01-04', 'Name': 'James'}

You can then call row['Name']

Note that:

previous versions returned namedtuple instead of dict.
it's less memory intensive to use iter_rows
overall it's not recommended to iterate through the data this way

Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.

Quibbling answered 2/2, 2023 at 13:26 Comment(7)

Hi @0x26res. Thank you for taking the time out to explain this to me. If i use mydf.iterrows(), like the following for row in mydf.iterrows(named=True): row['Name'], I get the error

Traceback (most recent call last):   File "<stdin>", line 2, in <module> TypeError: tuple indices must be integers or slices, not str

– Staggs 2/2, 2023 at 13:36

I am also happy to achieve the same result via a columnar method but have not seen much documentation around this particular type of iteration in Polars. Thank you again for your help – Staggs 2/2, 2023 at 13:37

As mentioned, previous versions returned namedtuple instead of dict.. Try row.Name – Quibbling 2/2, 2023 at 14:10

@JohnSmith iterating through rows isn't documented well in polars because it's highly discouraged as it basically circumvents all the optimization. It's like if you buy a Ferrari and then ask how to drive it really slowly and quietly. What result are you ultimately after? – Garfield 2/2, 2023 at 22:26

Hi @DeanMacGregor, I have a table which has names and dates. Each row is has variables that i "inject" into the SQL which in the case above would generate 3 SQL statements and generate 3 reports. I don't have access to create stored procedures and to run everything in one go due to spool space issues. Thank you for your question – Staggs 3/2, 2023 at 8:6

"overall it's not recommended to iterate through the data this way" - which way is recommended then? – Slocum 18/3, 2024 at 8:19

It depends what your use case is, but it's best to use polars expressions and built-in in functions, for performance and maintainability. – Quibbling 18/3, 2024 at 8:24

You would use select for that

names = mydf.select(['Name'])
for row in names:
    print(row)

Lindbom answered 2/2, 2023 at 13:23 Comment(1)

Hi @Kien Truong. Thank you very much for your quick reply. I used name as an example. I actually would want to get the date and the name as these items are then sent to an SQL statement in the real code as opposed to this example. Each row in this table would generate a separate SQL – Staggs 2/2, 2023 at 13:27

In polars, pl.DataFrame.iter_rows with named=True should be preferred over pl.DataFrame.rows as the latter materialises all frame data as a list of rows, which is potentially expensive.

import polars as pl


df = pl.DataFrame({
    "start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
    "name": ["John", "Joe", "James"]
})

for row in df.iter_rows(named=True):
    print(row)

{'start_date': '2020-01-02', 'name': 'John'}
{'start_date': '2020-01-03', 'name': 'Joe'}
{'start_date': '2020-01-04', 'name': 'James'}

Mcniel answered 5/5, 2024 at 11:45 Comment(0)

Recommended topics

Hot tags