Polars looping through the rows in a dataset
Asked Answered
S

3

15

I am trying to loop through a Polars recordset using the following code:

import polars as pl

df = pl.DataFrame({
    "start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
    "Name": ["John", "Joe", "James"]
})

for row in df.rows():
    print(row)
('2020-01-02', 'John')
('2020-01-03', 'Joe')
('2020-01-04', 'James')

Is there a way to specifically reference 'Name' using the named column as opposed to the index? In Pandas this would look something like:

import pandas as pd

df = pd.DataFrame({
    "start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
    "Name": ["John", "Joe", "James"]
})

for index, row in df.iterrows():
    df['Name'][index]
'John'
'Joe'
'James'
Staggs answered 2/2, 2023 at 13:15 Comment(0)
Q
23

You can specify that you want the rows to be named

for row in mydf.rows(named=True):
    print(row)

It will give you a dict:

{'start_date': '2020-01-02', 'Name': 'John'}
{'start_date': '2020-01-03', 'Name': 'Joe'}
{'start_date': '2020-01-04', 'Name': 'James'}

You can then call row['Name']

Note that:

  • previous versions returned namedtuple instead of dict.
  • it's less memory intensive to use iter_rows
  • overall it's not recommended to iterate through the data this way

Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.

Quibbling answered 2/2, 2023 at 13:26 Comment(7)
Hi @0x26res. Thank you for taking the time out to explain this to me. If i use mydf.iterrows(), like the following for row in mydf.iterrows(named=True): row['Name'], I get the error Traceback (most recent call last): File "<stdin>", line 2, in <module> TypeError: tuple indices must be integers or slices, not strStaggs
I am also happy to achieve the same result via a columnar method but have not seen much documentation around this particular type of iteration in Polars. Thank you again for your helpStaggs
As mentioned, previous versions returned namedtuple instead of dict.. Try row.NameQuibbling
@JohnSmith iterating through rows isn't documented well in polars because it's highly discouraged as it basically circumvents all the optimization. It's like if you buy a Ferrari and then ask how to drive it really slowly and quietly. What result are you ultimately after?Garfield
Hi @DeanMacGregor, I have a table which has names and dates. Each row is has variables that i "inject" into the SQL which in the case above would generate 3 SQL statements and generate 3 reports. I don't have access to create stored procedures and to run everything in one go due to spool space issues. Thank you for your questionStaggs
"overall it's not recommended to iterate through the data this way" - which way is recommended then?Slocum
It depends what your use case is, but it's best to use polars expressions and built-in in functions, for performance and maintainability.Quibbling
L
1

You would use select for that

names = mydf.select(['Name'])
for row in names:
    print(row)
Lindbom answered 2/2, 2023 at 13:23 Comment(1)
Hi @Kien Truong. Thank you very much for your quick reply. I used name as an example. I actually would want to get the date and the name as these items are then sent to an SQL statement in the real code as opposed to this example. Each row in this table would generate a separate SQLStaggs
M
1

In polars, pl.DataFrame.iter_rows with named=True should be preferred over pl.DataFrame.rows as the latter materialises all frame data as a list of rows, which is potentially expensive.

import polars as pl


df = pl.DataFrame({
    "start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
    "name": ["John", "Joe", "James"]
})

for row in df.iter_rows(named=True):
    print(row)
{'start_date': '2020-01-02', 'name': 'John'}
{'start_date': '2020-01-03', 'name': 'Joe'}
{'start_date': '2020-01-04', 'name': 'James'}
Mcniel answered 5/5 at 11:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.