Is there an efficient way to include every remaining unselected column in a python-polars select?

T

2

5

I'm trying to reorder the columns in a Polars dataframe and put 5 columns out of 100 first (the document must unfortunately be somewhat readable in excel). I can't seem to find an easy way to do this.

Ideally, I'd like something simple like

df.select(
    'col2',
    'col1',
    r'^.*$',  # the rest of the columns, but this throws a duplicate column name error
)

Negative lookahead is not supported so it's not possible to make a regex that excludes my selected columns. I could make two overlapping selections, drop the columns from one selection, and then join them, but this does not seem like it would be the intended way to do this. Every other solution I've found involves explicitly naming every single column, which I'm trying to avoid as the columns get added or change names somewhat frequently.

Thenar answered 22/8, 2024 at 16:33 Comment(0)

U

2

You can combine pl.exclude with the walrus operator.

Suppose you have something like

df=pl.DataFrame(
    [
        pl.Series('c', [1, 2, 3], dtype=pl.Int64),
        pl.Series('b', [2, 3, 4], dtype=pl.Int64),
        pl.Series('fcvem', [4, 5, 6], dtype=pl.Int64),
        pl.Series('msoy', [4, 5, 6], dtype=pl.Int64),
        pl.Series('smrn', [4, 5, 6], dtype=pl.Int64),
        pl.Series('z', [4, 5, 6], dtype=pl.Int64),
        pl.Series('wxous', [4, 5, 6], dtype=pl.Int64),
        pl.Series('uusn', [4, 5, 6], dtype=pl.Int64),
        pl.Series('ydj', [4, 5, 6], dtype=pl.Int64),
        pl.Series('squr', [4, 5, 6], dtype=pl.Int64),
        pl.Series('yyx', [4, 5, 6], dtype=pl.Int64),
        pl.Series('nl', [4, 5, 6], dtype=pl.Int64),
        pl.Series('a', [0, 1, 2], dtype=pl.Int64),
    ]
)

and you want the first 3 columns to be 'a', 'b', 'c'. You can do:

df.select(*(start_cols:=['a','b','c']), pl.exclude(start_cols))

This creates a list called start_cols which contains 'a','b','c'. The asterisk unwraps the list and then pl.exclude uses the contents of start_cols to tell polars to return everything except start_cols.

If you prefer, you could do this syntax instead:

df.select((start_cols:=['a','b','c'])+ [pl.exclude(start_cols)])

Ummersen answered 22/8, 2024 at 17:2 Comment(1)

Thanks! This is very similar to what I ended up doing with help from @jqurious, which was just to make it it's own list COLUMNS = ['col2', 'col1'] df = df.select(pl.col(COLUMNS), pl.exclude(COLUMNS)) Although your solution is more elegant, it's useful for me to have this list elsewhere. – Thenar 22/8, 2024 at 17:8

H

4

It involves naming the columns twice, but you can exclude them:

pl.exclude()

df.select('col2', 'col1', pl.exclude('col1', 'col2'))

You could store the names in a variable to avoid the duplication.

Example:

df = pl.DataFrame(schema=list('abcdefg'))

cols = 'g', 'e'

df.select(*cols, pl.exclude(cols))

shape: (0, 7)
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ g    ┆ e    ┆ a    ┆ b    ┆ c    ┆ d    ┆ f    │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ null ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null │
╞══════╪══════╪══════╪══════╪══════╪══════╪══════╡
└──────┴──────┴──────┴──────┴──────┴──────┴──────┘

There is an open feature request to allow df.select('col2', 'col1', ...)

https://github.com/pola-rs/polars/issues/12067

Hellbent answered 22/8, 2024 at 16:35 Comment(0)

U

2