From tuples to multiple columns in pandas

Asked 28/8, 2014 at 22:49 Answered 15/9, 2023 at 23:18

How do I convert this dataframe

                                          location  value                       
0                   (Richmond, Virginia, nan, USA)    100                       
1              (New York City, New York, nan, USA)    200

to this:

    city            state       region    country   value
0   Richmond        Virginia    nan       USA       100
1   New York City   New York    nan       USA       200

Note that the location column in the first dataframe contains tuples. I want to create four columns out of the location column.

Brade answered 28/8, 2014 at 22:49 Comment(0)

new_col_list = ['city','state','regions','country']
for n,col in enumerate(new_col_list):
    df[col] = df['location'].apply(lambda location: location[n])

df = df.drop('location',axis=1)

Benfield answered 28/8, 2014 at 23:27 Comment(0)

If you return a Series of the (split) location, you can merge (join to merge on index) the resulting DF directly with your value column.

addr = ['city', 'state', 'region', 'country']
df[['value']].join(df.location.apply(lambda loc: Series(loc, index=addr)))

   value           city     state  region country
0    100       Richmond  Virginia     NaN     USA
1    200  New York City  New York     NaN     USA

Rosebud answered 28/8, 2014 at 23:39 Comment(4)

Thanks. @exp1orer's solution worked, so I didn't have to try yours. – Brade 29/8, 2014 at 0:23

That is nicely done! If I'm not mistaken, it only works if df has index that is range(len(df)), right? – Benfield 29/8, 2014 at 5:10

@Benfield Not quite. join is shorthand for merging on index with both frames, so the indices need only be consistent (which it will be here as the apply and col selection don't affect it). I'll edit the answer. – Rosebud 29/8, 2014 at 6:37

Should have thought about this earlier, but worth noting that this is much slower than @exp1orer's solution on a largish frame due to all of the Series being created. – Rosebud 29/8, 2014 at 7:11

I haven't timed this, but I would suggest this option:

df.loc[:,'city']=df.location.map(lambda x:x[0])
df.loc[:,'state']=df.location.map(lambda x:x[1])
df.loc[:,'regions']=df.location.map(lambda x:x[2])
df.loc[:,'country']=df.location.map(lambda x:x[3])

I'm guessing avoiding explicit for loop might lend itself to a SIMD instruction (certainly numpy looks for that, but perhaps not other libraries)

Concerted answered 4/12, 2019 at 19:31 Comment(0)

I prefer to use pd.DataFrame.from_records to convert the tuples to Series. Then this can be joined to the previous dataset as described by meloncholy.

df = pd.DataFrame({"location":[("Richmond", "Virginia", pd.NA, "USA"),
                               ("New York City", "New York", pd.NA, "USA")],
                   "value": [100,200]})

loc = pd.DataFrame.from_records(df.location, columns=['city','state','regions','country'])
df.drop("location", axis=1).join(loc)

from_records does assume a sequential index. If this is not the case you should pass the index to the new DataFrame:

loc = pd.DataFrame.from_records(df.location.reset_index(drop=True),
                                columns=['city','state','regions','country'],
                                index=df.index)

Sensor answered 9/6, 2022 at 16:12 Comment(0)

Another option when returning a Series (listed previously) is to use the argument result_tpye='expand' of the apply method (see pandas.DataFrame.apply).

Code below also use join to display the value column as the last column.

def split_location(df):
    cols = ["city", "state", "region", "country"]
    return pd.Series(df.location, index=cols)

(df
 .apply(split_location, axis = 1)
 .join(df.value)
)

    city            state       region  country value
0   Richmond        Virginia    nan     USA     100
1   New York City   New York    nan     USA     200

Hypabyssal answered 15/9, 2023 at 23:18 Comment(0)

Recommended topics

Hot tags