Combine two columns of text in pandas dataframe
Asked Answered
K

22

971

I have a dataframe that looks like

Year  quarter
2000       q2
2001       q3

How do I add a new column by combining these columns to get the following dataframe?

Year  quarter  period
2000       q2  2000q2
2001       q3  2001q3
Kasiekask answered 15/10, 2013 at 9:42 Comment(1)
Searchers: here's a similar question with more answersMideast
S
1226

If both columns are strings, you can concatenate them directly:

df["period"] = df["Year"] + df["quarter"]

If one (or both) of the columns are not string typed, you should convert it (them) first,

df["period"] = df["Year"].astype(str) + df["quarter"]

Beware of NaNs when doing this!


If you need to join multiple string columns, you can use agg:

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

Where "-" is the separator.

Stoltzfus answered 15/10, 2013 at 10:9 Comment(18)
Is it possible to add multiple columns together without typing out all the columns? Let's say add(dataframe.iloc[:, 0:10]) for example?Hurtful
@Hurtful That should be possible with the Python builtin sum.Stoltzfus
@Stoltzfus could you please make an example for adding multiple columns? Thank youPerverse
Be careful, you need to apply map(str) to all columns that are not string in the first place. if quarter was a number you would do dataframe["period"] = dataframe["Year"].map(str) + dataframe["quarter"].map(str) map is just applying string conversion to all entries.Goodhumored
@OzgurOzturk Based on the OP's example, it seems that the data in quarter are already strings. If not you need to convert of course.Stoltzfus
Also facilitates the easy addition of a separator. For example the separator '__': dataframe["period"] = dataframe["Year"].map(str) + '__' + dataframe["quarter"]Examine
This solution can create problems iy you have nan values, e carefulHandfasting
this can create problems with text which needs utf-8 encodingHannon
I'm getting the SettingWithCopyWarning when I use this solution - how can I do this without triggering that warning?Drawl
This solution is much faster than using the apply function.Mullens
Not sure which version of python this would work for, but this didn't work. It created a list of the joined list in every row cell of the new column.Seignior
In general, astype should be used for casting pandas.Series to another type. So instead of .map(str), use .astype(str).Afteryears
what to of NaNs? Is there a solution to ignore them?Outhe
If both the column are string then df['year_qtr'] = df['year'].str.cat(df['qtr'],sep='_')Boozer
Instead of .agg... use .apply(lambda x: '_'.join(x), axis=1). This was 5 times faster than the .agg version.Christcrossrow
@Perverse see here https://mcmap.net/q/14030/-combine-two-columns-of-text-in-pandas-dataframeSelfliquidating
^^ and you can combine it with str convertig: .apply(lambda x: '__'.join(x.astype(str)), axis=1)Philhellene
If there are NAs: .fillna('').agg('-'.join, axis=1).Oneidaoneil
R
408

Small data-sets (< 150rows)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

or slightly slower but more compact:

df.Year.str.cat(df.quarter)

Larger data sets (> 150rows)

df['Year'].astype(str) + df['quarter']

UPDATE: Timing graph Pandas 0.23.4

enter image description here

Let's test it on 200K rows DF:

In [250]: df
Out[250]:
   Year quarter
0  2014      q1
1  2015      q2

In [251]: df = pd.concat([df] * 10**5)

In [252]: df.shape
Out[252]: (200000, 2)

UPDATE: new timings using Pandas 0.19.0

Timing without CPU/GPU optimization (sorted from fastest to slowest):

In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop

In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop

In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop

In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop

In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop

In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop

Timing using CPU/GPU optimization:

In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop

In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop

In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop

In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop

Answer contribution by @anton-vbr

Retract answered 28/4, 2016 at 10:2 Comment(16)
What difference between 261 and 264 in your timing?Souther
@AntonProtopopov apparently 100ms out of nowhere :)Epochal
@AntonProtopopov, i guess it's a mixture of two timings - one used CPU/GPU optimization, another one didn't. I've updated my answer and put both timing sets there...Retract
This use of .sum() fails If all columns look like they could be integers (ie are string forms of integers). Instead, it seems pandas converts them back to numeric before summing!Serpens
@CPBL, try this approach: df.T.apply(lambda x: x.str.cat(sep=''))Retract
@MaxU How did you go about the CPU/GPU optimization? Is that just a more powerful computer or is it something you did with code?Neuberger
@user3374113, it has nothing to do with the code. My notebook has a switch which switches on GPU and enables Intel Turbo Boost TechnologyRetract
@MaxU, have you tried comparing the .str.cat() method?Moonfish
Late to the party. But I'm confused as to how df['Year'].map(str) works. Does str serve as an arg and it basically just sets everything in the df['Year'] as string? Thanks.Claraclarabella
@BowenLiu, exactly, it does the same thing as df['Year'].astype(str)Retract
And df['period'] = [''.join(x) for x in zip(df['Year'], df['quarter'])] ?Siderosis
@MaxU - I've found that apply(str) is typically faster than astype(str), would be interesting to see how that stacks up here.Assembler
@MaxU could you help the following question stackoverflow.com/questions/54865312/…Denisdenise
I'm curious if there is an easy way to make any of these answers amenable to providing an arbitrary list of column names.Tecla
@jflournoy, pardon, I’m not sure I understood your question. Could you please explain what do you want to achieve?Retract
I'm curious how you would implement the best performers, [111] and [107] above, in a function that takes as input a list of arbitrary length. I guess it's obvious one could loop through the list to iteratively build up the full concatenated strings, but thought there might be a better way.Tecla
K
326
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

Yields this dataframe

   Year quarter  period
0  2014      q1  2014q1
1  2015      q2  2015q2

This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']] with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).

You can check more information about apply() method here

Kablesh answered 11/9, 2015 at 17:36 Comment(11)
lambda x: ''.join(x) is just ''.join, no?Bethought
@Bethought no. here we are taking each row of the df[['Year', 'quarter']] and passing it as series to join, and join concatenates the elements in the series.Goodhumored
@OzgurOzturk: the point is that the lambda part of the lambda x: ''.join(x) construction doesn't do anything; it's like using lambda x: sum(x) instead of just sum.Bethought
@Russ, if you wanted to join these strings based on some criteria i.e. join if string != "some_string" how could you add that?Often
Confirmed same result when using ''.join, i.e.: df['period'] = df[['Year', 'quarter']].apply(''.join, axis=1).Steady
This solution gives me a TypeError: ('sequence item 0: expected str instance, int found', as I want to combine an integer categorical variable with a string category.Cacogenics
@Cacogenics join takes only str instances in an iterable. Use a map to convert them all into str and then use join.Sewn
This solution ended up being reasonably slower for me than using the df['Foo'].map(function) + '-' + df['Bar'].map(function) approach. For ~48k rows applying a join along axis 1 took 1.15s whereas the other approach takes 100ms. Not unreasonable on a small dataset, but if you are working with a lot of data that can make a big difference.Sapphirine
'-'.join(x.map(str))Door
Similarly df['period'] = [''.join(x) for x in zip(df['Year'], df['quarter'])] works too which may be faster.Siderosis
This is the best and most versatile answer, given that the columns to be joined may need to be dynamic and not static and this solution makes that possible.Peekaboo
M
201

The method cat() of the .str accessor works really well for this:

>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"], 
...                    ["2015", "q3"]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014      q1
1  2015      q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
   Year Quarter  Period
0  2014      q1  2014q1
1  2015      q3  2015q3

cat() even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:

>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
...                    [2015, 3]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014       1
1  2015       3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
   Year Quarter  Period
0  2014       1  2014q1
1  2015       3  2015q3

Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat() invoked on the first column (Series):

>>> df = pd.DataFrame(
...     [['USA', 'Nevada', 'Las Vegas'],
...      ['Brazil', 'Pernambuco', 'Recife']],
...     columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
  Country       State       City                   AllTogether
0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas
1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife

Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.

Moonfish answered 7/3, 2016 at 18:4 Comment(5)
This seems way better (maybe more efficient, too) than lambda or map; also it just reads most cleanly.Venable
Which version of pandas are you using? I get ValueError: Did you mean to supply a sep keyword? in pandas-0.23.4. Thanks!Ajaajaccio
@QinqingLiu, I retested these with pandas-0.23.4 and they seem work. The sep parameter is only necessary if you intend to separate the parts of the concatenated string. If you get an error, please show us your failing example.Moonfish
@Moonfish can i do a newline instead of '-' with sep keyword?Swindell
@arun-menon: I don't see why not. In the last example above you could do .str.cat(df[['State', 'City']], sep ='\n'), for example. I haven't tested it yet, though.Moonfish
L
44

Use of a lamba function this time with string.format().

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df

  Quarter  Year
0      q1  2014
1      q2  2015
  Quarter  Year YearQuarter
0      q1  2014      2014q1
1      q2  2015      2015q2

This allows you to work with non-strings and reformat values as needed.

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df

df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df

Quarter     int64
Year       object
dtype: object
   Quarter  Year
0        1  2014
1        2  2015
   Quarter  Year YearQuarter
0        1  2014      2014q1
1        2  2015      2015q2
Liven answered 16/3, 2016 at 16:43 Comment(2)
Much quicker: .apply(''.join(x), axis=1)Ganef
This solution worked great for my needs since I had to do some formatting. df_game['formatted_game_time'] = df_game[['wday', 'month', 'day', 'year', 'time']].apply(lambda x: '{}, {}/{}/{} @ {}'.format(x[0], x[1], x[2], x[3], x[4]), axis=1)Volney
S
24

generalising to multiple columns, why not:

columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)
Selfliquidating answered 30/7, 2019 at 10:38 Comment(2)
Looks cool but what if I want to add a delimiter between the strings, like '-'?Roz
@Roz maybe create a delimiter column?Mitman
Y
19

You can use lambda:

combine_lambda = lambda x: '{}{}'.format(x.Year, x.quarter)

And then use it with creating the new column:

df['period'] = df.apply(combine_lambda, axis = 1)
Yoruba answered 28/2, 2021 at 16:25 Comment(0)
E
15

Let us suppose your dataframe is df with columns Year and Quarter.

import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})

Suppose we want to see the dataframe;

df
>>>  Quarter    Year
   0    q1      2000
   1    q2      2000
   2    q3      2000
   3    q4      2000

Finally, concatenate the Year and the Quarter as follows.

df['Period'] = df['Year'] + ' ' + df['Quarter']

You can now print df to see the resulting dataframe.

df
>>>  Quarter    Year    Period
    0   q1      2000    2000 q1
    1   q2      2000    2000 q2
    2   q3      2000    2000 q3
    3   q4      2000    2000 q4

If you do not want the space between the year and quarter, simply remove it by doing;

df['Period'] = df['Year'] + df['Quarter']
Environs answered 22/7, 2018 at 5:20 Comment(5)
Specified as strings df['Period'] = df['Year'].map(str) + df['Quarter'].map(str)Exorcism
I'm getting TypeError: Series cannot perform the operation + when I run either df2['filename'] = df2['job_number'] + '.' + df2['task_number'] or df2['filename'] = df2['job_number'].map(str) + '.' + df2['task_number'].map(str).Xanthate
However, df2['filename'] = df2['job_number'].astype(str) + '.' + df2['task_number'].astype(str) did work.Xanthate
@KarlBaker, I think you did not have strings in your input. But I am glad you figured that out. If you look at the example dataframe that I created above, you will see that all the columns are strings.Environs
What exactly is the point of this solution, since it's identical to the top answer?Sexcentenary
S
14

Although the @silvado answer is good if you change df.map(str) to df.astype(str) it will be faster:

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop

In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop
Souther answered 25/11, 2015 at 10:25 Comment(0)
P
13

Here is an implementation that I find very versatile:

In [1]: import pandas as pd 

In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
   ...:                    [1, 'fox', 'jumps', 'over'], 
   ...:                    [2, 'the', 'lazy', 'dog']],
   ...:                   columns=['c0', 'c1', 'c2', 'c3'])

In [3]: def str_join(df, sep, *cols):
   ...:     from functools import reduce
   ...:     return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep), 
   ...:                   [df[col] for col in cols])
   ...: 

In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')

In [5]: df
Out[5]: 
   c0   c1     c2     c3                cat
0   0  the  quick  brown  0-the-quick-brown
1   1  fox  jumps   over   1-fox-jumps-over
2   2  the   lazy    dog     2-the-lazy-dog
Pahl answered 3/4, 2017 at 17:5 Comment(1)
FYI: This method works great with Python 3, but gives me trouble in Python 2.Bruton
D
12

more efficient is

def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)

and here is a time test:

import numpy as np
import pandas as pd

from time import time


def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)


def concat_df_str2(df):
    """ run time: 5.2758s """
    return df.astype(str).sum(axis=1)


def concat_df_str3(df):
    """ run time: 5.0076s """
    df = df.astype(str)
    return df[0] + df[1] + df[2] + df[3] + df[4] + \
           df[5] + df[6] + df[7] + df[8] + df[9]


def concat_df_str4(df):
    """ run time: 7.8624s """
    return df.astype(str).apply(lambda x: ''.join(x), axis=1)


def main():
    df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
    df = df.astype(int)

    time1 = time()
    df_en = concat_df_str4(df)
    print('run time: %.4fs' % (time() - time1))
    print(df_en.head(10))


if __name__ == '__main__':
    main()

final, when sum(concat_df_str2) is used, the result is not simply concat, it will trans to integer.

Drus answered 9/1, 2018 at 2:13 Comment(1)
+1 Neat solution, this also allows us to specify the columns: e.g. df.values[:, 0:3] or df.values[:, [0,2]].Healy
B
7

Using zip could be even quicker:

df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

Graph:

enter image description here

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

myfuncs = {
"df['Year'].astype(str) + df['quarter']":
    lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
    lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
    lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df[['Year','quarter']].astype(str).sum(axis=1),
    "df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
    lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
    "[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
    lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in myfuncs.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()
Boyce answered 13/5, 2018 at 14:33 Comment(0)
K
7

This solution uses an intermediate step compressing two columns of the DataFrame to a single column containing a list of the values. This works not only for strings but for all kind of column-dtypes

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)

Result:

   Year quarter        list  period
0  2014      q1  [2014, q1]  2014q1
1  2015      q2  [2015, q2]  2015q2
Katekatee answered 15/3, 2019 at 16:37 Comment(4)
looks like other dtypes won't work. I got a TypeError: sequence item 1: expected str instance, float foundTwitt
apply first a cast to string. The join operation works only for stringsKatekatee
This solution won't work to combine two columns with different dtype, see my answer for the correct solution for such case.Parget
Instead of .apply(''.join) why not use .str.join('')?Furlani
P
7

Here is my summary of the above solutions to concatenate / combine two columns with int and str value into a new column, using a separator between the values of columns. Three solutions work for this purpose.

# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError

separator = "&&" 

# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"

df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)
Parget answered 16/5, 2019 at 13:19 Comment(1)
At least your first solution does not work (any more?). I use: df["period"] = (df["Year"].astype(str) + separator + df["quarter"].astype(str)).astype('category')Flocculent
B
5

my take....

listofcols = ['col1','col2','col3']
df['combined_cols'] = ''

for column in listofcols:
    df['combined_cols'] = df['combined_cols'] + ' ' + df[column]
'''
Bugleweed answered 18/8, 2020 at 4:13 Comment(1)
You should add an explanation to this code snippet. Adding only code answers encourages people to use code they don't understand and doesn't help them learn.Rarebit
F
3

When combining columns with strings by concatenating them using the addition operator + if any is NaN then entire output will be NaN so use fillna()

df["join"] = "some" + df["col"].fillna(df["val_if_nan"])
Foodstuff answered 7/1, 2023 at 14:12 Comment(0)
S
2

As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.

%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Siddra answered 25/10, 2017 at 3:21 Comment(2)
I'd like to use the numpyified version but I'm getting an error: Input: df2['filename'] = df2['job_number'].values.astype(str) + '.' + df2['task_number'].values.astype(str) --> Output: TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21'). Both job_number and task_number are ints.Xanthate
That's because you are combining two numpy arrays. It works if you combine an numpy array with pandas Series. as df['Year'].values.astype(str) + df.quarterMalleable
N
2

One can use assign method of DataFrame:

df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
  assign(period=lambda x: x.Year+x.quarter ))
Nestle answered 1/12, 2018 at 10:55 Comment(0)
C
1

Similar to @geher answer but with any separator you like:

SEP = " "
INPUT_COLUMNS_WITH_SEP = ",sep,".join(INPUT_COLUMNS).split(",")

df.assign(sep=SEP)[INPUT_COLUMNS_WITH_SEP].sum(axis=1)
Complain answered 4/12, 2021 at 12:43 Comment(0)
L
0
def madd(x):
    """Performs element-wise string concatenation with multiple input arrays.

    Args:
        x: iterable of np.array.

    Returns: np.array.
    """
    for i, arr in enumerate(x):
        if type(arr.item(0)) is not str:
            x[i] = x[i].astype(str)
    return reduce(np.core.defchararray.add, x)

For example:

data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])

df

    Year    quarter period
0   2000    q1  2000q1
1   2000    q2  2000q2
2   2000    q3  2000q3
3   2000    q4  2000q4
Lightfooted answered 21/7, 2017 at 19:38 Comment(2)
NameError: name 'reduce' is not definedButcher
from functools import reducePoitiers
C
0

Use .combine_first.

df['Period'] = df['Year'].combine_first(df['Quarter'])
Candiecandied answered 10/2, 2018 at 4:1 Comment(1)
This is not correct. .combine_first will result in either the value from 'Year' being stored in 'Period', or, if it is Null, the value from 'Quarter'. It will not concatenate the two strings and store them in 'Period'.Festoon
H
0

DataFrame.eval()

For a little terse code, we can use .eval(). We can concatenate two (or more) string dtype columns horizontally using the + operator as follows.

df = pd.DataFrame({'A': ['x', 'y', 'z'], 'B': ['1', '2', '3']}, dtype='string')
df['C'] = df.eval("A + B")

You can even include the new column assignment inside the evaluated expression (which also opens up the possibility to do it in-place).

df = df.eval('C = A + B')
df.eval('C = A + B', inplace=True)

eval doesn't allow a similarly terse way to add delimiters; however, we can call str.cat() (which has the sep= kwarg) inside the numerical expression.

df = df.eval("D = A.str.cat([B, C], '_')")

which produces the following output (where columns, A, B and C are concatenated horizontally):

result

When to use vectorized concat vs explicit loop

There are major string concatenation methods given on this page:

  • vectorized +: df['A'] + df['B']
  • string formatting in a loop (N.B. converting the columns into lists makes the loop faster):
    [f"{x}{y}" for x,y in zip(df['A'].tolist(), df['B'].tolist())])]
    
  • vectorized str.cat(): df['A'].str.cat(df['B'])

As the following figure shows, vectorized concatenation (via +) is fastest if the strings being concatenated are short such as in the OP. However, if the strings are long (e.g. each cell contains a tweet or a book excerpt), then an explicit Python loop (use f-string in a list comprehension) is the fastest.

perfplot

Code to reproduce the above figure:

import matplotlib.pyplot as plt
import pandas as pd
import perfplot

fig, axs = plt.subplots(1, 2, figsize=(15,5))
for ax, s, title in zip(axs, ('a'*1000, 'a'), ("long", "short")):
    plt.sca(ax)
    perfplot.plot(
        kernels=[lambda df: df.assign(C=df['A'] + '_' + df['B']),
                 lambda df: df.assign(C=df['A'].str.cat(df['B'], '_')),
                 lambda df: df.assign(C=[f"{x}_{y}" for x,y in zip(df['A'].tolist(), df['B'].tolist())])],
        n_range=[2**k for k in range(18)],
        setup=lambda n: pd.DataFrame({'A': [s]*n, 'B': [s]*n}),
        labels=["df['A'] + df['B']", "str.cat", "list comp"],
        xlabel="DataFrame length",
        title=f"When the strings are {title}", 
        equality_check=pd.DataFrame.equals)
fig.tight_layout()
fig.savefig("string_concat_perf.png")
Horseleech answered 6/12, 2023 at 3:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.