Combine two columns of text in pandas dataframe

Asked 15/10, 2013 at 9:42 Answered 6/12, 2023 at 3:43

python pandas string dataframe string-concatenation

971

I have a dataframe that looks like

Year  quarter
2000       q2
2001       q3

How do I add a new column by combining these columns to get the following dataframe?

Year  quarter  period
2000       q2  2000q2
2001       q3  2001q3

Kasiekask answered 15/10, 2013 at 9:42 Comment(1)

Searchers: here's a similar question with more answers – Mideast 18/10, 2022 at 19:40

1226

If both columns are strings, you can concatenate them directly:

df["period"] = df["Year"] + df["quarter"]

If one (or both) of the columns are not string typed, you should convert it (them) first,

df["period"] = df["Year"].astype(str) + df["quarter"]

Beware of NaNs when doing this!

If you need to join multiple string columns, you can use agg:

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

Where "-" is the separator.

Stoltzfus answered 15/10, 2013 at 10:9 Comment(18)

Is it possible to add multiple columns together without typing out all the columns? Let's say add(dataframe.iloc[:, 0:10]) for example? – Hurtful 9/5, 2015 at 19:15

@Hurtful That should be possible with the Python builtin sum. – Stoltzfus 11/5, 2015 at 11:6

@Stoltzfus could you please make an example for adding multiple columns? Thank you – Perverse 25/10, 2016 at 16:45

Be careful, you need to apply map(str) to all columns that are not string in the first place. if quarter was a number you would do dataframe["period"] = dataframe["Year"].map(str) + dataframe["quarter"].map(str) map is just applying string conversion to all entries. – Goodhumored 1/2, 2017 at 21:17

@OzgurOzturk Based on the OP's example, it seems that the data in quarter are already strings. If not you need to convert of course. – Stoltzfus 2/2, 2017 at 12:27

Also facilitates the easy addition of a separator. For example the separator '__': dataframe["period"] = dataframe["Year"].map(str) + '__' + dataframe["quarter"] – Examine 28/3, 2017 at 13:43

This solution can create problems iy you have nan values, e careful – Handfasting 27/12, 2017 at 17:14

this can create problems with text which needs utf-8 encoding – Hannon 28/6, 2018 at 18:49

I'm getting the SettingWithCopyWarning when I use this solution - how can I do this without triggering that warning? – Drawl 13/7, 2018 at 18:55

This solution is much faster than using the apply function. – Mullens 22/7, 2018 at 5:28

Not sure which version of python this would work for, but this didn't work. It created a list of the joined list in every row cell of the new column. – Seignior 25/10, 2018 at 16:40

In general, astype should be used for casting pandas.Series to another type. So instead of .map(str), use .astype(str). – Afteryears 7/5, 2019 at 18:59

what to of NaNs? Is there a solution to ignore them? – Outhe 16/4, 2020 at 18:54

If both the column are string then df['year_qtr'] = df['year'].str.cat(df['qtr'],sep='_') – Boozer 23/9, 2020 at 13:56

Instead of .agg... use .apply(lambda x: '_'.join(x), axis=1). This was 5 times faster than the .agg version. – Christcrossrow 6/2, 2021 at 11:2

@Perverse see here https://mcmap.net/q/14030/-combine-two-columns-of-text-in-pandas-dataframe – Selfliquidating 1/6, 2021 at 12:51

^^ and you can combine it with str convertig: .apply(lambda x: '__'.join(x.astype(str)), axis=1) – Philhellene 13/5, 2022 at 10:18

If there are NAs: .fillna('').agg('-'.join, axis=1). – Oneidaoneil 28/2 at 15:6

408

Small data-sets (< 150rows)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

or slightly slower but more compact:

df.Year.str.cat(df.quarter)

Larger data sets (> 150rows)

df['Year'].astype(str) + df['quarter']

UPDATE: Timing graph Pandas 0.23.4

Let's test it on 200K rows DF:

In [250]: df
Out[250]:
   Year quarter
0  2014      q1
1  2015      q2

In [251]: df = pd.concat([df] * 10**5)

In [252]: df.shape
Out[252]: (200000, 2)

UPDATE: new timings using Pandas 0.19.0

Timing without CPU/GPU optimization (sorted from fastest to slowest):

In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop

In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop

In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop

In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop

In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop

In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop

Timing using CPU/GPU optimization:

In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop

In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop

In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop

In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop

Answer contribution by @anton-vbr

Retract answered 28/4, 2016 at 10:2 Comment(16)

What difference between 261 and 264 in your timing? – Souther 21/5, 2016 at 19:57

@AntonProtopopov apparently 100ms out of nowhere :) – Epochal 10/10, 2016 at 17:30

@AntonProtopopov, i guess it's a mixture of two timings - one used CPU/GPU optimization, another one didn't. I've updated my answer and put both timing sets there... – Retract 10/10, 2016 at 17:45

This use of .sum() fails If all columns look like they could be integers (ie are string forms of integers). Instead, it seems pandas converts them back to numeric before summing! – Serpens 25/5, 2017 at 13:6

@CPBL, try this approach: df.T.apply(lambda x: x.str.cat(sep='')) – Retract 26/5, 2017 at 7:38

@MaxU How did you go about the CPU/GPU optimization? Is that just a more powerful computer or is it something you did with code? – Neuberger 7/7, 2017 at 11:30

@user3374113, it has nothing to do with the code. My notebook has a switch which switches on GPU and enables Intel Turbo Boost Technology – Retract 10/7, 2017 at 15:11

@MaxU, have you tried comparing the .str.cat() method? – Moonfish 23/7, 2018 at 21:22

Late to the party. But I'm confused as to how df['Year'].map(str) works. Does str serve as an arg and it basically just sets everything in the df['Year'] as string? Thanks. – Claraclarabella 18/10, 2018 at 18:12

@BowenLiu, exactly, it does the same thing as df['Year'].astype(str) – Retract 18/10, 2018 at 20:27

And df['period'] = [''.join(x) for x in zip(df['Year'], df['quarter'])] ? – Siderosis 6/11, 2018 at 4:43

@MaxU - I've found that apply(str) is typically faster than astype(str), would be interesting to see how that stacks up here. – Assembler 15/2, 2019 at 16:33

@MaxU could you help the following question stackoverflow.com/questions/54865312/… – Denisdenise 26/2, 2019 at 6:43

I'm curious if there is an easy way to make any of these answers amenable to providing an arbitrary list of column names. – Tecla 9/4, 2019 at 20:17

@jflournoy, pardon, I’m not sure I understood your question. Could you please explain what do you want to achieve? – Retract 10/4, 2019 at 5:43

I'm curious how you would implement the best performers, [111] and [107] above, in a function that takes as input a list of arbitrary length. I guess it's obvious one could loop through the list to iteratively build up the full concatenated strings, but thought there might be a better way. – Tecla 11/4, 2019 at 12:27

326

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

Yields this dataframe

   Year quarter  period
0  2014      q1  2014q1
1  2015      q2  2015q2

This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']] with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).

You can check more information about apply() method here

Kablesh answered 11/9, 2015 at 17:36 Comment(11)

lambda x: ''.join(x) is just ''.join, no? – Bethought 19/9, 2016 at 11:54

@Bethought no. here we are taking each row of the df[['Year', 'quarter']] and passing it as series to join, and join concatenates the elements in the series. – Goodhumored 1/2, 2017 at 21:2

@OzgurOzturk: the point is that the lambda part of the lambda x: ''.join(x) construction doesn't do anything; it's like using lambda x: sum(x) instead of just sum. – Bethought 1/2, 2017 at 21:7

@Russ, if you wanted to join these strings based on some criteria i.e. join if string != "some_string" how could you add that? – Often 20/7, 2017 at 9:1

Confirmed same result when using ''.join, i.e.: df['period'] = df[['Year', 'quarter']].apply(''.join, axis=1). – Steady 10/10, 2017 at 5:30

This solution gives me a TypeError: ('sequence item 0: expected str instance, int found', as I want to combine an integer categorical variable with a string category. – Cacogenics 3/1, 2018 at 14:56

@Cacogenics join takes only str instances in an iterable. Use a map to convert them all into str and then use join. – Sewn 27/3, 2018 at 12:51

This solution ended up being reasonably slower for me than using the df['Foo'].map(function) + '-' + df['Bar'].map(function) approach. For ~48k rows applying a join along axis 1 took 1.15s whereas the other approach takes 100ms. Not unreasonable on a small dataset, but if you are working with a lot of data that can make a big difference. – Sapphirine 20/4, 2018 at 12:16

'-'.join(x.map(str)) – Door 3/9, 2018 at 8:23

Similarly df['period'] = [''.join(x) for x in zip(df['Year'], df['quarter'])] works too which may be faster. – Siderosis 6/11, 2018 at 4:43

This is the best and most versatile answer, given that the columns to be joined may need to be dynamic and not static and this solution makes that possible. – Peekaboo 24/2, 2021 at 16:51

201

The method cat() of the .str accessor works really well for this:

>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"], 
...                    ["2015", "q3"]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014      q1
1  2015      q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
   Year Quarter  Period
0  2014      q1  2014q1
1  2015      q3  2015q3

cat() even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:

>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
...                    [2015, 3]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014       1
1  2015       3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
   Year Quarter  Period
0  2014       1  2014q1
1  2015       3  2015q3

Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat() invoked on the first column (Series):

>>> df = pd.DataFrame(
...     [['USA', 'Nevada', 'Las Vegas'],
...      ['Brazil', 'Pernambuco', 'Recife']],
...     columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
  Country       State       City                   AllTogether
0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas
1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife

Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.

Moonfish answered 7/3, 2016 at 18:4 Comment(5)

This seems way better (maybe more efficient, too) than lambda or map; also it just reads most cleanly. – Venable 22/5, 2016 at 20:31

Which version of pandas are you using? I get ValueError: Did you mean to supply a sep keyword? in pandas-0.23.4. Thanks! – Ajaajaccio 5/12, 2018 at 20:56

@QinqingLiu, I retested these with pandas-0.23.4 and they seem work. The sep parameter is only necessary if you intend to separate the parts of the concatenated string. If you get an error, please show us your failing example. – Moonfish 10/12, 2018 at 19:34

@Moonfish can i do a newline instead of '-' with sep keyword? – Swindell 21/6, 2021 at 2:52

@arun-menon: I don't see why not. In the last example above you could do .str.cat(df[['State', 'City']], sep ='\n'), for example. I haven't tested it yet, though. – Moonfish 21/6, 2021 at 12:8

Use of a lamba function this time with string.format().

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df

  Quarter  Year
0      q1  2014
1      q2  2015
  Quarter  Year YearQuarter
0      q1  2014      2014q1
1      q2  2015      2015q2

This allows you to work with non-strings and reformat values as needed.

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df

df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df

Quarter     int64
Year       object
dtype: object
   Quarter  Year
0        1  2014
1        2  2015
   Quarter  Year YearQuarter
0        1  2014      2014q1
1        2  2015      2015q2

Liven answered 16/3, 2016 at 16:43 Comment(2)

Much quicker: .apply(''.join(x), axis=1) – Ganef 8/7, 2019 at 10:31

This solution worked great for my needs since I had to do some formatting.

df_game['formatted_game_time'] = df_game[['wday', 'month', 'day', 'year', 'time']].apply(lambda x: '{}, {}/{}/{} @ {}'.format(x[0], x[1], x[2], x[3], x[4]), axis=1)

– Volney 26/11, 2022 at 17:7

generalising to multiple columns, why not:

columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)

Selfliquidating answered 30/7, 2019 at 10:38 Comment(2)

Looks cool but what if I want to add a delimiter between the strings, like '-'? – Roz 2/10, 2019 at 17:55

@Roz maybe create a delimiter column? – Mitman 20/9, 2021 at 8:15

You can use lambda:

combine_lambda = lambda x: '{}{}'.format(x.Year, x.quarter)

And then use it with creating the new column:

df['period'] = df.apply(combine_lambda, axis = 1)

Yoruba answered 28/2, 2021 at 16:25 Comment(0)

Let us suppose your dataframe is df with columns Year and Quarter.

import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})

Suppose we want to see the dataframe;

df
>>>  Quarter    Year
   0    q1      2000
   1    q2      2000
   2    q3      2000
   3    q4      2000

Finally, concatenate the Year and the Quarter as follows.

df['Period'] = df['Year'] + ' ' + df['Quarter']

You can now print df to see the resulting dataframe.

df
>>>  Quarter    Year    Period
    0   q1      2000    2000 q1
    1   q2      2000    2000 q2
    2   q3      2000    2000 q3
    3   q4      2000    2000 q4

If you do not want the space between the year and quarter, simply remove it by doing;

df['Period'] = df['Year'] + df['Quarter']

Environs answered 22/7, 2018 at 5:20 Comment(5)

Specified as strings df['Period'] = df['Year'].map(str) + df['Quarter'].map(str) – Exorcism 7/8, 2018 at 18:58

I'm getting TypeError: Series cannot perform the operation + when I run either df2['filename'] = df2['job_number'] + '.' + df2['task_number'] or df2['filename'] = df2['job_number'].map(str) + '.' + df2['task_number'].map(str). – Xanthate 3/3, 2019 at 6:43

However, df2['filename'] = df2['job_number'].astype(str) + '.' + df2['task_number'].astype(str) did work. – Xanthate 3/3, 2019 at 6:51

@KarlBaker, I think you did not have strings in your input. But I am glad you figured that out. If you look at the example dataframe that I created above, you will see that all the columns are strings. – Environs 3/3, 2019 at 17:31

What exactly is the point of this solution, since it's identical to the top answer? – Sexcentenary 18/3, 2020 at 1:22

Although the @silvado answer is good if you change df.map(str) to df.astype(str) it will be faster:

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop

In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop

Souther answered 25/11, 2015 at 10:25 Comment(0)

Here is an implementation that I find very versatile:

In [1]: import pandas as pd 

In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
   ...:                    [1, 'fox', 'jumps', 'over'], 
   ...:                    [2, 'the', 'lazy', 'dog']],
   ...:                   columns=['c0', 'c1', 'c2', 'c3'])

In [3]: def str_join(df, sep, *cols):
   ...:     from functools import reduce
   ...:     return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep), 
   ...:                   [df[col] for col in cols])
   ...: 

In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')

In [5]: df
Out[5]: 
   c0   c1     c2     c3                cat
0   0  the  quick  brown  0-the-quick-brown
1   1  fox  jumps   over   1-fox-jumps-over
2   2  the   lazy    dog     2-the-lazy-dog

Pahl answered 3/4, 2017 at 17:5 Comment(1)

FYI: This method works great with Python 3, but gives me trouble in Python 2. – Bruton 31/7, 2017 at 19:40

more efficient is

def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)

and here is a time test:

import numpy as np
import pandas as pd

from time import time


def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)


def concat_df_str2(df):
    """ run time: 5.2758s """
    return df.astype(str).sum(axis=1)


def concat_df_str3(df):
    """ run time: 5.0076s """
    df = df.astype(str)
    return df[0] + df[1] + df[2] + df[3] + df[4] + \
           df[5] + df[6] + df[7] + df[8] + df[9]


def concat_df_str4(df):
    """ run time: 7.8624s """
    return df.astype(str).apply(lambda x: ''.join(x), axis=1)


def main():
    df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
    df = df.astype(int)

    time1 = time()
    df_en = concat_df_str4(df)
    print('run time: %.4fs' % (time() - time1))
    print(df_en.head(10))


if __name__ == '__main__':
    main()

final, when sum(concat_df_str2) is used, the result is not simply concat, it will trans to integer.

Drus answered 9/1, 2018 at 2:13 Comment(1)

+1 Neat solution, this also allows us to specify the columns: e.g. df.values[:, 0:3] or df.values[:, [0,2]]. – Healy 9/2, 2018 at 9:51

Using zip could be even quicker:

df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

Graph:

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

myfuncs = {
"df['Year'].astype(str) + df['quarter']":
    lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
    lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
    lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df[['Year','quarter']].astype(str).sum(axis=1),
    "df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
    lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
    "[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
    lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in myfuncs.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()

Boyce answered 13/5, 2018 at 14:33 Comment(0)

This solution uses an intermediate step compressing two columns of the DataFrame to a single column containing a list of the values. This works not only for strings but for all kind of column-dtypes

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)

Result:

   Year quarter        list  period
0  2014      q1  [2014, q1]  2014q1
1  2015      q2  [2015, q2]  2015q2

Katekatee answered 15/3, 2019 at 16:37 Comment(4)

looks like other dtypes won't work. I got a TypeError: sequence item 1: expected str instance, float found – Twitt 10/4, 2019 at 9:8

apply first a cast to string. The join operation works only for strings – Katekatee 10/4, 2019 at 10:58

This solution won't work to combine two columns with different dtype, see my answer for the correct solution for such case. – Parget 16/5, 2019 at 13:21

Instead of .apply(''.join) why not use .str.join('')? – Furlani 28/5, 2021 at 0:45

Here is my summary of the above solutions to concatenate / combine two columns with int and str value into a new column, using a separator between the values of columns. Three solutions work for this purpose.

# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError

separator = "&&" 

# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"

df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)

Parget answered 16/5, 2019 at 13:19 Comment(1)

At least your first solution does not work (any more?). I use: df["period"] = (df["Year"].astype(str) + separator + df["quarter"].astype(str)).astype('category') – Flocculent 16/3, 2023 at 14:1

my take....

listofcols = ['col1','col2','col3']
df['combined_cols'] = ''

for column in listofcols:
    df['combined_cols'] = df['combined_cols'] + ' ' + df[column]
'''

Bugleweed answered 18/8, 2020 at 4:13 Comment(1)

You should add an explanation to this code snippet. Adding only code answers encourages people to use code they don't understand and doesn't help them learn. – Rarebit 18/8, 2020 at 10:10

When combining columns with strings by concatenating them using the addition operator + if any is NaN then entire output will be NaN so use fillna()

df["join"] = "some" + df["col"].fillna(df["val_if_nan"])

Foodstuff answered 7/1, 2023 at 14:12 Comment(0)

As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.

%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Siddra answered 25/10, 2017 at 3:21 Comment(2)

I'd like to use the numpyified version but I'm getting an error: Input: df2['filename'] = df2['job_number'].values.astype(str) + '.' + df2['task_number'].values.astype(str) --> Output: TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21'). Both job_number and task_number are ints. – Xanthate 3/3, 2019 at 6:56

That's because you are combining two numpy arrays. It works if you combine an numpy array with pandas Series. as df['Year'].values.astype(str) + df.quarter – Malleable 10/2, 2020 at 11:23

One can use assign method of DataFrame:

df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
  assign(period=lambda x: x.Year+x.quarter ))

Nestle answered 1/12, 2018 at 10:55 Comment(0)

Similar to @geher answer but with any separator you like:

SEP = " "
INPUT_COLUMNS_WITH_SEP = ",sep,".join(INPUT_COLUMNS).split(",")

df.assign(sep=SEP)[INPUT_COLUMNS_WITH_SEP].sum(axis=1)

Complain answered 4/12, 2021 at 12:43 Comment(0)

def madd(x):
    """Performs element-wise string concatenation with multiple input arrays.

    Args:
        x: iterable of np.array.

    Returns: np.array.
    """
    for i, arr in enumerate(x):
        if type(arr.item(0)) is not str:
            x[i] = x[i].astype(str)
    return reduce(np.core.defchararray.add, x)

For example:

data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])

df

    Year    quarter period
0   2000    q1  2000q1
1   2000    q2  2000q2
2   2000    q3  2000q3
3   2000    q4  2000q4

Lightfooted answered 21/7, 2017 at 19:38 Comment(2)

NameError: name 'reduce' is not defined – Butcher 8/3, 2022 at 20:14

from functools import reduce – Poitiers 25/5, 2022 at 4:22

Use .combine_first.

df['Period'] = df['Year'].combine_first(df['Quarter'])

Candiecandied answered 10/2, 2018 at 4:1 Comment(1)

This is not correct. .combine_first will result in either the value from 'Year' being stored in 'Period', or, if it is Null, the value from 'Quarter'. It will not concatenate the two strings and store them in 'Period'. – Festoon 29/1, 2019 at 20:48

DataFrame.eval()

For a little terse code, we can use .eval(). We can concatenate two (or more) string dtype columns horizontally using the + operator as follows.

df = pd.DataFrame({'A': ['x', 'y', 'z'], 'B': ['1', '2', '3']}, dtype='string')
df['C'] = df.eval("A + B")

You can even include the new column assignment inside the evaluated expression (which also opens up the possibility to do it in-place).

df = df.eval('C = A + B')
df.eval('C = A + B', inplace=True)

eval doesn't allow a similarly terse way to add delimiters; however, we can call str.cat() (which has the sep= kwarg) inside the numerical expression.

df = df.eval("D = A.str.cat([B, C], '_')")

which produces the following output (where columns, A, B and C are concatenated horizontally):

When to use vectorized concat vs explicit loop

There are major string concatenation methods given on this page:

vectorized +: df['A'] + df['B']
string formatting in a loop (N.B. converting the columns into lists makes the loop faster):
```
[f"{x}{y}" for x,y in zip(df['A'].tolist(), df['B'].tolist())])]
```
vectorized str.cat(): df['A'].str.cat(df['B'])

As the following figure shows, vectorized concatenation (via +) is fastest if the strings being concatenated are short such as in the OP. However, if the strings are long (e.g. each cell contains a tweet or a book excerpt), then an explicit Python loop (use f-string in a list comprehension) is the fastest.

Code to reproduce the above figure:

import matplotlib.pyplot as plt
import pandas as pd
import perfplot

fig, axs = plt.subplots(1, 2, figsize=(15,5))
for ax, s, title in zip(axs, ('a'*1000, 'a'), ("long", "short")):
    plt.sca(ax)
    perfplot.plot(
        kernels=[lambda df: df.assign(C=df['A'] + '_' + df['B']),
                 lambda df: df.assign(C=df['A'].str.cat(df['B'], '_')),
                 lambda df: df.assign(C=[f"{x}_{y}" for x,y in zip(df['A'].tolist(), df['B'].tolist())])],
        n_range=[2**k for k in range(18)],
        setup=lambda n: pd.DataFrame({'A': [s]*n, 'B': [s]*n}),
        labels=["df['A'] + df['B']", "str.cat", "list comp"],
        xlabel="DataFrame length",
        title=f"When the strings are {title}", 
        equality_check=pd.DataFrame.equals)
fig.tight_layout()
fig.savefig("string_concat_perf.png")

Horseleech answered 6/12, 2023 at 3:43 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++