How to add a new column to an existing DataFrame
Asked Answered
E

33

1318

I have the following indexed DataFrame with named columns and rows not- continuous numbers:

          a         b         c         d
2  0.671399  0.101208 -0.181532  0.241273
3  0.446172 -0.243316  0.051767  1.577318
5  0.614758  0.075793 -0.451460 -0.012493

I would like to add a new column, 'e', to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).

0   -0.335485
1   -1.166658
2   -0.385571
dtype: float64

How can I add column e to the above example?

Extent answered 23/9, 2012 at 19:0 Comment(2)
if your new column depends on your existing column so you can add your new columns as mine below.Monopoly
Wow, this Q&A is a mess. The straightforward answer is df['e'] = e, but that doesn't work if the indexes don't match, but the indexes only don't match because OP created it like that (e = Series(<np_array>)), but that was removed from the question in revision 5.Outburst
P
1308

Edit 2017

As indicated in the comments and by @Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign:

df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)

Edit 2015
Some reported getting the SettingWithCopyWarning with this code.
However, the code still runs perfectly with the current pandas version 0.16.1.

>>> sLength = len(df1['a'])
>>> df1
          a         b         c         d
6 -0.269221 -0.026476  0.997517  1.294385
8  0.917438  0.847941  0.034235 -0.448948

>>> df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e
6 -0.269221 -0.026476  0.997517  1.294385  1.757167
8  0.917438  0.847941  0.034235 -0.448948  2.228131

>>> pd.version.short_version
'0.16.1'

The SettingWithCopyWarning aims to inform of a possibly invalid assignment on a copy of the Dataframe. It doesn't necessarily say you did it wrong (it can trigger false positives) but from 0.13.0 it let you know there are more adequate methods for the same purpose. Then, if you get the warning, just follow its advise: Try using .loc[row_index,col_indexer] = value instead

>>> df1.loc[:,'f'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e         f
6 -0.269221 -0.026476  0.997517  1.294385  1.757167 -0.050927
8  0.917438  0.847941  0.034235 -0.448948  2.228131  0.006109
>>> 

In fact, this is currently the more efficient method as described in pandas docs


Original answer:

Use the original df1 indexes to create the series:

df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
Polyphemus answered 23/9, 2012 at 19:24 Comment(25)
The series comes from sensor and are fed to the computer. The only thing is that it has given length, the same length as DataFrame. The presented code is only to illustrate exampleExtent
Thanks a lot @Polyphemus your answer is perfectly what I couldn't figure out.Extent
if you need to prepend column use DataFrame.insert: df1.insert(0, 'A', Series(np.random.randn(sLength), index=df1.index))Production
From Pandas version 0.12 onwards, I believe this syntax is not optimal, and gives warning: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value insteadHeadroom
@Headroom No warning with 0.15.1. Do you believe or you actually tried the above exact code ?Polyphemus
I got this warning with pandas 0.16 too. What is the optimal syntax three years later?Phone
@Polyphemus I have tried with this exact code and things work as expected. I have also done operations like this with pretty much identical code and had it throw up SettingWithCopyWarning. I can't nail down when the warning will appear and when it won't.Crampon
@GregoryArenius Yeah, they explain in the docs: The SettingWithCopy warning is a ‘heuristic’ to detect this (meaning it tends to catch most cases but is simply a lightweight check). Figuring this out for real is way complicated.Polyphemus
It should be noted that this approach -- as any other based on the assignment df['column_name'] = array_like -- will overwrite an existing column with the same name as 'column_name'. .join will throw a ValueError if no prefix/suffix is given.Pashto
Following .loc as SettingWithCopy warning somehow results in more warning: ... self.obj[item_labels[indexer[info_axis]]] = valueHarem
Looks like you can update this further to df1.loc[:, 'f'] = value. Via Python console: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: pandas.pydata.org/pandas-docs/stable/… self.obj[item] = sCrosspatch
I don't understand... Why can't we just have df1['e']=np.random.randn(sLength)? Specifying the random numbers as an np array does not have the problem of messing up the indices.Smutchy
@Smutchy The OP wanted to add a pre-existing Series to a Dataframe. The random method was only to create them for the example. Once said that, currently the answer from @Alexander using assign would be the best answerPolyphemus
Is there a reason to use df1 = df1.assign(e=e.values) instead of df1['e'] = e.values?Dogtrot
@CarlMorris Yes, for the specific question of the OP. Look at the comments to Kathirmani Sukumar answer.Polyphemus
@joaquin, I see his answer, but his solution is different from what I am asking about. He uses df['e'] = e, whereas I am asking about df['e'] = e.values. Doesn't e.values fix the issue in Kathirmani Sukumar's answer?Dogtrot
Every time I see discussion about the cursed SettingWithCopyWarning, my eyes glaze over.Myriagram
how do you use string names, i.e. 'e' instead of e, with the assign method?Exist
The insert method (by @hum3) has been the only one that has almost work for me. However, I cannot use it when I reassign a value to a column that already exists, i.e. df['e'] = 0 when 'e' exists alreadyExist
@4myle The fact that this answer keeps evolving looks really bad for the pandas developers. Two fundamental api changes in as many years - doesn't look great for them.Chromogenic
@Exist You can unpack a kwargs dictionary, like so: df1 = df1.assign(**{'e': p.Series(np.random.randn(sLength)).values})Chromogenic
Won't assign copy the whole DataFrame? If yes, isn't that super inefficient?Outpoint
No need to create a Series from a numpy array to reconvert it back to a numpy array: df1 = df1.assign(e=np.random.randn(sLength)) is simpler.Klecka
Instead of saying "currently" or referencing years, please reference the pandas version numbers, e.g. "between 0.14-0.16 do X, in 0.17+ do Y..."Dime
Still got the warning after using df.loc[:,'mycolname'] , not sure why, but lucking df1.assign() works like a charm, thanksJespersen
P
329

This is the simple way of adding a new column: df['e'] = e

Prakrit answered 12/12, 2012 at 16:4 Comment(7)
Despite the high number of votes: this answer is wrong. Note that the OP has a dataframe with non continuous indexes and e (Series(np.random.randn(sLength))) generates a Series 0-n indexed. If you assign this to df1 then you get some NaN cells.Polyphemus
What @Polyphemus says is true, but as long as you keep that in mind, this is a very useful shortcut.Wolver
It doesn't help, because if you have multiple rows, and you use the assignment, it assigns all rows of the new column with that value ( in your case e) which is usually undesirable.Necessarily
The issue raised @Polyphemus above can simply be resolved (like in joaquin's Answer above) by doing: df['e'] = e.values or equivalently, df['e'] = e.to_numpy(). Right?Australasia
I didn't find this syntax in the DataFrame API reference, but it is used in the official Pandas User Guide: pandas.pydata.org/docs/user_guide/…Australasia
CAUTION: HIGH DOWNVOTES RATE (now at 1/6) (use df['e'] = e.values instead)Agamete
It helped for me!Subjacent
O
224

I would like to add a new column, 'e', to the existing data frame and do not change anything in the data frame. (The series always got the same length as a dataframe.)

I assume that the index values in e match those in df1.

The easiest way to initiate a new column named e, and assign it the values from your series e:

df['e'] = e.values

assign (Pandas 0.16.0+)

As of Pandas 0.16.0, you can also use assign, which assigns new columns to a DataFrame and returns a new object (a copy) with all the original columns in addition to the new ones.

df1 = df1.assign(e=e.values)

As per this example (which also includes the source code of the assign function), you can also include more than one column:

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.assign(mean_a=df.a.mean(), mean_b=df.b.mean())
   a  b  mean_a  mean_b
0  1  3     1.5     3.5
1  2  4     1.5     3.5

In context with your example:

np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = pd.Series(np.random.randn(sLength))

>>> df1
          a         b         c         d
0  1.764052  0.400157  0.978738  2.240893
2 -0.103219  0.410599  0.144044  1.454274
3  0.761038  0.121675  0.443863  0.333674
7  1.532779  1.469359  0.154947  0.378163
9  1.230291  1.202380 -0.387327 -0.302303

>>> e
0   -1.048553
1   -1.420018
2   -1.706270
3    1.950775
4   -0.509652
dtype: float64

df1 = df1.assign(e=e.values)

>>> df1
          a         b         c         d         e
0  1.764052  0.400157  0.978738  2.240893 -1.048553
2 -0.103219  0.410599  0.144044  1.454274 -1.420018
3  0.761038  0.121675  0.443863  0.333674 -1.706270
7  1.532779  1.469359  0.154947  0.378163  1.950775
9  1.230291  1.202380 -0.387327 -0.302303 -0.509652

The description of this new feature when it was first introduced can be found here.

Overgrowth answered 14/2, 2016 at 0:49 Comment(8)
Any comment on the relative performance of the two methods, considering that the first method (df['e'] = e.values) does not create a copy of the dataframe, while the second option (using df.assign) does? In cases of lots of new columns being added sequentially and large dataframes I'd expect much better performance of the first method.Ejective
@jhin Yes, direct assignment is obviously much if you are working on a fixed dataframe. The benefit of using assign is when chain together your operations.Overgrowth
This certainly seems like a nice balance between explicit and implicit. +1 :DPraemunire
For fun df.assign(**df.mean().add_prefix('mean_'))Heine
Just to update this answer with the version v0.23.2 : assign "always returns a copy of the data, leaving the original DataFrame untouched."Rola
You "assume that the index values in e match those in df1." What if index values do not match?Jessalin
@Jessalin From the question, it appears that the OP is simply concatenating the dataframes and ignoring the index. If this is the case, the the methods above will work. If one wishes to retain the index, then use something like df_new = pd.concat([df1, df2], axis=1), noting that ignore_index=False by default.Overgrowth
assign() is great. I believe directly using the index assignment gives a warning now.Gereld
P
77

Super simple column assignment

A pandas dataframe is implemented as an ordered dict of columns.

This means that the __getitem__ [] can not only be used to get a certain column, but __setitem__ [] = can be used to assign a new column.

For example, this dataframe can have a column added to it by simply using the [] accessor

    size      name color
0    big      rose   red
1  small    violet  blue
2  small     tulip   red
3  small  harebell  blue

df['protected'] = ['no', 'no', 'no', 'yes']

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

Note that this works even if the index of the dataframe is off.

df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

[]= is the way to go, but watch out!

However, if you have a pd.Series and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

This is because a pd.Series by default has an index enumerated from 0 to n. And the pandas [] = method tries to be "smart"

What actually is going on.

When you use the [] = method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. df['column'] = series

Side note

This quickly causes cognitive dissonance, since the []= method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the []= in code bases, but when exploring data in a notebook, it is fine.

Going around the problem

If you have a pd.Series and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.

You could downcast the pd.Series to a np.ndarray or a list, this will do the trick.

df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values

or

df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))

But this is not very explicit.

Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".

Explicit way

Setting the index of the pd.Series to be the index of the df is explicit.

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)

Or more realistically, you probably have a pd.Series already available.

protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index

3     no
2     no
1     no
0    yes

Can now be assigned

df['protected'] = protected_series

    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

Alternative way with df.reset_index()

Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.

df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

Note on df.assign

While df.assign make it more explicit what you are doing, it actually has all the same problems as the above []=

df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

Just watch out with df.assign that your column is not called self. It will cause errors. This makes df.assign smelly, since there are these kind of artifacts in the function.

df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'

You may say, "Well, I'll just not use self then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.

Peroxide answered 3/4, 2017 at 8:59 Comment(1)
"When you use the [] = method pandas is quietly performing an outer join or outer merge". This is the most important piece of information in the whole topic. But could you provide link to the official documentation on how []= operator works?Precision
O
60

It seems that in recent Pandas versions the way to go is to use df.assign:

df1 = df1.assign(e=np.random.randn(sLength))

It doesn't produce SettingWithCopyWarning.

Outcaste answered 21/7, 2016 at 17:35 Comment(1)
Copying @Dime 's comment from above... Instead of saying "currently" or referencing years, please reference the Pandas version numbersSatterfield
K
59

Doing this directly via NumPy will be the most efficient:

df1['e'] = np.random.randn(sLength)

Note my original (very old) suggestion was to use map (which is much slower):

df1['e'] = df1['a'].map(lambda x: np.random.random())
Kylakylah answered 23/9, 2012 at 19:22 Comment(2)
thanks for your reply, as I have e already given, have can I modify your code, .map to use existing series instead of lambda? I try df1['e'] = df1['a'].map(lambda x: e) or df1['e'] = df1['a'].map(e) but it not what I need. (I am new to pyhon and your previous answer already helped me)Extent
@Extent if you already have e as a Series then you don't need to use map, use df['e']=e (@joaquins answer).Kylakylah
C
43

Easiest ways:-

data['new_col'] = list_of_values

data.loc[ : , 'new_col'] = list_of_values

This way you avoid what is called chained indexing when setting new values in a pandas object. Click here to read further.

Cataldo answered 8/9, 2018 at 5:17 Comment(0)
B
27

If you want to set the whole new column to an initial base value (e.g. None), you can do this: df1['e'] = None

This actually would assign "object" type to the cell. So later you're free to put complex data types, like list, into individual cells.

Bit answered 13/10, 2017 at 16:53 Comment(2)
this raises a setting withcopywarningHutchens
df['E'] = '' also works if someone wants to add an empty columnPylorus
M
25

I got the dreaded SettingWithCopyWarning, and it wasn't fixed by using the iloc syntax. My DataFrame was created by read_sql from an ODBC source. Using a suggestion by lowtech above, the following worked for me:

df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength),  index=df.index))

This worked fine to insert the column at the end. I don't know if it is the most efficient, but I don't like warning messages. I think there is a better solution, but I can't find it, and I think it depends on some aspect of the index.
Note. That this only works once and will give an error message if trying to overwrite and existing column.
Note As above and from 0.16.0 assign is the best solution. See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign Works well for data flow type where you don't overwrite your intermediate values.

Mchail answered 11/6, 2015 at 9:45 Comment(0)
J
18
  1. First create a python's list_of_e that has relevant data.
  2. Use this: df['e'] = list_of_e
Janson answered 5/6, 2017 at 0:53 Comment(2)
I really do not understand, why this is not the preferred answer. In case you have a pd.Series, the tolist() command might be helpful.Corregidor
The OP have a Series e and the way of add a column to a df is different than add list. the above answers explain well what to do in that case especially the @Peroxide answer. The below answers mostly are not aware about that....Solis
A
16

To create an empty column

df['i'] = None
Androgyne answered 28/11, 2019 at 6:12 Comment(0)
B
15

If the column you are trying to add is a series variable then just :

df["new_columns_name"]=series_variable_name #this will do it for you

This works well even if you are replacing an existing column.just type the new_columns_name same as the column you want to replace.It will just overwrite the existing column data with the new series data.

Biogenesis answered 3/11, 2017 at 10:5 Comment(0)
F
13

If the data frame and Series object have the same index, pandas.concat also works here:

import pandas as pd
df
#          a            b           c           d
#0  0.671399     0.101208   -0.181532    0.241273
#1  0.446172    -0.243316    0.051767    1.577318
#2  0.614758     0.075793   -0.451460   -0.012493

e = pd.Series([-0.335485, -1.166658, -0.385571])    
e
#0   -0.335485
#1   -1.166658
#2   -0.385571
#dtype: float64

# here we need to give the series object a name which converts to the new  column name 
# in the result
df = pd.concat([df, e.rename("e")], axis=1)
df

#          a            b           c           d           e
#0  0.671399     0.101208   -0.181532    0.241273   -0.335485
#1  0.446172    -0.243316    0.051767    1.577318   -1.166658
#2  0.614758     0.075793   -0.451460   -0.012493   -0.385571

In case they don't have the same index:

e.index = df.index
df = pd.concat([df, e.rename("e")], axis=1)
Fastness answered 7/4, 2017 at 1:38 Comment(0)
T
13

Foolproof:

df.loc[:, 'NewCol'] = 'New_Val'

Example:

df = pd.DataFrame(data=np.random.randn(20, 4), columns=['A', 'B', 'C', 'D'])

df

           A         B         C         D
0  -0.761269  0.477348  1.170614  0.752714
1   1.217250 -0.930860 -0.769324 -0.408642
2  -0.619679 -1.227659 -0.259135  1.700294
3  -0.147354  0.778707  0.479145  2.284143
4  -0.529529  0.000571  0.913779  1.395894
5   2.592400  0.637253  1.441096 -0.631468
6   0.757178  0.240012 -0.553820  1.177202
7  -0.986128 -1.313843  0.788589 -0.707836
8   0.606985 -2.232903 -1.358107 -2.855494
9  -0.692013  0.671866  1.179466 -1.180351
10 -1.093707 -0.530600  0.182926 -1.296494
11 -0.143273 -0.503199 -1.328728  0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832  0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15  0.955298 -1.430019  1.434071 -0.088215
16 -0.227946  0.047462  0.373573 -0.111675
17  1.627912  0.043611  1.743403 -0.012714
18  0.693458  0.144327  0.329500 -0.655045
19  0.104425  0.037412  0.450598 -0.923387


df.drop([3, 5, 8, 10, 18], inplace=True)

df

           A         B         C         D
0  -0.761269  0.477348  1.170614  0.752714
1   1.217250 -0.930860 -0.769324 -0.408642
2  -0.619679 -1.227659 -0.259135  1.700294
4  -0.529529  0.000571  0.913779  1.395894
6   0.757178  0.240012 -0.553820  1.177202
7  -0.986128 -1.313843  0.788589 -0.707836
9  -0.692013  0.671866  1.179466 -1.180351
11 -0.143273 -0.503199 -1.328728  0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832  0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15  0.955298 -1.430019  1.434071 -0.088215
16 -0.227946  0.047462  0.373573 -0.111675
17  1.627912  0.043611  1.743403 -0.012714
19  0.104425  0.037412  0.450598 -0.923387

df.loc[:, 'NewCol'] = 0

df
           A         B         C         D  NewCol
0  -0.761269  0.477348  1.170614  0.752714       0
1   1.217250 -0.930860 -0.769324 -0.408642       0
2  -0.619679 -1.227659 -0.259135  1.700294       0
4  -0.529529  0.000571  0.913779  1.395894       0
6   0.757178  0.240012 -0.553820  1.177202       0
7  -0.986128 -1.313843  0.788589 -0.707836       0
9  -0.692013  0.671866  1.179466 -1.180351       0
11 -0.143273 -0.503199 -1.328728  0.610552       0
12 -0.923110 -1.365890 -1.366202 -1.185999       0
13 -2.026832  0.273593 -0.440426 -0.627423       0
14 -0.054503 -0.788866 -0.228088 -0.404783       0
15  0.955298 -1.430019  1.434071 -0.088215       0
16 -0.227946  0.047462  0.373573 -0.111675       0
17  1.627912  0.043611  1.743403 -0.012714       0
19  0.104425  0.037412  0.450598 -0.923387       0
Traveler answered 12/4, 2017 at 11:22 Comment(1)
Not foolproof. This does not address the OP's question, which is a case where the indices of the existing dataframe and the new series are not aligned.Overgrowth
T
11

One thing to note, though, is that if you do

df1['e'] = Series(np.random.randn(sLength), index=df1.index)

this will effectively be a left join on the df1.index. So if you want to have an outer join effect, my probably imperfect solution is to create a dataframe with index values covering the universe of your data, and then use the code above. For example,

data = pd.DataFrame(index=all_possible_values)
df1['e'] = Series(np.random.randn(sLength), index=df1.index)
Thick answered 20/2, 2015 at 17:32 Comment(0)
V
11

to insert a new column at a given location (0 <= loc <= amount of columns) in a data frame, just use Dataframe.insert:

DataFrame.insert(loc, column, value)

Therefore, if you want to add the column e at the end of a data frame called df, you can use:

e = [-0.335485, -1.166658, -0.385571]    
DataFrame.insert(loc=len(df.columns), column='e', value=e)

value can be a Series, an integer (in which case all cells get filled with this one value), or an array-like structure

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html

Vogt answered 7/4, 2019 at 15:12 Comment(0)
H
8

Let me just add that, just like for hum3, .loc didn't solve the SettingWithCopyWarning and I had to resort to df.insert(). In my case false positive was generated by "fake" chain indexing dict['a']['e'], where 'e' is the new column, and dict['a'] is a DataFrame coming from dictionary.

Also note that if you know what you are doing, you can switch of the warning using pd.options.mode.chained_assignment = None and than use one of the other solutions given here.

Hayott answered 22/10, 2015 at 14:21 Comment(0)
G
7

Before assigning a new column, if you have indexed data, you need to sort the index. At least in my case I had to:

data.set_index(['index_column'], inplace=True)
"if index is unsorted, assignment of a new column will fail"        
data.sort_index(inplace = True)
data.loc['index_value1', 'column_y'] = np.random.randn(data.loc['index_value1', 'column_x'].shape[0])
Gudrun answered 14/6, 2015 at 23:57 Comment(0)
C
6

To add a new column, 'e', to the existing data frame

 df1.loc[:,'e'] = Series(np.random.randn(sLength))
Consequence answered 8/11, 2016 at 6:55 Comment(2)
It also gives the caveat messageArmlet
you should use df1.loc[::,'e'] = Series(np.random.randn(sLength))Acariasis
S
6

I was looking for a general way of adding a column of numpy.nans to a dataframe without getting the dumb SettingWithCopyWarning.

From the following:

  • the answers here
  • this question about passing a variable as a keyword argument
  • this method for generating a numpy array of NaNs in-line

I came up with this:

col = 'column_name'
df = df.assign(**{col:numpy.full(len(df), numpy.nan)})
Sherbet answered 13/1, 2017 at 18:34 Comment(0)
T
5

For the sake of completeness - yet another solution using DataFrame.eval() method:

Data:

In [44]: e
Out[44]:
0    1.225506
1   -1.033944
2   -0.498953
3   -0.373332
4    0.615030
5   -0.622436
dtype: float64

In [45]: df1
Out[45]:
          a         b         c         d
0 -0.634222 -0.103264  0.745069  0.801288
4  0.782387 -0.090279  0.757662 -0.602408
5 -0.117456  2.124496  1.057301  0.765466
7  0.767532  0.104304 -0.586850  1.051297
8 -0.103272  0.958334  1.163092  1.182315
9 -0.616254  0.296678 -0.112027  0.679112

Solution:

In [46]: df1.eval("e = @e.values", inplace=True)

In [47]: df1
Out[47]:
          a         b         c         d         e
0 -0.634222 -0.103264  0.745069  0.801288  1.225506
4  0.782387 -0.090279  0.757662 -0.602408 -1.033944
5 -0.117456  2.124496  1.057301  0.765466 -0.498953
7  0.767532  0.104304 -0.586850  1.051297 -0.373332
8 -0.103272  0.958334  1.163092  1.182315  0.615030
9 -0.616254  0.296678 -0.112027  0.679112 -0.622436
Toucan answered 14/3, 2017 at 21:49 Comment(0)
A
5

If you just need to create a new empty column then the shortest solution is:

df.loc[:, 'e'] = pd.Series()
Antihistamine answered 27/11, 2020 at 8:26 Comment(0)
S
5

There are 4 ways you can insert a new column to a pandas DataFrame:

  1. Simple assignment
  2. insert()
  3. assign()
  4. Concat()

Let's consider the following example:

import pandas as pd

df = pd.DataFrame({
    'col_a':[True, False, False], 
    'col_b': [1, 2, 3],
})
print(df)
    col_a  col_b
0   True     1
1  False     2
2  False     3

Using simple assignment

ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
print(ser)
0    a
1    b
2    c
dtype: object

df['col_c'] = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
print(df)
     col_a  col_b col_c
0   True     1  NaN
1  False     2    a
2  False     3    b

Using assign()

e = pd.Series([1.0, 3.0, 2.0], index=[0, 2, 1])
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
df.assign(colC=s.values, colB=e.values)
     col_a  col_b col_c
0   True   1.0    a
1  False   3.0    b
2  False   2.0    c

Using insert()

df.insert(len(df.columns), 'col_c', ser.values)
print(df)
    col_a  col_b col_c
0   True     1    a
1  False     2    b
2  False     3    c

Using concat()

ser = pd.Series(['a', 'b', 'c'], index=[10, 20, 30])
df = pd.concat([df, ser.rename('colC')], axis=1)
print(df)
     col_a  col_b col_c
0    True   1.0  NaN
1   False   2.0  NaN
2   False   3.0  NaN
10    NaN   NaN    a
20    NaN   NaN    b
30    NaN   NaN    c
Suisse answered 6/3, 2022 at 14:21 Comment(0)
T
4

The following is what I did... But I'm pretty new to pandas and really Python in general, so no promises.

df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))

newCol = [3,5,7]
newName = 'C'

values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)

df = pd.DataFrame(values,columns=header)
Telmatelo answered 6/10, 2015 at 1:18 Comment(0)
A
4

If we want to assign a scaler value eg: 10 to all rows of a new column in a df:

df = df.assign(new_col=lambda x:10)  # x is each row passed in to the lambda func

df will now have new column 'new_col' with value=10 in all rows.

Allerie answered 24/1, 2021 at 4:27 Comment(0)
D
3

If you get the SettingWithCopyWarning, an easy fix is to copy the DataFrame you are trying to add a column to.

df = df.copy()
df['col_name'] = values
Deflate answered 7/3, 2016 at 3:28 Comment(1)
that's not a good idea. If the dataframe is large enough, it's gonna be memory intensive... Besides it would turn into a nightmare if you keep adding columns every once in a while.Cahan
F
3
x=pd.DataFrame([1,2,3,4,5])

y=pd.DataFrame([5,4,3,2,1])

z=pd.concat([x,y],axis=1)

enter image description here

Fillbert answered 4/10, 2020 at 2:30 Comment(1)
I doubt that this helps - or even works at all. Care to explain?Gertrude
J
2

this is a special case of adding a new column to a pandas dataframe. Here, I am adding a new feature/column based on an existing column data of the dataframe.

so, let our dataFrame has columns 'feature_1', 'feature_2', 'probability_score' and we have to add a new_column 'predicted_class' based on data in column 'probability_score'.

I will use map() function from python and also define a function of my own which will implement the logic on how to give a particular class_label to every row in my dataFrame.

data = pd.read_csv('data.csv')

def myFunction(x):
   //implement your logic here

   if so and so:
        return a
   return b

variable_1 = data['probability_score']
predicted_class = variable_1.map(myFunction)

data['predicted_class'] = predicted_class

// check dataFrame, new column is included based on an existing column data for each row
data.head()
Jillianjillie answered 19/6, 2020 at 12:24 Comment(0)
S
2
import pandas as pd

# Define a dictionary containing data
data = {'a': [0,0,0.671399,0.446172,0,0.614758],
    'b': [0,0,0.101208,-0.243316,0,0.075793],
    'c': [0,0,-0.181532,0.051767,0,-0.451460],
    'd': [0,0,0.241273,1.577318,0,-0.012493]}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)

# Declare a list that is to be converted into a column
col_e = [-0.335485,-1.166658,-0.385571,0,0,0]


df['e'] = col_e

# add column 'e'
df['e'] = col_e

# Observe the result
df

coding

Sternutation answered 8/11, 2021 at 8:58 Comment(1)
we can try with insert or assign() Method.... df.insert(4, “e”, [-0.335485,-1.166658,-0.385571,0,0,0], True) (or) df = df.assign(e = [-0.335485,-1.166658,-0.385571,0,0,0])Sternutation
U
1

Whenever you add a Series object as new column to an existing DF, you need to make sure that they both have the same index. Then add it to the DF

e_series = pd.Series([-0.335485, -1.166658,-0.385571])
print(e_series)
e_series.index = d_f.index
d_f['e'] = e_series
d_f

enter image description here

Urfa answered 2/3, 2021 at 21:9 Comment(0)
M
0

you can insert new column by for loop like this:

for label,row in your_dframe.iterrows():
      your_dframe.loc[label,"new_column_length"]=len(row["any_of_column_in_your_dframe"])

sample code here :

import pandas as pd

data = {
  "any_of_column_in_your_dframe" : ["ersingulbahar","yagiz","TS"],
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
your_dframe = pd.DataFrame(data)


for label,row in your_dframe.iterrows():
      your_dframe.loc[label,"new_column_length"]=len(row["any_of_column_in_your_dframe"])
      
      
print(your_dframe) 

and output is here:

any_of_column_in_your_dframe calories duration new_column_length
ersingulbahar 420 50 13.0
yagiz 380 40 5.0
TS 390 45 2.0

Not: you can use like this as well:

your_dframe["new_column_length"]=your_dframe["any_of_column_in_your_dframe"].apply(len)
Monopoly answered 12/8, 2021 at 5:33 Comment(0)
D
0

Simple way to add new columns to the existing dataframe is:

new_cols = ['a' , 'b' , 'c' , 'd']

for col in new_cols:
    df[f'{col}'] = 0 #assiging 0 for the placeholder

print(df.columns)
Divinize answered 15/9, 2021 at 7:54 Comment(0)
A
0

If the indices match, simple assignment does the job.

  • for a single column:
    df['e'] = new_series
    
    df = df.assign(e=new_series)
    
  • for multiple columns:
    df[['e', 'f']] = new_dataframe
    
    df = df.assign(**new_dataframe) # can assign multiple columns by unpacking
    

If the above throws a SettingWithCopyWarning, enable Copy-on-Write before the assignment:

pd.set_option('mode.copy_on_write', True)
df['e'] = new_series

If the indices don't match (as in the OP), then relabeling the index of the new column(s) to be the same as the original dataframe using set_axis() and assigning does the job. You can also use .assign(), concat() (as well as join() for multiple columns).

  • for a single column:

    df['e'] = new_column.set_axis(df.index)
    #                    ^^^^^^^^ <--- relabel index
    df = df.assign(e=new_column.set_axis(df.index))
    
  • for multiple columns (2 columns in the below example):

    df[['e', 'f']] = new_columns.set_axis(df.index)
    
    df = df.assign(**new_columns.set_axis(df.index).set_axis(['e', 'f'], axis=1))
    #                            ^^^^ relabel index ^^^^^^ relabel columns
    
    df = pd.concat((df, new_columns.set_axis(df.index)), axis=1)
    
    df = df.join(new_columns.set_axis(df.index))
    

This method is particularly useful when assigning multiple columns where the dtypes are mixed, in which case converting to numpy ndarray using values or to_numpy() mangles the dtypes which you probably want to avoid.

df1 = pd.DataFrame({'a': range(3), 'b': [*'abc']}, index=[2,3,5])
df2 = pd.DataFrame({'c': [10, 20, 30], 'd':[0.5, 1.5, 2.5]}) # column 'c' dtype is int

df1[['c', 'd']] = df2.values   # now column 'c' dtype is float (because dtype of 'd' is float)

df1 = df1.join(df2.set_axis(df1.index))   # dtypes are preserved
df1[['c', 'd']] = df2.set_axis(df1.index) # dtypes are preserved
Avruch answered 16/9, 2023 at 20:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.