How to remove numbers from string terms in a pandas dataframe
Asked Answered
R

5

58

I have a data frame similar to the one below:

Name    Volume  Value
May21   23      21321
James   12      12311
Adi22   11      4435
Hello   34      32454
Girl90  56      654654

I want the output to be in the format:

Name    Volume  Value
May     23      21321
James   12      12311
Adi     11      4435
Hello   34      32454
Girl    56      654654

Want to remove all the numbers from the Name column.

Closest I have come is doing it at a cell level with the following code:

result = ''.join([i for i in df['Name'][1] if not i.isdigit()])

Any idea how to do it in a better way at the series/dataframe level.

Rillings answered 18/1, 2017 at 12:20 Comment(0)
S
124

You can apply str.replace to the Name column in combination with regular expressions:

import pandas as pd

# Example DataFrame
df = pd.DataFrame.from_dict({'Name'  : ['May21', 'James', 'Adi22', 'Hello', 'Girl90'],
                             'Volume': [23, 12, 11, 34, 56],
                             'Value' : [21321, 12311, 4435, 32454, 654654]})

df['Name'] = df['Name'].str.replace('\d+', '')

print(df)

Output:

    Name   Value  Volume
0    May   21321      23
1  James   12311      12
2    Adi    4435      11
3  Hello   32454      34
4   Girl  654654      56

In the regular expression \d stands for "any digit" and + stands for "one or more".

Thus, str.replace('\d+', '') means: "Replace all occurring digits in the strings with nothing".

Sombrero answered 18/1, 2017 at 12:29 Comment(0)
C
22

You can do it like so:

df.Name = df.Name.str.replace('\d+', '')

To play and explore, check the online Regular expression demo here: https://regex101.com/r/Y6gJny/2

Whatever is matched by the pattern \d+ i.e 1 or more digits, will be replaced by empty string.

Clydesdale answered 18/1, 2017 at 12:27 Comment(0)
J
18

.str is not necessary. You can use pandas dataframe.replace or series.replace with regex=True argument.

df.replace('\d+', '', regex=True)

if you want to change source dataframe use inplace=True.

df.replace('\d+', '', regex=True, inplace=True)
Jordaens answered 3/1, 2020 at 18:7 Comment(1)
Can we use this to remove bad data such theres space between the numeric value? ie. "12 445"Hitchhike
C
11

Although the question sounds more general, the example input only contains trailing numbers. In this case you don't have to use regular expressions, since .rstrip (also available via the .str accessor of Series objects) can do exactly this:

import string
df['Name'] = df['Name'].str.rstrip(string.digits)

Similarly, you can use .lstrip to strip any digits from the start, or .strip to remove any digits from the start and the end of each string.

Cornwallis answered 29/12, 2019 at 13:48 Comment(0)
D
0

For some reason I dont understand .replace did not work for me. So I used:

df["Name"] = df["Name"].apply(lambda x: ''.join(y for y in x if not y.isdigit()))

to extract everything but numbers.

Dialogize answered 16/4 at 7:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.