How to remove every word with non alphabetic characters
Asked Answered
I

8

11

I need to write a python script that removes every word in a text file with non alphabetical characters, in order to test Zipf's law. For example:

[email protected] said: I've taken 2 reports to the boss

to

taken reports to the boss

How should I proceed?

Illuminative answered 29/9, 2017 at 9:44 Comment(1)
Look like a job for regex.Sensitivity
P
9

Using regular expressions to match only letters (and underscores), you can do this:

import re

s = "[email protected] said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'
Preparator answered 29/9, 2017 at 9:55 Comment(0)
A
6

Try this:

sentence = "[email protected] said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']

result = ' '.join(words)
# taken reports to the boss
Afrikander answered 29/9, 2017 at 9:59 Comment(0)
B
3

You can use split() and is isalpha() to get a list of words who only have alphabetic characters AND there is at least one character.

>>> sentence = "[email protected] said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']

You can then use join() to make the list into one string:

>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss
Bowknot answered 29/9, 2017 at 10:11 Comment(0)
M
2

The nltk package is specialised in handling text and has various functions you can use to 'tokenize' text into words.

You can either use the RegexpTokenizer, or the word_tokenize with a slight adaptation.

The easiest and simplest is the RegexpTokenizer:

import nltk

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

Which returns:

`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`

Or you can use the slightly smarter word_tokenize which is able to split most contractions like didn't into did and n't.

import re
import nltk
nltk.download('punkt')  # You only have to do this once

def contains_letters(phrase):
    return bool(re.search('[a-zA-Z]', phrase))

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

which returns:

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']
Magnetostriction answered 29/9, 2017 at 10:58 Comment(1)
This isn't what the question is asking. They don't want to split "[email protected]" into "asdf" and "gmail.com". It should be removed entirely because it contains non-letter characters.Hypogynous
R
0

may this will help

array = string.split(' ')
result = []
for word in array
 if word.isalpha()
  result.append(word)
string = ' '.join(result)
Rutabaga answered 29/9, 2017 at 9:57 Comment(0)
J
0

You can either use regex or can use python in build function such as isalpha()

Example using isalpha()

result = ''
with open('file path') as f:
line = f.readline()
a = line.split()
for i in a:
    if i.isalpha():
        print(i+' ',end='')
Jeans answered 29/9, 2017 at 9:59 Comment(0)
B
0

str.join() + comprehension will give you a one line solution:

sentence = "[email protected] said: I've taken 2 reports to the boss"
' '.join([i for i in sentence.split() if i.isalpha()])
#'taken reports to the boss'
Bartlett answered 29/9, 2017 at 10:4 Comment(0)
H
0

I ended up writing my own function for this because the regexes and isalpha() weren't working for the test cases I had.

letters = set('abcdefghijklmnopqrstuvwxyz')

def only_letters(word):
    for char in word.lower():
        if char not in letters:
            return False
    return True

# only 'asdf' is valid here
hard_words = ['ís', 'る', '<|endoftext|>', 'asdf']

print([x for x in hard_words if only_letters(x)])
# prints ['asdf']
Hypogynous answered 18/6, 2021 at 2:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.