I am doing textual analysis on texts that due to PDF-to-txt conversion errors, sometime lump words together. So instead of matching words, I want to match strings.
For example, I have the string:
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
And I search for
key_words=['loss', 'debt', 'debts', 'elephant']
The output should be of the form:
Filename Debt Debts Loss Elephant
mystring 2 1 1 0
The code I have works well, except for a few glitches: 1) it does not report the frequency of zero-frequency words (so 'Elephant' would not be in the output: 2) the order of the words in key_words seems to matter (ie. I sometimes get 1 count each for 'debt' and 'debts', and sometimes it reports only 2 counts for 'debt', and 'debts is not reported. I could live with the second point if I managed to "print" the variable names to the dataset... but not sure how.
Below is the relevant code. Thanks! PS. Needless to say, it is not the most elegant piece of code, but I am slowly learning.
bad=set(['debts', 'debt'])
csvfile=open("freq_10k_test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):
with open(filename, encoding='utf-8', errors='ignore') as f:
file_name=[]
file_name.append(filename)
new_review=[f.read()]
freq_all=[]
rev=[]
from collections import Counter
for review in new_review:
review_processed=review.lower()
for p in list(punctuation):
review_processed=review_processed.replace(p,'')
pattern = re.compile("|".join(bad), flags = re.IGNORECASE)
freq_iter=collections.Counter(pattern.findall(review_processed))
frequency=[value for (key,value) in sorted(freq_iter.items())]
freq_all.append(frequency)
freq=[v for v in freq_all]
fulldata = [ [file_name[i]] + freq for i, freq in enumerate(freq)]
writer=csv.writer(open("freq_10k_test.csv",'a',newline='', encoding='cp850', errors='replace'))
writer.writerows(fulldata)
csvfile.flush()