Reading a text file and splitting it into single words in python
Asked Answered
C

6

65

I have this text file made up of numbers and words, for example like this - 09807754 18 n 03 aristocrat 0 blue_blood 0 patrician and I want to split it so that each word or number will come up as a new line.

A whitespace separator would be ideal as I would like the words with the dashes to stay connected.

This is what I have so far:

f = open('words.txt', 'r')
for word in f:
    print(word)

not really sure how to go from here, I would like this to be the output:

09807754
18
n
3
aristocrat
...
Concatenate answered 4/6, 2013 at 15:50 Comment(2)
Does that data literally have quotes around it? Is it "09807754 18 n 03 aristocrat 0 blue_blood 0 patrician" or 09807754 18 n 03 aristocrat 0 blue_blood 0 patrician in the file?Summon
I follow-up with comment above. Does that data literally have quotes around itEuropium
S
155

Given this file:

$ cat words.txt
line1 word1 word2
line2 word3 word4
line3 word5 word6

If you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file):

with open('words.txt','r') as f:
    for line in f:
        for word in line.split():
           print(word)    

Prints:

line1
word1
word2
line2
...
word6 

Similarly, if you want to flatten the file into a single flat list of words in the file, you might do something like this:

with open('words.txt') as f:
    flat_list=[word for line in f for word in line.split()]

>>> flat_list
['line1', 'word1', 'word2', 'line2', 'word3', 'word4', 'line3', 'word5', 'word6']

Which can create the same output as the first example with print '\n'.join(flat_list)...

Or, if you want a nested list of the words in each line of the file (for example, to create a matrix of rows and columns from a file):

with open('words.txt') as f:
    matrix=[line.split() for line in f]

>>> matrix
[['line1', 'word1', 'word2'], ['line2', 'word3', 'word4'], ['line3', 'word5', 'word6']]

If you want a regex solution, which would allow you to filter wordN vs lineN type words in the example file:

import re
with open("words.txt") as f:
    for line in f:
        for word in re.findall(r'\bword\d+', line):
            # wordN by wordN with no lineN

Or, if you want that to be a line by line generator with a regex:

 with open("words.txt") as f:
     (word for line in f for word in re.findall(r'\w+', line))
Summon answered 4/6, 2013 at 15:56 Comment(9)
How a file object is iterable (for line in f:)?Forever
@haccks: It is the suggested idiom for looping line-by-line over a file. See also this SO postSummon
I just wanted to know the mechanism behind this; how it works?Forever
The open creates a file object. Python file objects support line-by-line iteration for text files (binary files are read in one gulp...) So each loop in the for loop is a line for a text file. At the end of the file, the file object raises StopIteration and we are done with the file. More understanding, of the mechanism is beyond what I can do in a comments.Summon
You can also load into main memory and use "re" library like here #7633774Abrade
I love the different ways and the discussion of when each might be used. Very clear, concise, and thorough.Norford
Maybe we should care about closing the file ? @SummonParakeet
@FlorentJousse: When you use with to open the file, the file is closed at the end of the with block. No need to manually close it. If you use a bare open it is indeed good practice to close that file when finished. All the examples here use with and therefor there is no close to worry about.Summon
Okay thanks you dawg, I was using your code into a loop and I was wondering if the close() was missing. Well that's perfect !Parakeet
S
22
f = open('words.txt')
for word in f.read().split():
    print(word)
Solorio answered 4/6, 2013 at 16:5 Comment(0)
P
16

As supplementary, if you are reading a vvvvery large file, and you don't want read all of the content into memory at once, you might consider using a buffer, then return each word by yield:

def read_words(inputfile):
    with open(inputfile, 'r') as f:
        while True:
            buf = f.read(10240)
            if not buf:
                break

            # make sure we end on a space (word boundary)
            while not str.isspace(buf[-1]):
                ch = f.read(1)
                if not ch:
                    break
                buf += ch

            words = buf.split()
            for word in words:
                yield word
        yield '' #handle the scene that the file is empty

if __name__ == "__main__":
    for word in read_words('./very_large_file.txt'):
        process(word)
Phonics answered 11/3, 2017 at 7:3 Comment(5)
For those interested in performance, this is an order of magnitude faster than the itertools answer.Painful
why 10240 ? Im assuming that bytes? So around 10kb? How big can the buffer be and if I am interested in performance is smaller or larger buf better?Holey
Im confuesed, what does process do? It isn't defined...Mansion
@Mansion read_words is a generator. It returns the words to the for word in ... loop one by one. Presumably, process does something useful with each of the words. E.g., compiling a frequency distribution (e.g., see collections.defaultdict) or perhaps a word-length histogram.Hamiltonian
@Holey 10kb is reasonable. If performance matters, experiment! In the past, I recall finding 16kb to be a sweet spot and a megabyte slightly slower. It should probably be a multiple of your system's disk allocation unit size. BTW, the generator should handle the buffer straddling problem with an auxiliary word-assembling buffer, not by extending the main buffer in search of a space!Hamiltonian
S
5

What you can do is use nltk to tokenize words and then store all of the words in a list, here's what I did. If you don't know nltk; it stands for natural language toolkit and is used to process natural language. Here's some resource if you wanna get started [http://www.nltk.org/book/]

import nltk 
from nltk.tokenize import word_tokenize 
file = open("abc.txt",newline='')
result = file.read()
words = word_tokenize(result)
for i in words:
       print(i)

The output will be this:

09807754
18
n
03
aristocrat
0
blue_blood
0
patrician
Stratopause answered 24/3, 2018 at 11:37 Comment(0)
C
4
with open(filename) as file:
    words = file.read().split()

Its a List of all words in your file.

import re
with open(filename) as file:
    words = re.findall(r"([a-zA-Z\-]+)", file.read())
Condiment answered 20/1, 2019 at 8:38 Comment(0)
H
1

Here is my totally functional approach which avoids having to read and split lines. It makes use of the itertools module:

Note for python 3, replace itertools.imap with map

import itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
        itertools.takewhile(lambda c: bool(c),
            itertools.imap(mfile.read,
                itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

Sample usage:

>>> import sys
>>> for w in readwords(sys.stdin):
...     print (w)
... 
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
           
It's soo very Functional!
It's
soo
very
Functional!
>>>

I guess in your case, this would be the way to use the function:

with open('words.txt', 'r') as f:
    for word in readwords(f):
        print(word)
Haemoid answered 29/11, 2016 at 5:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.