python - increase efficiency of large-file search by readlines(size)
Asked Answered
I

2

7

I am new to Python and I am currently using Python 2. I have some source files that each consists of a huge amount of data (approx. 19 million lines). It looks like the following:

apple   \t N   \t apple
n&apos
garden  \t N   \t garden
b\ta\md 
great   \t Adj \t great
nice    \t Adj \t (unknown)
etc

My task is to search the 3rd column of each file for some target words and every time a target word is found in the corpus the 10 words before and after this word have to be added to a multidimensional dictionary.

EDIT: The lines containing a '&', a '\' or the string '(unknown)' should be excluded.

I tried to solve this using readlines() and enumerate() as you see in the code below. The code does what it should but it is obviously not efficient enough for the amount of data provided in the source file.

I know that readlines() or read() should not be used for huge data sets as it loads the whole file into memory. Nevertheless, reading the file line by line, I did not manage to use the enumerate method to get the 10 words before and after the target word. I also cannot use mmap as I do not have the permission to use it on that file.

So, I guess the readlines method with some size limitation would be the most efficient solution. However, going for that, would I not make some errors as each time reaching the end of the size limit the 10 words after the target word would not be captured as the code just breaks?

def get_target_to_dict(file):
targets_dict = {}
with open(file) as f:
    for line in f:
            targets_dict[line.strip()] = {}
return targets_dict

targets_dict = get_target_to_dict('targets_uniq.txt')
# browse directory and process each file 
# find the target words to include the 10 words before and after to the dictionary
# exclude lines starting with <,-,; to just have raw text

    def get_co_occurence(path_file_dir, targets, results):
        lines = []
        for file in os.listdir(path_file_dir):
            if file.startswith('corpus'):
            path_file = os.path.join(path_file_dir, file)
            with gzip.open(path_file) as corpusfile:
                # PROBLEMATIC CODE HERE
                # lines = corpusfile.readlines()
                for line in corpusfile:
                    if re.match('[A-Z]|[a-z]', line):
                        if '(unknown)' in line:
                            continue
                        elif '\\' in line:
                            continue
                        elif '&' in line:
                            continue
                        lines.append(line)
                for i, line in enumerate(lines):
                    line = line.strip()
                    if re.match('[A-Z][a-z]', line):
                        parts = line.split('\t')
                        lemma = parts[2]
                        if lemma in targets:
                            pos = parts[1]
                            if pos not in targets[lemma]:
                                targets[lemma][pos] = {}
                            counts = targets[lemma][pos]
                            context = []
                            # look at 10 previous lines
                            for j in range(max(0, i-10), i):
                                context.append(lines[j])
                            # look at the next 10 lines
                            for j in range(i+1, min(i+11, len(lines))):
                                context.append(lines[j])
                            # END OF PROBLEMATIC CODE
                            for context_line in context:
                                context_line = context_line.strip()
                                parts_context = context_line.split('\t')
                                context_lemma = parts_context[2]
                                if context_lemma not in counts:
                                    counts[context_lemma] = {}
                                context_pos = parts_context[1]
                                if context_pos not in counts[context_lemma]:
                                    counts[context_lemma][context_pos] = 0
                                counts[context_lemma][context_pos] += 1
                csvwriter = csv.writer(results, delimiter='\t')
                for k,v in targets.iteritems():
                    for k2,v2 in v.iteritems():
                        for k3,v3 in v2.iteritems():
                            for k4,v4 in v3.iteritems():
                                csvwriter.writerow([str(k), str(k2), str(k3), str(k4), str(v4)])
                                #print(str(k) + "\t" + str(k2) + "\t" + str(k3) + "\t" + str(k4) + "\t" + str(v4))

results = open('results_corpus.csv', 'wb')
word_occurrence = get_co_occurence(path_file_dir, targets_dict, results)

I copied the whole part of the code for reasons of completeness as it is all part of one function which creates a multidimensional dictionary out of all information extracted and writes it to a csv file then.

I would really appreciate any hint or suggestion to make this code more efficient.

EDIT I corrected the code, so that it takes into account the exact 10 words before and after the target word

Illusive answered 11/11, 2016 at 9:38 Comment(5)
You can do I efficiently using map, filter, groupby and isliceRhomb
thank you, I read about it and it seems to be very efficient. Would you mind elaborating it a little more concerning the above code? To use map, I definitely need the corpusfile to be list, right?Illusive
Are you looking for 10 previous words in the column or simply 10 previous words?Rhomb
I am looking for the exact 10 previous words in the column 3Illusive
This is probably better suited for the code review stack exchange.Folacin
P
3

my idea was to create a buffer to store before 10 lines and another buffer to store after 10 lines, as the file being read, it will be push into before buffer and the buffer will be pop off if size exceed 10

for the after buffer, i clone another iterator from the file iterator 1st. Then running both iterator in parallel within the loop with clone iterator running 10 iteration ahead to get the after 10 lines.

This avoid using readlines() and load whole file in memory. Hope it works for you in actual case

edited: only fill the before after buffer if column 3 does not contains any of '&', '\', '(unknown)'.Also change split('\t') into just split() so it will take care all whitespace or tab

import itertools
def get_co_occurence(path_file_dir, targets, results):
    excluded_words = ['&', '\\', '(unknown)'] # modify excluded words here 
    for file in os.listdir(path_file_dir): 
        if file.startswith('testset'): 
            path_file = os.path.join(path_file_dir, file) 
            with open(path_file) as corpusfile: 
                # CHANGED CODE HERE
                before_buf = [] # buffer to store before 10 lines 
                after_buf = []  # buffer to store after 10 lines 
                corpusfile, corpusfile_clone = itertools.tee(corpusfile) # clone file iterator to access next 10 lines 
                for line in corpusfile: 
                    line = line.strip() 
                    if re.match('[A-Z]|[a-z]', line): 
                        parts = line.split() 
                        lemma = parts[2]

                        # before buffer handling, fill buffer excluded line contains any of excluded words 
                        if not any(w in line for w in excluded_words): 
                            before_buf.append(line) # append to before buffer 
                        if len(before_buf)>11: 
                            before_buf.pop(0) # keep the buffer at size 10 
                        # next buffer handling
                        while len(after_buf)<=10: 
                            try: 
                                after = next(corpusfile_clone) # advance 1 iterator 
                                after_lemma = '' 
                                after_tmp = after.split()
                                if re.match('[A-Z]|[a-z]', after) and len(after_tmp)>2: 
                                    after_lemma = after_tmp[2]
                            except StopIteration: 
                                break # copy iterator will exhaust 1st coz its 10 iteration ahead 
                            if after_lemma and not any(w in after for w in excluded_words): 
                                after_buf.append(after) # append to buffer
                                # print 'after',z,after, ' - ',after_lemma
                        if (after_buf and line in after_buf[0]):
                            after_buf.pop(0) # pop off one ready for next

                        if lemma in targets: 
                            pos = parts[1] 
                            if pos not in targets[lemma]: 
                                targets[lemma][pos] = {} 
                            counts = targets[lemma][pos] 
                            # context = [] 
                            # look at 10 previous lines 
                            context= before_buf[:-1] # minus out current line 
                            # look at the next 10 lines 
                            context.extend(after_buf) 

                            # END OF CHANGED CODE
                            # CONTINUE YOUR STUFF HERE WITH CONTEXT
Pitchman answered 11/11, 2016 at 16:39 Comment(8)
wow, good idea! thank you so much for your help and your code. I will try it later today and give you feedback right away.Illusive
Thank you, this is very helpful. I did not take into account that in the source file (corpusfile), there are also lines that should be excluded before reading it into the buffer (lines containing '&', '\' or '(unknown)', see edit). I have been already trying to add this to your code the whole day but did not get anywhere. Do you have a suggestion? It should definitely be after for line in cowfile: line = line.strip(). However the whole buffers get messed up then.Illusive
Looks like your original code doesn't do what you describe also, it just getting previous and after 10 lines regardless what it is then only check during processing the context; if out of 10 lines before 2 lines contains invalid word like unknown, then you will left with 8 lines only. So what you want should be filter and make sure all 10 lines before and after buffer is valid without any of the filter words, am i right ? I will try edit my code later for this.Pitchman
edited answer to address your comments, hope it works for your needs :)Pitchman
oh, ja, you are right, I will fix that. sorry. great, thank you for putting so much effort into fixing my code, I really appreciate that! I will try it in a minute but it looks to me as if it does what it should.Illusive
just one more thing, if you do not mind. for the cloned file I also need to exclude lines that are not starting with [A-Z]|[a-z] like in the normal corpusfile as otherwise an error will occur. I tried to do that in the while loop in the try clause with an additional if statement (if re.match('[A-Z]|[a-z]', after): after_lemma = after.split()[2] else: continue). However, this does one additional iteration that I do not like to have. Do you have a suggestion to fix that?Illusive
Let us continue this discussion in chat.Pitchman
there you go, i see what you added looks logically correct to me, if not match, it shall exclude, thats the advance 1 more iteration did, it only advance the corpusfile_clone iterator. i do it a bit different, end result should be samePitchman
R
1

A functional alternative written in Python 3.5. I simplified your example to only take 5 words on both sides. There are other simplifications with respect to junk-value filtering, but it will only require minor modifications. I will use package fn from PyPI to make this functional code more natural to read.

from typing import List, Tuple
from itertools import groupby, filterfalse
from fn import F

First we need to extract the column:

def getcol3(line: str) -> str:
    return line.split("\t")[2]

Then we need to split the lines into blocks separated by a predicate:

TARGET_WORDS = {"target1", "target2"}

# this is out predicate
def istarget(word: str) -> bool:
    return word in TARGET_WORDS        

Lets filter junk and write a function to take the last and the first 5 words:

def isjunk(word: str) -> bool:
    return word == "(unknown)"

def first_and_last(words: List[str]) -> (List[str], List[str]):
    first = words[:5]
    last = words[-5:]
    return first, last

Now, let's get the groups:

words = (F() >> (map, str.strip) >> (filter, bool) >> (map, getcol3) >> (filterfalse, isjunk))(lines)
groups = groupby(words, istarget)

Now, process the groups

def is_target_group(group: Tuple[str, List[str]]) -> bool:
    return istarget(group[0])

def unpack_word_group(group: Tuple[str, List[str]]) -> List[str]:
    return [*group[1]]

def unpack_target_group(group: Tuple[str, List[str]]) -> List[str]:
    return [group[0]]

def process_group(group: Tuple[str, List[str]]):
    return (unpack_target_group(group) if is_target_group(group) 
            else first_and_last(unpack_word_group(group)))

And the final steps are:

words = list(map(process_group, groups))

P.S.

This is my test-case:

from io import StringIO

buffer = """
_\t_\tword
_\t_\tword
_\t_\tword
_\t_\t(unknown)
_\t_\tword
_\t_\tword
_\t_\ttarget1
_\t_\tword
_\t_\t(unknown)
_\t_\tword
_\t_\tword
_\t_\tword
_\t_\ttarget2
_\t_\tword
_\t_\t(unknown)
_\t_\tword
_\t_\tword
_\t_\tword
_\t_\t(unknown)
_\t_\tword
_\t_\tword
_\t_\ttarget1
_\t_\tword
_\t_\t(unknown)
_\t_\tword
_\t_\tword
_\t_\tword
"""

# this simulates an opened file
lines = StringIO(buffer)

Given this file you will get this output:

[(['word', 'word', 'word', 'word', 'word'],
  ['word', 'word', 'word', 'word', 'word']),
 (['target1'], ['target1']),
 (['word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word']),
 (['target2'], ['target2']),
 (['word', 'word', 'word', 'word', 'word'],
  ['word', 'word', 'word', 'word', 'word']),
 (['target1'], ['target1']),
 (['word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word'])]

From here you can drop the first 5 words and the last 5 words.

Rhomb answered 14/11, 2016 at 12:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.