How to jump to a particular line in a huge text file?
Asked Answered
A

17

126

Are there any alternatives to the code below:

startFromLine = 141978 # or whatever line I need to jump to

urlsfile = open(filename, "rb", 0)

linesCounter = 1

for line in urlsfile:
    if linesCounter > startFromLine:
        DoSomethingWithThisLine(line)

    linesCounter += 1

If I'm processing a huge text file (~15MB) with lines of unknown but different length, and need to jump to a particular line which number I know in advance? I feel bad by processing them one by one when I know I could ignore at least first half of the file. Looking for more elegant solution if there is any.

Aleksandropol answered 6/3, 2009 at 20:49 Comment(2)
How do you know the first 1/2 of the file isn't a bunch of "\n"s while the second half is a single line? Why do you feel bad about this?Lakin
I think that the title is misleading - tbh 15MB is not really "huge text file", to say the least...Hammurabi
M
34

linecache:

The linecache module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback...

Mebane answered 6/3, 2009 at 20:58 Comment(4)
I just checked the source code of this module: the whole file is read in memory! So I would definitely rule this answer out for the purpose of quickly accessing a given line in a file.Zigzag
MiniQuark, I tried it, it actually works, and really quickly. I'll need to see what happens if I work on a dozen of files at the same time this way, find out at what point my system dies.Aleksandropol
Your OS's virtual memory manager helps out quite a bit, so reading big files into memory may not be slow if you're not generating a lot of pages faults :) On the contrary, doing it the "stupid way" and allocating lots and lots of memory can be blazingly fast. I enjoyed the danish FreeBSD developer Poul-Henning Kamp's article on it: queue.acm.org/detail.cfm?id=1814327Hygroscope
try 100G file, it sucks. i have to use f.tell(), f.seek(),f.readline()Toast
F
139

You can't jump ahead without reading in the file at least once, since you don't know where the line breaks are. You could do something like:

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
Faun answered 6/3, 2009 at 21:28 Comment(9)
+1, but beware that this is only useful if he's gonna jump to several random lines! but if he's only jumping to one line, then this is wastefulLiger
+1: Also, if the file doesn't change, the line number index can be pickled and reused, further amortizing the initial cost of scanning the file.Trifolium
OK, after I jumped there how would I process then line-by-line starting from this position?Aleksandropol
One thing to note (particularly on windows): be careful to open the file in binary mode, or alternatively use offset=file.tell(). In text mode on windows, the line will be a byte shorter than it's raw length on disk (\r\n replaced by \n)Tycoon
@photographer: Use read() or readline(), they start from the current position as set by seek.Trifolium
OK, this thread is not very recent, but I thought, suppose you know the number of characters in the file. Then you can do something like the solution above even cheaper. Just pick a random number 'r' between 0 and the number of characters -> f.seek(r) -> first f.readline(): potentially part of a line -> second f.readline(): your random line. Though this is biased if the line lengths differ greatly, and you'll never pick the first line this way.Literacy
I'm getting the problem that on long unicode text len() seems to be under reporting the string length. Fixed it by opening the file in binary (open(..., 'rb') when counting but in text mode when reading open(..., 'wt'). I guess .seek()` counts bytes.Ling
This is very fast compared to reading the entire file. I originally structured my data as JSON and reading that entire file took over 2 minutes, changing the file to CSV, reading the whole thing and getting the data I want from it took 17 seconds. But seeking this same file using offset metadata collected earlier allows me to read the data in 0.8 seconds. So very good performance gains providing you can collect the metadata in advance.Ruel
You can also do the "guess and check" method with the seek method on the file handle (outlined in the Python docs here). After seeking, call readline to jump to the next line (this first read will likely be a partial line). But subsequent invocations of readline will return complete lines.Gulf
M
34

linecache:

The linecache module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback...

Mebane answered 6/3, 2009 at 20:58 Comment(4)
I just checked the source code of this module: the whole file is read in memory! So I would definitely rule this answer out for the purpose of quickly accessing a given line in a file.Zigzag
MiniQuark, I tried it, it actually works, and really quickly. I'll need to see what happens if I work on a dozen of files at the same time this way, find out at what point my system dies.Aleksandropol
Your OS's virtual memory manager helps out quite a bit, so reading big files into memory may not be slow if you're not generating a lot of pages faults :) On the contrary, doing it the "stupid way" and allocating lots and lots of memory can be blazingly fast. I enjoyed the danish FreeBSD developer Poul-Henning Kamp's article on it: queue.acm.org/detail.cfm?id=1814327Hygroscope
try 100G file, it sucks. i have to use f.tell(), f.seek(),f.readline()Toast
D
22

You don't really have that many options if the lines are of different length... you sadly need to process the line ending characters to know when you've progressed to the next line.

You can, however, dramatically speed this up AND reduce memory usage by changing the last parameter to "open" to something not 0.

0 means the file reading operation is unbuffered, which is very slow and disk intensive. 1 means the file is line buffered, which would be an improvement. Anything above 1 (say 8 kB, i.e. 8192, or higher) reads chunks of the file into memory. You still access it through for line in open(etc):, but python only goes a bit at a time, discarding each buffered chunk after its processed.

Daffy answered 6/3, 2009 at 21:28 Comment(1)
I've done some testing here, and setting it to -1 (os default, often 8k, but often hard to tell), seems to be about as fast as it gets. That said, part of that may be that I'm testing on a virtual server.Welldisposed
A
13

I'm probably spoiled by abundant ram, but 15 M is not huge. Reading into memory with readlines() is what I usually do with files of this size. Accessing a line after that is trivial.

Anni answered 6/3, 2009 at 21:25 Comment(6)
Why I was slightly hesitant to read entire file -- I might have several of those processes running, and if a dozen of those read 12 files 15MB each it could be not good. But I need to test it to find out if it'll work. Thank you.Aleksandropol
@photographer: even "several" processes reading in 15MB files shouldn't matter on a typical modern machine (depending, of course, on exactly what you're doing with them).Bokbokhara
Jacob, yes, I should just try. The process(es) is/are running on a virtual machine for weeks if vm is not crashed. Unfortunately last time it crashed after 6 days. I need to continue from where it suddenly stopped. Still need to figure out how to find where it was left.Aleksandropol
@Noah: but it is not! Why don't you go further? What if file 128TB? Than many OS wouldn't be able to support it. Why not to solve the problem as they come?Anni
@SilentGhost: I was hoping to get an answer that might be useful to me, as well. I've cobbled together an indexing scheme for my files, which range from 100MB to nearly 1GB, but an easier and less error-prone solution would be nice.Gaskill
No matter how much RAM you have, if you compute long enough, you'll run outUnreadable
C
13

I am suprised no one mentioned islice

line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line

or if you want the whole rest of the file

rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
    print line

or if you want every other line from the file

rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
    print odd_line
Cherianne answered 26/4, 2016 at 2:36 Comment(1)
this is the best solution, IMO, but explaining how it works would improve clarity.Criminology
B
6

Since there is no way to determine the length of all lines without reading them, you have no choice but to iterate over all lines before your starting line. All you can do is make it look nice. If the file is really huge then you might want to use a generator-based approach:

from itertools import dropwhile

def iterate_from_line(f, start_from_line):
    return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))

for line in iterate_from_line(open(filename, "r", 0), 141978):
    DoSomethingWithThisLine(line)

Note: the index is zero-based in this approach.

Brenner answered 6/3, 2009 at 21:33 Comment(0)
H
6

I have had the same problem (need to retrieve from huge file specific line).

Surely, I can every time run through all records in file and stop it when counter will be equal to target line, but it does not work effectively in a case when you want to obtain plural number of specific rows. That caused main issue to be resolved - how handle directly to necessary place of file.

I found out next decision: Firstly I completed dictionary with start position of each line (key is line number, and value – cumulated length of previous lines).

t = open(file,’r’)
dict_pos = {}

kolvo = 0
length = 0
for each in t:
    dict_pos[kolvo] = length
    length = length+len(each)
    kolvo = kolvo+1

ultimately, aim function:

def give_line(line_number):
    t.seek(dict_pos.get(line_number))
    line = t.readline()
    return line

t.seek(line_number) – command that execute pruning of file up to line inception. So, if you next commit readline – you obtain your target line.

Using such approach I have saved significant part of time.

Harar answered 6/7, 2014 at 18:3 Comment(0)
F
5

You may use mmap to find the offset of the lines. MMap seems to be the fastest way to process a file

example:

with open('input_file', "r+b") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    i = 1
    for line in iter(mapped.readline, ""):
        if i == Line_I_want_to_jump:
            offsets = mapped.tell()
        i+=1

then use f.seek(offsets) to move to the line you need

Firsthand answered 7/8, 2015 at 18:15 Comment(1)
Such a good answer.Unreadable
R
5

None of the answers are particularly satisfactory, so here's a small snippet to help.

class LineSeekableFile:
    def __init__(self, seekable):
        self.fin = seekable
        self.line_map = list() # Map from line index -> file position.
        self.line_map.append(0)
        while seekable.readline():
            self.line_map.append(seekable.tell())

    def __getitem__(self, index):
        # NOTE: This assumes that you're not reading the file sequentially.  
        # For that, just use 'for line in file'.
        self.fin.seek(self.line_map[index])
        return self.fin.readline()

Example usage:

In: !cat /tmp/test.txt

Out:
Line zero.
Line one!

Line three.
End of file, line four.

In:
with open("/tmp/test.txt", 'rt') as fin:
    seeker = LineSeekableFile(fin)    
    print(seeker[1])
Out:
Line one!

This involves doing a lot of file seeks, but is useful for the cases where you can't fit the whole file in memory. It does one initial read to get the line locations (so it does read the whole file, but doesn't keep it all in memory), and then each access does a file seek after the fact.

I offer the snippet above under the MIT or Apache license at the discretion of the user.

Regrate answered 4/12, 2019 at 23:11 Comment(1)
This is the best solution, not only for the question, but for many other memory related issues when reading large files. Thank you for that!Fossick
L
4

If you don't want to read the entire file in memory .. you may need to come up with some format other than plain text.

of course it all depends on what you're trying to do, and how often you will jump across the file.

For instance, if you're gonna be jumping to lines many times in the same file, and you know that the file does not change while working with it, you can do this:
First, pass through the whole file, and record the "seek-location" of some key-line-numbers (such as, ever 1000 lines),
Then if you want line 12005, jump to the position of 12000 (which you've recorded) then read 5 lines and you'll know you're in line 12005 and so on

Liger answered 6/3, 2009 at 21:31 Comment(0)
G
3

If you know in advance the position in the file (rather the line number), you can use file.seek() to go to that position.

Edit: you can use the linecache.getline(filename, lineno) function, which will return the contents of the line lineno, but only after reading the entire file into memory. Good if you're randomly accessing lines from within the file (as python itself might want to do to print a traceback) but not good for a 15MB file.

Gaskill answered 6/3, 2009 at 21:6 Comment(2)
I would definitely not use linecache for this purpose, because it reads the whole file in memory before returning the requested line.Zigzag
Yeah, it sounded too good to be true. I still wish there were a module to do this efficiently, but tend to use the file.seek() method instead.Gaskill
P
3

What generates the file you want to process? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to. The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller. And thus can be read and processed qucikly.

  • Which line do you want?.
  • Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).
  • Use seek or whatever to directly jump to get the line from index file.
  • Parse to get byte offset for corresponding line of actual file.
Professed answered 28/4, 2010 at 7:39 Comment(0)
D
2

Do the lines themselves contain any index information? If the content of each line was something like "<line index>:Data", then the seek() approach could be used to do a binary search through the file, even if the amount of Data is variable. You'd seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.

Otherwise, the best you can do is just readlines(). If you don't want to read all 15MB, you can use the sizehint argument to at least replace a lot of readline()s with a smaller number of calls to readlines().

Decarlo answered 6/3, 2009 at 22:33 Comment(0)
S
2

If you're dealing with a text file & based on linux system, you could use the linux commands.
For me, this worked well!

import commands

def read_line(path, line=1):
    return commands.getoutput('head -%s %s | tail -1' % (line, path))

line_to_jump = 141978
read_line("path_to_large_text_file", line_to_jump)
Siberson answered 23/3, 2016 at 9:47 Comment(3)
of course it's not compatible with windows or some kind of linux shells which don't support head / tail.Erine
Is this faster than doing it in Python?Intentional
Can this get multiple lines?Intentional
L
1

Here's an example using readlines(sizehint) to read a chunk of lines at a time. DNS pointed out that solution. I wrote this example because the other examples here are single-line oriented.

def getlineno(filename, lineno):
    if lineno < 1:
        raise TypeError("First line is line 1")
    f = open(filename)
    lines_read = 0
    while 1:
        lines = f.readlines(100000)
        if not lines:
            return None
        if lines_read + len(lines) >= lineno:
            return lines[lineno-lines_read-1]
        lines_read += len(lines)

print getlineno("nci_09425001_09450000.smi", 12000)
Lakin answered 7/3, 2009 at 4:24 Comment(0)
J
0

@george brilliantly suggested mmap, which presumably uses the syscall mmap. Here's another rendition.

import mmap

LINE = 2  # your desired line

with open('data.txt','rb') as i_file, mmap.mmap(i_file.fileno(), length=0, prot=mmap.PROT_READ) as data:
  for i,line in enumerate(iter(data.readline, '')):
    if i!=LINE: continue
    pos = data.tell() - len(line)
    break

  # optionally copy data to `chunk`
  i_file.seek(pos)
  chunk = i_file.read(len(line))

print(f'line {i}')
print(f'byte {pos}')
print(f'data {line}')
print(f'data {chunk}')
Janus answered 25/11, 2021 at 9:17 Comment(0)
S
-1

Can use this function to return line n:

def skipton(infile, n):
    with open(infile,'r') as fi:
        for i in range(n-1):
            fi.next()
        return fi.next()
Shiver answered 19/9, 2015 at 22:5 Comment(2)
This logic doesn't work if there are continuous empty lines, fi.next() skips all empty lines at once, otherwise it good :)Wistrup
The OP doesn't mention that the lines have lines with non-standard line-breaks. In that case, you'd have to parse each line with at least one if-statement for the partial line-breaks.Shiver

© 2022 - 2024 — McMap. All rights reserved.