Seek on a large text file python
Asked Answered
C

2

8

I have a few text files whose sizes range between 5 gigs and 50 gigs. I am using Python to read them. I have specific anchors in terms of byte offsets, to which I can seek and read the corresponding data from each of these files (using Python's file api).

The issue that I am seeing is that for relatively smaller files (< 5 gigs), this reading approach works well. However, for the much larger files (> 20 gigs) and especially when the file.seek function has to take longer jumps (like a few multi-million bytes at a time), it (sometimes) takes a few hundred milliseconds for it to do so.

My impression was that seek operations within the files are constant time operations. But apparently, they are not. Is there a way around it?

Here is what I am doing:

import time

f = open(filename, 'r+b')
f.seek(209)
current = f.tell()
t1 = time.time()
next = f.seek(current + 1200000000)
t2 = time.time()
line = f.readline()
delta = t2 - t1

The delta variable is varying between few microseconds to few hundreeld milliseconds, intermittently. I also profiled the cpu usage, and didnt see anything busy there as well.

Cogen answered 4/7, 2019 at 20:36 Comment(7)
Are you sure it's the seek itself, and not the subsequent read that's taking longer? I agree with you that the seek itself should take basically no time at all. If it's the read rather than the seek itself, then I'd look at the buffering behavior behind the read. - in either case, where you are in the file wouldn't matter I'd think.Krummhorn
Could you provide a minimal reproducible example?Immigrate
For example: I am reading the file as: f=open(filename, 'r+b'); f.seek(100000000); and then read a line, like f.readline().Cogen
Ah, so binary mode, despite that your question mentions "text files". But seeking a 2 GB file in steps of 100 MB takes a few milliseconds on my system. So it's not a Python thing I guess. It could be your specific implementation of the Python interpreter, your operating system / file system, a virus scanner? I can imagine (but this is guessing) that a seek operation might start a read-ahead action on some level on your computer. Just for sake of argument you could try the same test with virusscanner disabled, or with the file on another file system, or using a file-like object in Python?Immigrate
And as @Steve suggested, it could be the read itself, not the seek. This is still not clear from the code in your comment, that's another reason why I asked for a minimal reproducible example. Could you show how you measured the time?Immigrate
Umm..I don't think reading a text file with binary mode does any damage. I don't have an antivirus and the operating system is ubuntu16. I am adding a code sample above. Hope it helps.Cogen
Thanks for adding the code. This makes your question more clear and (hopefully) answerable. I edited the code a bit because there were additional spaces and semicolons and there was a code error. It can now be copy and pasted by anyone to be used (only the filename should be added/modified of course). I have posted an answer now. I hope it's useful, or anyone else has a better answer :-) Good luck.Immigrate
I
1

Your code runs consistently in under 10 microseconds on my system (Windows 10, Python 3.7), so there is no obvious error in your code.

NB: You should use time.perf_counter() instead of time.time() for measuring performance. The granularity of time.time() can be very bad ("not all systems provide time with a better precision than 1 second"). When comparing timings with other systems you may get strange results.

My best guess is that the seek triggers some buffering (read-ahead) action, which might be slow, depending on your system.

According to the documentation:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

You could try to disable buffering by adding the argument buffering=0 to open() and check if that makes a difference:

open(filename, 'r+b', buffering=0)
Immigrate answered 6/7, 2019 at 7:56 Comment(1)
Thank you @wovano. Let me see if it helps. The issue is very intermittent though, and as I mentioned, I double checked and its not correlated with CPU being busy or anything. Also, I did make sure that pagecache/fs cache is empty when I ran the tests. I will post my observations here if I can see any.Cogen
H
0

A good way around that could be combining functions from OS module os.open (with flag os.O_RDONLY in your case), os.lseek, os.read which are at low-level I/O

Hepta answered 4/7, 2019 at 21:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.