Complexity of f.seek() in Python

Asked 11/8, 2018 at 15:34 Answered 11/8, 2018 at 15:58

Does f.seek(500000,0) go through all the first 499999 characters of the file before getting to the 500000th? In other words, is f.seek(n,0) of order O(n) or O(1)?

Hadsall answered 11/8, 2018 at 15:34 Comment(0)

You need to be a bit more specific on what type of object f is.

If f is a normal io module object for a file stored on disk, you have to determine if you are dealing with:

The raw binary file object
A buffer object, wrapping the raw binary file
A TextIO object, wrapping the buffer
An in-memory BytesIO or TextIO object

The first option just uses the lseek system call to reposition the file descriptor position. If this call is O(1) depends on the OS and what kind of file system you have. For a Linux system with ext4 filesystem, lseek is O(1).

Buffers just clear the buffer if your seek target is outside of the current buffered region and read in new buffer data. That's O(1) too, but the fixed cost is higher.

For text files, things are more complicated as variable-byte-length codecs and line-ending translation mean you can't always map the binary stream position to a text position without scanning from the start. The implementation doesn't allow for non-zero current-position- or end-relative seeks, and does it's best to minimise how much data is read for absolute seeks. Internal state shared with the text decoder tracks a recent 'safe point' to seek back to and read forward to the desired position. Worst-case this is O(n).

The in-memory file objects are just long, addressable arrays really. Seeking is O(1) because you can just alter the current position pointer value.

There are legion other file-like objects that may or may not support seeking. How they handle seeking is implementation dependent.

The zipfile module supports seeking on zip files opened in read-only mode, and seeking to a point that lies before the data section covered by the current buffer requires a full re-read and decompression of the data up to the desired point, seeking after requires reading from the current position until you reach the new. The gzip, lzma and bz2 modules all use the same shared implementation, that also starts reading from the start if you seek to a point before the current read position (and there's no larger buffer to avoid this).
The chunk module allows seeking within the chunk boundaries and delegates to the underlying object. This is an O(1) operation if the underlying file seek operation is O(1).

Etc. So, it depends.

Meissen answered 11/8, 2018 at 15:58 Comment(6)

Do you know if this applies to gzip.py also? – Landloper 19/10, 2020 at 17:17

@Landloper I already include the gzip module in my answer, it shares a common base implementation with the other compression formats and these have to do a full re-read of the compressed stream if you are seeking to a point before the current. – Meissen 19/10, 2020 at 23:57

Is there a way to get tell and seek on the gzip file itself, and not on the underlying decompressed data? – Infielder 22/5, 2021 at 9:1

@JohnStrood: the wrapped file object is available as gzipfileobj.fileobj. Seeking on that object will beak the internal state of the read buffer, however. – Meissen 23/5, 2021 at 12:51

@JohnStrood you can use io.FileIO to read the uncompressed data. – Ungava 21/5, 2022 at 0:34

@Ungava that’s just the interface for file objects, the .fileobj attribute is such an object. – Meissen 22/5, 2022 at 20:39

It would depend on the implementation of f. However, in normal file-system files, it is O(1).

If python implements f on text files, it could be implemented as O(n), as each character may need to be inspected to manage cr/lf pairs correctly.

This would be based on whether f.seek(n,0) gave the same result as a loop of reading chars, and (depending on OS) cr/lf were shrunk to lf or lf expanded to cr/lf

If python implements f on a compressed stream, then the order would b O(n), as decompression may require some working of blocks, and decompression.

Feedback answered 11/8, 2018 at 15:37 Comment(4)

However, in normal file-system files, it is O(1). This seems very wrong. – Beading 11/8, 2018 at 15:41

Time should not depend on n in terms of big-O, however, I'd expect if n is close enough to the current offset, seek would not cause any disk access. – Squires 11/8, 2018 at 15:46

@mksteve: I was wrong, zipfile.ZipFile does support seeking when reading, and it'll try to satisfy that from the current buffer. If it can't it'll seek back to compression section start and re-read data until it reaches the desired position. – Meissen 11/8, 2018 at 16:5

seek itself is almost certainly O(1), because any further work can (and should) be delayed until a read or write actually needs the new file position. No sense doing an O(N) operation if the next operation on the file is a close. – Evolve 11/8, 2018 at 17:3

Recommended topics

Hot tags