How to detect a file has been truncated while reading
Asked Answered
R

1

6

I'm reading lines from a group of files (log files) following them as they are written using pyinotify.

I'm opening and reading the files with python native methods:

file = open(self.file_path, 'r')
# ... later
line = file.readline()

This is generally stable and can handle the file being deleted and re-created. pyinotify will notify the unlink and subsequent link.

However some log files are not being deleted. Instead they are being truncated and new content written to the beginning of the same file.

I'm having trouble reliably detecting when this has occurred since pyinotify will simply report only a write. The only evidence I currently get is that pyinotify reports a write and readline() returns an empty string. BUT, it is possible that two subsiquent writes could trigger the same behavior.

I have thought of comparing a file's size to file.tell() but according to the documentation tell produces an opaque number and it appears this can't be trusted to be a number of bytes.

Is there a simple way to detect a file has been truncated while reading from it?


Edit:

Truncating a file can be simulated with simple shell commands:

echo hello > test.log
echo hello >> test.log
# Truncate test.log
echo goodbye > test.log

To compliment this, a simple python script can be used to confirm that file.tell() does not reduce when the file is truncated:

foo = open('./test.log', 'r')
line = foo.readline()
while line != '':
    print(foo.tell())
    print(line)
    line = foo.readline()

# Put a breakpoint on the following line and 
# truncate the file before it executes
print(foo.tell())
Romaromagna answered 14/4, 2019 at 21:4 Comment(6)
I think you can rely on the fact that if tell() returns a smaller number than the last time you called it, and you haven't seeked on your own, then something strange has happened. If you can confidently deduce that that "strange thing" is a file truncation, then I think you'll be good. - This whole idea kinda freaks me out. I'd go well out of my way to not have to read from a file that some other process might do anything but append to.Meatus
@Steve No. At least on linux tell() will NOT move when the file is truncated. In context this is log monitoring. The whole idea is to read from a file another process is writing to.Romaromagna
OK. I was only commenting on what I thought I read in your statement... how you might not be able interpret what tell() is telling you (ha!) if it were to give you a smaller number. If it won't, it won't. And as I said, I'm not the guy who's going to have had any experience in writing code against files that can have their contents wiped out while I'm reading them. Best of luck in figuring this out!Meatus
if each modification of the file represented a versioning, your notifier would represent that latest version at the point it ran. You seem to be attempting to account for versions between runs and versioning after. Would that be accurate? "a simple way" is a bit subjective. but, no, I don't think there is a "simple way" to account for the number of operations occurring against a file by reviewing the content or attributable information of that file. Also, I don't think it is necessarily a integrity matter, it is a Point Of Time and Differential notionInsalivate
What do you want to do when you recognize a truncation? Do you just want to mimic what tail -f does?Meatus
your question does make me wonder if it's possible to represent file changes on uncommitted work in git. I suppose a mechanic to commit on each change is in order too; that is, you could sub module this file in a git repo and setup git hooks that commit to a local repo on every change, I suppose... if it is reliable, it could be part of a file change history solution. I wonder about locks in that approach though.Insalivate
S
3

Use os.lseek(file.fileno(),0,os.SEEK_CUR) to obtain a byte offset without moving the file pointer. You can’t really use the regular file interface to find out, not least because it may have buffered text (that no longer exists) that it hasn’t made visible to Python yet. If the file is not a byte stream (e.g., the default open in Python 3), it could even be in the middle of a multibyte character and be unable to proceed even if the file immediately grew back past your file offset.

Sharpfreeze answered 14/4, 2019 at 22:21 Comment(2)
Just to be clear, you're suggesting comparing the result of os.lseek(file.fileno(),0,os.SEEK_CUR) to a file size (stat)?Romaromagna
@PhilipCouling: Yes (with fstat on the file descriptor, which modern Python versions spell os.stat(file.fileno()) without the “f”), or you can issue three lseeks to learn the offset and the file size with no other effects.Sharpfreeze

© 2022 - 2024 — McMap. All rights reserved.