Delete first X lines from a file PHP
Asked Answered
T

2

7

I was wondering if anyone out there knew how this could be done in PHP. I am running a script that involves opening a file, taking the first 1000 lines, doing some stuff with those lines, then the php file opens another instance of itself to take the next thousand lines and so on until it reaches the end of the file. I'm using splfileobject so that I can seek to a certain line, which allows me to break this up into 1000 line chunks quite well. The biggest problem that I'm having is with performance. I'm dealing with files that have upwards of 10,000,000 lines and while it does the first 10,000 lines or so quite fast, there is a huge exponential slowdown after that point that I think is just having to seek to that point.

What I would like to do is read the first thousand lines, then just delete them from the file so that my script is always reading the first thousand lines. Is there a way to do this without reading the rest of the file into memory. Other solutions I have seen involve reading each line into an array then getting rid of the first X entries, but with ten million lines that will eat up too much memory and time.

If anyone has a solution or other suggestions that would speed up the performance, it would be greatly appreciated.

Toiletry answered 26/3, 2012 at 18:14 Comment(12)
You think the time is taken seeking?Formless
I commented out the line that iterates the line counter so that it always ran the first 1000 and it ran exponentially faster. Plus this gets exponentially slower as it goes along, the only thing thats changing is the line that its seeking to.Toiletry
Seeking shouldn't be taking exponentially more time. On what sort of scale is the slowdown?Formless
It might be worthwhile split-ing your file into several n thousand line files, or is there some reason it must be one big file?Formless
It might also be of interest to know that when using SplFileObject's seek() method, the file is still being read all the way up to where you're seeking to (each line is read then thrown away). It is not the same as fseek()-ing to a byte offset.Formless
The data that I'm getting from the file is used to create entries in a mysql database, so I'm monitoring the performance by number of records. The first thousand records get inserted in less than a second. The second thousand takes about five seconds, the next thousand about a minute. Once I get up to around 15,000 records, it takes about 10 minutes per thousand. Again, when I commented out the iteration, the sql records were inserted at the same speed as the first thousand continuously, so it's not a problem with size of the database.Toiletry
In that case, I doubt SplFileObject::seek() is the culprit. It should be taking in the order of second(s) at most to read 10,000,000+ lines.Formless
My only advice here is to break down the script to find the real point that is causing the slow down. It might be SplFileObject's fault (especially on Windows), but without you being able to show that it is the cause I would remain skeptical.Formless
The reason that I'm using the splfileobject is because you can seek by line instead of bytes. I imagine though that that is whats causing the slowdown, because it has to seek to line 1,000,000 or whatever and is reading everything up to that line.Toiletry
Why not make a script that only seeks over the file and see if that is too slow for you?Formless
I just did and you're exactly right. Seeking to where I tested was almost instantaneous when there was nothing else going on, so it must be somewhere else in the script. Thank you for all your help.Toiletry
@Eric don't seek by lines. You'll have to count lines EVERY TIME you open the file. Store the byte offset returned by tell() or whatever it is in spfileobject. That's a simple count of bytes to skip over, and will be very fast since PHP doesn't have to scan/count line endings. Once you've seeked to the proper location, THEN you can start counting lines.Trinitarianism
L
1

Unfortunately there is no real solution to this because the files are always loaded fully on to the main memory before they are read.

Still, I have posted this answer because this is a possible solution but I suspect it hardly improves the performance. Correct me if I am wrong.

You can use XML to divide the files into units of 1000 lines. And use DomDocument Class of PHP to retrieve and append data. You can append the child when you want to add data and retrieve the first child to get the first thousand lines and delete the node it if you want. Just like this :

<document>
    <part>
        . . . 
        Thousand lines here
        . . . 
    </part>
    <part>
        . . . 
        Thousand lines here
        . . . 
    </part>
    <part>
        . . . 
        Thousand lines here
        . . . 
    </part>
    .
    .
    .
</document>

ANOTHER WAY :

If you are really sure about breaking the sections into exactly 1000 lines why don't you save it in a database with each 1000 in a different row ? By doing this you will surely reduce file read/write overhead and improve the performance.

Lindesnes answered 26/3, 2012 at 18:55 Comment(0)
K
1

It seems to me that the objective is to parse a huge amount of data and insert it into a database? If so, I fail to understand why it's important to work with exactly 1000 lines?

I think I would just approach it by reading a big chunk of data, say 1 MB, into memory at once, and then scan backwards from the end of the in-memory chunk for the last line ending. Once I have that, I can save the file position and the extra data I have (what's left over from the last line ending until the end of the chunk). Alternatively, just reset the file pointer using fseek() to where in the file that I found the last line ending, easily accomplished with strlen($chunk).

That way, all I have to do is explode the chunk by running explode("\r\n", $chunk) and I have all the lines I need, in a suitably big block for further processing.

Deleting lines from the beginning of the file is not recommended. That's going to shuffle a huge amount of data back and forth to disk.

Karb answered 28/3, 2012 at 20:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.