How to read a file backwards to find substring efficiently
Asked Answered
M

3

27

I have a huge log file in this kind of structure:

"timestamp": {"identifier":value}

"1463403600":{"AA":74.42},
"1463403601":{"AA":29.55},
"1463403603":{"AA":24.78},
"1463403604":{"AA":8.46},
"1463403605":{"AA":44.84},
"1463403607":{"AA":87.05},
"1463403608":{"AA":54.81},
"1463403609":{"AA":93.1},
"1463403611":{"AA":77.64},
"1463403612":{"AA":33.39},
"1463403613":{"AA":69.2},

I want to extract the content after(!) a given timestamp like:

std::ifstream * myfunc( uint32_t timestamp) 

example:

myfunc(1463403611);
/* returns
"1463403611":{"AA":77.64},
"1463403612":{"AA":33.39},
"1463403613":{"AA":69.2},
*/

The logfile is long - too long to keep it in memory. The code will run on a resource limited embedded devices (80Mhz, ~10kB free memory), so Im looking for some ideas for a effective solution.

The logfile might have 500k+ entries and in 99% of the time the timestamp will be in the last 100 lines, so starting at the beginnig of the file and check every line for the right timestamp would be very inefficient.

So I guess Im looking for a solution to read the file backwards, line by line. I dont realy have a solution how to do that efficient without loading big chunks into memory.

I tried with reading in chunks of 200bytes starting from the EOF, but was faced with the issue, that the chunk cut the timestamp into half in many cases. I tried to detect that and reselect some bytes if needed, but got the feeling, that there must be a smarte solution.

Middleclass answered 16/5, 2016 at 13:17 Comment(16)
1) With file creation timestamp, file last modified time stamp, number of timestamps per day, you may be able to approximate the offset, then adjust the estimation by reading. 2) with "99% of the time...", it should be easy to approximate a starting offset.Coronet
You could find the record using a binary search type algorithm targeted towards the end of the file.Mcripley
You don't need to read big chunks, the buffer must be just larger than twice the longest possible line. In your example, it seems those lines would be always pretty small; A 64-byte buffer would probably do.Kunkel
Roll the logs every 1,000 lines, or every hour so you never have too much in each one maybe...Stich
@Mcripley : Sounds interesting. Can you extend that thought? I would guess you mean it like: Im looking for timestamp "123456". I start at EOF and I go backwards looking for a "6" - if I found the "6" I check if the next is "5" and so on. Hitting the whole string is a match. Is that what you mean by "binary search" ? Its already a much better solution than reading chunks in my mind.Middleclass
This isn't a solution generator. What do you have so far? And what's wrong with it?C
@LightnessRacesinOrbit: See my Edit. Tried with reading chunks. Bad performance and not very elegant. If those kind of questions are not wanted I may delete it.Middleclass
Any read you do — beginning, end, or middle — may cut a timestamp in half. Just accept it, correct for it, and be done. "But getline()..." — nope, getline()'s a formatted input function implemented on top of read, and any read can cut a timestamp in half.Crane
@n.m.: Well, no, getline() will keep blocking and reading and blocking and reading until it gets a newline or EOF. You won't get half a timestamp with getline().C
@LightnessRacesinOrbit Let's read it again together: getline()'s a formatted input function implemented on top of read. Do you agree with this basic premise?Crane
@n.m. I don't have time for this. I am at work. Ping me if you can demonstrate that getline() will ever give you a "partial line".C
A "binary search" is where you start in the middle, if its value is too large then you take the lower half of the data and start in the middle of that, if its lower than the one you want you take the higher half and start in the middle of that. You do that recursively until you get within slicing distance of the record (timestamp) you are looking for. There are a few ways you can optimize that algorithm because of your data. you know its near the end. Also its time-stamps so you can make more targeted estimates where to search next, you don't have to cut the data exactly in half.Mcripley
@LightnessRacesinOrbit I wonder where you've got this. I don't remember ever saying or implying anything to this effect. I said (1) every read can cut a timestamp in half, and (2) getline is implemented on top of read. If you can demonstrate how these two imply what you are saying, please do so.Crane
@n.m.: I inferred it from: «Any read you do — beginning, end, or middle — may cut a timestamp in half [..] "But getline()..." — nope, getline()'s a formatted input function implemented on top of read, and any read can cut a timestamp in half»" which very much looked like a "getline() can fail to read a whole timestamp just as can any read", to me. Glad to hear we agree though.C
@LightnessRacesinOrbit um, that's exactly the point. Read can cut, getline uses read but doesn't cut. The idea is to look at it in awe and ask what kind of magic inside getline makes it possible. The answer is "buffering". One can do the same buffering when reading backwards.Crane
Can you have the data producer make the lines fixed width (by padding if necessary)? This would make accessing the file a lot faster and you can use memory-based algorithms, such as binary search.Exemplification
M
21

Well I found this kind of interesting so I knocked up a proof-of-concept for the binary-search idea.

This is poorly tested and probably a little buggy but seems to work so far and demonstrates the idea of divide-and-conquer. You check in the middle of the file and, depending if you are to high or too low, you divide the data into two and search the relevant half. You do that recursively until you get close enough.

#include <ctime>
#include <cmath>
#include <cstdlib>
#include <string>
#include <fstream>
#include <iostream>

// Don't use this, its just to show how many reads
// are being done to find the record.
int global_counter;

std::streampos find_stamp(std::istream& is, long stamp, std::streampos pos, std::streampos end)
{
    ++global_counter;

    if(pos == 0) // can't divide zero
        return 0;

    std::string s;
    long found_stamp;

    // extract nearest timestamp after pos
    is.seekg(pos);
    if(!(std::getline(std::getline(is, s, ','), s, '"') >> found_stamp))
        return end;

    // if its too big check first half of this region
    if(found_stamp > stamp)
        return find_stamp(is, stamp, pos / 2, pos);

    // if its not within 10 timestamp seconds check end half of this region
    if(stamp - found_stamp > 10)
        return find_stamp(is, stamp, (pos + end) / 2, end);

    // read record by record (prolly more efficient than skipping)
    pos = is.tellg();
    while(std::getline(std::getline(is, s, ','), s, '"') >> found_stamp)
    {
        if(found_stamp > stamp)
            return pos;
        pos = is.tellg();
    }
    return end;
}

void print_after(const std::string& filename, long stamp)
{
    // open at end of file (to get length)
    std::ifstream ifs(filename, std::ios::ate);

    std::streampos end = ifs.tellg();
    auto pos = end / 2; // start checking in middle

    // find position before required record
    // (may be in the middle of a record)
    if((pos = find_stamp(ifs, stamp, pos, end)) != end)
    {
        ifs.seekg(pos);

        std::string line;
        std::getline(ifs, line, ','); // skip to next whole record

        // print out all following recors
        while(std::getline(ifs, line, ','))
            std::cout << line;
    }
}

inline
std::string leading_zeros(int n, int zeros = 2)
{
    std::string s;
    for(int z = std::pow(10, zeros - 1); z; z /= 10)
        s += (n < z ? "0":"");
    return s + std::to_string(n);
}

int main()
{
    std::srand(std::time(0));

    // generate some test data
    std::ofstream ofs("test.txt");

    for(int i = 0; i < 1000; ++i)
    {
        ofs << '"' << leading_zeros(i, 10) << '"';
        ofs << ":{\"AA\":" << (std::rand() % 100);
        ofs << '.' << (std::rand() % 100) << "},\n";
    }

    ofs.close();

    global_counter = 0;
    print_after("test.txt", 993);

    std::cout << "find checked " << global_counter << " places in the file\n";
}

Output:

"0000000994":{"AA":80.6}
"0000000995":{"AA":11.90}
"0000000996":{"AA":16.43}
"0000000997":{"AA":53.11}
"0000000998":{"AA":68.43}
"0000000999":{"AA":79.77}
find checked 6 places in the file
Mcripley answered 16/5, 2016 at 16:38 Comment(1)
"500k+ entries and in 99% of the time the timestamp will be in the last 100 lines" This could be further improved by starting the search not in the mid, but roughly the 100th line counted from the end. Reduces the checks needed for the 99%. (And also, depending on the cache design of the system, reduces cache misses)Fetus
C
5

Since you are on an embedded device where mmap() is likely not available, I guess the only viable approach is to use a buffer that you fill with a part of the file, to be able to examine its contents one chunk at a time. Note that you will need to overlap your buffer windows to avoid missing a line that is cut in half by the beginning of the buffer. You will need to seek for the first newline at the beginning of a chunk and discard it with anything before it before you can start examining the chunk for the timestamps. Discarding the partial line at the beginning of the buffer also helps in aligning the end of that same line with the end of the buffer when you load the previous chunk into your buffer.

The handling of incomplete lines at the beginning of the buffer makes this a very ugly and error prone approach. This is the reason why I would suggest using mmap() if at all available, it would allow you to just ignore these issues.

Canty answered 16/5, 2016 at 15:50 Comment(2)
thanks ! I guess the combination of a binary search and chunking might have a good performance. I will try that out (thanks @Mcripley for the binary search hint) I just heard about mmap - will investigateMiddleclass
@Resulma: in particular, if 99% of the time the result is in the last 100 lines then try really hard to get your chunk size up to at least 100 times the typical line length (so, 2700 bytes or more), so that 99% of the time you only have to do one seek and read.Subinfeudate
R
3

If performance is not an issue, you can read the entire file line by line till you reach the time requested, and then start dump. There is no reason to read the entire file in memory. If performance is an issue, seek at half the file, check the timestamp, then divide by two again in a binary search fashion.

Rascal answered 16/5, 2016 at 15:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.