Reading and graphing data read from huge files
Asked Answered
F

2

4

We have pretty large files, the order of 1-1.5 GB combined (mostly log files) with raw data that is easily parseable to a csv, which is subsequently supposed to be graphed to generate a set of graph images.

Currently, we are using bash scripts to turn the raw data into a csv file, with just the numbers that need to be graphed, and then feeding it into a gnuplot script. But this process is extremely slow. I tried to speed up the bash scripts by replacing some piped cuts, trs etc. with a single awk command, although this improved the speed, the whole thing is still very slow.

So, I am starting to believe there are better tools for this process. I am currently looking to rewrite this process in python+numpy or R. A friend of mine suggested using the JVM, and if I am to do that, I will use clojure, but am not sure how the JVM will perform.

I don't have much experience in dealing with these kind of problems, so any advice on how to proceed would be great. Thanks.

Edit: Also, I will want to store (to disk) the generated intermediate data, i.e., the csv, so I don't have to re-generate it, should I choose I want a different looking graph.

Edit 2: The raw data files have one record per one line, whose fields are separated by a delimiter (|). Not all fields are numbers. Each field I need in the output csv is obtained by applying a certain formula on the input records, which may use multiple fields from the input data. The output csv will have 3-4 fields per line, and I need graphs that plot 1-2, 1-3, 1-4 fields in a (may be) bar chart. I hope that gives a better picture.

Edit 3: I have modified @adirau's script a little and it seems to be working pretty well. I have come far enough that I am reading data, sending to a pool of processor threads (pseudo processing, append thread name to data), and aggregating it into an output file, through another collector thread.

PS: I am not sure about the tagging of this question, feel free to correct it.

Forest answered 29/3, 2011 at 6:53 Comment(2)
With files of that size R can get tricky, as it is rather memory-intensive. The graphical possibilities of R exceed those of Python though (see eg addictedtor.free.fr/graphiques ). Make sure you check the multithreading in R (package snowfall). But most of all, code in what you're familiar with. If you're not very familiar with R, it will be difficult to optimize this.Cherub
yes, that is also another point, I have little to no experience in R, and same with numpy and matplotlib, but I am very comfortable with python. This also, will influence my choice.Forest
R
4

python sounds to be a good choice because it has a good threading API (the implementation is questionable though), matplotlib and pylab. I miss some more specs from your end but maybe this could be a good starting point for you: matplotlib: async plotting with threads. I would go for a single thread for handling bulk disk i/o reads and sync queueing to a pool of threads for data processing (if you have fixed record lengths things may get faster by precomputing reading offsets and passing just the offsets to the threadpool); with the diskio thread I would mmap the datasource files, read a predefined num bytes + one more read to eventually grab the last bytes to the end of the current datasource lineinput; the numbytes should be chosen somewhere near your average lineinput length; next is pool feeding via the queue and the data processing / plotting that takes place in the threadpool; I don't have a good picture here (of what are you plotting exactly) but I hope this helps.

EDIT: there's file.readlines([sizehint]) to grab multiple lines at once; well it may not be so speedy cuz the docs are saying its using readline() internally

EDIT: a quick skeleton code

import threading
from collections import deque
import sys
import mmap


class processor(Thread):
    """
        processor gets a batch of data at time from the diskio thread
    """
    def __init__(self,q):
        Thread.__init__(self,name="plotter")
        self._queue = q
    def run(self):
        #get batched data 
        while True:
            #we wait for a batch
            dataloop = self.feed(self._queue.get())
            try:
                while True:
                    self.plot(dataloop.next())
            except StopIteration:
                pass
            #sanitizer exceptions following, maybe

    def parseline(self,line):
        """ return a data struct ready for plotting """
        raise NotImplementedError

    def feed(self,databuf):
        #we yield one-at-time datastruct ready-to-go for plotting
        for line in databuf:
            yield self.parseline(line)

    def plot(self,data):
        """integrate
        https://www.esclab.tw/wiki/index.php/Matplotlib#Asynchronous_plotting_with_threads
        maybe
        """
class sharedq(object):
    """i dont recall where i got this implementation from 
    you may write a better one"""
    def __init__(self,maxsize=8192):
        self.queue = deque()
        self.barrier = threading.RLock()
        self.read_c = threading.Condition(self.barrier)
        self.write_c = threading.Condition(self.barrier)
        self.msz = maxsize
    def put(self,item):
        self.barrier.acquire()
        while len(self.queue) >= self.msz:
            self.write_c.wait()
        self.queue.append(item)
        self.read_c.notify()
        self.barrier.release()
    def get(self):
        self.barrier.acquire()
        while not self.queue:
            self.read_c.wait()
        item = self.queue.popleft()
        self.write_c.notify()
        self.barrier.release()
        return item



q = sharedq()
#sizehint for readine lines
numbytes=1024
for i in xrange(8):
    p = processor(q)
    p.start()
for fn in sys.argv[1:]
    with open(fn, "r+b") as f:
        #you may want a better sizehint here
        map = mmap.mmap(f.fileno(), 0)
        #insert a loop here, i forgot
        q.put(map.readlines(numbytes))

#some cleanup code may be desirable
Rolanderolando answered 29/3, 2011 at 7:22 Comment(8)
Thanks for your ideas adirau, my intention on using python was so that I could use pooled threads reading in data from a queue. As for a better picture, I edited the question with more info, hope that gives a better idea of what I am upto.Forest
Thank you very much for the code adirou, it took a while for me to go through as I've never used deque and mmap. Could you point to more info on those, also what is the difference between the queue.Queue and deque? and why not open the file simply and read lines from it sequentially?Forest
collections.deque is supposed supposed to offer fast append and popleft atomic operations that don't require locking (snip from queue documentation); you can find deque and mmap docs in python documentation; I opted for bulk line reads as some sort of quick optimization; reading and queueing lines one by one seemed a bad idea (more readline calls, more queueing operations ) so I thought its better and faster to bulk read and bulk process;Rolanderolando
yes, I am reading them, however I can't find the .readlines method on mmap (looking at 2.7 docs), which version is it present in?Forest
i snipped the sharedq from some old code of mine where I used same almost the same technique to process incoming video streams ... I don't remember where I got it from initially, must be some py howto/codesample (possible py sourcetree)Rolanderolando
it's not in mmap but in the file-like object returned by mmap.mmap(). see docs.python.org/library/stdtypes.html, section 5.9. File ObjectsRolanderolando
so, as I see from the docs, .readlines(size) is similar to .read(size).splitlines() right? Sorry, but I am having a hard time grasping thisForest
no it's not. with .readlines you will get complete lines, the size in this case is a hint that gets rounded up to read the last line fully. with .read(size).splitlines() you will surely get the last line incomplete and you need additional logic to manage this situationRolanderolando
A
1

I think python+Numpy would be the most efficient way, regarding speed and ease of implementation. Numpy is highly optimized so the performance is decent, and python would ease up the algorithm implementation part.

This combo should work well for your case, providing you optimize the loading of the file on memory, try to find the middle point between processing a data block that isn't too large but large enough to minimize the read and write cycles, because this is what will slow down the program

If you feel that this needs more speeding up (which i sincerely doubt), you could use Cython to speed up the sluggish parts.

Alimentation answered 29/3, 2011 at 7:33 Comment(2)
I didn't quite get your second paragraph, is it that if I do a .read(2000).splitlines(), it will perform better than doing a .readline() for each line?Forest
I would advise so, as it would minimize the read and write cycles, again you have to find the optimal size depending on you configuration another thing about readline() is that it may cause you errors as it reads some times the lines in a different order than in the file read, specially when mixing file-iteration with readlineAlimentation

© 2022 - 2024 — McMap. All rights reserved.