Charting massive amounts of data [closed]
Asked Answered
A

4

16

We are currently using ZedGraph to draw a line chart of some data. The input data comes from a file of arbitrary size, therefore, we do not know what the maximum number of datapoints in advance. However, by opening the file and reading the header, we can find out how many data points are in the file.

The file format is essentially [time (double), value (double)]. However, the entries are not uniform in the time axis. There may not be any points between say t = 0 sec and t = 10 sec, but there might be 100K entires between t = 10 sec and t = 11 sec, and so on.

As an example, our test dataset file is ~2.6 GB and it has 324M points. We'd like to show the entire graph to the user and let her navigate through the chart. However, loading up 324M points to ZedGraph not only is impossible (we're on a 32-bit machine), but also not useful since there is no point of having so many points on the screen.

Using the FilteredPointList feature of ZedGraph also appears to be out of question, since that requires loading the entire data first and then performing filtering on that data.

So, unless we're missing anything, it appears that our only solution is to -somehow- decimate the data, however as we keep working on it, we're running into a lot of problems:

1- How do we decimate data that is not arriving uniformly in time?

2- Since the entire data can't be loaded into memory, any algorithm needs to work on the disk and so needs to be designed carefully.

3- How do we handle zooming in and out, especially, when the data is not uniform on the x-axis.

If data was uniform, upon initial load of the graph, we could Seek() by predefined amount of entries in the file, and choose every N other samples and feed it to ZedGraph. However, since the data is not uniform, we have to be more intelligent in choosing the samples to display, and we can't come up with any intelligent algorithm that would not have to read the entire file.

I apologize since the question does not have razor-sharp specificity, but I hope I could explain the nature and scope of our problem.

We're on Windows 32-bit, .NET 4.0.

Arroyo answered 27/1, 2011 at 15:0 Comment(0)
M
9

I've needed this before, and it's not easy to do. I ended up writing my own graph component because of this requirement. It turned out better in the end because I put in all the features we needed.

Basically, you need to get the range of data (min and max possible/needed index values), subdivide it into segments (let's say 100 segments), and then determine a value for each segment by some algorithm (average value, median value, etc.). Then you plot based on those summarized 100 elements. This is much faster than trying to plot millions of points :-).

So what I am saying is similar to what you are saying. You mention you do not want to plot every X element because there might be a long stretch of time (index values on the x-axis) between elements. What I am saying is that for each subdivision of data determine what is the best value, and take that as the data point. My method is index value-based, so in your example of no data between the 0 sec and 10-sec index values I would still put data points there, they would just have the same values among themselves. The point is to summarize the data before you plot it. Think through your algorithms to do that carefully, there are lots of ways to do so, choose the one that works for your application. You might get away with not writing your own graph component and just write the data summarization algorithm.

Machiavellian answered 27/1, 2011 at 20:54 Comment(3)
Agreed on your last point. Summarizing the data appropriately will relieve the headache of displaying x # of scenarios.Singularize
gmagana: Yes, I think that sounds like one way to go. I'm only worried that every time the user wants to zoom-in or zoom-out, we're going to have to re-read and re-summarize the file. We thought about some caching algorithms, but since at any point in time, the data on the graph will be decimated, caching those points may not make sense.Arroyo
@SomethingBetter: Yeah, zooming is one of the reasons I wrote my own. We needed something like Google Finance graph (ie, google.com/finance?q=msft), so I wrote it from scratch.Machiavellian
A
4

I would approach this in two steps:

  1. Pre-processing the data
  2. Displaying the data

Step 1 The file should be preprocessed into a binary fixed format file. Adding an index to the format, it would be int,double,double. See this article for speed comparisons:

http://www.codeproject.com/KB/files/fastbinaryfileinput.aspx

You can then either break up the file into time intervals, say one per hour or day, which will give you an easy way to express accessing different time intervals. You could also just keep one big file and have an index file which tells you where to find specific times,

1,1/27/2011 8:30:00
13456,1/27/2011 9:30:00

By using one of these methods you will be able to quickly find any block of data by either time, via a index or file name, or by number of entries, due to the fixed byte format.

Step 2 Ways to show data 1. Just display each record by index. 2. Normalize data and create aggregate data bars with open, high, low ,close values. a. By Time b. By record count c. By Diff between value

For more possible ways to aggregate non-uniform data sets, you may want to look at different methods used to aggregate trade data in the financial markets. Of course, for speed in realtime rendering you would want to create files with this data already aggregated.

Arrearage answered 27/1, 2011 at 21:22 Comment(0)
S
3

1- How do we decimate data that is not arriving uniformly in time?

(Note - I'm assuming your loader datafile is in text format.)

On a similar project, I had to read datafiles that were more than 5GB in size. The only way I could parse it out was by reading it into an RDBMS table. We chose MySQL because it makes importing text files into datatables drop-dead simple. (An interesting aside -- I was on a 32-bit Windows machine and couldn't open the text file for viewing, but MySQL read it no problem.) The other perk was MySQL is screaming, screaming fast.

Once the data was in the database, we could easily sort it and quantify large amounts of data into singular paraphrased queries (using built-in SQL summary functions like SUM). MySQL could even read its query results back out to a text file for use as loader data.

Long story short, consuming that much data mandates the use of a tool that can summarize the data. MySQL fits the bill (pun intended...it's free).

Singularize answered 27/1, 2011 at 21:1 Comment(1)
That's interesting, getting all this data into the database first might be a lot of time, but I guess if we're going to analyze the same data over and over again, it might help. Although, this does not help at all with our fundamental problem: How to read this data out, summarize to a reasonable number of points and show in the graph. -- thanksArroyo
D
2

A relatively easy alternative I've found to do this is to do the following:

  1. Iterate through the data in small point groupings (say 3 to 5 points at a time - the larger the group, the faster the algorithm will work but the less accurate the aggregation will be).
  2. Compute the min & max of the small group.
  3. Remove all points that are not the min or max from that group (i.e. you only keep 2 points from each group and omit the rest).
  4. Keep looping through the data (repeating this process) from start to end removing points until the aggregated data set has a sufficiently small number of points where it can be charted without choking the PC.

I've used this algorithm in the past to take datasets of ~10 million points down to the order of ~5K points without any obvious visible distortion to the graph.

The idea here is that, while throwing out points, you're preserving the peaks and valleys so the "signal" viewed in the final graph isn't "averaged down" (normally, if averaging, you'll see the peaks and the valleys become less prominent).

The other advantage is that you're always seeing "real" datapoints on the final graph (it's missing a bunch of points, but the points that are there were actually in the original dataset so, if you mouse over something, you can show the actual x & y values because they're real, not averaged).

Lastly, this also helps with the problem of not having consistent x-axis spacing (again, you'll have real points instead of averaging X-Axis positions).

I'm not sure how well this approach would work w/ 100s of millions of datapoints like you have, but it might be worth a try.

Dmz answered 4/1, 2018 at 17:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.