What language could I use for fast execution of this database summarization task?
Asked Answered
F

18

9

So I wrote a Python program to handle a little data processing task.

Here's a very brief specification in a made-up language of the computation I want:

parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
    flatten | format "%s %lf %s" aa bb cc

That is, for each line, parse out a word, a floating-point number, and another word. Think of them as a player ID, a score, and a date. I want the top five scores and dates for each player. The data size is not trivial, but not huge; about 630 megabytes.

I want to know what real, executable language I should have written it in to get it to be similarly short (as the Python below) but much faster.

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    # We want the top 5 for each distinct value of aa.  There are
    # hundreds of thousands of values of aa.
    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    current.append((bb, cc))

    # Every once in a while, we drop the values that are not in
    # the top 5, to keep our memory footprint down, because some
    # values of aa have thousands of (bb, cc) pairs.
    if len(current) > 10:
        current.sort()
        current[:-5] = []

for aa in top_5:
    current = top_5[aa]
    current.sort()
    for bb, cc in current[-5:]:
        print aa, bb, cc

Here’s some sample input data:

3 1.5 a
3 1.6 b
3 0.8 c
3 0.9 d
4 1.2 q
3 1.5 e
3 1.8 f
3 1.9 g

Here’s the output I get from it:

3 1.5 a
3 1.5 e
3 1.6 b
3 1.8 f
3 1.9 g
4 1.2 q

There are seven values for 3, and so we drop the c and d values because their bb value puts them out of the top 5. Because 4 has only one value, its “top 5” consists of just that one value.

This runs faster than doing the same queries in MySQL (at least, the way we’ve found to do the queries) but I’m pretty sure it's spending most of its time in the Python bytecode interpreter. I think that in another language, I could probably get it to process hundreds of thousands of rows per second instead of per minute. So I’d like to write it in a language that has a faster implementation.

But I’m not sure what language to choose.

I haven’t been able to figure out how to express this as a single query in SQL, and actually I’m really unimpressed with MySQL’s ability even to merely select * from foo into outfile 'bar'; the input data.

C is an obvious choice, but things like line.split(), sorting a list of 2-tuples, and making a hash table require writing some code that’s not in the standard library, so I would end up with 100 lines of code or more instead of 14.

C++ seems like it might be a better choice (it has strings, maps, pairs, and vectors in the standard library) but it seems like the code would be a lot messier with STL.

OCaml would be fine, but does it have an equivalent of line.split(), and will I be sad about the performance of its map?

Common Lisp might work?

Is there some equivalent of Matlab for database computation like this that lets me push the loops down into fast code? Has anybody tried Pig?

(Edit: responded to davethegr8's comment by providing some sample input and output data, and fixed a bug in the Python program!)

(Additional edit: Wow, this comment thread is really excellent so far. Thanks, everybody!)

Edit:

There was an eerily similar question asked on sbcl-devel in 2007 (thanks, Rainer!), and here's an awk script from Will Hartung for producing some test data (although it doesn't have the Zipfian distribution of the real data):

BEGIN {
 for (i = 0; i < 27000000; i++) {
  v = rand();
  k = int(rand() * 100);
  print k " " v " " i;
 }
 exit;
}
Fasciate answered 23/9, 2009 at 18:45 Comment(4)
Some sample data and desired outcome might help your question. This is pretty ambiguous. Also, CREATE TABLE statements... examples of the data in SQL...Monomorphic
I don't have a suggestion on language, but one thing that may help is to track the minimum value for each aa necessary to place it in the top 5 in a separate table/dictionary. That will potentially allow you to skip a whole lot of sorting at a very small cost.Absolve
Assuming we have a large file to start with (i.e. excluding SQL, for the sake of argument), I noticed most solutions are single-threaded. An interesting angle is to use multiple threads. There are two issues though: (a) the program might not be 'short' as requested and (b) it may not be faster. I spent a lot of time on a threaded Java solution last night and, alas, it isn't faster... yet ;-)Grouty
Michael: I think the Pig and Matlab solutions might be able to multithread usefully. As it happens, all of the actual solutions are a couple of orders of magnitude worse than the lower bound set by my C filter, so I've been focusing on single-threaded performance first, but the production target is currently a 16-core box...Fasciate
B
9

I have a hard time believing that any script without any prior knowledge of the data (unlike MySql which has such info pre-loaded), would be faster than a SQL approach.

Aside from the time spent parsing the input, the script needs to "keep" sorting the order by array etc...

The following is a first guess at what should work decently fast in SQL, assuming a index (*) on the table's aa, bb, cc columns, in that order. (A possible alternative would be an "aa, bb DESC, cc" index

(*) This index could be clustered or not, not affecting the following query. Choice of clustering or not, and of needing an "aa,bb,cc" separate index depends on use case, on the size of the rows in table etc. etc.

SELECT T1.aa, T1.bb, T1.cc , COUNT(*)
FROM tblAbc T1
LEFT OUTER JOIN tblAbc T2 ON T1.aa = T2.aa AND 
         (T1.bb < T2.bb OR(T1.bb = T2.bb AND T1.cc < T2.cc))
GROUP BY T1.aa, T1.bb, T1.cc
HAVING COUNT(*) < 5  -- trick, remember COUNT(*) goes 1,1,2,3,...
ORDER BY T1.aa, T1.bb, T1.cc, COUNT(*) DESC

The idea is to get a count of how many records, within a given aa value are smaller than self. There is a small trick however: we need to use LEFT OUTER join, lest we discard the record with the biggest bb value or the last one (which may happen to be one of the top 5). As a result of left joining it, the COUNT(*) value counts 1, 1, 2, 3, 4 etc. and the HAVING test therefore is "<5" to effectively pick the top 5.

To emulate the sample output of the OP, the ORDER BY uses DESC on the COUNT(), which could be removed to get a more traditional top 5 type of listing. Also, the COUNT() in the select list can be removed if so desired, this doesn't impact the logic of the query and the ability to properly sort.

Also note that this query is deterministic in terms of the dealing with ties, i,e, when a given set of records have a same value for bb (within an aa group); I think the Python program may provide slightly different outputs when the order of the input data is changed, that is because of its occasional truncating of the sorting dictionary.

Real solution: A SQL-based procedural approach

The self-join approach described above demonstrates how declarative statements can be used to express the OP's requirement. However this approach is naive in a sense that its performance is roughly bound to the sum of the squares of record counts within each aa 'category'. (not O(n^2) but roughly O((n/a)^2) where a is the number of different values for the aa column) In other words it performs well with data such that on average the number of records associated with a given aa value doesn't exceed a few dozens. If the data is such that the aa column is not selective, the following approach is much -much!- better suited. It leverages SQL's efficient sorting framework, while implementing a simple algorithm that would be hard to express in declarative fashion. This approach could further be improved for datasets with particularly huge number of records each/most aa 'categories' by introducing a simple binary search of the next aa value, by looking ahead (and sometimes back...) in the cursor. For cases where the number of aa 'categories' relative to the overall row count in tblAbc is low, see yet another approach, after this next one.

DECLARE @aa AS VARCHAR(10), @bb AS INT, @cc AS VARCHAR(10)
DECLARE @curAa AS VARCHAR(10)
DECLARE @Ctr AS INT

DROP TABLE  tblResults;
CREATE TABLE tblResults
(  aa VARCHAR(10),
   bb INT,
   cc VARCHAR(10)
);

DECLARE abcCursor CURSOR 
  FOR SELECT aa, bb, cc
  FROM tblABC
  ORDER BY aa, bb DESC, cc
  FOR READ ONLY;

OPEN abcCursor;

SET @curAa = ''

FETCH NEXT FROM abcCursor INTO @aa, @bb, @cc;
WHILE @@FETCH_STATUS = 0
BEGIN
    IF @curAa <> @aa
    BEGIN
       SET @Ctr = 0
       SET @curAa = @aa
    END
    IF @Ctr < 5
    BEGIN
       SET @Ctr = @Ctr + 1;
       INSERT tblResults VALUES(@aa, @bb, @cc);
    END
    FETCH NEXT FROM AbcCursor INTO @aa, @bb, @cc;
END;

CLOSE abcCursor;
DEALLOCATE abcCursor;

SELECT * from tblResults
ORDER BY aa, bb, cc    -- OR .. bb DESC ... for a more traditional order.

Alternative to the above for cases when aa is very unselective. In other words, when we have relatively few aa 'categories'. The idea is to go through the list of distinct categories and to run a "LIMIT" (MySql) "TOP" (MSSQL) query for each of these values. For reference purposes, the following ran in 63 seconds for tblAbc of 61 Million records divided in 45 aa values, on MSSQL 8.0, on a relatively old/weak host.

DECLARE @aa AS VARCHAR(10)
DECLARE @aaCount INT

DROP TABLE  tblResults;
CREATE TABLE tblResults
(  aa VARCHAR(10),
   bb INT,
   cc VARCHAR(10)
);

DECLARE aaCountCursor CURSOR 
  FOR SELECT aa, COUNT(*)
  FROM tblABC
  GROUP BY aa
  ORDER BY aa
  FOR READ ONLY;
OPEN aaCountCursor;


FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount
WHILE @@FETCH_STATUS = 0
BEGIN
    INSERT tblResults 
       SELECT TOP 5 aa, bb, cc
       FROM tblproh
       WHERE aa = @aa
       ORDER BY aa, bb DESC, cc

    FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount;
END;

CLOSE aaCountCursor
DEALLOCATE aaCountCursor

SELECT * from tblResults
ORDER BY aa, bb, cc    -- OR .. bb DESC ... for a more traditional order.

On the question of needing an index or not. (cf OP's remark) When merely running a "SELECT * FROM myTable", a table scan is effectively the fastest appraoch, no need to bother with indexes. However, the main reason why SQL is typically better suited for this kind of things (aside from being the repository where the data has been accumulating in the first place, whereas any external solution needs to account for the time to export the relevant data), is that it can rely on indexes to avoid scanning. Many general purpose languages are far better suited to handle raw processing, but they are fighting an unfair battle with SQL because they need to rebuilt any prior knowledge of the data which SQL has gathered in the course of its data collection / import phase. Since sorting is a typically a time and sometimes space consuming task, SQL and its relatively slower processing power often ends up ahead of alternative solutions.

Also, even without pre-built indexes, modern query optimizers may decide on a plan that involves the creation of a temporary index. And, because sorting is an intrinsic part of DDMS, the SQL servers are generally efficient in that area.

So... Is SQL better?

This said, if we are trying to compare SQL and other languages for pure ETL jobs, i.e. for dealing with heaps (unindexed tables) as its input to perform various transformations and filtering, it is likely that multi-thread-able utilities written in say C, and leveraging efficient sorting libaries, would likely be faster. The determining question to decide on a SQL vs. Non-SQL approach is where the data is located and where should it eventually reside. If we merely to convert a file to be supplied down "the chain" external programs are better suited. If we have or need the data in a SQL server, there are only rare cases that make it worthwhile exporting and processing externally.

Bashibazouk answered 23/9, 2009 at 18:50 Comment(6)
How would you formulate this query as SQL? And, as it turns out, this script in Python runs about twice as fast as select * from foo into outfile 'bar' in MySQL, which isn't doing any processing on the data at all.Fasciate
@Kragen, i'll post in a few hours. promised! (gotta run...) Tentatively, your first attempts with SQL were slow because of lack of indexes or a poorly crafted query ?Bashibazouk
No, select * from foo with no where clause is not slow because of lack of indexes; there's no index that could speed it up. You could argue that it's a "poorly crafted query" in the sense that it necessarily has to sequentially scan the whole table — but my point is that just sequentially scanning the whole table off spinning rust is a lot slower in MySQL than actually running the whole query in Python from a text file, and I think the Python script is still orders of magnitude slower than ideal.Fasciate
@Kragen Got the SQL snippet [my day job gets in the way of my SO-ing :-) ]... Note that this was only tested in MSSQL, should be just about the same in MySql; I'll have access to a "My" server later today if that was an issue.Bashibazouk
I understand your SQL now. How well does MSSQL handle the aa keys that have several thousand rows? The query looks O(N²) on the number of values with a particular aa key, and it appears nontrivial to optimize (I think the optimizer would have to figure out that the join condition induces a total ordering on rows with the same aa, which is only true if there are no duplicate rows, which you have also assumed and which happens to be true), so I'm wondering if you're using software that can actually handle that reasonably?Fasciate
@Kragen See my two additions. These are SQL-based but usinga procedural style. I ran these on MSSQL server, as this is where I had easy access to big databases where I could emulate your load. The same would run just fine on MySQL if not for the occasional syntactical difference. The last script runs in slightly more than a minute, for 61M rows.Bashibazouk
V
6

You could use smarter data structures and still use python. I've ran your reference implementation and my python implementation on my machine and even compared the output to be sure in results.

This is yours:

$ time python ./ref.py  < data-large.txt  > ref-large.txt

real 1m57.689s
user 1m56.104s
sys 0m0.573s

This is mine:

$ time python ./my.py  < data-large.txt  > my-large.txt

real 1m35.132s
user 1m34.649s
sys 0m0.261s
$ diff my-large.txt ref-large.txt 
$ echo $?
0

And this is the source:

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    # We want the top 5 for each distinct value of aa.  There are
    # hundreds of thousands of values of aa.
    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    if len(current) < 5:
        heapq.heappush(current, (bb, cc))
    else:
        if current[0] < (bb, cc):
            heapq.heapreplace(current, (bb, cc))

for aa in top_5:
    current = top_5[aa]
    while len(current) > 0:
        bb, cc = heapq.heappop(current)
        print aa, bb, cc

Update: Know your limits. I've also timed a noop code, to know the fastest possible python solution with code similar to the original:

$ time python noop.py < data-large.txt  > noop-large.txt

real    1m20.143s
user    1m19.846s
sys 0m0.267s

And the noop.py itself:

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    if len(current) < 5:
        current.append((bb, cc))

for aa in top_5:
    current = top_5[aa]
    current.sort()
    for bb, cc in current[-5:]:
        print aa, bb, cc
Vermeil answered 24/9, 2009 at 6:8 Comment(4)
Nice! I hadn't tried heapq because I thought it was implemented in pure Python and therefore would probably be slower, but it's not and it isn't.Fasciate
I haven't benchmarked it, but I believe that even Python heapq implementation would be faster, especially with psyco.Vermeil
It returns 'bottom 5' not 'top 5', isn't it?Whiteley
ygrek: no, heapq maintains a min-heap, not a max-heap; current[0] is the smallest of the five items in the heap, not the largest. So whenever we find an item that's lower than the current minimum, we replace the current minimum with it using heapreplace.Fasciate
C
3

This is a sketch in Common Lisp

Note that for long files there is a penalty for using READ-LINE, because it conses a fresh string for each line. Then use one of the derivatives of READ-LINE that are floating around that are using a line buffer. Also you might check if you want the hash table be case sensitive or not.

second version

Splitting the string is no longer needed, because we do it here. It is low level code, in the hope that some speed gains will be possible. It checks for one or more spaces as field delimiter and also tabs.

(defun read-a-line (stream)
  (let ((line (read-line stream nil nil)))
    (flet ((delimiter-p (c)
             (or (char= c #\space) (char= c #\tab))))
      (when line
        (let* ((s0 (position-if     #'delimiter-p line))
               (s1 (position-if-not #'delimiter-p line :start s0))
               (s2 (position-if     #'delimiter-p line :start (1+ s1)))
               (s3 (position-if     #'delimiter-p line :from-end t)))
          (values (subseq line 0 s0)
                  (list (read-from-string line nil nil :start s1 :end s2)
                        (subseq line (1+ s3)))))))))

Above function returns two values: the key and a list of the rest.

(defun dbscan (top-5-table stream)
   "get triples from each line and put them in the hash table"
  (loop with aa = nil and bbcc = nil do
    (multiple-value-setq (aa bbcc) (read-a-line stream))
    while aa do
    (setf (gethash aa top-5-table)
          (let ((l (merge 'list (gethash aa top-5-table) (list bbcc)
                          #'> :key #'first)))
             (or (and (nth 5 l) (subseq l 0 5)) l)))))


(defun dbprint (table output)
  "print the hashtable contents"
  (maphash (lambda (aa value)
              (loop for (bb cc) in value
                    do (format output "~a ~a ~a~%" aa bb cc)))
           table))

(defun dbsum (input &optional (output *standard-output*))
  "scan and sum from a stream"
  (let ((top-5-table (make-hash-table :test #'equal)))
    (dbscan top-5-table input)
    (dbprint top-5-table output)))


(defun fsum (infile outfile)
   "scan and sum a file"
   (with-open-file (input infile :direction :input)
     (with-open-file (output outfile
                      :direction :output :if-exists :supersede)
       (dbsum input output))))

some test data

(defun create-test-data (&key (file "/tmp/test.data") (n-lines 100000))
  (with-open-file (stream file :direction :output :if-exists :supersede)
    (loop repeat n-lines
          do (format stream "~a ~a ~a~%"
                     (random 1000) (random 100.0) (random 10000)))))

; (create-test-data)

(defun test ()
  (time (fsum "/tmp/test.data" "/tmp/result.data")))

third version, LispWorks

Uses some SPLIT-STRING and PARSE-FLOAT functions, otherwise generic CL.

(defun fsum (infile outfile)
  (let ((top-5-table (make-hash-table :size 50000000 :test #'equal)))
    (with-open-file (input infile :direction :input)
      (loop for line = (read-line input nil nil)
            while line do
            (destructuring-bind (aa bb cc) (split-string '(#\space #\tab) line)
              (setf bb (parse-float bb))
              (let ((v (gethash aa top-5-table)))
                (unless v
                  (setf (gethash aa top-5-table)
                        (setf v (make-array 6 :fill-pointer 0))))
                (vector-push (cons bb cc) v)
                (when (> (length v) 5)
                  (setf (fill-pointer (sort v #'> :key #'car)) 5))))))
    (with-open-file (output outfile :direction :output :if-exists :supersede)
      (maphash (lambda (aa value)
                 (loop for (bb . cc) across value do
                       (format output "~a ~f ~a~%" aa bb cc)))
               top-5-table))))    
Carnay answered 23/9, 2009 at 18:45 Comment(18)
For future readers: ww.telent.net/cliki/SPLIT-SEQUENCE is a broken link; use cliki.net/SPLIT-SEQUENCE instead.Fasciate
...and on Ubuntu I can apt-get install cl-split-sequence, and then I am able to (require 'split-sequence) in SBCL.Fasciate
Another note: to get this to run, one must replace all occurrences of ˜ with ~.Fasciate
Also, the order of the arguments to sort and gethash are wrong.Fasciate
I don't have with-input-from-file in SBCL; instead I am using (with-open-file (stream file :direction :input).Fasciate
The version I got to run is at gist.github.com/192354 — it does not seem to be as fast as I had hoped. It took 940 seconds in SBCL to digest 27 million records, which it seems to have done correctly. Its memory usage was reasonable (maybe 83 megabytes). SBCL reports that it consed 32 gigabytes of memory, which seems like a lot to me; the input file is only 0.6 gigabytes. On the same machine, the interpreted Python program took 361 seconds. The outputs aren't exactly the same but the differences look OK. Should I do something differently?Fasciate
Ask on sbcl-help to get hints on optimization.Clayberg
Hey, as I sort of indicated in my gist, the separator might be a #\Tab too, or multiple spaces. Do I end up with (min (position #\Space line) (position #\Tab line)), or is there something nicer?Fasciate
I have updated the function to accept multiple spaces and also tabs. the function gets larger, but such is life...Carnay
(defun read-a-line (stream) "read a line, parsing the first two tokens and getting the rest as string." (let ((aa (read stream nil)) (bb (read stream nil)) (cc (read-line stream nil))) (values aa (list bb cc))))Declare
Svante, well, maybe. If you we now the contents of the fields, yes. But otherwise, maybe not. "a(b)" in the first field needs two READs.Carnay
If you we know the contents of the fields, yes. But otherwise, maybe not. "a(b)" in the first field needs two READs.Carnay
Awesome, thanks. Is there anything special you think I ought to do to get SBCL to run it more efficiently (compiler options, declarations)? I'm testing the new version now...Fasciate
...okay, without doing anything special in SBCL, it takes 478 seconds, compared to 169 seconds for the Python version (now that my CPU isn't overheating and throttling, heh). 26 seconds in the GC (it consed 29 gigabytes, so the changes to the input processing didn't buy as much as I would have hoped). I feel like I must be doing something wrong, because SBCL ought to be quite a bit faster...Fasciate
I have a version for LispWorks that runs quite a bit faster. An expensive operation is to parse the float. Common Lisp does not have a low-level parse-float, so one needs to use an implementation version or from a library. If I use LispWorks' parse-float, the time is 1/3 of the of SBCL.Carnay
Wow, does that mean SBCL is spending ⅔ of its time parsing floats? It seems to be a substantial fraction of the C++ version's time too. Still, a 3× speedup would still get it only into the same ballpark as the interpreted Python — it would still be substantially slower than the OCaml and Psyco-compiled Python versions. I expected CL implementations like Python-the-compiler to be able to beat the pants off of Python-the-language at this kind of thing.Fasciate
No, SBCL loses time in several places. I'm not really sure what the general problem is, it would make sense to check that with the SBCL developers. I could think that it also has something to do with Unicode operations - just a guess. I think your expectations were wrong. Operations like parsing floats (an example) are costly. But this is not written in Python, but in C and called by Python. Where Lisp implementations like SBCL (and many others) tend to have most operations in Lisp and not in C. Every task where Python mostly calls its C routines should not be expected to be 'slow'.Carnay
so here is a good thread with useful hints for SBCL: groups.google.com/group/sbcl-devel/browse_thread/thread/…Carnay
G
3

This took 45.7s on my machine with 27M rows of data that looked like this:

42 0.49357 0
96 0.48075 1
27 0.640761 2
8 0.389128 3
75 0.395476 4
24 0.212069 5
80 0.121367 6
81 0.271959 7
91 0.18581 8
69 0.258922 9

Your script took 1m42 on this data, the c++ example too 1m46 (g++ t.cpp -o t to compile it, I don't know anything about c++).

Java 6, not that it matters really. Output isn't perfect, but it's easy to fix.

package top5;

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Arrays;
import java.util.Map;
import java.util.TreeMap;

public class Main {

    public static void main(String[] args) throws Exception {
        long start  = System.currentTimeMillis();
        Map<String, Pair[]> top5map = new TreeMap<String, Pair[]>();
        BufferedReader br = new BufferedReader(new FileReader("/tmp/file.dat"));

        String line = br.readLine();
        while(line != null) {
            String parts[] = line.split(" ");

            String key = parts[0];
            double score = Double.valueOf(parts[1]);
            String value = parts[2];
            Pair[] pairs = top5map.get(key);

            boolean insert = false;
            Pair p = null;
            if (pairs != null) {
                insert = (score > pairs[pairs.length - 1].score) || pairs.length < 5;
            } else {
                insert = true;
            }
            if (insert) {
                p = new Pair(score, value);
                if (pairs == null) {
                    pairs = new Pair[1];
                    pairs[0] = new Pair(score, value);
                } else {
                    if (pairs.length < 5) {
                        Pair[] newpairs = new Pair[pairs.length + 1];
                        System.arraycopy(pairs, 0, newpairs, 0, pairs.length);
                        pairs = newpairs;
                    }
                    int k = 0;
                    for(int i = pairs.length - 2; i >= 0; i--) {
                        if (pairs[i].score <= p.score) {
                            pairs[i + 1] = pairs[i];
                        } else {
                            k = i + 1;
                            break;
                        }
                    }
                    pairs[k] = p;
                }
                top5map.put(key, pairs);
            }
            line = br.readLine();
        }
        for(Map.Entry<String, Pair[]> e : top5map.entrySet()) {
            System.out.print(e.getKey());
            System.out.print(" ");
            System.out.println(Arrays.toString(e.getValue()));
        }
        System.out.println(System.currentTimeMillis() - start);
    }

    static class Pair {
        double score;
        String value;

        public Pair(double score, String value) {
            this.score = score;
            this.value = value;
        }

        public int compareTo(Object o) {
            Pair p = (Pair) o;
            return (int)Math.signum(score - p.score);
        }

        public String toString() {
            return String.valueOf(score) + ", " + value;
        }
    }
}

AWK script to fake the data:

BEGIN {
 for (i = 0; i < 27000000; i++) {
  v = rand();
  k = int(rand() * 100);
  print k " " v " " i;
 }
 exit;
}
Ghetto answered 23/9, 2009 at 22:19 Comment(3)
Hmm, seems like it’s splitting on spaces? My real data file happens to be tab-separated, so I changed it to use "\t" as the separator. I compiled it by putting this file into top5/Main.java, compiling with javac top5/Main.java, and executing with java top5.Main. Will report my results soon. I'd like to add my thanks for the awk script — it’s been crucial!Fasciate
Also, I note that the output format is not compatible, but that's fixable. Assuming the program is correct, my first run is 106 seconds on the data file that my Python program takes 169±4 seconds on.Fasciate
And that was the slowest run of the five — I get 96 seconds ±10, almost as fast as the OCaml version.Fasciate
W
3

Here is one more OCaml version - targeted for speed - with custom parser on Streams. Too long, but parts of the parser are reusable. Thanks peufeu for triggering competition :)

Speed :

  • simple ocaml - 27 sec
  • ocaml with Stream parser - 15 sec
  • c with manual parser - 5 sec

Compile with :

ocamlopt -pp camlp4o code.ml -o caml

Code :

open Printf

let cmp x y = compare (fst x : float) (fst y)
let digit c = Char.code c - Char.code '0'

let rec parse f = parser
  | [< a=int; _=spaces; b=float; _=spaces; 
       c=rest (Buffer.create 100); t >] -> f a b c; parse f t
  | [< >] -> ()
and int = parser
  | [< ''0'..'9' as c; t >] -> int_ (digit c) t
  | [< ''-'; ''0'..'9' as c; t >] -> - (int_ (digit c) t)
and int_ n = parser
  | [< ''0'..'9' as c; t >] -> int_ (n * 10 + digit c) t
  | [< >] -> n
and float = parser
  | [< n=int; t=frem; e=fexp >] -> (float_of_int n +. t) *. (10. ** e)
and frem = parser
  | [< ''.'; r=frem_ 0.0 10. >] -> r
  | [< >] -> 0.0
and frem_ f base = parser
  | [< ''0'..'9' as c; t >] -> 
      frem_ (float_of_int (digit c) /. base +. f) (base *. 10.) t
  | [< >] -> f
and fexp = parser
  | [< ''e'; e=int >] -> float_of_int e
  | [< >] -> 0.0
and spaces = parser
  | [< '' '; t >] -> spaces t
  | [< ''\t'; t >] -> spaces t
  | [< >] -> ()
and crlf = parser
  | [< ''\r'; t >] -> crlf t
  | [< ''\n'; t >] -> crlf t
  | [< >] -> ()
and rest b = parser
  | [< ''\r'; _=crlf >] -> Buffer.contents b
  | [< ''\n'; _=crlf >] -> Buffer.contents b
  | [< 'c; t >] -> Buffer.add_char b c; rest b t
  | [< >] -> Buffer.contents b

let () =
  let all = Array.make 200 [] in
  let each a b c =
    assert (a >= 0 && a < 200);
    match all.(a) with
    | [] -> all.(a) <- [b,c]
    | (bmin,_) as prev::tl -> if b > bmin then
      begin
        let m = List.sort cmp ((b,c)::tl) in
        all.(a) <- if List.length tl < 4 then prev::m else m
      end
  in
  parse each (Stream.of_channel stdin);
  Array.iteri 
    (fun a -> List.iter (fun (b,c) -> printf "%i %f %s\n" a b c))
    all
Whiteley answered 1/10, 2009 at 9:40 Comment(0)
F
2

Of all the programs in this thread that I've tested so far, the OCaml version is the fastest and also among the shortest. (Line-of-code-based measurements are a little fuzzy, but it's not clearly longer than the Python version or the C or C++ versions, and it is clearly faster.)

Note: I figured out why my earlier runtimes were so nondeterministic! My CPU heatsink was clogged with dust and my CPU was overheating as a result. Now I am getting nice deterministic benchmark times. I think I've now redone all the timing measurements in this thread now that I have a reliable way to time things.

Here are the timings for the different versions so far, running on a 27-million-row 630-megabyte input data file. I'm on Ubuntu Intrepid Ibex on a dual-core 1.6GHz Celeron, running a 32-bit version of the OS (the Ethernet driver was broken in the 64-bit version). I ran each program five times and report the range of times those five tries took. I'm using Python 2.5.2, OpenJDK 1.6.0.0, OCaml 3.10.2, GCC 4.3.2, SBCL 1.0.8.debian, and Octave 3.0.1.

  • SquareCog's Pig version: not yet tested (because I can't just apt-get install pig), 7 lines of code.
  • mjv's pure SQL version: not yet tested, but I predict a runtime of several days; 7 lines of code.
  • ygrek's OCaml version: 68.7 seconds ±0.9 in 15 lines of code.
  • My Python version: 169 seconds ±4 or 86 seconds ±2 with Psyco, in 16 lines of code.
  • abbot's heap-based Python version: 177 seconds ±5 in 18 lines of code, or 83 seconds ±5 with Psyco.
  • My C version below, composed with GNU sort -n: 90 + 5.5 seconds (±3, ±0.1), but gives the wrong answer because of a deficiency in GNU sort, in 22 lines of code (including one line of shell.)
  • hrnt's C++ version: 217 seconds ±3 in 25 lines of code.
  • mjv's alternative SQL-based procedural approach: not yet tested, 26 lines of code.
  • mjv's first SQL-based procedural approach: not yet tested, 29 lines of code.
  • peufeu's Python version with Psyco: 181 seconds ±4, somewhere around 30 lines of code.
  • Rainer Joswig's Common Lisp version: 478 seconds (only run once) in 42 lines of code.
  • abbot's noop.py, which intentionally gives incorrect results to establish a lower bound: not yet tested, 15 lines of code.
  • Will Hartung's Java version: 96 seconds ±10 in, according to David A. Wheeler’s SLOCCount, 74 lines of code.
  • Greg's Matlab version: doesn't work.
  • Schuyler Erle's suggestion of using Pyrex on one of the Python versions: not yet tried.

I supect abbot's version comes out relatively worse for me than for them because the real dataset has a highly nonuniform distribution: as I said, some aa values (“players”) have thousands of lines, while others only have one.

About Psyco: I applied Psyco to my original code (and abbot's version) by putting it in a main function, which by itself cut the time down to about 140 seconds, and calling psyco.full() before calling main(). This added about four lines of code.

I can almost solve the problem using GNU sort, as follows:

kragen@inexorable:~/devel$ time LANG=C sort -nr infile -o sorted

real    1m27.476s
user    0m59.472s
sys 0m8.549s
kragen@inexorable:~/devel$ time ./top5_sorted_c < sorted > outfile

real    0m5.515s
user    0m4.868s
sys 0m0.452s

Here top5_sorted_c is this short C program:

#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

enum { linesize = 1024 };

char buf[linesize];
char key[linesize];             /* last key seen */

int main() {
  int n = 0;
  char *p;

  while (fgets(buf, linesize, stdin)) {
    for (p = buf; *p && !isspace(*p); p++) /* find end of key on this line */
      ;
    if (p - buf != strlen(key) || 0 != memcmp(buf, key, p - buf)) 
      n = 0;                    /* this is a new key */
    n++;

    if (n <= 5)               /* copy up to five lines for each key */
      if (fputs(buf, stdout) == EOF) abort();

    if (n == 1) {               /* save new key in `key` */
      memcpy(key, buf, p - buf);
      key[p-buf] = '\0';
    }
  }
  return 0;
}

I first tried writing that program in C++ as follows, and I got runtimes which were substantially slower, at 33.6±2.3 seconds instead of 5.5±0.1 seconds:

#include <map>
#include <iostream>
#include <string>

int main() {
  using namespace std;
  int n = 0;
  string prev, aa, bb, cc;

  while (cin >> aa >> bb >> cc) {
    if (aa != prev) n = 0;
    ++n;
    if (n <= 5) cout << aa << " " << bb << " " << cc << endl;
    prev = aa;
  }
  return 0;
}

I did say almost. The problem is that sort -n does okay for most of the data, but it fails when it's trying to compare 0.33 with 3.78168e-05. So to get this kind of performance and actually solve the problem, I need a better sort.

Anyway, I kind of feel like I'm whining, but the sort-and-filter approach is about 5× faster than the Python program, while the elegant STL program from hrnt is actually a little slower — there seems to be some kind of gross inefficiency in <iostream>. I don't know where the other 83% of the runtime is going in that little C++ version of the filter, but it isn't going anywhere useful, which makes me suspect I don't know where it's going in hrnt's std::map version either. Could that version be sped up 5× too? Because that would be pretty cool. Its working set might be bigger than my L2 cache, but as it happens it probably isn't.

Some investigation with callgrind says my filter program in C++ is executing 97% of its instructions inside of operator >>. I can identify at least 10 function calls per input byte, and cin.sync_with_stdio(false); doesn’t help. This probably means I could get hrnt’s C program to run substantially faster by parsing input lines more efficiently.

Edit: kcachegrind claims that hrnt’s program executes 62% of its instructions (on a small 157000 line input file) extracting doubles from an istream. A substantial part of this is because the istreams library apparently executes about 13 function calls per input byte when trying to parse a double. Insane. Could I be misunderstanding kcachegrind's output?

Anyway, any other suggestions?

Fasciate answered 23/9, 2009 at 18:45 Comment(4)
Indeed, C++ iostreams implementations have historically been pretty slow. However, they shouldn't be that slow anymore :) What g++ and libstdc++ versions are you using, by the way?Cordite
Thank you! I'm using libstdc++6 4.3.2-1ubuntu12 and g++ 4:4.3.1-1ubuntu2.Fasciate
you can actually apt-get Pig, just need to get it from archive.cloudera.com -- but don't bother, I ran it and it takes 20 minutes. Pig is built for problems that don't fit in memory, it's grossly inefficient in this case.Southeastwards
SquareCog: That's a shame! I wonder what it was doing for those 20 minutes.Fasciate
W
2

Pretty straightforward Caml (27 * 10^6 rows -- 27 sec, C++ by hrnt -- 29 sec)

open Printf
open ExtLib

let (>>) x f = f x
let cmp x y = compare (fst x : float) (fst y)
let wsp = Str.regexp "[ \t]+"

let () =
  let all = Hashtbl.create 1024 in
  Std.input_lines stdin >> Enum.iter (fun line ->
    let [a;b;c] = Str.split wsp line in
    let b = float_of_string b in
    try
      match Hashtbl.find all a with
      | [] -> assert false
      | (bmin,_) as prev::tl -> if b > bmin then
        begin
          let m = List.sort ~cmp ((b,c)::tl) in
          Hashtbl.replace all a (if List.length tl < 4 then prev::m else m)
        end
    with Not_found -> Hashtbl.add all a [b,c]
  );
  all >> Hashtbl.iter (fun a -> List.iter (fun (b,c) -> printf "%s %f %s\n" a b c))
Whiteley answered 25/9, 2009 at 13:12 Comment(9)
That's pretty short! How does it handle the case where the input fields are separated by tabs or multiple spaces? And if it's slower than the C++ version, does that mean it's slower than the Python version too?Fasciate
Original python version takes 1m20s here. Edited the code to handle multiple spaces, thanks for the tip. BTW I really think this task is IO bounded, best performance can be achieved by parallelizing.Whiteley
It's not close to I/O-bound. On my machine, my Python version takes 5–7 minutes, while my C filter chews through the same data in 5–7 seconds. But that doesn't take into account the time to sort. (The data is 630MB in textual form and I have 1GiB of RAM, so the sorting doesn't make it I/O-bound either. Main-memory-bandwidth bound maybe.)Fasciate
Thanks for the ExtLib reference, by the way! There are a lot of things in there that look interesting. I got your program to byte-compile with ocamlc -I +extlib str.cma extLib.cma top5.ml (I'm a little rusty with OCaml so this took me a while to figure out, especially the capital L). I got it to compile to native code with ocamlopt -I +extlib str.cmxa extLib.cmxa top5.ml, and it runs in 69 seconds (my Python version is 169±4, the C++ version is 217±3). That makes it the fastest version I've tried so far — and one of the shortest ones here. What's your OS? Did you just byte-compile it?Fasciate
I am really surprised by your results -- here are my timings (on the test data generated by awk script) : c++ 0m30.559s py(heapq) 1m10.400s ocaml 0m40.470s Maybe some difference can be explained by difference in data, but your C++ result looks really strange. OS is Debian x86_64 OCaml 3.11.1 (compiled to native code) g++ 4.3.2 (-O2) python 2.5.2Whiteley
Funny : out of those 40 seconds : 4 seconds spent reading lines, 16 s - splitting, 4 s - converting to float, 16 s - actually computing top5. Looks like there are plenty of opportunities to optimize :)Whiteley
I'm using the Ubuntu version of the same g++, but in 32-bit mode.Fasciate
Hypothesis: maybe the C++ version is actually spending a substantial fraction of its time inserting into the maps in my tests, because of the Zipf distribution, but not in yours? And maybe a sorted linked list is cheaper than a red-black tree (or whatever the C++ version uses) when N=5?Fasciate
Edited the code to sort list only when it is actually needed (41 -> 27 sec). Other optimizations, namely -- switch from hashtbl to array, split with pcre, arrays instead of lists for topN -- do not give any significant speedup.Whiteley
C
1

Here is a C++ solution. I didn't have a lot of data to test it with, however, so I don't know how fast it actually is.

[edit] Thanks to the test data provided by the awk script in this thread, I managed to clean up and speed up the code a bit. I am not trying to find out the fastest possible version - the intent is to provide a reasonably fast version that isn't as ugly as people seem to think STL solutions can be.

This version should be about twice as fast as the first version (goes through 27 million lines in about 35 seconds). Gcc users, remember to compile this with -O2.

#include <map>
#include <iostream>
#include <functional>
#include <utility>
#include <string>
int main() {
  using namespace std;
  typedef std::map<string, std::multimap<double, string> > Map;
  Map m;
  string aa, cc;
  double bb;
  std::cin.sync_with_stdio(false); // Dunno if this has any effect, but anyways.

  while (std::cin >> aa >> bb >> cc)
    {
      if (m[aa].size() == 5)
        {
          Map::mapped_type::iterator iter = m[aa].begin();
          if (bb < iter->first)
            continue;
          m[aa].erase(iter);
        }
      m[aa].insert(make_pair(bb, cc));
    }
  for (Map::const_iterator iter = m.begin(); iter != m.end(); ++iter)
    for (Map::mapped_type::const_iterator iter2 = iter->second.begin();
         iter2 != iter->second.end();
         ++iter2)
      std::cout << iter->first << " " << iter2->first << " " << iter2->second <<
 std::endl;

}
Cordite answered 23/9, 2009 at 21:32 Comment(6)
If you are compiling this with g++, then you should use -O2 to turn on compiler optimizations.Cordite
Hmm, is it necessarily going to erase the pair with the lowest bb when you m[aa].erase(--iter)?Fasciate
(Note: i updated the version that is perhaps a bit cleaner) The old version had a map that had bb as its key. C++ sorts maps according to their key - in the old version the order was descending. In that case, the last element is the element with smallest key (=bb).Cordite
This new version keeps the map sorted in ascending order. Because bb is used as the key, the first element is always the element with smallest bb. If new bb is larger than that, then the older one can be removed.Cordite
Very elegant! I think I just failed to notice that Map::mapped_type was a multimap itself.Fasciate
Oops, I thought it was faster than the Python version on my machine, but it looks like it was the CPU overheating tricking me. This C++ version takes 217 seconds (±3) compiled with g++ -O2, while the Python version takes 169 seconds (±4). I'm still impressed that the C++ version is as clean as it is, even though it still seems a little messier than the Python (and twice as long).Fasciate
H
1

Interestingly, the original Python solution is by far the cleanest looking (although the C++ example comes close).

How about using Pyrex or Psyco on your original code?

Halitosis answered 24/9, 2009 at 0:37 Comment(1)
Psyco actually makes the Python version faster than the Java and traditional-Unix versions. Thanks!Fasciate
A
1

Has anybody tried doing this problem with just awk. Specifically 'mawk'? It should be faster than even Java and C++, according to this blog post: http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/

EDIT: Just wanted to clarify that the only claim being made in that blog post is that for a certain class of problems that are specifically suited to awk-style processing, the mawk virtual machine can beat 'vanilla' implementations in Java and C++.

Acetous answered 24/9, 2009 at 4:34 Comment(3)
That's a really great blog post! Thanks! I haven't tried using mawk on this yet. I don't think that we can justify a blanket expectation that mawk is always faster than Java and C++ — after all, anything you can do in the mawk interpreter can be done in C++ — but it definitely looks like it's worth a try. Another lovely thing about mawk is its small size: only about 10kloc of C.Fasciate
Agreed that a blanket statement isn't justified, and I don't think that blog post is making such a statement. The claim is that awk is a domain specific langauge (for line-by-line data-processing tasks), and for such tasks, the mawk virtual machine can beat "vanilla" Java and C++ programs. I've edited my original comment to include this clarification.Acetous
I wonder how you'd handle the sorting in mawk. It doesn't seem to have a built-in sort routine. It has arrays, so you could write one, but an interpreted sort routine (or heap-management routine) would seem to put awk at a bit of a disadvantage here both in speed and in brevity.Fasciate
L
1

Since you asked about Matlab, here's how I did something like what you're asking for. I tried to do it without any for loops, but I do have one because I didn't care to take a long time with it. If you were worried about memory then you could pull data from the stream in chunks with fscanf rather than reading the entire buffer.

fid = fopen('fakedata.txt','r');
tic
A=fscanf(fid,'%d %d %d\n');
A=reshape(A,3,length(A)/3)';  %Matlab reads the data into one long column'
Names = unique(A(:,1));
for i=1:length(Names)
    indices = find(A(:,1)==Names(i));   %Grab all instances of key i
    [Y,I] = sort(A(indices,2),1,'descend'); %sort in descending order of 2nd record
    A(indices(I(1:min([5,length(indices(I))]))),:) %Print the top five
end
toc
fclose(fid)
Lyrate answered 24/9, 2009 at 23:14 Comment(6)
Yeah, I was thinking of Matlab in the "no stinking loops" sense ☺ — Pig might be it, I don't know.Fasciate
It does a million records in 3.91 seconds. I corrected an error in the code also.Lyrate
How fast is the Python version on the machine where Matlab does a million records in 3.91 seconds?Fasciate
I got it to run on my example data set from the original post by changing the format string to '%s %f %s\n', but that transforms the first and third columns into variable-length strings of ASCII byte numbers! It works on my sample data set because those columns all happen to be one character long, but not on the real data. Also, the output format is a little off — it prints the ASCII number of the character, and it intersperses ans = lines.Fasciate
So, an alternative formulation to handle variable length strings would be fid = fopen('fakedata.txt','r'); tic A=textscan(fid,'%s %f %s\n'); Names = unique(A{1}); for i=1:length(Names) indices = strmatch(Names{i},A{1}); %Grab all instances of key i [Y,I] = sort(A{2}(indices),1,'descend'); %sort in descending order of 2nd record indices = indices(I(1:min([5,length(indices(I))]))); % disp(sprintf('%s %f %s',A{1}{indices},A{2}(indices),A{3}{indices})); end toc fclose(fid)Lyrate
Well, that didn't work. Anyway, instead of using fscanf I used textscan. Textscan will cram the data into a cell array, so it can handle the variable length string problem. However, printing the data out now becomes a problem and I didn't find an elegant solution for that. But, the whole thing is kinda moot because ignoring the printing, as my example code of my previous comment does, takes over 27seconds. A good chunk of that time is taken up by textscan itself. So, for your specific task Matlab is not a good solution, but you figured that in the first place.Lyrate
Y
1

Speaking of lower bounds on compute time :

Let's analyze my algo above :

for each row (key,score,id) :
    create or fetch a list of top scores for the row's key
    if len( this list ) < N
        append current
    else if current score > minimum score in list
        replace minimum of list with current row
        update minimum of all lists if needed

Let N be the N in top-N Let R be the number of rows in your data set Let K be the number of distinct keys

What assumptions can we make ?

R * sizeof( row ) > RAM or at least it's big enough that we don't want to load it all, use a hash to group by key, and sort each bin. For the same reason we don't sort the whole stuff.

Kragen likes hashtables, so K * sizeof(per-key state) << RAM, most probably it fits in L2/3 cache

Kragen is not sorting, so K*N << R ie each key has much more than N entries

(note : A << B means A is small relative to B)

If the data has a random distribution, then

after a small number of rows, the majority of rows will be rejected by the per-key minimum condition, the cost is 1 comparison per row.

So the cost per row is 1 hash lookup + 1 comparison + epsilon * (list insertion + (N+1) comparisons for the minimum)

If the scores have a random distribution (say between 0 and 1) and the conditions above hold, both epsilons will be very small.

Experimental proof :

The 27 million rows dataset above produces 5933 insertions into the top-N lists. All other rows are rejected by a simple key lookup and comparison. epsilon = 0.0001

So roughly, the cost is 1 lookup + coparison per row, which takes a few nanoseconds.

On current hardware, there is no way this is not going to be negligible versus IO cost and especially parsing costs.

Yam answered 30/9, 2009 at 12:58 Comment(1)
Yes. Well, you left out the parsing time, and the real dataset ends up with several hundred thousand final members in the top-N lists (Zipf distribution, again), so it must have had at least several hundred thousand insertions into them. But this answer does do a good job of explaining my frustration about how far away the working answers are from this kind of performance — and the direction of "oh, maybe you should parallelize this".Fasciate
P
0

Isn't this just as simple as

 SELECT DISTINCT aa, bb, cc FROM tablename ORDER BY bb DESC LIMIT 5

?

Of course, it's hard to tell what would be fastest without testing it against the data. And if this is something you need to run very fast, it might make sense to optimize your database to make the query faster, rather than optimizing the query.

And, of course, if you need the flat file anyway, you might as well use that.

Plumb answered 23/9, 2009 at 19:9 Comment(1)
That is a different query. For example, if I have 500000 different values of aa, it will give me 5 rows of output, while I want between 500000 and 2500000: up to 5 rows of output for each value of aa.Fasciate
T
0

Pick "top 5" would look something like this. Note that there's no sorting. Nor does any list in the top_5 dictionary ever grow beyond 5 elements.

from collections import defaultdict
import sys

def keep_5( aList, aPair ):
    minbb= min( bb for bb,cc in aList )
    bb, cc = aPair
    if bb < minbb: return aList
    aList.append( aPair )
    min_i= 0
    for i in xrange(1,6):
        if aList[i][0] < aList[min_i][0]
            min_i= i
    aList.pop(min_i)
    return aList


top_5= defaultdict(list)
for row in sys.stdin:
    aa, bb, cc = row.split()
    bb = float(bb)
    if len(top_5[aa]) < 5:
        top_5[aa].append( (bb,cc) )
    else:
        top_5[aa]= keep_5( top_5[aa], (bb,cc) )
Therm answered 23/9, 2009 at 19:23 Comment(7)
I considered that approach but I figured it would be too slow, but I admit I haven't actually tried it. I'm really sure it won't give me the order-of-magnitude speedup I'm looking for though.Fasciate
Run it and see. It's extremely fast.Therm
My script runs over a certain dataset in 1m38s. Yours takes 2m2s, about 20% more time, after I fixed it to import sys and changed line.split() to row.split(). Also, it doesn't produce any output. Shortly I will be able to report whether it is correct (unlike my initial version!)Fasciate
...it is also not correct; it is happy replacing any row that is less than the one it's considering with the one it's considering. Consequently if you have rows with scores (bb) of 5, 4, 3, 2, 1 and you insert a new row with score 3, it will replace the 2 with the 3 instead of the 1, leaving you with 5, 4, 3, 3, 1 instead of 5, 4, 3, 3, 2. In short, it gives the wrong answer more slowly.Fasciate
Incidentally, wc takes 4.7 seconds to report that that dataset contains 27 million lines in 630 megabytes.Fasciate
@Kragen Javier Sitaker: "Also, it doesn't produce any output". Correct. It produces the same top_5 structure; so feel free to copy the print loop from your original script.Therm
Yeah, that's what I did to find out that it was giving the wrong answer.Fasciate
S
0

The Pig version would go something like this (untested):

 Data = LOAD '/my/data' using PigStorage() as (aa:int, bb:float, cc:chararray);
 grp = GROUP Data by aa;
 topK = FOREACH grp (
     sorted = ORDER Data by bb DESC;
     lim = LIMIT sorted 5;
     GENERATE group as aa, lim;
)
STORE topK INTO '/my/output' using PigStorage();

Pig isn't optimized for performance; it's goal is to enable processing of multi-terabyte datasets using parallel execution frameworks. It does have a local mode, so you can try it, but I doubt it will beat your script.

Southeastwards answered 24/9, 2009 at 5:50 Comment(1)
Wow, that's awesomely short and clear, much better than any of the other versions so far. I was going to ask what the per-group overhead is (the flip side of some groups having tens of thousands of members is that most groups have a single member) but I'm not sure how to formulate that. I'll have to try it and see.Fasciate
Y
0

That was a nice lunch break challenge, he, he.

Top-N is a well-known database killer. As shown by the post above, there is no way to efficiently express it in common SQL.

As for the various implementations, you got to keep in mind that the slow part in this is not the sorting or the top-N, it's the parsing of text. Have you looked at the source code for glibc's strtod() lately ?

For instance, I get, using Python :

Read data : 80.5  s
My TopN   : 34.41 s
HeapTopN  : 30.34 s

It is quite likely that you'll never get very fast timings, no matter what language you use, unless your data is in some format that is a lot faster to parse than text. For instance, loading the test data into postgres takes 70 s, and the majority of that is text parsing, too.

If the N in your topN is small, like 5, a C implementation of my algorithm below would probably be the fastest. If N can be larger, heaps are a much better option.

So, since your data is probably in a database, and your problem is getting at the data, not the actual processing, if you're really in need of a super fast TopN engine, what you should do is write a C module for your database of choice. Since postgres is faster for about anything, I suggest using postgres, plus it isn't difficult to write a C module for it.

Here's my Python code :

import random, sys, time, heapq

ROWS = 27000000

def make_data( fname ):
    f = open( fname, "w" )
    r = random.Random()
    for i in xrange( 0, ROWS, 10000 ):
        for j in xrange( i,i+10000 ):
            f.write( "%d    %f    %d\n" % (r.randint(0,100), r.uniform(0,1000), j))
        print ("write: %d\r" % i),
        sys.stdout.flush()
    print

def read_data( fname ):
    for n, line in enumerate( open( fname ) ):
        r = line.strip().split()
        yield int(r[0]),float(r[1]),r[2]
        if not (n % 10000 ):
            print ("read: %d\r" % n),
            sys.stdout.flush()
    print

def topn( ntop, data ):
    ntop -= 1
    assert ntop > 0
    min_by_key = {}
    top_by_key = {}
    for key,value,label in data:
        tup = (value,label)
        if key not in top_by_key:
            # initialize
            top_by_key[key] = [ tup ]
        else:
            top = top_by_key[ key ]
            l    = len( top )
            if l > ntop:
                # replace minimum value in top if it is lower than current value
                idx = min_by_key[ key ]
                if top[idx] < tup:
                    top[idx] = tup
                    min_by_key[ key ] = top.index( min( top ) )
            elif l < ntop:
                # fill until we have ntop entries
                top.append( tup )
            else:
                # we have ntop entries in list, we'll have ntop+1
                top.append( tup )
                # initialize minimum to keep
                min_by_key[ key ] = top.index( min( top ) )

    # finalize:
    return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )

def grouptopn( ntop, data ):
    top_by_key = {}
    for key,value,label in data:
        if key in top_by_key:
            top_by_key[ key ].append( (value,label) )
        else:
            top_by_key[ key ] = [ (value,label) ]

    return dict( (key, sorted( values, reverse=True )[:ntop]) for key,values in top_by_key.iteritems() )

def heaptopn( ntop, data ):
    top_by_key = {}
    for key,value,label in data:
        tup = (value,label)
        if key not in top_by_key:
            top_by_key[ key ] = [ tup ]
        else:
            top = top_by_key[ key ]
            if len(top) < ntop:
                heapq.heappush(top, tup)
            else:
                if top[0] < tup:
                    heapq.heapreplace(top, tup)

    return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )

def dummy( data ):
    for row in data:
        pass

make_data( "data.txt" )

t = time.clock()
dummy( read_data( "data.txt" ) )
t_read = time.clock() - t

t = time.clock()
top_result = topn( 5, read_data( "data.txt" ) )
t_topn = time.clock() - t

t = time.clock()
htop_result = heaptopn( 5, read_data( "data.txt" ) )
t_htopn = time.clock() - t

# correctness checking :
for key in top_result:
    print key, " : ", "        ".join (("%f:%s"%(value,label)) for (value,label) in    top_result[key])
    print key, " : ", "        ".join (("%f:%s"%(value,label)) for (value,label) in htop_result[key])

print
print "Read data :", t_read
print "TopN :     ", t_topn - t_read
print "HeapTopN : ", t_htopn - t_read

for key in top_result:
    assert top_result[key] == htop_result[key]
Yam answered 26/9, 2009 at 13:19 Comment(7)
You are surely correct about the slow part, in several implementations, being the parsing of text. I haven't looked at the source code to strtod lately but I did look at some profiling results... And it is true that the slowest part is already reading the data from the database, but that is tempting me to store it somewhere else instead.Fasciate
...wow, a little program timing glibc strtod on my machine clocks it at only 20 megabytes per second (on ten-character strings). Given that my 630-megabyte data file contains 240 megabytes of floating-point text, that accounts for, uh, 12 seconds of run-time. Which is a lot, in an absolute sense if not a relative one. It ought to be able to do 10× better than that.Fasciate
That's a variable l, not a digit 1, right? Just checking.Fasciate
I wonder if psyco.full() would have different speedups on the different approaches...Fasciate
On your sample dataset, psyco.full() speeds up both algorithms by a factor of between 2 and 3, and indeed it makes your algorithm run faster than the heapq one. I haven't broken the code out yet to measure it on my real dataset. It wouldn't be surprising if it were faster than the other Python approaches I've tried.Fasciate
...oops, that was from your timings: t_topn - t_read. For some reason, Psyco actually made read_data slower, so it wasn't actually such a big improvement...Fasciate
I wrote this function to be able to test topn with Psyco on the original data set: read_silently = lambda infile: ((int(aa), float(bb), cc) for aa, bb, cc in (line.split() for line in infile)). psyco.full() sped up the program from about 218 seconds to about 186, which is kind of a surprisingly small difference, since it sped up my original program from 169 seconds to 86 seconds.Fasciate
Y
0

Well, please grab a coffee and read the source code for strtod -- it's mindboggling, but needed, if you want to float -> text -> float to give back the same float you started with.... really...

Parsing integers is a lot faster (not so much in python, though, but in C, yes).

Anyway, putting the data in a Postgres table :

SELECT count( key ) FROM the dataset in the above program

=> 7 s (so it takes 7 s to read the 27M records)

CREATE INDEX topn_key_value ON topn( key, value );

191 s

CREATE TEMPORARY TABLE topkeys AS SELECT key FROM topn GROUP BY key;

12 s

(You can use the index to get distinct values of 'key' faster too but it requires some light plpgsql hacking)

CREATE TEMPORARY TABLE top AS SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1) AS r FROM topkeys a) foo;

Temps : 15,310 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 1) AS r FROM topkeys a) foo;

Temps : 17,853 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 2) AS r FROM topkeys a) foo;

Temps : 13,983 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 3) AS r FROM topkeys a) foo;

Temps : 16,860 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 4) AS r FROM topkeys a) foo;

Temps : 17,651 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 5) AS r FROM topkeys a) foo;

Temps : 19,216 ms

SELECT * FROM top ORDER BY key,value;

As you can see computing the top-n is extremely fast (provided n is small) but creating the (mandatory) index is extremely slow because it involves a full sort.

Your best bet is to use a format that is fast to parse (either binary, or write a custom C aggregate for your database, which would be the best choice IMHO). The runtime in the C program shouldn't be more than 1s if python can do it in 1 s.

Yam answered 27/9, 2009 at 23:21 Comment(3)
Yeah, I was reading Clinger's 1990 paper ("How to Read Floating Point Numbers Accurately") last night. I think I'm safe as long as I stick to sending numbers of 14 digits and under through the fast path, and push the hard numbers off onto strtod. :) And that still gives me a 4–6× speedup in the usual case. Your correlated subquery approach is very clever! I'm still convinced that it's possible to do this problem (even correctly) substantially faster than even the OCaml program has, but I don't know if I'll finish testing that hypothesis before I have to go do other stuff.Fasciate
The correlated subquery is clever if you close your eyes very hard while not looking at the index creation time. But then, sorting a GB of data is never going to be instantaneous. If you already have the index, though, good for you.Yam
Anyway, if you really want to use textual data (do you, cause you'll also need to generate those floating point numbers in text, which is also very slow) then why are you even parsing the floating point numbers at all ???? a = X.XXXXXe+E b = Y.YYYYYe+F (a < b) is equivalent to E<F or strcmp("X.XXXXX", "Y.YYYYY") < 0 Just print your numbers in scientific notation and use string comparison. Or, dammit, don't use FP numbers, if the max/min range is known, use integers (fixed point) with leading zeros and compare them as strings.Yam
Y
0

I love lunch break challenges. Here's a 1 hour implementation.

OK, when you don't want do some extremely exotic crap like additions, nothing stops you from using a custom base-10 floating point format whose only implemented operator is comparison, right ? lol.

I had some fast-atoi code lying around from a previous project, so I just imported that.

http://www.copypastecode.com/11541/

This C source code takes about 6.6 seconds to parse the 580MB of input text (27 million lines), half of that time is fgets, lol. Then it takes approximately 0.05 seconds to compute the top-n, but I don't know for sure, since the time it takes for the top-n is less than the timer noise.

You'll be the one to test it for correctness though XDDDDDDDDDDD

Interesting huh ?

Yam answered 28/9, 2009 at 12:58 Comment(13)
Awesome, now that's the kind of speed I was hoping for! Now it's just a matter of getting that kind of speed out of a reasonably short program...Fasciate
> Just stuff everything in a library > and your code will be reasonably short (Java Book of Wisdom)Yam
Nah, just mmap() the file, get rid of fgets, and hack it a bit, I'm sure you can get it below 3 secs. If the float is lower than the lowest score in your current top-scores scoreboard for all categories, you don't need to parse the other 2 fields. And if it's lower than the minimum of the current category, you don't need to parse the other int. On the subject of printf(), some time ago I wrote some integer->text conversion functions which were about 20x faster than printf("%d"), it's not hard.Yam
In this particular case I suspect that a library isn't sufficient, since a lot of the optimizations applied are pretty dependent on the nature of the data. WRT mmap I've often been disappointed with it failing to speed things up, but maybe in this case I wouldn't.Fasciate
-1. It doesn't answer the OP question, it hardcodes so many things that it is not even funny, it doesn't solve the problem but only some specific subproblem. Yes, it is fast, but the price is too high.Whiteley
The OP's problem is to parse a huge amounts of double precision numbers fast, and then apply a trivial algorithm on them. Since parsing doubles fast is not possible if you want the answer in IEEE format, the simplest solution is to use a trick like this. The only hardcoded thing is the format (obviously, since that was specified by the OP) and the number of possible keys (which should really be handled by a hashtable, not a C array).Yam
Let me read that for you : "to get it to be similarly short (as the Python below) but much faster.". Other hardcoded things : no negative floats, one kind of delimiter only, first and third columns are ints only, limited line length. No doubt - throwing away abstractions and using bounded arrays indexed with integers instead of dynamic hashtables may lead to faster code, but it looks like the wrong direction (for me personally).Whiteley
"to get it to be similarly short (as the Python below) but much faster" is simply not possible unless you follow the first advice I gave, which is to drop the text format and either use a binary format or write a C module for your database of choice, which is really the fastest way if your data already is in a database. OP's problem is text format parsing which is is either generic or fast, not both. The top-N algo itself is rather trivial, and using a hashtable (which is of course the right choice) would probably matter very little to the speed.Yam
So the reason I think this answer is important is not because it is good as a solution itself but because it shows that the lower bound on execution time of a solution is far, far less than many people suspected. Adding negative-number and tab and hash-table handling back in would not make much of a difference. I disagree with “either generic or fast, not both” in this context — it’s possible to lose only a little bit of speed from this version by handling more generic input. As to “throwing away abstractions”, yes, this code should be emitted by a compiler, not a programmer.Fasciate
Actually, my crap code above was more intended as a demonstration of how fast it can get if you're willing to drop all sanity, exactly as Kragen says. Note that it still validates all the input. The best solution to this problem is to change the problem, by getting the data in another format.Yam
Kragen, your last remark is absolutely exact : generic text parsing (like a customizable CSV parser) is in fact writing your own tiny interpreted language. You think it'll go fast because you write it in C, but in reality your program is not C, it is in an ad-hoc language (just like printf format strings are an ad-hoc, slow, interpreted language). Where does your data comes from btw ? I'm curious.Yam
At the moment it's being extracted from MySQL, but before going into MySQL it was emitted (as text, in a less decently parseable format) by some proprietary program written in a mix of different compiled languages. MySQL's apparent inability to even traverse the records in sequence in less than several minutes (on spinning rust) is making me think I should use a different storage medium, something better suited to sequential access.Fasciate
peufeu, agree. (BTW as you said hashtbl->array didn't really give much speedup, I was surprised)Whiteley

© 2022 - 2024 — McMap. All rights reserved.