Speed up processing from CSV file

Asked 22/5, 2011 at 19:47 Answered 2/3, 2015 at 23:10

I have a project and I have to improve its performance. I have a big Mysql database constructed from a huge CSV file (100 millions rows). Insert time is not a problem, but response time of a request is much important and sometimes a query with 2 join takes around 20 hours...

In a goal to reduce this response time, I tried to migrate my database to Cassandra but without success : my data model is not appropried to Cassandra concepts. Then I would like to try another way to improve performance : Parallel Virutal File System. Instead insert data in Mysql database and send then send a query, I tried to read the whole csv file with multi-threading and did my computing. But the result was not good : 2m20s for only 1 000 000 rows.

For the moment, my computing is very simple : in C++ with the MPI-IO API, I just count the number of differents pair value from 2 columns. To perfom that computing, I use a hashmap where each key is a pair value from csv file. At the end, I return the hashmap size. Here a small code :

 MPI::Init(argc,argv); 
 cout << " INFO init done" << endl;
 int myrank = MPI::COMM_WORLD.Get_rank(); 
 int numprocs = MPI::COMM_WORLD.Get_size(); 
 get_filename(path_name, myrank);
 cout << " INFO open file : " << path_name << endl;
 MPI::File thefile = MPI::File::Open(MPI::COMM_WORLD, path_name.c_str(), 
                  MPI::MODE_RDONLY, 
                  MPI::INFO_NULL); 
 MPI::Offset offset = 101;
 MPI::Offset limit = thefile.Get_size();
 cout << " INFO go computing" << endl;
 do {
   thefile.Read_at(offset, buf, bufsize, MPI_CHAR, status);
   temp.assign(buf);
   Tokenize(temp,tokens,"\n");
   line.assign(tokens.at(0));
   tokens.clear();

   Tokenize(line,tokens,"\t");
   nidt_count(tokens);
   tokens.clear();
   offset += (line.size() + 1);
 }while(offset < limit);
 count = status.Get_count(MPI_INT);
 cout << "process " << myrank << " reads " << nidt_hash.size() << " nidt" << endl;

I work on a server with 4 cores, 8GB of ram. My data is on a NAS mounted in NFS or Samba on my server. I would be able to add 2 or 3 server for processing, but for the moment I just tried on a small file (1 Million rows) on one server to measure performance.

Finally my questions are :

Is it a good way to think to change to PVFS kind for my issue? I would like to say that I will process with more complicated query like : select all rows with a specific date(range hours), and specific pair value from specific columns.
Do you know others things that could help me to improve processing from a csv file? I am thinking to use Hadoop, Pytables or FasterCSV.

Here it is a sample of my data composed by 2 csv file :

The biggest one (100 Millions rows) is composed as follow :

ID        DATE             NUM_1        NUM_2     NB_UNITE TYPUNIT CODE_1 CODE_2

0  2007-05-13 15:37:48  33671624244  33698802900    547      s       0      17
0  2007-05-13 15:52:22  33671624244  33672211799      5      s       0      17 
....

The second one is more simple and small (90 000), it is like a dictionary where from a code_1 and code_2 I get a value named CODEVAL:

CODE_1 CODE_2 CODEVAL

  0       17     VS
  0       34     SS

As you expected, usually I create 2 tables one for each file and a typical query is :

Select CODEVAL, hour(date) AS HEURE, COUNT(*) AS NBSMSSOR 
From Tables_1 Join CODEVAL using(CODE_1,CODE_2) 
Where CODEVAL='SS'

Sorry for the presentation, I do not know how to make a array.

Here it is a sample of my data composed by 2 csv file :

the biggest one (100 Millions rows) is composed as follow :

ID DATE NUM_1 NUM_2 NB_UNITE TYPUNIT CODE_1 CODE_2

0 2007-05-13 15:37:48 33671624244 33698802900 547 s 0 17
0 2007-05-13 15:52:22 33671624244 33672211799 5 s 0 17 ....
the second one is more simple and small (90 000), it's like a dictionary where from a code_1 and code_2 I get a value named CODEVAL:

CODE_1 CODE_2 CODEVAL

0 17 VS

0 34 SS

As you expected, usually I create 2 tables one for each file and a typical query is :

Select CODEVAL, hour(date) AS HEURE, COUNT(*) AS NBSMSSOR From Tables_1 Join CODEVAL using(CODE_1,CODE_2) Where CODEVAL='SS'

Sorry for the presentation, I do not know how to make a array.

Raffish answered 22/5, 2011 at 19:47 Comment(5)

Have you tried analysing and tuning MySql: forge.mysql.com/wiki/Top10SQLPerformanceTips – Bunyip 22/5, 2011 at 21:20

If you're not using the lastest version of MySQL consider upgrading, and do read that link by @ChrisWue, it's great stuff. – Twowheeler 22/5, 2011 at 21:27

I would be concerned more with the layout of your data and how you're accessing them. Could you paste a sample of 10 rows, and the most common queries you're using? – Effeminacy 23/5, 2011 at 1:58

Thanks for your help, you can see below a sample of my data written in CSV file. I writte a simple query as well. – Raffish 23/5, 2011 at 7:54

100 million rows doesn't exacly sound like a big database, not even by MySQL standards. – Din 23/5, 2011 at 8:41

It looks to me like you're I/O bound. It doesn't help that your data is over a network. I suspect that if you just add more machines then your performance will go DOWN because of the extra contention. Remember that there's still just one spindle and just one HD head reading your data. For the MPI solution I'd suggest making multiple copies of the data and putting them on the servers themselves.

For MySQL, I hear what you're saying. I found MySQL to be very inefficient with joins. I looks to me like it does full-table scans when it could get away without them. I remember MySQL taking over a minute on a query that Oracle would take less than a second. Maybe try PostgreSQL? I'm not sure if it's any better. Another approach could be to have the db sort the data for you so that you can then do the scan without a hashmap.

Unless your records are ginormous, 100M records shouldn't be that bad.

Approximation answered 22/5, 2011 at 20:15 Comment(2)

Thanks you for your help. I suspected that I was bound by I/O with head disk... But my hope was focus that my NAS is composed by few disks in RAID 6, and I'm expected from PVFS that it will be able to balance data on differents disks from my NAS and then spread I/O accross disks. I am probably wrong... – Raffish 23/5, 2011 at 7:52

Hahaha no I won't tell you that, sorry. :) The second one is the slowest. But I have some more complex query like : I would like to store in a csv file every rows by range of one hour and by CODEVAL. In a Mysql Database, it is easy to translate my need in a query but it takes too long time. That is why I tried to do it in a programming way like C++ program with MPI-IO to improve performance, but it looks like more difficult to setting up... – Raffish 23/5, 2011 at 8:57

If you read the data from the CSV, I assume it won't change too often. So instead of loading it into a generic database product, you could also construct your own index over the CSV data. Or do you need to have full SQL support?

Apart from that you mention that you want to return the NUMBER of different K,V-Pairs. However, you really compute the actual pairs. I don't know if you need them for some other purpose, but you could also get that number as #distinctKeys x #distinctValues without actually building a HashMap.

Assuming you build an index for each colum of form

value -> {r | r is a byteOffset of a row that has "value" in the index column}

you could answer many, many queries and especially determining the number of distinct pairs should only take a couple of milliseconds.

I hope this answer is helpful, since I am not sure what other requirements have to be met. This solution is significantly less powerful that a DB supporting SQL (especially inserts will make stuff a lot more complicated) but at least determining the number of distinct pairs should be faster by several orders of magnitude

Antichrist answered 23/5, 2011 at 9:9 Comment(0)

divide and conquer A hundred small databases should be WAY faster. you decide how to break it up - use split() or slice() I am currently using the first character of the first word of each line, so where there once was one huge slow DB there is now (A - Z + a - z + 0 - 9) 62 small faster databases. Another advantage is that a laptop can now do the job that only a powerful, expensive PC could do before

Sincerity answered 2/3, 2015 at 23:10 Comment(0)

Recommended topics

Hot tags