I have a project and I have to improve its performance. I have a big Mysql database constructed from a huge CSV file (100 millions rows). Insert time is not a problem, but response time of a request is much important and sometimes a query with 2 join takes around 20 hours...
In a goal to reduce this response time, I tried to migrate my database to Cassandra but without success : my data model is not appropried to Cassandra concepts. Then I would like to try another way to improve performance : Parallel Virutal File System. Instead insert data in Mysql database and send then send a query, I tried to read the whole csv file with multi-threading and did my computing. But the result was not good : 2m20s for only 1 000 000 rows.
For the moment, my computing is very simple : in C++ with the MPI-IO API, I just count the number of differents pair value from 2 columns. To perfom that computing, I use a hashmap where each key is a pair value from csv file. At the end, I return the hashmap size. Here a small code :
MPI::Init(argc,argv);
cout << " INFO init done" << endl;
int myrank = MPI::COMM_WORLD.Get_rank();
int numprocs = MPI::COMM_WORLD.Get_size();
get_filename(path_name, myrank);
cout << " INFO open file : " << path_name << endl;
MPI::File thefile = MPI::File::Open(MPI::COMM_WORLD, path_name.c_str(),
MPI::MODE_RDONLY,
MPI::INFO_NULL);
MPI::Offset offset = 101;
MPI::Offset limit = thefile.Get_size();
cout << " INFO go computing" << endl;
do {
thefile.Read_at(offset, buf, bufsize, MPI_CHAR, status);
temp.assign(buf);
Tokenize(temp,tokens,"\n");
line.assign(tokens.at(0));
tokens.clear();
Tokenize(line,tokens,"\t");
nidt_count(tokens);
tokens.clear();
offset += (line.size() + 1);
}while(offset < limit);
count = status.Get_count(MPI_INT);
cout << "process " << myrank << " reads " << nidt_hash.size() << " nidt" << endl;
I work on a server with 4 cores, 8GB of ram. My data is on a NAS mounted in NFS or Samba on my server. I would be able to add 2 or 3 server for processing, but for the moment I just tried on a small file (1 Million rows) on one server to measure performance.
Finally my questions are :
- Is it a good way to think to change to PVFS kind for my issue? I would like to say that I will process with more complicated query like : select all rows with a specific date(range hours), and specific pair value from specific columns.
- Do you know others things that could help me to improve processing from a csv file? I am thinking to use Hadoop, Pytables or FasterCSV.
Here it is a sample of my data composed by 2 csv file :
The biggest one (100 Millions rows) is composed as follow :
ID DATE NUM_1 NUM_2 NB_UNITE TYPUNIT CODE_1 CODE_2
0 2007-05-13 15:37:48 33671624244 33698802900 547 s 0 17
0 2007-05-13 15:52:22 33671624244 33672211799 5 s 0 17
....
The second one is more simple and small (90 000), it is like a dictionary where from a code_1
and code_2
I get a value named CODEVAL:
CODE_1 CODE_2 CODEVAL
0 17 VS
0 34 SS
As you expected, usually I create 2 tables one for each file and a typical query is :
Select CODEVAL, hour(date) AS HEURE, COUNT(*) AS NBSMSSOR
From Tables_1 Join CODEVAL using(CODE_1,CODE_2)
Where CODEVAL='SS'
Sorry for the presentation, I do not know how to make a array.
Here it is a sample of my data composed by 2 csv file :
the biggest one (100 Millions rows) is composed as follow :
ID DATE NUM_1 NUM_2 NB_UNITE TYPUNIT CODE_1 CODE_2
0 2007-05-13 15:37:48 33671624244 33698802900 547 s 0 17
0 2007-05-13 15:52:22 33671624244 33672211799 5 s 0 17 ....the second one is more simple and small (90 000), it's like a dictionary where from a code_1 and code_2 I get a value named CODEVAL:
CODE_1 CODE_2 CODEVAL
0 17 VS
0 34 SS
As you expected, usually I create 2 tables one for each file and a typical query is :
- Select CODEVAL, hour(date) AS HEURE, COUNT(*) AS NBSMSSOR From Tables_1 Join CODEVAL using(CODE_1,CODE_2) Where CODEVAL='SS'
Sorry for the presentation, I do not know how to make a array.