Import only non-existing data to database from CSV

Asked 18/8, 2017 at 14:31 Answered 22/8, 2017 at 20:8

I have created a script that read data from a CSV file, check if the data already exists in the database and import it if it does not. If the data does exist (the code of a specific product) then the rest of the information needs to be updated from the CSV file.

For example; I have a member with code WTW-2LT, named Alex and surname Johnson in my CSV file. The script checks if the member with code WTW-2LT, named Alex and surname Johnson already exist, if it does, the contact details and extra details needs to be updated from the script (other details like subject and lecturer also needs to be checked, all details are in one line in the CSV), if it doesn't exist the new member just have to be created.

My script what I have so far with minimum other checks to prevent distraction for now;

while ($row = fgetcsv($fp, null, ";")) {
    if ($header === null) {
        $header = $row;
        continue;
    }

    $record = array_combine($header, $row);

    $member = $this->em->getRepository(Member::class)->findOneBy([
        'code' =>$record['member_code'],
        'name' =>$record['name'],
        'surname' =>$record['surname'],
    ]);

    if (!$member) {
        $member = new Member();
        $member->setCode($record['member_code']);
        $member->setName($record['name']);
        $member->setName($record['surname']);
    }    
    $member->setContactNumber($record['phone']);
    $member->setAddress($record['address']);
    $member->setEmail($record['email']);

    $subject = $this->em->getRepository(Subject::class)->findOneBy([
        'subject_code' => $record['subj_code']
    ]);

    if (!$subject) {
        $subject = new Subject();
        $subject->setCode($record['subj_code']);
    }
    $subject->setTitle($record['subj_title']);
    $subject->setDescription($record['subj_desc']);
    $subject->setLocation($record['subj_loc']);

    $lecturer = $this->em->getRepository(Lecturer::class)->findOneBy([
        'subject' => $subject,
        'name' => $record['lec_name'],
        'code' => $record['lec_code'],
    ]);

    if (!$lecturer) {
        $lecturer = new Lecturer();
        $lecturer->setSubject($subject);
        $lecturer->setName($record['lec_name']);
        $lecturer->setCode($record['lec_code']);
    }
    $lecturer->setEmail($record['lec_email']);
    $lecturer->setContactNumber($record['lec_phone']);

    $member->setLecturer($lecturer);

    $validationErrors = $this->validator->validate($member);
    if (!count($validationErrors)) {
        $this->em->persist($member);
        $this->em->flush();
    } else {
        // ...
    }
}

You can notice this script has to query the database 3 times to check if one CSV line exists. In my case I have files with up to 2000+ lines, so for every line to perform 3 queries to check if that line exists or not is quite time-consuming.

Unfortunately, I also can't import the rows in batches because if one subject doesn't exist it will create it so many times until the batch is flushed to the database and then I sit with duplicate records that serve no point.

How can I improve performance and speed to the max? Like first get all records from the database and store it in arrays (memory consuming?) and then do the checks and add the line to the array and check from there...

Can somebody please help me find a way to improve this (with sample code please?)

Chromoplast answered 18/8, 2017 at 14:31 Comment(0)

To be honest, I do find 2000+ rows with 3x that amount of queries not that much. But, since you are asking for performance, here are my two cents:

Using a framework will always give overhead. Meaning that if you make this code in native PHP, it will already run quicker. Im not familiar with symfony, but I assume you store your data in a database. In MySQL you can use the command INSERT ... ON DUPLICATE KEY update. If you have set the 3 fields (code, name, lastname) as a primary key (which I assume), you can use that to: insert data, but if the key already exists, update the values in the database. MySQL will do the checsk for you, to see if the data has changed: if not, no diskwrite will happen.

Im quite certain you can write native SQL to symfony, allowing you to use the security the framework provides, yet speeding up your insert.

Iritis answered 18/8, 2017 at 14:51 Comment(2)

Hmm, OK, I guess I just have to get use to using set_time_limit, because the timeouts are mostly the problem. – Chromoplast 22/8, 2017 at 8:24

If you don't want to change the time limit, you can make a script that converts the lines to individual SQL queries. Than save the queries as a txt file and upload it to the server. With php you can read X lines from the file, run them on the database and save what line you ran last. Than next time continue from that line. It is however allot more work to do it this way, but it will balance the serverload more. – Iritis 23/8, 2017 at 7:23

Generally if you want performance my best experience has been to dump all the data into the database and then transform it in there using SQL statements. The DBS will be able optimize all of your steps this way.

You can import CSV-Files directly into your MySQL database with the SQL command

LOAD DATA INFILE 'data.csv'
INTO TABLE tmp_import

The command has a lot of options where you can specify your CSV file's format, for example:

If your data.csv is a full dump containing all old and new rows then you can just replace your current table with the imported one, after you fixed it up a bit.

For example it looks like your csv-file (and import table) might look a bit like

WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name1, ...
WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name2, ...
WTW-2LT, Alex, Johnson, subj_code2, ..., lec_name3, ...
WTW-2LU, John, Doe,     subj_code3, ..., lec_name4, ...

You could then get the distinct rows via grouping:

SELECT member_code, name, surname
FROM tmp_import
GROUP BY member_code, name, surname

If member_code is a key you can just GROUP BY member_code in MySQL. The DBS won't complain even though I believe it's technically against the standard.

To get the rest of your data you do the same:

SELECT subj_code, subj_title, member_code
FROM tmp_import
GROUP BY subj_code

and

SELECT lec_code, lec_name, subj_code
FROM tmp_import
GROUP BY lec_code

assuming subj_code and lec_code are both keys for subjects and lectures.

To actually save this result as a table you can use MySQL's CREATE TABLE ... SELECT-syntax, for example

CREATE TABLE tmp_import_members
SELECT member_code, name, surname
FROM tmp_import
GROUP BY member_code, name, surname

You can then do the inserts and updates in two queries:

INSERT INTO members (member_code, name, surname)
SELECT member_code, name, surname
FROM tmp_import_members
WHERE tmp_import_members.member_code NOT IN (
  SELECT member_code FROM members WHERE member_code IS NOT NULL
);

UPDATE members 
JOIN tmp_import_members ON 
  members.member_code = tmp_import_members.members_code
SET 
  members.name = tmp_import_members.name,
  members.surname = tmp_import_members.surname;

and the same for subjects and lectures to your liking.

This all amounts to

one bulk import of your CSV file, which should be very fast,
3 temporary tables for your members, subjects and lectures,
3 insert and 3 update statements (one per table)
one drop tables on your temporary tables after you're done

Again: If your CSV-File contains all rows you could just replace your existing tables and save the 3 inserts and 3 updates.

Make sure that you create indexes on the relevant columns of your temporary tables so that MySQL can speed up the NOT IN and JOIN in the above queries.

Christoforo answered 22/8, 2017 at 15:6 Comment(0)

You can run a custom sql first to get the count from all the three tables

SELECT
  (SELECT COUNT(*) FROM member WHERE someCondition) as memberCount, 
  (SELECT COUNT(*) FROM subject WHERE someCondition) as subjectCount,
  (SELECT COUNT(*) FROM lecturer WHERE someCondition) as lecturerCount

Then on the basis of count you can find if data is present in your table or not. You don't have to run the queries multiple times for uniqueness if you go with native SQL

Checkout this link to know how to run custom SQL in Doctrine

Symfony2 & Doctrine: Create custom SQL-Query

Infrared answered 22/8, 2017 at 20:8 Comment(0)

Recommended topics

Hot tags