Inserting several "new" items into the database with DBIC
Asked Answered
J

1

5

I'm working in a bioinformatics project that requires me to read genomic data (nothing too fancy, just think of it as strings) from various organisms and insert it into a database. Each read belongs to one organism, and can contain from 5000 to 50000 thousand genes, which I need to process and analyze prior to storage.

The script currently doing this is written in perl and, after all calculations, stores the results in a hash likie this:

$new{$id}{gene_name}              = $id;
$new{$id}{gene_database_source} = $gene_database_source
$new{$id}{product}            = $product;
$new{$id}{sequence}               = $sequence;
$new{$id}{seqlength}              = $seqlength;
$new{$id}{digest}             = $digest;
$new{$id}{mw}                     = $mw;
$new{$id}{iep}                = $iep;
$new{$id}{tms}                = $tms;

After all genes are read and, the insertions are made looping through the hash into an eval{} statement.

eval {
foreach my $id (keys %new) {

  my $rs = $schema->resultset('Genes')->create(
    {
        gene_name               => $new{$id}{gene_name},
        gene_product            => $new{$id}{product},
        sequence                => $new{$id}{sequence},
        gene_protein_length     => $new{$id}{seqlength},
        digest                  => $new{$id}{digest},
        gene_isoelectric_point  => $new{$id}{iep},
        gene_molecular_weight   => $new{$id}{mw},
        gene_tmd_count          => $new{$id}{tms},
        gene_species            => $species,
        species_code            => $spc,
        user_id                 => $tdruserid,
        gene_database_source    => $new{$id}{gene_database_source}

    }
  );
}; 

While this "works", it has at least two problems I'd like to solve:

  • The eval statement is intended to "failsafe" the insertions: if one of the insertions fail, the eval dies and no insertion is done. This is clearly not how eval works. I'm pretty sure all insertions made until failure point will be done and there's no rollback whatsoever.

  • The script needs to loop twice through very large datasets (one while reading and creating the hashes, and once again when reading the hashes and performing the insertions). This makes the process' performance rather poor.

Instead of creating the hashes, I'd been thinking of using the new directive of DBIX $schema->new({..stuff..}); and then doing a massive insert transaction. That would solve the double iteration and the eval would either work (or not) with a single transaction, which would do the expected behaviour of < either all insertions or none > ... Is there a way to do this?

Jasperjaspers answered 7/7, 2017 at 15:44 Comment(4)
You're the first Bioperl person I've seen on here that's using something as modern as DBIC. Good job! :)Zachery
You can do transactions with DBIC.Zachery
I would not use DBIC for bulk data import. Have your Perl script create a CSV or TSV and use your database's own bulk import functionality.Dewain
If you're still having issues with speed following the advice by Simbabque's excellent answer, feel free to post again, I for one am glad to see "Bioperl" coders around :)Groupie
Z
8

You can create your massive transaction by using a TxnScopeGuard in DBIC. In the most basic form, that would be as follows.

eval { # or try from Try::Tiny
    my $guard = $schema->txn_scope_guard;

    foreach my $id ( keys %new ) {
        my $rs = $schema->resultset('Genes')->create(
            {
                gene_name              => $new{$id}{gene_name},
                gene_product           => $new{$id}{product},
                sequence               => $new{$id}{sequence},
                gene_protein_length    => $new{$id}{seqlength},
                digest                 => $new{$id}{digest},
                gene_isoelectric_point => $new{$id}{iep},
                gene_molecular_weight  => $new{$id}{mw},
                gene_tmd_count         => $new{$id}{tms},
                gene_species           => $species,
                species_code           => $spc,
                user_id                => $tdruserid,
                gene_database_source   => $new{$id}{gene_database_source}

            }
        );
    }
    $guard->commit;
}

You create a scope guard object, and when you're done setting up your transaction, you commit it. If the object goes out of scope, i.e. because something died, it will rollback the transaction automatically.

The eval can catch the die, and your program will not crash. You had that part correct, but you're also right that your code will not undo previous inserts. Note that Try::Tiny's try provides nicer syntax. But it's not needed here.

Transaction in this case means that all queries are collected and run at the same time.

Note that this will still insert one row per INSERT statement only!

If you want to instead create larger INSERT statements, like the following, you need populate, not new.

INSERT INTO foo (bar, baz) VALUES
(1, 1),
(2, 2),
(3, 3),
...

The populate method lets you pass in an array reference with multiple rows at one time. This is supposed to be way faster than inserting one at a time.

$schema->resultset("Artist")->populate([
  [ qw( artistid name ) ],
  [ 100, 'A Formally Unknown Singer' ],
  [ 101, 'A singer that jumped the shark two albums ago' ],
  [ 102, 'An actually cool singer' ],
]);

Translated to your loop, that would be as follows. Note that the documentation claims that it's faster if you run it in void context.

eval {
    $schema->resultset('Genes')->populate(
        [
            [
                                qw(
                    gene_name             gene_product   sequence
                    gene_protein_length   digest         gene_isoelectric_point
                    gene_molecular_weight gene_tmd_count gene_species
                    species_code          user_id        gene_database_source
        )
            ],
            map {
                [
                    $new{$_}{gene_name}, $new{$_}{product},
                    $new{$_}{sequence},  $new{$_}{seqlength},
                    $new{$_}{digest},    $new{$_}{iep},
                    $new{$_}{mw},        $new{$_}{tms},
                    $species,            $spc,
                    $tdruserid,          $new{$_}{gene_database_source},
                ]
            } keys %new
        ],
    );
}

Like this the scope guard is not needed. However, I would advise you to not do more than 1000 rows per statement though. Processing it in chunks might be a good idea for performance reasons. In that case, you'd loop over the keys 1000 at a time. List::MoreUtils has a nice natatime function for that.

use List::MoreUtils 'natatime';

eval {
    my $guard = $schema->txn_scope_guard;

    my $it = natatime 1_000, keys %new;

    while ( my @keys = $it->() ) {
        $schema->resultset('Genes')->populate(
            [
                [
                    qw(
                        gene_name             gene_product   sequence
                        gene_protein_length   digest         gene_isoelectric_point
                        gene_molecular_weight gene_tmd_count gene_species
                        species_code          user_id        gene_database_source
                        )
                ],
                map {
                    [
                        $new{$_}{gene_name}, $new{$_}{product},
                        $new{$_}{sequence},  $new{$_}{seqlength},
                        $new{$_}{digest},    $new{$_}{iep},
                        $new{$_}{mw},        $new{$_}{tms},
                        $species,            $spc,
                        $tdruserid,          $new{$_}{gene_database_source},
                    ]
                } @keys
            ],
        );
    }

    $guard->commit;
}

Now it will do 1000 rows per insertion, and run all those queries in one big transaction. If one of them fails, none will be done.

The script needs to loop twice through very large datasets (one while reading and creating the hashes, and once again when reading the hashes and performing the insertions). This makes the process' performance rather poor.

You're not showing how you create the data, besides this assignment.

$new{$id}{gene_name}              = $id;
$new{$id}{gene_database_source} = $gene_database_source
$new{$id}{product}            = $product;

If that's all there is to it, nothing is stopping you from using the approach I've shown above directly where you're processing the data the first time and building the hash. The following code is incomplete, because you're not telling us where the data is coming from, but you should get the gist.

eval {
    my $guard = $schema->txn_scope_guard;

    # we use this to collect rows to process
    my @rows;

    # this is where your data comes in
    while ( my $foo = <DATA> ) {

        # here you process the data and come up with your variables
        my ( $id, $gene_database_source, $product, $sequence, $seqlength, 
             $digest, $mw, $iep, $tms );

        # collect the row so we can insert it later
        push(
            @rows,
            [
                $id, $gene_database_source, $product, $sequence, $seqlength, 
                $digest, $mw, $iep, $tms,
            ]
        );
        
        # only insert if we reached the limit
        if ( scalar @rows == 1000 ) {
            $schema->resultset('Genes')->populate(
                [
                    [
                        qw(
                            gene_name             gene_product   sequence
                            gene_protein_length   digest         gene_isoelectric_point
                            gene_molecular_weight gene_tmd_count gene_species
                            species_code          user_id        gene_database_source
                            )
                    ],
                    \@rows,
                ],
            );
            
            # empty the list of values
            @rows = ();
        }
    }
    $guard->commit;
}

Essentially we collect up to 1000 rows directly as array references while we process them, and when we've reached the limit, we pass them to the database. We then reset our row array and start over. Again, all of this is wrapped in a transaction, so it will only be committed if all the inserts are fine.


There is more information on transactions in DBIC in the cookbook.

Please note that I have not tested any of this code.

Zachery answered 7/7, 2017 at 17:15 Comment(1)
@LionelUranLandaburu no problem. But keep Sinan's advice from above in mind. A bulk import might make more sense.Zachery

© 2022 - 2024 — McMap. All rights reserved.