You can create your massive transaction by using a TxnScopeGuard
in DBIC. In the most basic form, that would be as follows.
eval { # or try from Try::Tiny
my $guard = $schema->txn_scope_guard;
foreach my $id ( keys %new ) {
my $rs = $schema->resultset('Genes')->create(
{
gene_name => $new{$id}{gene_name},
gene_product => $new{$id}{product},
sequence => $new{$id}{sequence},
gene_protein_length => $new{$id}{seqlength},
digest => $new{$id}{digest},
gene_isoelectric_point => $new{$id}{iep},
gene_molecular_weight => $new{$id}{mw},
gene_tmd_count => $new{$id}{tms},
gene_species => $species,
species_code => $spc,
user_id => $tdruserid,
gene_database_source => $new{$id}{gene_database_source}
}
);
}
$guard->commit;
}
You create a scope guard object, and when you're done setting up your transaction, you commit
it. If the object goes out of scope, i.e. because something die
d, it will rollback the transaction automatically.
The eval
can catch the die
, and your program will not crash. You had that part correct, but you're also right that your code will not undo previous inserts. Note that Try::Tiny's try
provides nicer syntax. But it's not needed here.
Transaction in this case means that all queries are collected and run at the same time.
Note that this will still insert one row per INSERT
statement only!
If you want to instead create larger INSERT
statements, like the following, you need populate
, not new
.
INSERT INTO foo (bar, baz) VALUES
(1, 1),
(2, 2),
(3, 3),
...
The populate
method lets you pass in an array reference with multiple rows at one time. This is supposed to be way faster than inserting one at a time.
$schema->resultset("Artist")->populate([
[ qw( artistid name ) ],
[ 100, 'A Formally Unknown Singer' ],
[ 101, 'A singer that jumped the shark two albums ago' ],
[ 102, 'An actually cool singer' ],
]);
Translated to your loop, that would be as follows. Note that the documentation claims that it's faster if you run it in void context.
eval {
$schema->resultset('Genes')->populate(
[
[
qw(
gene_name gene_product sequence
gene_protein_length digest gene_isoelectric_point
gene_molecular_weight gene_tmd_count gene_species
species_code user_id gene_database_source
)
],
map {
[
$new{$_}{gene_name}, $new{$_}{product},
$new{$_}{sequence}, $new{$_}{seqlength},
$new{$_}{digest}, $new{$_}{iep},
$new{$_}{mw}, $new{$_}{tms},
$species, $spc,
$tdruserid, $new{$_}{gene_database_source},
]
} keys %new
],
);
}
Like this the scope guard is not needed. However, I would advise you to not do more than 1000 rows per statement though. Processing it in chunks might be a good idea for performance reasons. In that case, you'd loop over the keys 1000 at a time. List::MoreUtils has a nice natatime
function for that.
use List::MoreUtils 'natatime';
eval {
my $guard = $schema->txn_scope_guard;
my $it = natatime 1_000, keys %new;
while ( my @keys = $it->() ) {
$schema->resultset('Genes')->populate(
[
[
qw(
gene_name gene_product sequence
gene_protein_length digest gene_isoelectric_point
gene_molecular_weight gene_tmd_count gene_species
species_code user_id gene_database_source
)
],
map {
[
$new{$_}{gene_name}, $new{$_}{product},
$new{$_}{sequence}, $new{$_}{seqlength},
$new{$_}{digest}, $new{$_}{iep},
$new{$_}{mw}, $new{$_}{tms},
$species, $spc,
$tdruserid, $new{$_}{gene_database_source},
]
} @keys
],
);
}
$guard->commit;
}
Now it will do 1000 rows per insertion, and run all those queries in one big transaction. If one of them fails, none will be done.
The script needs to loop twice through very large datasets (one while reading and creating the hashes, and once again when reading the hashes and performing the insertions). This makes the process' performance rather poor.
You're not showing how you create the data, besides this assignment.
$new{$id}{gene_name} = $id;
$new{$id}{gene_database_source} = $gene_database_source
$new{$id}{product} = $product;
If that's all there is to it, nothing is stopping you from using the approach I've shown above directly where you're processing the data the first time and building the hash. The following code is incomplete, because you're not telling us where the data is coming from, but you should get the gist.
eval {
my $guard = $schema->txn_scope_guard;
# we use this to collect rows to process
my @rows;
# this is where your data comes in
while ( my $foo = <DATA> ) {
# here you process the data and come up with your variables
my ( $id, $gene_database_source, $product, $sequence, $seqlength,
$digest, $mw, $iep, $tms );
# collect the row so we can insert it later
push(
@rows,
[
$id, $gene_database_source, $product, $sequence, $seqlength,
$digest, $mw, $iep, $tms,
]
);
# only insert if we reached the limit
if ( scalar @rows == 1000 ) {
$schema->resultset('Genes')->populate(
[
[
qw(
gene_name gene_product sequence
gene_protein_length digest gene_isoelectric_point
gene_molecular_weight gene_tmd_count gene_species
species_code user_id gene_database_source
)
],
\@rows,
],
);
# empty the list of values
@rows = ();
}
}
$guard->commit;
}
Essentially we collect up to 1000 rows directly as array references while we process them, and when we've reached the limit, we pass them to the database. We then reset our row array and start over. Again, all of this is wrapped in a transaction, so it will only be committed if all the inserts are fine.
There is more information on transactions in DBIC in the cookbook.
Please note that I have not tested any of this code.