SQLite Optimization for Millions of Entries? [closed]

Asked 27/9, 2010 at 19:27 Answered 28/9, 2010 at 5:42

Solved perl optimization sqlite berkeley-db

I'm trying to tackle a problem by using a SQLite database and Perl modules. In the end, there will be tens of millions of entries I need to log. The only unique identifier for each item is a text string for the URL. I'm thinking of doing this in two ways:

Way #1: Have a good table, bad table, unsorted table. (I need to check the html and decide whether I want it.) Say we have 1 billion pages total, 333 million URLs in each table. I have a new URL to add, and I need to check and see if it's in any of the tables, and add it to the Unsorted if it is unique. Also, I would be moving a lot of rows around with this option.

Way #2: I have 2 tables, Master and Good. Master has all 1 billion page URLs, and Good has the 333 million that I want. New URL, need to do the same thing, except this time I am only querying one table, and I would never delete a row from Master, only add the data to Good.

So basically, I need to know the best setup to quickly query a huge SQLite database to see if a text string of ~20 characters is unique, then add if it isn't.

Edit: I'm now trying to get Berkeley DB to work using the Perl module, but no dice. Here's what I have:

use BerkeleyDB;

$dbFolder = 'C:\somedirectory';
my $env = BerkeleyDB::Env->new ( -Home => $dbFolder );

my $db  = BerkeleyDB::Hash->new (
-Filename => "fred.db", 
-Env => $env );
my $status = $db->db_put("apple", "red");

And when I run this, I get the following:

Can't call method "db_put" on an undefined value at C:\Directory\perlfile.pl line 42, <STDIN> line 1.

Rhamnaceous answered 27/9, 2010 at 19:27 Comment(5)

I'm not a DB expert (so I may be underestimating SQLite), but are you sure that SQLite is the best solution for that much data? – Melinite 27/9, 2010 at 19:28

I've heard from plenty of people that SQLite is great with tons of data. #3161487 – Rhamnaceous 27/9, 2010 at 19:29

The best SQLite optimisation for huge datasets is not to use SQLite. I know from experience that SQLite will slow down significantly on inserts after a while. Especially if you have indexes that are constantly recalculated. The theoretical limits of SQLite are well beyond their its practical limits. – Aldarcie 27/9, 2010 at 20:12

I think that a DBM like BerkeleyDB is going to be faster for this application. Use your string as the key, and serialize all your other data with YAML or Storable. – Unselfish 27/9, 2010 at 20:49

Actually, I've been trying to get BerkeleyDB to work for some time now. I can't get it to create a single file. I'll post what I have, maybe you can help me out. – Rhamnaceous 27/9, 2010 at 21:29

If $db is undefined, opening the database is failing, and you should inspect $! and $BerkeleyDB::Error to see why.

Have you created the database already? If not, you need -Flags => DB_CREATE.

Working example:

use strict;
use warnings;
use BerkeleyDB;

my $dbFolder = '/home/ysth/bdbtmp/';

my $db  = BerkeleyDB::Hash->new (
    -Filename => "$dbFolder/fred.db", 
    -Flags => DB_CREATE,
) or die "couldn't create: $!, $BerkeleyDB::Error.\n";

my $status = $db->db_put("apple", "red");

I couldn't get BerkeleyDB::Env to do anything useful, though; whatever I tried, the constructor returned undef.

Berte answered 28/9, 2010 at 5:42 Comment(6)

I added DB_CREATE and checked the $! after creating the environment, but all it says is "No such file or directory." Would you mind just dropping me a working sample for me to dissect? All I need to do is start a hash on disk, add items to it, and check for existing items. – Rhamnaceous 28/9, 2010 at 18:30

@Sho Minamimoto: added an example – Berte 28/9, 2010 at 21:26

Got it working. I think the problem was that I put "fred.db" instead of a full path, but in the docs it says the DB should be made wherever the Env is anyway. Oh well. Thanks for the help! – Rhamnaceous 28/9, 2010 at 21:42

One last thing, with millions of entries, how would I go about sorting by values? Like if I had some key URLs with the value of '0' and some with the value of '1', how would I only get URLs with the '0' value? Would this be fast? – Rhamnaceous 28/9, 2010 at 22:10

@Sho Minamimoto: no, having to go through the bad ones would definitely slow down just getting the good ones. If you need that to be as fast as possible, you're better off with good, bad, and unsorted files (or maybe good, all, and unsorted? not sure how you are wanting to use this.) – Berte 29/9, 2010 at 1:45

Well I've got a list of 10 million of URLs, and code to parse it, but what happens when the code breaks at any point and time? I can't be positive that all pages are at the same state at any given time (some might be unsorted, some good, some bad, and there will always be new URLs). And I don't want to start from #1 again. I think 3 Berkeley DBs will still keep me at O(c) because of 3 queries to 3 Hashes per URL. – Rhamnaceous 30/9, 2010 at 5:23

I'd be inclined to use a hash instead of SQLite to do what you want to do. A hash is optimized to test for existence without the need to keep the values in any sorted order and with no need to keep a redundant copy of the datum in an index. The hash algorithm applied to the datum yields the location where it would be stored, if it did exist; you can seek to that location and see if it's there. I don't think you'd need to keep the hash table in RAM.

Here's how you might take a hybrid hash/SQLite approach.

Create a SQLite table

STORE
id INTEGER PRIMARY KEY
BUCKET (integer, indexed) 
URL (text, not indexed)
status

You could have three of these tables, STORE1, STORE2, and STORE3 if you want to keep them separate by status.

Let's assume that there will be 250,000,001 distinct buckets in each store. (You can experiment with this number; make it a prime number).

Find a hashing algorithm that takes two inputs, the URL string and 250,000,0001 and returns a number between 1 and 250,000,001.

When you get a URL, feed it to the hashing algorithm and it will tell you which BUCKET to look in:

Select * from STORE where BUCKET = {the value returned by your hash function}.

Your index on the BUCKET field will quickly return the rows, and you can examine the URLs. If the current URL is not one of them, add it:

INSERT STORE(BUCKET, URL) VALUES( {your hash return value}, theURL).

SQLite will be indexing integer values, which I think will be more efficient than indexing the URL. And the URL will be stored only once.

Plains answered 27/9, 2010 at 19:44 Comment(8)

Err, indexing integer values may be more efficient, but only because indexing by a string presumably does the equivalent of your bucket calculation, only more efficiently. This is a general observation and may not apply to SQLite, but I bet it does. – Berte 28/9, 2010 at 5:44

Elaborate please, why SQLite's sort of URL for insertion into btree is more efficient than hashing algorithm that calculates a bucket? Secondly, your criticism doesn't address the duplication of the URL value which can be avoided by the approach I suggested. I stand by my original recommendation: hash better than btree because hash won't degrade with insertion -- no btree to keep balanced. This hybrid approach won't be as efficient as a pure hash approach. I just suggested it as possibly better than indexing the URL column, assuming the OP had to use a relational database like SQLite. – Plains 28/9, 2010 at 12:7

On the first point, I'm not being argumentative -- I'm asking why the calculation of the URL's hash-bucket would be less efficient than the binary algorithm to decide where to place the URL in a btree when there are a billion URLs (or 333 million of them if the set is broken into separate tables). My assumption would be that at some point the hash-calc would surpass the binary calc in efficiency because hash-calc wouldn't involve disk reads. – Plains 28/9, 2010 at 12:16

I don't know that it is more efficient, but am guessing that you are trading one btree for the indexed url for additional work on the client side, plus two btrees (one for the primary key and one for the bucket index) in sqlite and don't see how that could be an improvement. It seems like you are expecting sqlite to do something significantly different with integer indexes? I am not so assuming. – Berte 28/9, 2010 at 15:17

Re avoiding duplication of the URL value, I'm not sure what you mean. I'm assuming the url would be the primary key, so there's clearly no duplication. – Berte 28/9, 2010 at 17:21

@ysth: With SQLite everything is "client-side", but I take your point -- client & server could be on different boxes with another DB. Wouldn't the URL stored once in the primary table and once again in the index if you make it the PK? My assumption is that SQLite would handle integers more efficiently than strings of up to 255 characters; I assume URLs would require more RAM. I'd expect integers to take 8 bytes and strings of up to 255 characters to significantly more. But your point that I'd need two btrees is well taken. Still, I'd prefer a real hash solution to a simulated one or a btree. – Plains 28/9, 2010 at 17:48

I meant client of the sql library. The question says these are ~20 characters (!). Somebody is scraping a url-shortener, maybe? – Berte 28/9, 2010 at 18:11

Well thanks to you guys for figuring this out, maybe some other soul later will be forced to use SQLite to solve this, but I found Berkeley DB to be more well suited because I'm going to be making a lot of queries, and I can match exact keys. – Rhamnaceous 28/9, 2010 at 21:40

I don't know if this is optimal, but you could set up your SQLite DB such that the "good" table has a unique constraint on the URL column. You probably don't have enough RAM to do the comparisons in Perl (naive solution would be to create a hash where the URLs are the keys, but if you have a billion pages you'll need an awful lot of memory).

When it comes time to do an insert, the database will enforce the uniqueness and throw some kind of error when it tries to insert a duplicated URL. You can catch this error and ignore it, as long as DBI returns different error values for different error messages.

Variorum answered 27/9, 2010 at 19:43 Comment(0)