Efficient way to ensure unique rows in SQLite3
Asked Answered
V

5

27

I am using SQLite3 in one of my projects and I need to ensure that the rows that are inserted into a table are unique with regard to a combination of some of their columns. In most cases the rows inserted will differ in that respect, but in case of a match the new row must update/replace the existing one.

The obvious solution was to use a composite primary key, with a conflict clause to handle collisions. Thefore this:

CREATE TABLE Event (Id INTEGER, Fld0 TEXT, Fld1 INTEGER, Fld2 TEXT, Fld3 TEXT, Fld4 TEXT, Fld5 TEXT, Fld6 TEXT);

became this:

CREATE TABLE Event (Id INTEGER, Fld0 TEXT, Fld1 INTEGER, Fld2 TEXT, Fld3 TEXT, Fld4 TEXT, Fld5 TEXT, Fld6 TEXT, PRIMARY KEY (Fld0, Fld2, Fld3) ON CONFLICT REPLACE);

This does indeed enforce the uniqueness constraint as I need it to. Unfortunately, this change also incurs a performance penalty that is way beyond what I expected. I did a few tests using the sqlite3 command line utility to ensure that there is not a fault in the rest of my code. The tests involve entering 100,000 rows, either in a single transaction or in 100 transactions of 1,000 rows each. I got the following results:

                                | 1 * 100,000   | 10 * 10,000   | 100 * 1,000   |
                                |---------------|---------------|---------------|
                                | Time  | CPU   | Time  | CPU   | Time  | CPU   |
                                | (sec) | (%)   | (sec) | (%)   | (sec) | (%)   |
--------------------------------|-------|-------|-------|-------|-------|-------|
No primary key                  | 2.33  | 80    | 3.73  | 50    | 15.1  | 15    |
--------------------------------|-------|-------|-------|-------|-------|-------|
Primary key: Fld3               | 5.19  | 84    | 23.6  | 21    | 226.2 | 3     |
--------------------------------|-------|-------|-------|-------|-------|-------|
Primary key: Fld2, Fld3         | 5.11  | 88    | 24.6  | 22    | 258.8 | 3     |
--------------------------------|-------|-------|-------|-------|-------|-------|
Primary key: Fld0, Fld2, Fld3   | 5.38  | 87    | 23.8  | 23    | 232.3 | 3     |

My application currently performs transactions of at most 1,000 rows and I was surprised by the 15-fold drop in performance. I expected at most a 3-fold drop in throughput and a rise in CPU usage, as seen in the 100k-transaction case. I guess the indexing involved in maintaining the primary key constraints requires a significantly larger number of synchronous DB operations, thus making my hard drives the bottleneck in this case.

Using WAL mode does have some effect - a performance increase of about 15%. Unfortunately that is not enough on its own. PRAGMA synchronous = NORMAL did not seem to have any effect.

I might be able to recover some performance by increasing the transaction size, but I'd rather not do that, due to the increased memory usage and concerns about responsiveness and reliability.

The text fields in each row have variable lengths of about 250 bytes in average. The query performance does not matter too much, but the insert performance is very important. My application code is in C and is (supposed to be) portable to at least Linux and Windows.

Is there a way to improve the insert performance without increasing the transaction size? Either some setting in SQLite (anything but permanently forcing the DB into asynchronous operation, that is) or programmatically in my application code? For example, is there a way to ensure row uniqueness without using an index?

BOUNTY:

By using the hashing/indexing method described in my own answer, I managed to somewhat moderate the performance drop to a point where it's probably acceptable for my application. It seems, however, that as the number of rows in the table increases, the presence of the index makes inserts slower and slower.

I am interested in any technique or fine-tuning setting that will increase performance in this particular use case, as long as it does not involve hacking the SQLite3 code or otherwise cause the project to become unmaintainable.

Volturno answered 3/3, 2011 at 14:20 Comment(0)
C
15

I have used sqlite to insert millions of rows at runtime and this is what I have used to increase performance:

  • Use as few transactions as possible.
  • Use parametrized commands for inserting the data (prepare the command once and just change paramater values in the loop)
  • Set PRAGMA synchronous OFF (not sure how it works with WAL)
  • Increase page size of the database.
  • Increase cache size. This is an important setting as it will cause sqlite to actually write the data to the disk fewer times and will run more operations in memory making the whole process faster.
  • If you need an index add it after inserting the rows by running the necessary sqlite command. In this case you will need to ensure uniqueness yourself as you are currently doing it now.

If you try these please post your test results. I believe it will be interesting for everyone.

Carolacarolan answered 22/3, 2011 at 6:34 Comment(10)
In order: Fewer transactions: Already done, within reason. Parameterized statements: is there another way? :-). Asynchronous mode: No can do - too much risk of DB corruption. Larger pages: Already done, it does indeed increase performance somewhat. Larger cache: Already tested, it actually slows things down a bit. It only helps with reads, anyway, not with writes. Index after the fact: No can do - the whole thing will be constantly updated. Any more ideas? Every little bit helps...Volturno
@thkala: It's strange that increasing cache is not helping. In my case I could see increase in memory usage and no file activity while the memory used was less that cache. In your case there might be something causing sqlite (maybe index) to ignore the cache while inserting the records. Can you use process monitor to check if there is any IO activity going on while it should be using cache?Carolacarolan
@Giorgi: As long as the index pages fit in the cache, increasing its size should not matter: I only do inserts and the atomicity constraints already force a few fsync() (i.e. disk writes) operations per transaction. The OS cache already helps with reads, anyway...Volturno
@thkala: One more idea: You can use sqlite in memory database so all operations will be performed in RAM and than write all changes to the disk. Here are some links: Saving to disk an in-memory database and Synchronizing sqlite database from memory to file Also have a look at this questions about sqlite performance: #1712131Carolacarolan
@Giorgi: I cannot use an in-memory DB, for the same reasons that I cannot increase the transaction size too much: memory usage and extensive data loss in case of abnormal program termination. If not for these, I'd simply bunch 100k inserts instead of 2k for a rather significant performance gain.Volturno
+1 for a nice set of SQLite performance tuning suggestions, anyway :-)Volturno
@thkala: I see but 2k inserts in memory db and than flushing it to disk can result in better performance. At least you could give it a try to measure how it performs.Carolacarolan
How would I enforce the uniquess constraint with an intermediate in-memory DB, though? I cannot load the whole DB in memory, and using a single "flush" statement would imply going back to a composite primary key for the on-disk DB...Volturno
Your comments for tuning are very useful. I tried similar approaches suggested on this question [#6239876 and they did make my work a lot easier.Matriarch
How on Earth does this answer the OP question?Rudolph
S
8

The ON CONFLICT REPLACE clause will make SQLite delete existing rows, then insert new rows. That means that SQLite is probably going to spend some of its time

  • deleting existing rows
  • updating the indexes
  • inserting new rows
  • updating the indexes

That's my take on it, based on SQLite documentation and reading about other database management systems. I didn't look at the source code.

SQLite has two ways of expressing uniqueness constraints: PRIMARY KEY and UNIQUE. Both of them create an index, though.

Now the really important stuff . . .

It's great that you did tests. Most developers don't do that. But I think your test results are badly misleading.

In your case, it doesn't matter how fast you can insert rows into a table that doesn't have a primary key. A table that doesn't have a primary key doesn't satisfy your basic requirements for data integrity. That means you can't rely on your database to give you the right answers.

If it doesn't have to give the right answers, I can make it really, really fast.

To get a meaningful timing for inserting into a table that has no key, you need to either

  • run code before inserting new data to make sure you don't violate the undeclared primary key constraint, and to make sure you update existing rows with the right values (instead of inserting), or
  • run code after inserting into that table to clean up duplicates on (Fld0, Fld2, Fld3), and to reconcile conflicts

And, of course, the time those processes take has to be taken into account, too.

FWIW, I did a test by running 100K SQL insert statements into your schema in transactions of 1000 statements, and it only took 30 seconds. A single transaction of 1000 insert statements, which seems to be what you expect in production, took 149 msec.

Maybe you can speed things up by inserting into an unkeyed temporary table, then updating the keyed table from that.

Synchroflash answered 7/3, 2011 at 2:21 Comment(2)
Actually my testing did make sense: The original requirements for the project (and all the performance testing) were that the DB would store all rows. This changed halfway through (not my fault, I swear :-) ), hence my attempt to find a way to handle this change without essentially starting from scratch, which is what my alternative solutions would mean.Volturno
ON CONFLICT IGNORE could be an enhancement, if the data must not be strictly replaced.Disobedience
V
4

(I don't normally answer my own questions, but I would like to document a few ideas/partial solutions for this.)

The main problem with a composite primary key is the way the indexes are handled. Composite keys imply an index on the composite value, which in my case means indexing strings. While comparing string values isn't that slow, indexing a value with a length of, say, 500 bytes means that the B-tree nodes in the index can fit far fewer row/node pointers than a B-tree that indexes a 64-bit integer value. This means loading far more DB pages for each index search, as the height of the B-tree increases.

In order to deal with this issue I modified my code so that:

  • It uses WAL mode. The performance increase was certainly worth such a small change, since I do not have any issues with the DB file not being self-contained.

  • I used the MurmurHash3 hash function - after re-writing it in C and adapting it - to produce a single 32-bit hash value from the values of the fields that would form the key. I stored this hash in a new indexed column. Since this is an integer value, the index is quite fast. This is the only index for this table. Since there are going to be at most 10,000,000 rows in the table, hash collisions will not be a performance concern - although I cannot really consider the hash value to be UNIQUE, the index will only return a single row in the general case.

At this point there are two alternatives that I have coded and are currently undergoing testing:

  • DELETE FROM Event WHERE Hash=? AND Fld0=? AND Fld2=? AND Fld3=?, followed by an INSERT.

  • UPDATE Event SET Fld1=?,... WHERE Hash=? AND Fld0=? AND Fld2=? AND Fld3=?, followed by an INSERT if no rows where updated.

I expect the second alternative to be faster, but I will have to complete the testing first. In any case, it seems that with these changes the performance drop (as compared to the original index-less table) has been lessened to a factor of 5 or so, which is far more manageable.

EDIT:

At this point I have settled with using the second variation, which is indeed slightly faster. It seems, however, that any kind of index slows down SQLite3 dramatically as the indexed table gets larger. Increasing the DB page size to 8192 bytes seems to help somewhat, but not nearly as drastically as I would like.

Volturno answered 7/3, 2011 at 11:5 Comment(0)
F
3
Case When Exists((Select ID From Table Where Fld0 = value0 and Fld2 = value1 and Fld3 = value 2)) Then
    --Insert Statement
End

I'm not 100% that the insert works like that in SQLite, but I think it should. This with proper indexing on the Where fields should be rather quick. However this is two transactions which is something to consider.

Floatstone answered 3/3, 2011 at 15:16 Comment(2)
From what I can find CASE in SQLite is an expression, not a statement. I have not been able to use it like this. Do you have an actual SQL snippet that I could try?Volturno
@Volturno I'll have to look into it.Floatstone
P
3

In addition to all the other great answers, one thing that you can do is partition data into several tables.

SQLite INSERTs get slower and slower as the number of rows increases, but if you can split a table into several ones that effect is diminished (e.g.: "names" -> "names_a", "names_b", ... for names starting with the letter x). Later on, you can do CREATE VIEW "names" AS SELECT * FROM "names_a" UNION SELECT * FROM "names_b" UNION ....

Parnassus answered 13/10, 2013 at 22:37 Comment(2)
How to calculate the (optimal) partition size (ie number of rows in a partition)?Labyrinthodont
Most partitioning strategies I've seen (across multiple databases) partition at convenient points in key space, not so as to keep the sizes of partitions strictly balanced.Toaster

© 2022 - 2024 — McMap. All rights reserved.