Does sqlite3 compress data?
Asked Answered
Y

3

40

I've got an 7.4Gb csv file. After converting it to a sqlite database with a python script the output DB is 4.7Gb, around 60% of the original size.

The csv has around 150,000,000 rows. It has header:

tkey,ipaddr,healthtime,numconnections,policystatus,activityflag

And each row looks something like

261846,172.10.28.15,2012-02-03 16:15:00,22,1,1

The script uses healthtime to split the data into tables 192 tables

When I first saw these numbers, I assumed I had made an error someplace. How big of a reduction in file size should I expect from the added efficiency of only writing the healthtime 192 times instead of 150,000,000 times?

EDIT: As soon as I posted this I realized the answer. I'm removing about 40% of the string, hence the 40% reduction in size.

Edit 2 Let's calculate the difference in size between the plain text:

"261846,172.10.28.15,2012-02-03 16:15:00,22,1,1" 

And the data base entry:

db(261846,'172.10.28.15',22,1,1)

First of all, we drop from 46 to 26 characters in plain text representation.

The remaining characters are:

"261846,172.10.28.15,22,1,1"

or 26 bytes. If each integer needs to be stored in 32 bit (4 bytes), then we have:

12 bytes (ipaddr) + 4 bytes * 4 (integer fields) = 28 bytes.

So it looks like converting to integers makes the storage slightly less efficient, and all of my gains come form reducing the number of characters stored in each row.

Yand answered 30/5, 2012 at 21:12 Comment(3)
Obviously sqlite is a binary format, while csv is text. 60% size difference may be caused by that.Etter
Sorry, what is healthtime? A quick google search didn't turn anything up.Downward
healthtime is just the name of the field.Yand
D
45

SQLite is not running a compression algorithm, but it will store data in a binary file instead of a text file. Which means that the data can be stored more efficiently, for example using a 32-bit (4 byte) number to represent 10,000,000 instead of storing it as 8 bytes of text (or more if the file is unicode).

Here are more details on the SQL Database File Format if you are interested.

Does that make sense?

Dew answered 30/5, 2012 at 21:18 Comment(2)
It is worth pointing out that SQLite only uses as many bytes as is necessary. For example the value 7 only needs one byte.Retiary
The value 7 needs two bytes: a serial type varint which would be one byte long in this case, and the value which would be one byte in this case. The serial type is present for every value in the database, it's one byte for nulls and numbers, and is a variable length up to 9 bytes for blobs and text. So, for example, a 32-bit long integer takes 5 bytes, a 64-bit float is 9 bytes as in both those cases it has a one-byte serial type.Ephrayim
A
23

SQLite, by default, does not compress data it writes to the disk; however, SQLite does have a set of "Proprietary Extensions" for that and other purposes. Look for ZIPVFS in the links as follows.

http://www.sqlite.org/support.html and http://www.hwaci.com/sw/sqlite/prosupport.html

You can achieve a lot of "compression" in your data by encoding fields as integers. For example an IP address was engineered to fit into a word (4 bytes). Each octect of the address may be represented in one byte of a word.

string[] octets = '172.168.0.1'.split('.')
int ip = atoi(octets[0]) << 24
ip |= atoi(octets[1]) << 16
ip |= atoi(octets[2]) << 8
ip |= atoi(octets[3])

Additionally, your timestamp may be represented in Unix time, which is the number of seconds since the epoch.

UPDATE mytable SET healthtime = CAST(strftime('%s',healthtime) AS INTEGER);

See the Date and Time functions

Note the CAST directive in the above SQL: SQLite does not enforce type on a column, so you may have a group of digits stored as a string; increasing your field size more than necessary (this will also make certain queries behave oddly).

One more thing: field size is not the only part of the story. Remember that indexes take up a space too, and indexes on integers are more efficient --in terms of disk size and performance.

Amphistylar answered 4/12, 2013 at 1:40 Comment(0)
S
4

I have a rather big SQLite file in my GIT repository and I was wondering why my total repository size was not growing that much but instead, was even smaller than my SQLite-.db-file. Turns out, GIT compresses the repository by default. A quick check on my .db-file also indicates this, since zipping the .db-file resulted in a zip-archive which was only ~20% the size of the .db-file.

So, at least with the default settings, it doesn't look like SQLite is storing the data in a compressed way. However, regardless of that, it might be fine to add a big SQLite file to a GIT repository, since GIT performs compression automatically.

Scourge answered 28/1, 2020 at 11:26 Comment(3)
Storing binary files like SQLite databases in Git is not a great idea if you expect them to change a lot over time. Git will try its best to compress files, and multiple versions of the same file, but if the binary format changes (Say, due to a VACUUM, or a change in an index), the compression algorithm is likely to miss that and your repo size will grow. It's probably better to store your data in a plaintext format (CSV, TSV, yaml, etc.) and check that in, then automate converting that to a database for efficient use.Unconstitutional
@CodyCasterline YMMV, while this may be true for storing binary such as JPGs with varying color filters applied; it gets a little more complicated for repetitive data repository files like SQLite databases and even text based CSV files. If the process that generates the CSV data does not sort the output then you're likely to have the same JPG problem; similarly if the db file is just an ordered array of structs written to a binary file in pages, then *.db -text diff merge=binary could potentially be a good option... (although a custom object diff'r or LFS could be even better)Unemployable
I would strongly advice against committing large amounts of binary data into a git repository, unless you are certain that you will never want to remove it. Put it in a separate repository and add it as a submodule, if you must store data in a git repositoryEzar

© 2022 - 2024 — McMap. All rights reserved.