Compressed Sqlite database and indexing
Asked Answered
W

2

6

I read that it is possible to compress sqlite databases with extensions like sqlite-zstd.

In Sqlite3, are there methods, ready to use in Python, that allow both:

  • compression of a text column (let's say 1 billion of rows, with at least one text column of < 20 characters)
  • keep the fast lookup that we have when this column has an INDEX (and even full text search e.g. LIKE 'foo%')

?

I was about to write some code with LZ4-compressed rows, but then a single search/lookup would require a full scan (to decompress all values to see if there is match).

Are there Sqlite techniques adapted to this (or other data structures)?

Windsucking answered 15/10, 2022 at 10:13 Comment(11)
There is the ZIPVFS extension which allows for reading and writing to compressed databases.Operculum
@MarkBenningfield Would you have a Python example code? It would be great! (does it work for Windows?)Windsucking
Could you elaborate, please, 1) why do you need fast lookup for 1 billion of rows compressed through lz4 in sqlite? 2) B-tree index is useless for LIKE "%foo%" queries. There are other mechanisms for this, such as the trigram extension in Postgres.Harmon
@Harmon It's a rather general question, doing a fast lookup in big databases with many rows is a common requirement in many applications I have had in the past decade. Good to know about this extension! Do you think it is available in Sqlite?Windsucking
@Basj, key thought is: the size of the sqlite database is in practice only limited by how big the file is comfortable for you to keep. I'd call it "mauvais ton", but, for example, 30-50 gigabytes it's possible to have as sql files. LIKE "%foo%" is also never about large databases or about fast searches. So there is a full-text search in Postgres, there are trigrams, there are Elastic and Manticore. Sqlite is not immediately available, it is too dumb for this kind of optimizations.Harmon
@Harmon You're right, let's drop the LIKE %foo% FTS requirement and only keep the LIKE foo%" prefix search, for which a B-tree is probably ok. Would you have an example of compressed Sqlite code, in Python? Would be very interesting :)Windsucking
I'd say for LIKE 'foo%' standard index is enough.Harmon
@Harmon Yes, this would be ok to start with. Would you know how to enable compression in a Sqlite + Windows + Python context? :)Windsucking
https://mcmap.net/q/396893/-does-sqlite3-compress-data there give an approach on how to enable compression!Ambrotype
Could you provide more details to get good suggestions. 1) why are you using sqlite in the first place? What features of it did you need? 2) what columns have indexes? 3) are their other columns besides the text? 4) do your text columns have high cardinality? 5) do you need full-text search ("%foo%") or just prefix search ("foo%")Harmon
@Harmon 1) I use sqlite for simplicity and single user/worker, so no need for a bigger client/server DB 2) CREATE TABLE data (id INTEGER PRIMARY KEY, description TEXT) CREATE UNIQUE INDEX idx ON data(description); 3) no other column except an ID for this simple example but in real code, there might be another integer or text column (without index) 4) I want to be ready for 1 billion of rows, average char length of description is 20 characters, always < 32. 5) only prefix is ok to start with "foo%"Windsucking
P
0

SQLITE doesn't compress it's data bydefault, but they do have some Proprietary SQLite Extensions that can do the required work. ZIPVFS is one such extension/addon that can allow you to read and write the compressed data using zlib or any other application-supplied compression and decompression functions.

Panther answered 24/10, 2022 at 20:28 Comment(0)
H
0

So, as the simplest solution is a trie of separately compressed SQLite or parquet files, but this is a common enough problem that there is likely a file format built for it.

The biggest objective is to maintain data locality across prefixes and the data is changing and unsorted. If you just compress a single SQLite db, then every insert causes the compression engine to accommodate two shifts in the data. Over at the end of the data table, but the other might be in the middle of the btree itself.

If your proposed table is why you have a shadow primary key. Why is that necessary for? It seems clear your text is a primary key, why not use it as such?

Harmon answered 25/10, 2022 at 12:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.