Full text search on encrypted data

Asked 9/3, 2014 at 18:55 Answered 27/6, 2024 at 13:32

Suppose I have a server storing encrypted text (end-to-end: server never sees plain text).

I want to be able to do full text search on that text.
I know this is tricky, but my idea is to use the traditional full text design ("list" and "match" tables where words are stored and matched with ids from the content table). When users submit the encrypted text, they also send a salted MD5 of the words and respective matches. The salt used is unique for each user and is recovered from their password.
(in short: the only difference is that the "list" table will contain hashed words)

Now, how vulnerable would this system be?
Note that I said "how vulnerable" instead of "how safe", because I acknowledge that it can't be totally safe.
I DO understand the tradeoff between features (full text search) and security (disclosing some information from the word index). For example, I understand that an attacker able to get the list and match tables could get information about the original, encrypted text and possibly be able to decipher some words with statistical analysis (however, being the salt unique for each user, this would need to be repeated for each user).

How serious would this threat be? And would there be any other serious threats?

DISCLAIMER
What I'm trying to build (and with the help of a cryptographer for actual implementation; right now I'm just trying to understand wether this will be possible) is a consumer-grade product which will deal with confidential yet not totally secret data.
My goal is just to provide something safe enough, so that it would be easier for an attacker to try stealing users' passwords (e.g. breaching into clients - they're consumers, eventually) rather than spending a huge amount of time and computing power trying to brute force the index or run complicated statistical analysis.

Comments in response to @Matthew

(may be relevant for anyone else answering)

As you noted, other solutions are not viable. Storing all the data inside the client means that users cannot access their data from other clients. Server-side encryption would work, but then we won't be able to give users the added security of client-side encryption.
The only "true alternative" is just to not implement search: while this is not a required feature, it's very important to me/us.
The salt will be protected in the exactly same way as the users' decryption key (the one used to decrypt stored texts). Thus, if someone was able to capture the salt, he or she would likely be able to capture also the key, creating a much bigger issue.
To be precise, the key and the salt will be stored encrypted on the server. They will be decrypted by the client locally with the user's password and kept in memory; the server never sees the decrypted key and salt. Users can change passwords, then, and they just need to re-encrypt the key and the salt, and not all stored texts. This is a pretty standard approach in the industry, to my knowledge.
Actually, the design of the database will be as follow (reporting relevant entries only). This design is like you proposed in your comment. It disallows proximity searches (not very relevant to us) and makes frequency less accurate.
- Table content, containing all encrypted texts. Columns are content.id and content.text.
- Table words, containing the list of all hashes. Columns are words.id and words.hash.
- Table match, that matches texts with hashes/words (in a one-to-many relationship). Columns are match.content_id and match.word_id.
We would have to implement features like removing stopwords etc. Sure. That is not a big issue (will, of course, be done on the client). Eventually, those lists have always been of limited utility for international (i.e. non English-speaking) users.
We expect the lookup/insert ratio to be pretty high (i.e. many lookups, but rare inserts and mostly in bulk).
Decrypting the whole hash database is certainly possible, but requires a brute force attack.
Suppose the salt is kept safe (as per point 2 above). If the salt is long enough (you cited 32 bits... but why not 320? - just an example) that would take A LOT of time.

To conclude... You confirmed my doubts about the possible risk of frequency analysis. However, I feel like this risk is not so high. Can you confirm that?
Indeed, first of all the salt would be unique per each user. This means that one user must be attacked at time.
Second, by reporting words only once per text (no matter how many times they appear), frequency analysis becomes less reliable.
Third... Frequency analysis on hashed words doesn't sound as something as good as frequency analysis on a Caesar-shift, for example. There are 250,000 words in English alone (and, again, not all our users will be English-speaking), and even if some words are more common than others, I believe it'd be hard to do this attack anyway.

PS: The data we'll be storing is messages, like instant messages. These are short, contain a lot of abbreviations, slang, etc. And every person has a different style in writing texts, further reducing the risk (in my opinion) of frequency attacks.

Gambell answered 9/3, 2014 at 18:55 Comment(10)

Would you make a hash for each word in the text? Would the salt be secret? It will not take long to calculate a MD5 hash for every single english word (less than 250000 words). Thats 0.1 seconds of calculation on my graphic card. – Nonah 9/3, 2014 at 19:8

@EbbeM.Pedersen that's a fair critique. The salt would be generated from the user's password (same way as the text's encryption key) and would be unique for each user. An attacker stealing the salt would require to steal the password (so everything falls down) or hack his/her client. However, if you believe this is not enough, I'm more than open to suggestions/comments (as this is the purpose of my question). – Gambell 9/3, 2014 at 19:22

@EbbeM.Pedersen and yes, it's roughly one (salted) hash for each word in the text. For example, for the word "dog" (with salt "blah") the client would send MD5("dogblah") as hashed word to store in the database. Either MD5 or SHA1 or SHA256 or similar. – Gambell 9/3, 2014 at 19:26

@72DFBF5BA0DF5BE9 I do understand how hashes work. I don't need to revert them. Are you familiar with how to implement a full-text search engine? You need two tables, generally: a "list" and a "match". Inside the "list", instead of plain words, I'd stored hashed ones. – Gambell 11/3, 2014 at 20:45

instead of vulnerability, I would question this system's usability or being useful at all, what's the use of storing words hashed in DB? How to retrieve it? I think there is a design problem, can you explain what you are trying to achieve? – Maintopmast 11/3, 2014 at 20:45

@72DFBF5BA0DF5BE9 Don't worry about users changing passwords. I'm already aware of that, and we have a solution (not using the password as key, but retrieving the key from the password). It's pretty standard – Gambell 11/3, 2014 at 20:46

You say: "Inside list instead of plain words, I'd stored hashed ones", OK, what's use of it and why? What it's protecting actually? I mean, in a lit which as you say: "list" and a "match", when you hash the list, it's not going to retrievable. – Maintopmast 11/3, 2014 at 20:48

@72DFBF5BA0DF5BE9 Suppose my text (id#1) is "hello world". The client will send "hello world" encrypted, and will also send hash(hello) and hash(world). The server will store the words in the table and match them against text#1. To search for the word "hello", then, the client will send "search for hash(hello)". Server will look for hash(hello) in the "list" table and will match it with text#1. The response will be the encrypted "hello world" text, which the client will decrypt. This is exactly like a "normal" FTS. – Gambell 11/3, 2014 at 20:52

Please also note that we'll build a "simplified" FTS engine. For example, we will not implement partial word match (wouldn't be possible with hashed words), not fancy features like "NEAR", etc. But without this FTS engine, it wouldn't be possible to search inside encrypted text. – Gambell 11/3, 2014 at 20:58

Your Problem sounds like the problem Signal had. This may be a good read: esl.cs.brown.edu/blog/signal ("[...] an encrypted index is created on the message column of the Signal message table. The column is parsed to find the set of all possible keywords and an encrypted index EDB is created that maps keywords to the IDs of the messages that contain them.") – Razid 29/11, 2020 at 16:48

TL;DR: If this needs to be secure enough that it requires per-user end-to-end encryption: Don't do it.

Too long for a comment, so here goes - if I understand correctly:

You have encrypted data submitted by the user (client side encrypted, so not using the DB to handle).
You want this to be searchable to the user (without you knowing anything about this - so an encrypted block of text is useless).
Your proposed solution to this is to also store a list (or perhaps paragraph) of hashed words submitted from the client as well.

So the data record would look like:

Column 1: Encrypted data block
Column 2: (space) delimited hashed, ordered, individual words from the above encrypted text

Then to search you just hash the search terms and treat the hashed terms as words to search the paragraph(s) of "text" in column 2. This will definitely work - just consider searching nonsense text with nonsense search terms. You would even still be able to do some proximity ranking of terms with this approach.

Concerns:

The column with the individually hashed words as text will incredibly weak in comparison to the encrypted text. You are greatly weakening your solution as not only are there limited words to work from, the resultant text will be susceptible to word frequency analysis, etc.
If you do this: separately store a salt unrelated to the password. Given that a rainbow table will be easy to create if your salt is captured (dictionary words only) store it encrypted somewhere.
You will lose many benefits of FTS like ignoring words like 'the' - you will need to re-implement this functionality on your own if you want it (i.e. remove these terms on the client side before submitting the data / search terms).

Other approaches that you imply are not acceptable/workable:

Implement searching client side (all data has to exist on the client to search)
Centralized encryption leveraging the databases built in functionality

I understand the argument being that your approach provides the user with the only access to their data (i.e. you cannot see/decrypt it). I would argue that this hashed approach weakens the data sufficiently that you could reasonably work out a users data (that is, you have lowered the effort required to the point that it is very plausible you could decrypt a user's information without any knowledge of their keys/salts). I wouldn't quite lower the bar to describe this as just obfuscation, but you should really think through how significant this is.

If you are sure that weakening your system to implement searching like this makes sense, and the another approach is not sufficient, one thing that could help is to store the hashes of words in the text as a list of uniquely occuring words only (i.e. no frequency or proximity information would be available). This would reduce the attack surface area of your implementation a little, but would also lose the benefits you are implying you want by describing the approach as FTS. You could get very fast results like this though as the hashed words essentially become tags attached to all the records that include them. The search lookup then could become very fast (at the expense of your inserts).

*Just to be clear - I would want to be REALLY sure my business needs demanded something like this before I implemented it...

EDIT:

Quick example of the issues - say I know you are using 32-bit salts and are hashing common words like "the". 2^32 possible salts = 4 billion possible salts (that is, not that many if you only need to hash a handful of words for the initial attack). Assume the salt is appended or prepended, this still is only 8 billion entries to pre-calculate. Even if it is less common words you do not need to create too many lists to ensure you will get hits (if this is not the case your data would not be worth searching).

Then lookup the highest frequency salts for a given block of text in our each of our pre-calculated salt tables and use the match to see if it correctly decrypts other words in the text. Once you have a plausible candidate generate the 250,000 word English language rainbow table for that salt and decrypt the text.

I would guess you could decrypt the hashed data in the system in hours to days with access to the database.

Puma answered 11/3, 2014 at 21:35 Comment(9)

Edited with an example of the problem you would face. – Puma 11/3, 2014 at 21:55

Thank you @Puma for your very reasoned answer. I had so many comments that I had to edit my original post. Please see the updated question! – Gambell 11/3, 2014 at 22:26

@Qualcuno My point is you do not need to attack one user at a time, you can attack 1 (or a couple) high frequency words and use that to figure out the users salt. A long salt will not be enough to make this attack vector unreasonable due to the nature of your data (high frequency dictionary words). Once I know the salt for a user I generate the comparatively small rainbow table for each user. – Puma 11/3, 2014 at 23:6

I get what you mean. Given that the index will exclude incredibly common words (such as "the", "is", etc), and given that words will be logged only one time per each text, are there other measures we can use to limit the chance of frequency attacks? – Gambell 11/3, 2014 at 23:52

At some point your lower 'high frequency' words will still be the problem. You will still get some information on frequency by the number of entries referencing the word even if not the number of times per record it exists. All the things you are suggesting will help certainly, but I still don't think you are going to raise the cost of an attack as high as you think... Client side searching or indexing as suggested by erickson are likely your best solutions. – Puma 12/3, 2014 at 0:5

Those solutions, however, are not feasible. Downloading and re-uploading an index which is a few MBs, not to say keeping that in memory (client may be a JS app, and browsers have limited memory - especially on mobile) is not an option. – Gambell 12/3, 2014 at 1:44

Could you be able to quantify "some point"? How many messages in the database? Also, how would it sound if instead of one salt, each users has 2 or 3, randomly used for every insert? (in this case, searching for one word would just be searching for all the 3 possible hashes) – Gambell 12/3, 2014 at 1:48

let us continue this discussion in chat – Puma 12/3, 2014 at 2:19

Added more to the chat. – Puma 12/3, 2014 at 2:39

First, you have all of the normal vulnerabilities of password-based cryptography, which stem from users picking predictable passwords. It is common to crack more than 50% of passwords from real-world applications in offline attacks with less than two hours of desktop computing time.

I assume the full text encryption key is derived from the password, or is encrypted by a password-derived key. So an attacker can test guesses against a selection of hashed index keys, and once she finds the password, decrypt all of the documents.

But, even if a user picks a high-entropy password, frequency analysis on the index could potentially reveal a lot about the plain text. Although word order is lost in indexing (if you don't support proximity searches), you are essentially creating an electronic code book for each user. This index would be vulnerable to centuries of well-developed cryptanalytical techniques. Modern encryption protocols avoid ECB, and provide "ciphertext indistinguishability"—the same plain text yields different cipher text each time it's encrypted. But that doesn't work with indexes.

A less vulnerable approach would be to index and search on the client. The necessary tables would be bundled as a single message and encrypted on the client, then transported to the server for storage. The obvious tradeoff is the cost of transmission of that bundle on each session. Client-side caching of index fragments could mitigate this cost somewhat.

In the end, only you can weigh the security cost of a breach against the performance costs of client-side indexing. But the statistical analysis enabled by an index is a significant vulnerability.

Aparri answered 11/3, 2014 at 22:20 Comment(4)

Thanks erickson. Please read my addendum to the question, in reply to @Matthew, as I believe it answers some of your points too. About the vulnerabilities of password-based cryptography... You're right, but I can't protect users against their own stupidity (we'll force some criteria for passwords, etc). Eventually, if the password is stolen, data would be compromised with or without cryptography. I would consider this, thus, as a non-related issue to implement search (or even encryption). – Gambell 11/3, 2014 at 22:31

Again, then, you recall the issue of frequency analysis. I understand it, but I wonder how SERIOUS it is. To crack the index, an attacker would need to break into the database, steal the index and hope there's enough data to do some serious statistical analysis. And then? – Gambell 11/3, 2014 at 22:33

Lastly, your idea of encrypting the whole index and doing the search on a client could be interesting. However, that would require transferring both "tables" ("words" and "match"), as transferring only "words" would still leave the system vulnerable to frequency analysis (as every hash has one unique id). Those tables, however, can be sensibly big (they can grow to MBs easily), and it's probably too much data to be downloaded (and uploaded after new inserts). – Gambell 11/3, 2014 at 22:35

@Qualcuno I don't know enough about your application to say how serious analysis of the index is. It depends on the value of the information, the damage that occurs if compromised, etc. But I think what you really are questioning is whether such analysis is feasible. It is, and I don't think it would be terribly expensive. Building and adapting tools would take some time, but once available, there wouldn't be much computation necessary to break each user's index. Pulling numbers out of air, I'd say $50,000 to break first index, $1 for each additional. – Aparri 11/3, 2014 at 23:14

MSSQL Enterprise TDE encrypts Full-Text index as well as other indices when you set whole database encryption (Since 2008). in practice, it works pretty well, without a huge performance penalty. Can't comment on how, b/c it's a proprietary algo, but heres the docs.

https://learn.microsoft.com/en-us/sql/relational-databases/security/encryption/transparent-data-encryption-tde

it doesn't cover any of your application stack besides your db, but your FTS indices will work like normal and won't exist in plain text like they do in MySQL or PostGres. MariaDB and of course Oracle have their own implementation as well, from what i remember. MySQL and PGSQL do not.

As for passwords, TDE on all the implementations use AES keys, which can be rotated (though not always easily) - so the password vulnerability fall on the DBA's.

The problem is you need to pay for full enterprise licensing for MSSQL TDE (ie features not available in "standard" or "basic" cloud and on premise editions), and you do probably for TDE in Oracle as well. But if what you need is a quick solution and have the cash for enterprise licensing (probably cheaper than developing your own implementation), implementations are out there.

Compaction answered 15/7, 2017 at 1:21 Comment(0)

In addition to the frequency analysis vulnerability, there is an addition one related to known plaintext. If the channel owner knows the plaintext of any ciphertext, he is able to know plaintext/ciphertext mapping of ALL words for that user.

In practice, if this were a messaging system someone could get the system to do this merely by sending a message.

So if an attacker Alice gains access to the encrypted store, all Alice needs to do is to send a message to Bob. When it reaches the system, the message is encrypted and the FTE system has encrypted value for each token.

What you have is not encryption at all. Any system that does not use random salts is really an encoding and encoding systems are usually way easier to break.

Pastis answered 27/6, 2024 at 13:32 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Comments in response to @Matthew

Recommended topics

Hot tags