index on url or hashing considering RAM
Asked Answered
E

1

6

I am working on a project which needs to add/update around 1 million urls daily. Some days are mostly updates and some days are mostly add and some days are mix.

So, on every query there is need to look up uniqueness of url in url table.

How look up for url can be made really fast because at the moment index is set at url column and it works good but in coming weeks RAM would not be enough if index are kept on same column and new records will be added in millions.

That's why I am looking for a solution so that when there will be 150+ million urls in total then its look up should be fast. I am thinking of creating indexing on md5 but then worries about collision chances. A friend tipped me to calculate crc32 hash also and concatenate with md5 to make collision possibility to zero and store it in binary(20) that way only 20 bytes would be taken as index instead of 255 currently varchar(255) set as url column data type.

Currently there are total around 50 million urls and with 8GB ram its working fine.

Yesterday, I asked a question url text compression (not shortening) and storing in mysql related to the same project.

[Edit] I have thought of another solution of putting crc32 hash only in decimal form to speed up look up. And at application level porting a check on how many records are returned. If more than 1 record is returned then exact url should also be matched. That way collision would also be avoided while keep low load on RAM and disk space by storing 4 bytes for each row instead of 20 bytes (md5+crc32). What you say?

Embolus answered 13/9, 2011 at 14:3 Comment(8)
If you are hashing the URL, is that going to decrease the write performance?Prepossession
Its not a big concern because only once in a lifetime that unique url hash would be written.Embolus
Boss, in your question, you specify you need to add/update around 1M of URLs daily ...so ?Prepossession
In updates there is no change to hash column because its just hash of url but few other columns are updated of the row. Even if there are more rush of new adds then still there will not be more than 10 inserts per second anytime.Embolus
But you would need to calculate the hash on the source (both insert/update), would not you?Prepossession
If possible use something like redis for this purpose. It will work faster and 8GB RAM should be enough. Also, I think MD5 should not be a concern.Joshuajoshuah
@Joshuajoshuah Isn't it NoSQL solution? Though I am in the mood to learn new things at the moment because I have some time but NoSQL is completely new territory and I don't want to trap my boss :p If Redis can quick search hashes then MySQL can too because my setup is not that complicated to try a new bull. But I am interested to learn your views in detail about preferring Redis rather than indexing hashes in MySQL or find some other solution with MySQL.Embolus
@Prepossession Yes it would be calculated at application level and its affect write performance at all.. do you think it will decrease write performance?Embolus
A
8

After reading all your questions ( unique constraint makes hashes useless? , 512 bit hash vs 4 128bit hash and url text compression (not shortening) and storing in mysql), I understood that your problem is more or less the following:

"I need to store +150M URLs in mySQL, using 8GB of RAM, and still have a good performance on writing them all and retrieving them, because daily I'll update them, so I'll retrive a lot of URLs, check them against the database. Actually it has 50M URLs, and will grow about 1M each day in the following 3 monts."

Is that it?

The following points are important: How is the format of the URL that you'll save? Will you need to read the URL back, or just update informations about it, but never search based in partial URLs, etc?

Assuming URL = "http://www.somesite.com.tv/images/picture01.jpg" and that you want to store everything, inclusing the filename. If it's different, please provide more details or correct my answer assumptions.

  1. If can save space by replacing some group of characters in the URL. Not all ASCII characters are valid in an URL, as you can see here: RFC1738, so you can use those to represent (and compress) the URL. For example: using character 0x81 to represent "http://" can make you save 6 characters, 0x82 to represent ".jpg" can save you another 3 bytes, etc.

  2. Some words might be very common (like "image", "picture", "video", "user"). If you choose to user characters 0x90 up to 0x9f + any other character (so, 0x90 0x01, 0x90 0x02, 0x90 0xfa) to encode such words, you can have 16 * 256 = 4,096 "dictionary entries" to encode the most used words. You'll use 2 bytes to represent 4 - 8 characters.

Edit: as you can read in the mentioned RFC, above, in the URL you can only have the printable ASCII characters. This means that only characters 0x20 to 0x7F should be used, with some observations made in the RFC. So, any character after 0x80 (hexadecimal notation, would be character 128 decimal in the ASCII table) shouldn't be used. So, if can choose one character (let's say the 0x90) to be one flag to indicate "the following byte is a indication in the dictionary, the index that I'll use". One character (0x90) * 256 characters (0x00 up to 0xFF) = 256 entries in the dictionary. But you can also choose to use characters 0x90 to 0x9f (or 144 to 159 in decimal) to indicate that they are a flag to the dictionary, thus giving you 16 *256 possibilities...

These 2 methods can save you a lot of space in your database and are reversible, without the need to worry about collisions, etc. You'll simple create a dictionary in your application and go encode/decode URLs using it, very fast, making your database much lighter.

Since you already have +50M URLs, you can generate statistics based on them, to generate a better dictionary.

Using hashes : Hashes, in this case, are a tradeoff between size and security. How bad will it be if you get a collision? And in this case you can use the birthday paradox to help you.

Read the article to understand the problem: if all inputs (possible characters in the URL) were equivalent, you could stimate the probability of a collision. And could calculate the opposite: given your acceptable collision probability, and your number of files, how broad should your range be? And since your range is exactlly related to the number of bits generated by the hash function...

Edit: if you have a hash function that gives you 128 bits, you'll have 2^128 possible outcomes. So, your "range" in the birthday paradox is 2^128: it's like your year have 2^128 days, instead of 365. So, you calculate the probabilities of collision ("two files being born in the same day, with a year that have 2^128 days instead of 365 days). If you choose to use a hash that gives you 512 bits, your range would go from 0 to 2^512...

And, again, have the RFC in mind: not all bytes (256 characters) are valid in the internet / URL world. So, the probabillity of collisions decrease. Better for you :).

Anisotropic answered 15/9, 2011 at 17:15 Comment(8)
Thanks for your time to read all of my questions. And yes you have got my point, Now it would be easy to find the solution. I have already analyzed the solution of encoding and decoding but i left it because it will work only for English words but our database contains Chinese characters too and plenty of URLs have parameters with numbers as their values. So, it will bring us less value than the computation it will take for reporting. That's why I started to look into hashes. Now I will look into again but what you think about alternative solution (hashes or any other)?Embolus
Moreover, let say if 50% compression is achieved even then it would still be above 100 bytes range while 512bit hash would take only 64 bytes. On the other hand varchar(255) for URLs are considered low and I also have found many records with truncation so likely this limit would be increased soon.Embolus
You should answer / know one thing: if you hash some url, you can't get the hash and find what was the original URL. If you never need to do that, ok, hashes are good. If you need to read the URL again, you'll have to keep one table to associate hash <-> url. And that will take the same space...Anisotropic
At the end of the day we have come to the conclusion that making search queries faster I need to have extra disk space booked for hashes which is not a big issue. While unanimous decision for fast look up is the use of hashes?Embolus
Related to your Edit -> My Edit in question refers to check the original url if collision occur and then changing the hash by some number. It will happen only if collision occur. I thought porting a check in application to be the cheapest solution now even small hash can be used than 512bits. Isn't it a nice trade off?Embolus
yes, it is... and it's what a hashtable does, in general :) . Take a look at en.wikipedia.org/wiki/Hash_table, in the "Collision resolution" section.Anisotropic
Could you please elaborate meanings of 'And since your range is exactlly related to the number of bits generated by the hash function'? And what are the user characters you mentioned 0x90? And this range can be extended? (though 4096 is more than enough :-) )Embolus
Well, I found the way to extend the range because undefined (HTML 4 standards) html character codes starts from 7F to 9F. ascii.cl/htmlcodes.htmEmbolus

© 2022 - 2024 — McMap. All rights reserved.