I'm trying to load a couple of files into the memory. The files have either of the following 3 formats:
- string TAB int
- string TAB float
- int TAB float.
Indeed, they are ngram statics files, in case this helps with the solution. For instance:
i_love TAB 10
love_you TAB 12
Currently, the pseudocode of I'm doing right now is
loadData(file):
data = {}
for line in file:
first, second = line.split('\t')
data[first] = int(second) #or float(second)
return data
To much of my surprise, while the total size of the files in disk is about 21 mb, when loaded into memory the process takes 120 - 180 mb of memory! (the whole python application doesn't load any other data into memory).
There are less than 10 files, most of them would stay stable at about 50-80k lines, except for one file which currently has millions of lines.
So I would like to ask for a technique/data structure to reduce the memory consumption:
- Any advice for compression techniques?
- If I still use dict, is there any way to reduce the memory? Is it possible to set the "load factor" as in Java for Python dict?
- If you have some other data structures, 'm also willing to trade some of the speed to reduce the memory. Nevertheless, this is a time sensitive application so that once the users input their queries, I think it'd be not quite reasonable to take more than a few seconds to return the result. With regard to this, I'm still amazed by how Google manage to do the Google Translate so fast: they must be using a lot of techniques + lots of servers' power?
Thank you very much. I look forward to your advice.