I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB 3-column CSV file into memory, on my 16GB high-memory EC2 instance:
data = np.loadtxt('rec_log_train.txt')
the python session ate up all my memory (100%), and then got killed.
I then read the same file using R (via read.table) and it used less than 5GB of ram, which collapsed to less than 2GB after I called the garbage collector.
My question is why did this fail under numpy, and what's the proper way of reading a file into memory. Yes I can use generators and avoid the problem, but that's not the goal.
np.fromfile / np.loadtxt( dtype=np.float32 )
will take less memory, thenX = X.astype(np.float64)
when done. – Hitt