I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB.
The problem is that I don't have this amount of data, so I'm thinking of ways to produce it.
The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored).
Another idea is to write a program that will create files with dummy data.
Any other idea?