How to produce massive amount of data?

Asked 29/12, 2011 at 12:59 Answered 3/1, 2014 at 10:57

I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB.

The problem is that I don't have this amount of data, so I'm thinking of ways to produce it.

The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored).

Another idea is to write a program that will create files with dummy data.

Any other idea?

Bathsheeb answered 29/12, 2011 at 12:59 Comment(6)

Query data from google analitycs ? – Osrock 29/12, 2011 at 13:1

Wow, a program like that would take a good ammount of time to run. – Olindaolinde 29/12, 2011 at 13:2

Have you considered just generating random numbers? – Shadowy 29/12, 2011 at 13:2

"(Identical files are ignored)." Trim a few (a random number of) lines from the end and they won't be. – Boomkin 29/12, 2011 at 13:3

@Osrock - can you elaborate on this one? – Bathsheeb 29/12, 2011 at 13:51

why produce ? there's plenty of free large data sets out there: #2674921 – Ambrosane 29/12, 2011 at 14:2

This may be a better question for the statistics StackExchange site (see, for instance, my question on best practices for generating synthetic data).

However, if you're not so interested in the data properties as the infrastructure to manipulate and work with the data, then you can ignore the statistics site. In particular, if you are not focused on statistical aspects of the data, and merely want "big data", then we can focus on how one can generate a large pile of data.

I can offer several answers:

If you are just interested in random numeric data, generate a large stream from your favorite implementation of the Mersenne Twister. There is also /dev/random (see this Wikipedia entry for more info). I prefer a known random number generator, as the results can be reproduced ad nauseam by anyone else.
For structured data, you can look at mapping random numbers to indices and create a table that maps indices to, say, strings, numbers, etc., such as one might encounter in producing a database of names, addresses, etc. If you have a large enough table or a sufficiently rich mapping target, you can reduce the risk of collisions (e.g. same names), though perhaps you'd like to have a few collisions, as these occur in reality, too.
Keep in mind that with any generative method you need not store the entire data set before beginning your work. As long as you record the state (e.g. of the RNG), you can pick up where you left off.
For text data, you can look at simple random string generators. You might create your own estimates for the probability of strings of different lengths or different characteristics. The same can go for sentences, paragraphs, documents, etc. - just decide what properties you'd like to emulate, create a "blank" object, and fill it with text.

Kehr answered 31/12, 2011 at 16:22 Comment(0)

If you only need to avoid exact duplicates, you could try a combination of your two ideas---create corrupted copies of a relatively small data set. "Corruption" operations might include: replacement, insertion, deletion, and character swapping.

Irreconcilable answered 29/12, 2011 at 13:10 Comment(0)

I would write a simple program to do it. The program doesn't need to be too clear as the speed of writing to disk is likely to be your bottle neck.

Cardin answered 29/12, 2011 at 13:2 Comment(0)

Just about the long time comment: I've recently extended a disk partition and I know well how long can it take to move or create a great number of files. It would be much faster to request the OS a range of free space on disk, and then create a new entry in the FAT for that range, without writing a single bit of content (reusing the previously existing information). This would serve your purpose (since you don't care about file content) and would be as fast as deleting a file.

The problem is that this might be difficult to achieve in Java. I've found an open source library, named fat32-lib, but since it doesn't resort to native code I don't think it is useful here. For a given filesystem, and using a lower level language (like C), if you have the time and motivation I think it would be achievable.

Olindaolinde answered 29/12, 2011 at 13:52 Comment(0)

Have a look at TPC.org, they have different Database Benchmarks with data generators and predefined queries.

The generators have a scale-factor which allows to define the target data size.

There is also the myriad research project (paper) that focuses on distributed "big data" data generation. Myriad has a steep learning curve, so you might have to ask the authors of the software for help.

Restrictive answered 3/1, 2014 at 10:57 Comment(0)

Recommended topics

Hot tags