Huge amount of plaintext data for parsing experiment

About

Asked 26/4, 2011 at 3:53 Answered 26/4, 2011 at 4:1

I am developing a parser in ruby which parses some nonuniform text data. Can anybody tell me, where I can get a good number of plaintext data for that?

Roundly answered 26/4, 2011 at 3:53 Comment(0)

Here's you'll get a list of many:

http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public

And my fav is:

http://ftp.sunet.se/mirror/archive/ftp.sunet.se/pub/tv+movies/imdb/

Migrate answered 26/4, 2011 at 3:54 Comment(1)

As long as amazon us-east-1d is up :) – Bier 26/4, 2011 at 4:4

You could scrape Wikipedia (or just run a bunch of it through lynx -dump). That would also give you a vast source of non-English text as well. Project Gutenberg would be another good source of large amounts of plain text.

Elenore answered 26/4, 2011 at 4:1 Comment(3)

@Phrogz: I used to be a Gutenberg addict back in my "Palm Pilot and commuting on the bus" days. – Elenore 26/4, 2011 at 4:14

Project Gutenberg as a very strict bot policy, they allow no more than 100 visits from the same ip address in a day. – Raymund 2/7, 2013 at 6:29

@kyle k That's ok. They have a torrent: gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project – Songsongbird 3/10, 2013 at 17:14

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags