Extracting a subset data of Freebase for faster development iteration

Asked 7/12, 2013 at 21:2 Answered 23/7, 2014 at 12:1

I have downloaded the 250G dump of freebase data. I don't want to iterate my development on the big data. I want to extract a small subset of the data (may be a small domain or some 10 personalities and their information). This small subset will make my iterations faster and easier.

What's the best approach to partition the freebase data? Is there any subset download provided by Google/Freebase?

Karimakarin answered 7/12, 2013 at 21:2 Comment(0)

This is feedback that we've gotten from many people using the data dumps. We're looking into how best to create such subsets. One approach would be to get all the data for a single domain like Film.

Here's how you'd get every RDF triple from the /film domain:

zgrep '\s<http://rdf\.freebase\.com/ns/film.' freebase-rdf-{date}.gz | gzip > freebase-films.gz

The tricky part is that this subset won't contain the names, images or descriptions which you most likely also want. So you'll need to get those like this:

zgrep '\s<http://rdf\.freebase\.com/ns/(type\.object|common\.topic)' freebase-rdf-{date}.gz | gzip > freebase-topics.gz

Then you'll possibly want to filter that subset down to only topic data about films (match only triples that start with the same /m ID) and concatenate that to the film subset.

It's all pretty straight-forward to script this with regular expressions but a lot more work than it should be. We're working on a better long-term solution.

Handknit answered 8/12, 2013 at 0:35 Comment(8)

Although you almost certainly want /common/topic for aliases, etc and /type/object for name, there is much more that you probably want as well. If you're interested in the film domain, you probably also want actors spouses, birth dates, nationalities, etc, so you'll want some of the properties from the included type /people/person. Basically anything that is an included type of one of the target types is likely to be of potential interest. – Despair 8/12, 2013 at 4:3

zgrep $'\tns/film.' freebase-rdf-2013-12-01-00-00.gz yielded zero lines. Am I missing something? – Karimakarin 9/12, 2013 at 8:35

Oops, copypasta error on my end. I've updated the examples. Please give it another try. – Handknit 9/12, 2013 at 20:27

if we could download freebase data in smaller pieces by topic, i'd be so happy. – Density 11/12, 2013 at 22:25

@ShawnSimister what is the easiest way to extract keywords with there categories into a excel sheet ? – Retrospective 13/12, 2013 at 16:22

@Retrospective you can use grep to extract that data as Tom showed on your previous question(https://mcmap.net/q/1163098/-getting-error-while-importing-rdf-closed). Please try not to threadjack other people's questions. If you have a new question, post it on its own page. – Handknit 17/12, 2013 at 0:39

@ShawnSimister: Your second regex is not properly escaped. zgrep '\s<rdf\.freebase\.com/ns/(type\.object\|common\.topic)' freebase-rdf-{date}.gz | gzip > freebase-topics.gz If you can edit that, I will accept the answer. For now, this solution is good enough to me. Update us when there is a long term solution. I accept that its not a straight forward problem to separate out from such a huge linked data. Freebase is powerful because its linked :) – Karimakarin 25/12, 2013 at 16:20

Okay, I escaped the periods in the property names but the bar is part of the regex. We want all triples with predicates that start with type.object or common.topic – Handknit 26/12, 2013 at 17:37

I wanted to do a similar thing and I came up with the following command line.

gunzip -c freebase-rdf-{date}.gz | awk 'BEGIN { prev_1 = ""} { if (prev_1 != $1) { print '\n' } print $0; prev_1 = $1};' | awk 'BEGIN { RS=""} $0 ~ /type\.object\.type.*\/film\.film>/' > freebase-films.txt

It will give you all the triplets for all subjects that has the type film. (it assumes all subjects come in sorted order)

After this you can simply grep for the predicates that you need.

Deedradeeds answered 4/5, 2014 at 21:34 Comment(0)

Just one remark for accepted post, variant for topics don't work for me, because if we want use regex we need to set -E parameter

zgrep -E '\s<http://rdf\.freebase\.com/ns/(type\.object|common\.topic)' freebase-rdf-{date}.gz | gzip > freebase-topics.gz

Fez answered 23/7, 2014 at 12:1 Comment(0)

Recommended topics

Hot tags