Freebase: What data dump file contains the "imdb_id"?
Asked Answered
D

2

11

I run IMDbAPI.com and have been using Bing's Search API for finding IMDb ID's from title searches. Bing is currently changing their API over to the Azure Marketplace (August 1st) and is no longer available for free. I started testing my API using Freebase to resolve these ID's and hit their 100k limit in the first 8 hours (my site currently gets about 3 million requests a day, but only 200-300k are title searches)

This is exactly why they offer the data dump files,

I downloaded most of the files in the Film folder but cannot find where they are storing the "/authority/imdb/title" imdb id namespace data.

https://www.googleapis.com/freebase/v1/mqlread?query={"type":"/film/film","name":"True%20Grit","imdb_id":null,"initial_release_date>=":"1969-01","limit":1}

This is how I'm currently accessing the ID.

Does anyone know which file contains this information? and how to link back to it from the film title/id?

Disquietude answered 15/7, 2012 at 14:24 Comment(0)
W
7

That imdb_id property is backed by a key in the /authority/imdb/title namespace, so you're looking for the line:

/m/015gxt       /type/object/key        /authority/imdb/title   tt0065126

in the file http://download.freebase.com/datadumps/latest/freebase-datadump-quadruples.tsv.bz2

That's a 4 GB file, so be prepared to wait a little while for the download. Note that everything is keyed by MID, so you'll need to figure that out first if you don't have it in your database.

The equivalent query using MQL instead of the data dumps is https://www.googleapis.com/freebase/v1/mqlread?query=%7B%22type%22%3a%22/film/film%22,%22name%22%3a%22True%20Grit%22,%22imdb_id%22%3anull,%22initial_release_date%3E=%22%3a%221969-01%22,%22mid%22:null,%22key%22:[{%22namespace%22:%22/authority/imdb/title%22}],%22limit%22:1%7D&indent=1

EDIT: p.s. I'm pretty sure the files in the Browse directory are going away, so I wouldn't depend on them even if you could find the info there.

Woodpecker answered 15/7, 2012 at 15:2 Comment(4)
I was trying to avoid the 4gig (33gig extracted) file, but I downloaded it anyways and spent the past 3 hours trying to find ANYTHING to open/parse it. I wound up using Microsoft's Log Parser 2.2 which worked great! LogParser.exe -i:TSV "SELECT Col1, Col4 INTO C:\imdbList.csv FROM C:\freebase.tsv WHERE Col3 like '%imdb/title%'" -o:CSV -headers:OFF -iHeaderFile:"C:\header.txt" So now I have a 3mb CSV file that has all freebase ID's and IMDb's ID'sDisquietude
Next I need to get the "Title", "Release Year" and "Aliases" from the "Film.tsv" then I can join the data in SQL... And finally be able to search :) But I am relying on the extra file from the Browse folder "Films.tsv" are these going away soon?Disquietude
It's probably faster (and certainly less disk space) to process the compressed file, so I wouldn't decompress it. Any Linux system (or Cygwin on Windows) can process this trivially without downloading weird proprietary utilities. The equivalent command is bzgrep "authority/imdb/title" freebase-datadump-quadruples.tsv.bz2 | cut -f 1,4 > imdbList.csv Even on a laptop it can decompress & search that 4GB file and output 142K pairs of IDs in under 20 minutes.Woodpecker
Here's the closest I can find to an announcement on the retirement of the TSV dumps: markmail.org/message/6yve4c36p6pwhchvWoodpecker
M
0

The previous answer works fine, it's just that a snappier version of such a query could be:

query = [{
          'type': '/film/film',
          'name': 'prometheus',
          'imdb_id': null,
          ...
        }];

The rest of the MQL request isn't mentionned as it doesn't differ from the aforementioned. Hope that helps.

Mirabelle answered 28/1, 2014 at 21:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.