Parsing and Converting TED Talks JSON Subtitles
Asked Answered
S

5

12

This question is related to this other question @ SuperUser.

I want to download the TED Talks and the respective subtitles for offline viewing, for instance lets take this short talk by Richard St. John, the high-resolution video download URL is the following:

http://www.ted.com/talks/download/video/5118/talk/70

And the respective JSON encoded english subtitles can be downloaded at:

http://www.ted.com/talks/subtitles/id/70/lang/eng

Here is an except from the beginning of actual subtitle:

{
  "captions": [{
        "content": "This is really a two hour presentation I give to high school students,",
        "startTime": 0,
        "duration": 3000,
        "startOfParagraph": false
      }, {
        "content": "cut down to three minutes.",
        "startTime": 3000,
        "duration": 1000,
        "startOfParagraph": false
      }, {
        "content": "And it all started one day on a plane, on my way to TED,",
        "startTime": 4000,
        "duration": 3000,
        "startOfParagraph": false
      }, {
        "content": "seven years ago."

And from the end of the subtitle:

{
  "content": "Or failing that, do the eight things -- and trust me,",
  "startTime": 177000,
  "duration": 3000,
  "startOfParagraph": false
}, {
  "content": "these are the big eight things that lead to success.",
  "startTime": 180000,
  "duration": 4000,
  "startOfParagraph": false
}, {
  "content": "Thank you TED-sters for all your interviews!",
  "startTime": 184000,
  "duration": 2000,
  "startOfParagraph": false
}]
}

I want to write an app that automatically downloads the high-resolution version of the video and all the available subtitles, but I'm having a really hard time since I have to convert the subtitle to a (VLC or any other decent video player) compatible format (.srt or .sub are my first choices) and I've no idea what the startTime and duration keys of the JSON file represent.

What I know so far is this:

  • The downloaded video lasts for 3 minutes and 30 seconds, and has 29 FPS = 6090 frames.
  • startTime starts at 0 with a duration of 3000 = 3000
  • startTime ends at 184000 with a duration of 2000 = 186000

It may also be worthwhile noticing the following Javascript snippet:

introDuration:16500,
adDuration:4000,
postAdDuration:2000,

So my question is, what logic should I apply to convert startTime and duration values to a .srt compatible format:

1
00:01:30,200 --> 00:01:32,201
MEGA DENG COOPER MINE, INDIA

2
00:01:37,764 --> 00:01:39,039
Watch out, watch out!

Or to a .sub compatible format:

{FRAME_FROM}{FRAME_TO}This is really a two hour presentation I give to high school students,
{FRAME_FROM}{FRAME_TO}cut down to three minutes.

Can anyone help me out with this?


Ninh Bui nailed it, the formula is the following:

introDuration - adDuration + startTime ... introDuration - adDuration + startTime + duration

This approach allows to me convert directly to .srt format (no need to know length and FPS) in two ways:

00:00:12,500 --> 00:00:15,500
This is really a two hour presentation I give to high school students,

00:00:15,500 --> 00:00:16,500
cut down to three minutes.

And:

00:00:00,16500 --> 00:00:00,19500
And it all started one day on a plane, on my way to TED,

00:00:00,19500 --> 00:00:00,20500
seven years ago.
Stevens answered 23/12, 2009 at 22:17 Comment(2)
+1 for the detailed explanation :)Heartwarming
+1 for trying to do something I was wondering if I could do.Kunkel
C
4

My guess would be that the times in the json are expressed in milliseconds, e.g. 1000 = 1 second. There is probably a maintimer, where startTime indicates the time on the timeline at which the subtitle should appear and the duration is probably the amount of time the subtitle should remain in vision. This theory is further affirmed by dividing 186000 / 1000 = 186 seconds = 186 / 60 = 3.1 minutes = 3 minutes and 6 seconds. The remaining seconds are probably applause ;-) With this information you should also be able to calculate from what frame to what frame you should apply your conversion to, i.e. you already know what the frames per second is so all you need to do is multiply the number of seconds of starttime with the FPS to get the begin frame. The end frame can be obtained by: (startTime + duration) * fps :-)

Contra answered 23/12, 2009 at 22:51 Comment(1)
Thank you, my conversion is perfectly synced now. =)Stevens
D
3

I made a simple console-based program to download the subtitles. I was thinking of making it available via web using some script system like grease monkey... Here is the link to my blogpost with the code.: http://estebanordano.com.ar/ted-talks-download-subtitles/

Dachia answered 5/1, 2010 at 22:47 Comment(1)
Excellent! You should also provide some kind of API to allow the automatic download of subtitles, like with an apple script after download from podcast or something.Falsify
C
1

I found another site which used this format. I quickly hacked a function to convert them into srt, should be self-explanatory:

import urllib2
import json

def json2srt(url, fname):
    data = json.load(urllib2.urlopen(url))['captions']

    def conv(t):
        return '%02d:%02d:%02d,%03d' % (
            t / 1000 / 60 / 60,
            t / 1000 / 60 % 60,
            t / 1000 % 60,
            t % 1000)

    with open(fname, 'wb') as fhandle:
        for i, item in enumerate(data):
            fhandle.write('%d\n%s --> %s\n%s\n\n' %
                (i,
                 conv(item['startTime']),
                 conv(item['startTime'] + item['duration'] - 1),
                 item['content'].encode('utf8')))
Collotype answered 7/4, 2012 at 23:53 Comment(0)
M
0

TEDGrabber beta2 : my program : http://sourceforge.net/projects/tedgrabber/

Melindamelinde answered 21/9, 2010 at 5:57 Comment(0)
R
0

I've written a python script that downloads any TED video and creates an mkv file with all the subtitles/metadata embedded in it ( https://github.com/oxplot/ted2mkv ).

I used the variable pad_seconds in the javascript code of the TED talk page as an offset to be added to all the timestamps in JSON subtitle files. It is what the flash player uses, I assume.

Regress answered 16/12, 2012 at 5:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.