How do some sites download YouTube captions?
Asked Answered
E

6

20

This is somewhat of a duplicate question of Does YouTube API forbid to download video captions if you are not it's owner?, Get YouTube captions and Does YouTube API forbid to download video captions if you are not it's owner?, which all basically say it's not possible unless to download captions via the YouTube API unless you are the owner or third-party contributions are not enabled; however, my question is how to sites like http://downsub.com/ or http://www.lilsubs.com/ have access to all captions?

In other words, when I access the YouTube API myself (even with youtubepartner and youtube.force-ssl scopes), I can only download the captions of some videos, but when I try the same videos that failed for me with 403: The permissions associated with the request are not sufficient to download the caption track. The request might not be properly authorized, or the video order might not have enabled third-party contributions for this caption. on these other sites, it works fine. I'm assuming they are using the YouTube API to access the captions, but what special sauce are they using? Some special partner key? An different API version? Are they just scraping from the videos themselves or something?

Election answered 21/10, 2017 at 14:41 Comment(4)
Any link to example you are not able to get them but you can get them via mentioned sites?Francisfrancisca
@JanisS. Here's an example: youtu.be/0db1_qWZjRA, which resolves to caption id zMTLb41gaOS5LWeeAi0ribdiUBImBdqb, and then fails with a 403Election
Thank you for comments about the unofficial timedtext. That'll probably work for my use case; however, it does not seem to support kind=asr (i.e. auto-translated captions) without a signature. The other sites like downsub.com also include these. How are they doing that? Here's an example: youtube.com/watch?v=vx6NCUyg1NE Only English and Indonesian work without a key. ASR captions also aren't listed here youtube.com/api/….Election
please check my updated answer.Francisfrancisca
F
18

Send a GET request on:

http://video.google.com/timedtext?lang={LANG}&v={VIDEOID}

Example for your video in comment: http://video.google.com/timedtext?lang=ko&v=0db1_qWZjRA

Let's look at another example of yours, i.e. https://www.youtube.com/watch?v=7068mw-6lmI (and I agree about differentiation part in your comment).

There are multiple subtitles available for the video

  • English
  • Korean
  • Spanish
  • Korean (auto-generated) also called asr (automatic speech recognition)

These stand for the subtitle name parameter (i.e., name=English).

lang stands for the country code. In your example: https://www.youtube.com/api/timedtext?lang=es-MX&v=7068mw-6lmI&name=Spanish

If subtitle track is available, it is possible to do translation form it, namely using tlang parameter.

https://www.youtube.com/api/timedtext?lang=en&v=7068mw-6lmI&name=English&tlang=lv
https://www.youtube.com/api/timedtext?lang=ko&v=7068mw-6lmI&name=Korean&tlang=lv

This would be my bid for what these sites are using, i.e. translation of the available subtitle track (confirm by trying to use a video without subtitle track as input for one of their sites).

As for asr signature seems to always be needed, but as long as one of the subtitle tracks are available, you could use that for translation. E.g. in your OP comment example:

https://www.youtube.com/api/timedtext?lang=en&v=vx6NCUyg1NE&tlang=lv

Looks like the last example is special with both of subtitle tracks being asr (checked with Chrome -> Inspect -> Network) therefore you need to omit the subtitle name parameter part. This difference unfortunately is not visible in YouTube video's settings wheel.

Francisfrancisca answered 24/10, 2017 at 15:56 Comment(2)
this stopped working currently (11 Dec 2021). Any suggestions how to overcome this?Palate
@Palate So why do sites like downsub still work and download subtitles of any youtube video?Calvary
D
20

A 2022 answer:

Option 1: Send a curl request to the webpage: curl -L "https://youtu.be/YbJOTdZBX1g", search for timedtext in the result, and you would get a URL. replace \u0026 with & and you get the link for the subtitle.

Option 2: Use youtube-dl or the yt-dlp, from the command line or as a Python package:

# For installing see: https://github.com/yt-dlp/yt-dlp#with-pip
from yt_dlp import YoutubeDL

ydl_opts = {
    "skip_download": True,
    "writesubtitles": True,
    "subtitleslangs": ["all", "-live_chat"],
    # Looks like formats available are vtt, ttml, srv3, srv2, srv1, json3
    "subtitlesformat": "json3",
    # You can skip the following option
    "sleep_interval_subtitles": 1,
}
with YoutubeDL(ydl_opts) as ydl:
    ydl.download(["YbJOTdZBX1g"])
Driskell answered 18/1, 2022 at 14:13 Comment(0)
F
18

Send a GET request on:

http://video.google.com/timedtext?lang={LANG}&v={VIDEOID}

Example for your video in comment: http://video.google.com/timedtext?lang=ko&v=0db1_qWZjRA

Let's look at another example of yours, i.e. https://www.youtube.com/watch?v=7068mw-6lmI (and I agree about differentiation part in your comment).

There are multiple subtitles available for the video

  • English
  • Korean
  • Spanish
  • Korean (auto-generated) also called asr (automatic speech recognition)

These stand for the subtitle name parameter (i.e., name=English).

lang stands for the country code. In your example: https://www.youtube.com/api/timedtext?lang=es-MX&v=7068mw-6lmI&name=Spanish

If subtitle track is available, it is possible to do translation form it, namely using tlang parameter.

https://www.youtube.com/api/timedtext?lang=en&v=7068mw-6lmI&name=English&tlang=lv
https://www.youtube.com/api/timedtext?lang=ko&v=7068mw-6lmI&name=Korean&tlang=lv

This would be my bid for what these sites are using, i.e. translation of the available subtitle track (confirm by trying to use a video without subtitle track as input for one of their sites).

As for asr signature seems to always be needed, but as long as one of the subtitle tracks are available, you could use that for translation. E.g. in your OP comment example:

https://www.youtube.com/api/timedtext?lang=en&v=vx6NCUyg1NE&tlang=lv

Looks like the last example is special with both of subtitle tracks being asr (checked with Chrome -> Inspect -> Network) therefore you need to omit the subtitle name parameter part. This difference unfortunately is not visible in YouTube video's settings wheel.

Francisfrancisca answered 24/10, 2017 at 15:56 Comment(2)
this stopped working currently (11 Dec 2021). Any suggestions how to overcome this?Palate
@Palate So why do sites like downsub still work and download subtitles of any youtube video?Calvary
M
3

There is this unofficial API used by Youtube :

https://www.youtube.com/api/timedtext?lang={LANG}&v={VIDEO_ID}

LANG here is ISO 639-1 2 letter country code. For your example it would be :

https://www.youtube.com/api/timedtext?lang=ko&v=0db1_qWZjRA

You can check it in network tab while toggling the closed caption button :

enter image description here

Melena answered 25/10, 2017 at 5:37 Comment(6)
Thanks, this is the best answer so far, but please see my comment about ASR captions. Happen to know? #46864928Election
Any idea why the name param is required on some videos even though lang is already provided? For example, this URL https://www.youtube.com/api/timedtext?v=7068mw-6lmI&lang=ko&name=Korean will not work without name=Korean. Other ones are fine. I'm thinking it might have something to do w/ the ASR captions on this video since there's also auto-generated Korean captions, so perhaps it's to differentiate, but just a guess.Election
looking at the list of available subs indicate when it's required. Not why. My guess is it's related to the YT v2 > v3 upgrade. Example : youtube.com/api/timedtext?v=7068mw-6lmI&type=list and youtube.com/api/timedtext?v=dhwpLACAls8&type=listDisappear
https://www.youtube.com/api/timedtext?lang={LANG}&v={VIDEO_ID} it's not working. I have to use: https://www.youtube.com/api/timedtext?v=u5lwQPyqfJY&caps=asr&xoaf=4&hl=vi&ip=0.0.0.0&ipbits=0&expire=1664101459&sparams=ip%2Cipbits%2Cexpire%2Cv%2Ccaps%2Cxoaf&signature=A5224B6829D6A9A602FF5635CE7695A19E7F71F8.3121CD5AAAC620BC32E73A876EC74AAEE036DAFC&key=yt8&lang=ko&fmt=json3 Does anyone know the meaning of the signature field? It also has an expire field, don't know if it can be used for a long time.Calvary
Hi I am wondering if you ever found out what is this signature? Thank youSteere
Apparently you have to glean the full URL from the content: https://mcmap.net/q/394600/-how-do-some-sites-download-youtube-captionsSurra
B
0

I have used youtube-transcript-api successfully to retrieve transcripts. The below is a demo to dump the transcript into HTML with links back to the timestamps in the video:

import sys

from youtube_transcript_api import YouTubeTranscriptApi

video_id = sys.argv[1]

# Retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
# Just use the first transcript, let it raise an exception if none exist.
transcript = next(iter(transcript_list))
print("<html><body>")
for line_map in transcript.fetch():
    st_sec = int(line_map['start'] / 60)
    st_msec = int(line_map['start'] - st_sec * 60)
    tstmp = f"{st_sec}:{st_msec}"
    link_to_tstmp = f"https://youtu.be/{video_id}?t={st_sec*60}"
    tstmp_str = ("%2d:%-2d" % (st_sec, st_msec)).replace(" ", "&nbsp;")
    #print(f"{st_sec}:{st_msec} {line_map['text']}")
    print("""<a href="%s">%s</a> %s<br/>""" % (link_to_tstmp, tstmp_str, line_map['text']))
print("</html></body>")

If there are multiple transcripts, the library provides API to search by language etc.

You can further tweak the logic to merge text so you only get one link every so many minutes. I got good results for a lecture by linking at every 1 min and format the lines into a HTML table.

Bogie answered 15/1, 2023 at 10:20 Comment(0)
R
0

If anyone wants to know this today, you can get a ton of information about a video from it's player. YouTube's undocumented youtubei API has multiple libraries in languages like JavaScript, Python and even Rust trying to tame it. (I'm writing a replacement for the broken Rust one). If you don't want to use any of these, or there isn't one for your language and this information is still valid:

Request

You can make a POST request to https://www.youtube.com/youtubei/v1/player?key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8&prettyPrint=false (The key is the one that the YouTube web client uses) with the following HTTP headers:

  • Accept-Language: en-US,en;q=0.5 (You can obviously change the language)
  • Content-Type: application/json
  • X-Youtube-Client-Name: 1 (To pretend to be the web client)
  • X-Youtube-Client-Version: 2.20230607.06.00
  • Sec-Fetch-Mode: no-cors

Then set the user agent to something that looks like a browser (juz grab it from ya browza), I don't know if they check it, but 🤷‍♂️ just in case (you know).

In terms of the request JSON, it should look like this:

{
  "context": {
    "hl": "en",
    "clientName": "WEB",
    "clientVersion": "2.20230607.06.00",
  },
  "videoId": "{video_id}",
  "params": "" // These are a little odd, you won't really have any of these so leave it blank
}

No, don't actually put that comment in there! That's for your education.

Response

There's a ton of useful information in this response, but we're looking for captions. Let's call the root of the response response. We find captions, as of June 2023, responsecaptionsplayerCaptionsTracklistRenderercaptionTracks (if captions doesn't exist, it's because captions don't exist for the video). This captionTracks is an array of objects that look like this:

{
  "baseUrl": "https://www.youtube.com/api/timedtext?v=c0td7Noukww&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1687045036&sparams=ip,ipbits,expire,v,caps,opi,xoaf&signature=35A403189649A24C75C8CE6CB6016B46D9385CC4.1F3E3B7FF4670E84747F5C24DE2B119B04BA9F47&key=yt8&kind=asr&lang=en",
  "name": {
    "simpleText": "English (auto-generated)"
  },
  "vssId": "a.en",
  "languageCode": "en",
  "kind": "asr",
  "isTranslatable": true
}

If you make a GET request to this baseUrl, you'll get in response HTML encoded text captions. By appending &fmt=vtt You'll get WebVTT captions. That means time data, so we can have real subtitles and even convert to SRT for usage in video players if we download the video.

Resurgent answered 19/6, 2023 at 1:12 Comment(0)
G
0

Looking at the source code of youtube-transcript-api, it seems pretty straightforward.

  1. Send a Get Request to YouTube video URL e.g. https://www.youtube.com/watch?v=R0hAI0qUvmk
  2. Search for "captionTracks" in the response. You can use any HTML parser for this
  3. First item in the captionTracks JSON/Dictionary like structure is baseURL, which looks like this
    https://www.youtube.com/api/timedtext?v=R0hAI0qUvmk\u0026caps=asr\u0026opi=112496729\u0026xoaf=5\u0026hl=ur\u0026ip=0.0.0.0\u0026ipbits=0\u0026expire=1687732221\u0026sparams=ip,ipbits,expire,v,caps,opi,xoaf\u0026signature=8E3DD9D76DA864ACF6947F759695C6917A6B5A8E.46267D5DCE9449DF7EDAFB4F3492503D8CD55C1C\u0026key=yt8\u0026kind=asr\u0026lang=en
  4. Decode the Unicode either using a tool like this, or in python like this:
from html import unescape
url = unescape(baseURL) 

Basically just replace \u0026 with & sign. In this case, it looks like this https://www.youtube.com/api/timedtext?v=yCmBMO4hvaA&caps=asr&opi=112496729&xoaf=4&hl=ur&ip=0.0.0.0&ipbits=0&expire=1687730845&sparams=ip,ipbits,expire,v,caps,opi,xoaf&signature=BA99CEA6951B2D155A290B55020E423AB776F639.77C0AD10AF2C02987AA97015288FE7E15244B15D&key=yt8&kind=asr&lang=en
Go to this URL and you should have your captions.

Gluconeogenesis answered 25/6, 2023 at 15:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.