How to extract closed caption transcript from YouTube video?
Asked Answered
P

10

90

Is it possible to extract the closed caption transcript from YouTube videos?

We have over 200 webcasts on YouTube and each is at least one hour long. YouTube has closed caption for all videos but it seems users have no way to get it.

I tried the URL in this blog but it does not work with our videos.

http://googlesystem.blogspot.com/2010/10/download-youtube-captions.html

Philan answered 8/3, 2012 at 0:43 Comment(1)
Works in 2022, an answer to another question: stackoverflow.com/a/70756998Sharlasharleen
P
20

Following document says only the owner of the channel can do this via standard youtube interface: https://developers.google.com/youtube/2.0/developers_guide_protocol_captions?hl=en

Cheap fix: You can click on the "interactive transscript" button - and copy the content this way. Of course you lose the milliseconds this way.

Extremely cheap fix: A shared youtube account - so that multiple people can edit and upload caption files.

Challenging solution: The youtube API allows downloading and uploading of caption files via HTTP... You may write a youtube API application to provide a browser user interface for uploading or downloading for ANY user or particular users.

Here is an example project for this in java http://apiblog.youtube.com/2011/01/youtube-captions-uploader-web-app.html

Here is very simple example of a working upload for everybody: http://yt-captions-uploader.appspot.com/

Preferable answered 13/6, 2012 at 10:53 Comment(2)
Every link in this answer is out of date. YouTube API 2.0 has been since replaced by API 3.0 and downloading captions under this API incurs “a quota cost of approximately 200 units”. They fail to mention how this quota is allocated and to whom, so this solution is not going to be useful to most people who just want to download captions rather than admire some API.Hung
somehow is that possible to generate auto transcript via Youtube V3 API?Rubidium
A
79

Here's how to get the transcript of a YouTube video (when available):

  • Go to YouTube and open the video of your choice.
  • Click on the "More actions" button (3 horizontal dots) located next to the Share button.
  • Click "Open transcript"

Although the syntax may be a little goofy this is a pretty good solution.

Source: http://ccm.net/faq/40644-youtube-how-to-get-the-transcript-of-a-video

Acton answered 1/2, 2016 at 14:37 Comment(3)
That's exactly what I needed. You can click each caption to jump straight to the right part of the video.Sometimes
AWESOME!!! This is much better solution than all hacks suggested in many other similar question on SO, some of 'em led me to pop up spam,Tang
What if there is no "Open transcript" link? Edit: I see that that may be the case if the video is only an hour old. I see the link now, after another 30 minutes or so.Adlai
R
64

Get timedtext file directly from YouTube

curl -s "$video_url"|grep -o '"baseUrl":"https://www.youtube.com/api/timedtext[^"]*lang=en'|cut -d \" -f4|sed 's/\\u0026/\&/g'|xargs curl -Ls|grep -o '<text[^<]*</text>'|sed -E 's/<text start="([^"]*)".*>(.*)<.*/\1 \2/'|sed 's/\xc2\xa0/ /g;s/&amp;/\&/g'|recode xml|awk '{$1=sprintf("%02d:%02d:%02d",$1/3600,$1%3600/60,$1%60)}1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1

yt-dlp

yt-dlp supports saving the automatically generated closed captions in a JSON format:

cap()(printf %s\\n "${@-$(cat)}"|parallel -j10 -q yt-dlp -i --skip-download --write-auto-sub --sub-format json3 -o '%(upload_date)s.%(title)s.%(uploader)s.%(id)s.%(ext)s' --;for f in *.json3;do jq -r '.events[]|select(.segs and .segs[0].utf8!="\n")|(.tStartMs|tostring)+" "+([.segs[]?.utf8]|join(""))' "$f"|awk '{x=$1/1e3;$1=sprintf("%02d:%02d:%02d",x/3600,x%3600/60,x%60)}1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1 >"${f%.json3}";rm "$f";done)

You can also use the function above to download the captions for all videos on a channel or playlist if you give the ID or URL of the channel or playlist as an argument. When there is an error downloading a single video, the -i (--ignore-errors) option skips the video instead of exiting with an error.

Or this just gets the text without the timestamps:

yt-dlp --skip-download --write-auto-sub --sub-format json3 $youtube_url_or_id;jq -r '.events[]|select(.segs and.segs[0].utf8!="\n")|[.segs[].utf8]|join("")' *json3|paste -sd\ -|fold -sw60

youtube-dl

As of 2022, the format of the VTT and TTML downloaded by youtube-dl --write-auto-sub is messed up so that all subtitle texts are placed under a few long lines so that the timestamps of the subtitles are not visible. If you don't need the timestamps, then it shouldn't matter, but otherwise you can fix it by substituting yt-dlp for youtube-dl in the following commands. But with yt-dlp, you can also use a more convenient JSON format, so you don't need the following approach to deal with the VTT subtitle format.

This downloads the subtitles as VTT:

youtube-dl --skip-download --write-auto-sub $youtube_url

The other available formats are ttml, srv3, srv2, and srv1 (shown by --list-subs):

--write-sub
       Write subtitle file

--write-auto-sub
       Write automatically generated subtitle file (YouTube only)

--all-subs
       Download all the available subtitles of the video

--list-subs
       List all available subtitles for the video

--sub-format FORMAT
       Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"

--sub-lang LANGS
       Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags

You can use ffmpeg to convert the subtitle file to another format:

ffmpeg -i input.vtt output.srt

In the VTT subtitles, each subtitle text is repeated three times, and there is typically a new subtitle text every eighth line (but under some mysterious circumstances it's every 12th line instead):

WEBVTT
Kind: captions
Language: en

00:00:01.429 --> 00:00:04.249 align:start position:0%

ladies<00:00:02.429><c> and</c><00:00:02.580><c> gentlemen</c><c.colorE5E5E5><00:00:02.879><c> I'd</c></c><c.colorCCCCCC><00:00:03.870><c> like</c></c><c.colorE5E5E5><00:00:04.020><c> to</c><00:00:04.110><c> thank</c></c>

00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
 </c>

00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
you<00:00:04.440><c> for</c><00:00:04.620><c> coming</c><00:00:05.069><c> tonight</c><00:00:05.190><c> especially</c></c><c.colorCCCCCC><00:00:05.609><c> at</c></c>

00:00:05.930 --> 00:00:05.940 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
 </c>

00:00:05.940 --> 00:00:07.730 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
such<00:00:06.180><c> short</c><00:00:06.690><c> notice</c></c>

00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice


00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm<00:00:08.370><c> sure</c><c.colorE5E5E5><00:00:08.580><c> mr.</c><00:00:08.820><c> Irving</c><00:00:09.000><c> will</c><00:00:09.120><c> fill</c><00:00:09.300><c> you</c><00:00:09.389><c> in</c><00:00:09.420><c> on</c></c>

00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
 </c>

00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
the<00:00:09.750><c> circumstances</c><00:00:10.440><c> that's</c><00:00:10.620><c> brought</c><00:00:10.920><c> us</c></c>

00:00:11.030 --> 00:00:11.040 align:start position:0%
<c.colorE5E5E5>the circumstances that's brought us
 </c>

This converts the VTT subtitles to a simpler format:

sed '1,/^$/d' *.vtt| # remove the lines at the top of the file
sed 's/<[^>]*>//g'| # remove tags
awk -F. 'NR%4==1{printf"%s ",$1}NR%4==3' | # print each new subtitle text and its start time without milliseconds
awk NF\>1 # remove lines with only one field

Output:

00:00:01 ladies and gentlemen I'd like to thank
00:00:04 you for coming tonight especially at
00:00:05 such short notice
00:00:07 I'm sure mr. Irving will fill you in on
00:00:09 the circumstances that's brought us

In maybe around 10% of videos that I tested with (like for example p9M3shEU-QM and aE05_REXnBc), there were one or more subtitle texts which came 12 and not 8 lines after the previous subtitle text. But a workaround is to print every fourth line but to then remove empty lines.

Function form:

cap()(printf %s\\n "${@-$(cat)}"|parallel -j10 -q youtube-dl -i --skip-download --write-auto-sub -o '%(upload_date)s.%(title)s.%(uploader)s.%(id)s.%(ext)s' --;for f in *.vtt;do sed '1,/^$/d' -- "$f"|sed 's/<[^>]*>//g'|awk -F. 'NR%4==1{printf"%s ",$1}NR%4==3'|awk 'NF>1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1 >"${f%.vtt}";rm "$f";done)

Ruminate answered 22/2, 2019 at 1:6 Comment(5)
As of 2019, this is the only working solution. I think YouTube video downloads, and I assume by proxy, subtitles, is a moving target. The youtube-dl folks are the only ones that consistently hit the mark of being able to automatically download from YouTube, probably because they actively make sure it keeps working.Gilbreath
Thanks for doing this answer, how do I print the simplified format to a text file or markdown? I mean how to modify the cap() command to print to a file rather than print it out in terminalKelleekelleher
In case anybody looking at this answer, I have asked and received an answer on how to have the simplified format be printed into a file See https://mcmap.net/q/246031/-how-to-modify-this-sed-awk-command-so-that-the-output-goes-to-a-file-of-choice for how to do thisKelleekelleher
Here's a detailed bash script for those who wants to save the subs file with a relative path. The result is saved as plaintext, removing time, new lines and other markup. https://mcmap.net/q/246031/-how-to-modify-this-sed-awk-command-so-that-the-output-goes-to-a-file-of-choiceTasker
One of the most well written, complete, and helpful answers I've ever seen on SO.Demetricedemetris
P
20

Following document says only the owner of the channel can do this via standard youtube interface: https://developers.google.com/youtube/2.0/developers_guide_protocol_captions?hl=en

Cheap fix: You can click on the "interactive transscript" button - and copy the content this way. Of course you lose the milliseconds this way.

Extremely cheap fix: A shared youtube account - so that multiple people can edit and upload caption files.

Challenging solution: The youtube API allows downloading and uploading of caption files via HTTP... You may write a youtube API application to provide a browser user interface for uploading or downloading for ANY user or particular users.

Here is an example project for this in java http://apiblog.youtube.com/2011/01/youtube-captions-uploader-web-app.html

Here is very simple example of a working upload for everybody: http://yt-captions-uploader.appspot.com/

Preferable answered 13/6, 2012 at 10:53 Comment(2)
Every link in this answer is out of date. YouTube API 2.0 has been since replaced by API 3.0 and downloading captions under this API incurs “a quota cost of approximately 200 units”. They fail to mention how this quota is allocated and to whom, so this solution is not going to be useful to most people who just want to download captions rather than admire some API.Hung
somehow is that possible to generate auto transcript via Youtube V3 API?Rubidium
T
20

You can view/copy/download a timecoded xml file of a youtube's closed captions file by accessing

http://video.google.com/timedtext?lang=[LANGUAGE]&v=[YOUTUBE VIDEO IDENTIFIER]

For example http://video.google.com/timedtext?lang=pt&v=WSVKbw7LC2w

NOTE: this method does not download autogenerated closed captions, even if you get the language right (maybe there's a special code for autogenerated languages).

Tomlin answered 27/4, 2017 at 14:28 Comment(8)
As of May 2017 this no longer works (I'm guessing that video.google.com no longer works for the Youtube API. Any other google tool to extract the captions?Albuminate
Thanks for the headsup, BUT... you must have run into some problem or other. This solution still works, just tested it. It might be some formatting option (language, maybe?). Post the video link and i'll double check directly.Tomlin
It does work for your example @tonygil; however does not work for... video.google.com/timedtext?lang=en&v=odPD-H0LMkc (youtu.be/odPD-H0LMkc)Pandemic
@J.Won. the video does not have closed captions to download. The bad recording quality and the very specific accent (indian subcontinent) probably impeded google scripts from obtaining a transcriptionTomlin
I just found something out: this method does not download autogenerated closed captionsTomlin
It doesn't work, e.g., video.google.com/timedtext?lang=pt&v=J_F5ssmvAqI This does have cc and I can read the transcript on youtubeErdah
@Erdah First, it does NOT have subtitles in PORTUGUESE. you are indicating language as PORTUGUESE (lang=pt). Second, said video only has autogenerated subtitles, which, as I wrote in the answer, this script does not download. Try another video with uploaded subtitles and you will see that it works.Tomlin
you just need to add &track=asr in last, as a query parameter. and this should work for auto-transcription captions.Rubidium
A
12

You can download the streaming subtitles from YouTube with KeepSubs DownSub and SaveSubs.

You can choose from the Automatic Transcript or author supplied close captions. It also offers the possibility to automatically translate the English subtitles into other languages using Google Translate.

Amygdalate answered 20/4, 2015 at 14:10 Comment(10)
It appears that KeepSubs no longer exists.Adulterant
DownSub (downsub.com) is an alternative to KeepSubs. I've only used it one time (today) and it seems to have worked fine.Thousand
As of 02-19-17 DownSub pushes malware: it downloads a hacked version of the Flash installerBeiderbecke
@NoGrabbing: People always say that some website installs some malware but they always fail to say how. Browsers don’t allow installation of arbitrary software on users’ computers, so an explanation is due. I have been using DownSub for a year. Where can I find that “hacked version of the Flash installer” on my PC?Hung
@NoGrabbing: I can’t prove that I’m not their “shill” but that’s irrelevant. Your link confirms what I have said: “your security is not compromised unless you manually install the file”. To your credit, you have now shown how they may infect an absent-minded user and that’s a very useful information. I will try to implement something similar on my website. Thanks and +1.Hung
@NoGrabbing: I want to add that getting infected in this manner is a problem with a user more than a problem with a website. It’s everyone’s moral duty to teach people safe behaviour and exploit recklessness of those who refuse to listen.Hung
@NoGrabbing: Thanks for an invitation but I’m not a chaser.Hung
There is savesubs.com to the rescueOhalloran
malware sites! stay away!Helotism
the website 'savesubs' works pretty well and dont have clickbait as'downsub' website.Vida
C
9

(Obligatory 'this is probably an internal youtube.com interface and might break at any time')

Instead of linking to another tool that does this, here's an answer to the question of "how to do this"

Use fiddler or your browser devtools (e.g. Chrome) to inspect the youtube.com HTTP traffic, and there's a response from /api/timedtext that contains the closed caption info as XML.

It seems that a response like this:

    <p t="0" d="5430" w="1">
        <s p="2" ac="136">we&#39;ve</s>
        <s t="780" ac="252"> got</s>
    </p>
    <p t="2280" d="7170" w="1">
        <s ac="243">we&#39;re</s>
        <s t="810" ac="233"> going</s>
    </p>

means at time 0 is the word we've and at time 0+780 is the word got and at time 2280+810 is the word going, etc. This time is in milliseconds so for time 3090 you'd want to append &t=3 to the URL.

You can use any tool to stitch together the XML into something readable, but here's my Power BI Desktop script to find words like "privilege":

let
    Source = Xml.Tables(File.Contents("C:\Download\body.xml")),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Attribute:format", Int64.Type}}),
    body = #"Changed Type"{0}[body],
    p = body{0}[p],
    #"Changed Type1" = Table.TransformColumnTypes(p,{{"Attribute:t", Int64.Type}, {"Attribute:d", Int64.Type}, {"Attribute:w", Int64.Type}, {"Attribute:a", Int64.Type}, {"Attribute:p", Int64.Type}}),
    #"Expanded s" = Table.ExpandTableColumn(#"Changed Type1", "s", {"Attribute:ac", "Attribute:p", "Attribute:t", "Element:Text"}, {"s.Attribute:ac", "s.Attribute:p", "s.Attribute:t", "s.Element:Text"}),
    #"Changed Type2" = Table.TransformColumnTypes(#"Expanded s",{{"s.Attribute:t", Int64.Type}}),
    #"Removed Other Columns" = Table.SelectColumns(#"Changed Type2",{"s.Attribute:t", "s.Element:Text", "Attribute:t"}),
    #"Replaced Value" = Table.ReplaceValue(#"Removed Other Columns",null,0,Replacer.ReplaceValue,{"s.Attribute:t"}),
    #"Filtered Rows" = Table.SelectRows(#"Replaced Value", each [#"s.Element:Text"] <> null),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows", "Time", each [#"Attribute:t"] + [#"s.Attribute:t"]),
    #"Filtered Rows1" = Table.SelectRows(#"Added Custom", each ([#"s.Element:Text"] = " privilege" or [#"s.Element:Text"] = " privileged" or [#"s.Element:Text"] = " privileges" or [#"s.Element:Text"] = "privilege" or [#"s.Element:Text"] = "privileges"))
in
    #"Filtered Rows1"
Caraviello answered 17/10, 2016 at 18:52 Comment(2)
This is now the best answer if you want FULL data (aside from the "Open Transcript" option in the kabob "..." menu). Just, instead of using Fiddler, you can just use DevTools built into chrome. In this case, pop that open, then go to the "Network" tab and in that little search box, just drop in timedtext. You can then right click and open that URL into a new tab and it'll provide an XML document of the transcript, complete with timing information.Exsert
Thanks @chunk_split I edited the answer to mention that. No need to set up HTTPS MITM for this :)Caraviello
M
9

There is a free python tool called YouTube transcript API

You can use it in scripts or as a command line tool:

pip install youtube_transcript_api
Murine answered 2/8, 2019 at 4:55 Comment(1)
This was the answer that finally worked for me in 2021Alkaline
A
4

With the YouTube video updated as of June 2020 it's very straight forward

  1. select on the 3 dots next to like/dislike buttons to open further menu options
  2. select "add translations"
  3. select language
  4. click autogenerate if needed
  5. click Actions > Download

You will get and .sbv file

Archambault answered 16/6, 2020 at 15:14 Comment(1)
Thanks for your contribution to the topic. I do not see this option. I wonder if YouTube has since removed the option (April 2024) or if the option simply doesn't appear for soME video. Even on some videos with Closed Captions, I still have no option to ADD TRANSLATION or to AUTOGENERATE transcript. Does the option only exist if I am the creator of the video?Yaw
D
3

Choose Open Transcript from the ... dropdown to the right of the vote up/down and share links.

This will open a Transcript scrolling div on the right side.

You can then use Copy. Note that you cannot use Select All but need to click the top line, then scroll to the bottom using the scroll thumb, and then shift-click on the last line.

Note that you can also search within this text using the normal web page search.

Derron answered 13/11, 2017 at 21:45 Comment(0)
S
1

I just got this easily done manually by opening the transcript at the beginning of the video and left-clicking and dragging at the time 00:00 marker with the shift key pressed over a few lines at the beginning.

I then advanced the video to near the end. When the video stopped, I clicked the end of the last sentence whilst holding down the shift key once more. With CTRL-C I copied the text to the clipboard and pasted it into an editor.

Done!

Caveat: Be sure to have no RDP-Windows sharing the clipboard or Software such as Teamviewer is running at the same time as this procedure will overflow their buffers where a large amount of text is copied.

Sultry answered 15/6, 2018 at 18:31 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.