Convert WebVTT file from Youtube to plain text
Asked Answered
D

4

7

I am downloading WebVTT files from youtube using youtube-dl.

A typical file looks like this:

WEBVTT
Kind: captions
Language: en

00:00:00.730 --> 00:00:05.200 align:start position:0%

[Applause]

00:00:05.200 --> 00:00:05.210 align:start position:0%
[Applause]


00:00:05.210 --> 00:00:11.860 align:start position:0%
[Applause]
hi<00:00:06.440><c> I'm</c><00:00:07.440><c> here</c><00:00:07.740><c> to</c><00:00:08.160><c> talk</c><00:00:08.429><c> to</c><00:00:09.019><c> share</c><00:00:10.019><c> an</c><00:00:10.469><c> idea</c><00:00:10.820><c> to</c>

00:00:11.860 --> 00:00:11.870 align:start position:0%
hi I'm here to talk to share an idea to


00:00:11.870 --> 00:00:15.890 align:start position:0%
hi I'm here to talk to share an idea to
communicate<00:00:12.920><c> but</c><00:00:13.920><c> what</c><00:00:14.790><c> is</c><00:00:14.940><c> communication</c>

00:00:15.890 --> 00:00:15.900 align:start position:0%
communicate but what is communication

I would like to get a text file with this:

hi I'm here to talk to share an idea to
communicate but what is communication

Using code I found online, I got this:

cat output.vtt | sed "s/^[0-9]*[0-9\:\.\ \>\-]*//g" | grep -v "^WEBVTT\|^Kind: cap\|^Language" | awk 'BEGIN{ RS="\n\n+"; RS="\n\n" }NR>=2{ print }' > dialogues.txt

But it is far from perfect. I get a lot of useless spaces, and all the sentences are displayed twice. Would you mind helping me? Somebody asked a similar question before but the answer submitted did not work for me.

Thanks!

Disulfide answered 8/7, 2019 at 2:38 Comment(0)
C
1

You might be able to do something similar to this:

sed -e '1,4d' -E -e '/^$|]|>$|%$/d' output.vtt | awk '!seen[$0]++' > dialogues.txt
  • sed removes the first 4 lines
  • sed then deletes any blank lines, or ones that contain ], or end in >, %.
  • awk removes duplicate lines.

Result:

hi I'm here to talk to share an idea to
communicate but what is communication 

You might have to tweak it a bit, although it should result in more along the lines of what you want.

Contemplation answered 8/7, 2019 at 3:8 Comment(2)
This removes all new lines after the first four (which may be intended, but wasn't what I wanted), but more importantly, it doesn't seem to answer the question as it leaves all of the timestamps in.Rasping
careful though with that regex - sometimes those damn lyric files may embed html-esque tags without a duplicate line for u to throw away (and i had to deal with unicode on top of all that)Gum
U
5

Could you please try following in a single awk itself.

awk 'FNR<=4 || ($0 ~ /^$|-->|\[|\]|</){next} !a[$0]++'  Input_file

Explanation: Adding explanation now for above code.

awk '                                     ##Starting awk program here.
FNR<=4 || ($0 ~ /^$|-->|\[|\]|</){        ##Checking condition if line number is less than 4 OR having spaces or [ or ] or --> then go next line.
  next                                    ##next will skip all further statements from here.
  }
!a[$0]++                                  ##Creating an array whose index is $0 and increment its value with 1 with condition that it should NOT be already present in array a, which means it will give only 1 value of each line.
'  Input_file                             ##Mentioning Input_file name here.
Underscore answered 8/7, 2019 at 5:3 Comment(3)
Good all in one solution; some explanation would be nice +1Piegari
Sure, thanks for encouragement, I have added explanation now.Underscore
This almost works, but it removes all new lines (which may be intended, I'm not sure, but wasn't what I wanted) and most importantly, it removes all duplicate lines, which is a serious problem in, for example, subtitles for music, where only the first instance of a chorus would be kept.Rasping
C
1

You might be able to do something similar to this:

sed -e '1,4d' -E -e '/^$|]|>$|%$/d' output.vtt | awk '!seen[$0]++' > dialogues.txt
  • sed removes the first 4 lines
  • sed then deletes any blank lines, or ones that contain ], or end in >, %.
  • awk removes duplicate lines.

Result:

hi I'm here to talk to share an idea to
communicate but what is communication 

You might have to tweak it a bit, although it should result in more along the lines of what you want.

Contemplation answered 8/7, 2019 at 3:8 Comment(2)
This removes all new lines after the first four (which may be intended, but wasn't what I wanted), but more importantly, it doesn't seem to answer the question as it leaves all of the timestamps in.Rasping
careful though with that regex - sometimes those damn lyric files may embed html-esque tags without a duplicate line for u to throw away (and i had to deal with unicode on top of all that)Gum
C
0

If you analyze the pattern of your .vtt file, basically you want to keep every 8th line starting at line 10. So the algorithm is to delete the first 2 lines, then keep every 8th line:

$ cat output.vtt | sed '1,2 d' | awk 'NR%8==0'

[Applause]
hi I'm here to talk to share an idea to
communicate but what is communication
  • sed '1,2 d' deletes range from line 1 to line 2
  • awk 'NR%8==0' prints every 8th line

If you want to further filter out the "[...]" lines, then you can add another grep command such as grep -v '^\[.*\]$'

Cenis answered 6/11, 2021 at 5:58 Comment(0)
R
0

In my case I wanted to:

  • Remove the first 4 lines
  • Remove all timestamp lines
  • Keep the empty lines between subtitles

I managed to do this with the following single sed command:

sed -En '1,4d;/^[0-9].:[0-9].:[0-9].+$/!p' input.vtt > output.txt

If like me, you need to do this often and you're using Bash, you could also convert this to a Bash function:

function vtt_to_txt() {
    sed -En '1,4d;/^[0-9].:[0-9].:[0-9].+$/!p' "$1" > "$2"
}

This would allow you to simply call the function like so at any time:

vtt_to_text input.vtt output.txt
Rasping answered 13/5, 2022 at 19:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.