How do I convert the WebVTT format to plain text?
Asked Answered
S

4

15

Here is a sample of WebVTT

WEBVTT
Kind: captions
Language: en
Style:
::cue(c.colorCCCCCC) { color: rgb(204,204,204);
 }
::cue(c.colorE5E5E5) { color: rgb(229,229,229);
 }
##

00:00:00.060 --> 00:00:03.080 align:start position:0%
 
<c.colorE5E5E5>okay<00:00:00.690><c> so</c><00:00:00.750><c> this</c><00:00:01.319><c> is</c><00:00:01.469><c> a</c></c><c.colorCCCCCC><00:00:01.500><c> newsflash</c><00:00:02.040><c> page</c><00:00:02.460><c> for</c></c>

00:00:03.080 --> 00:00:03.090 align:start position:0%
<c.colorE5E5E5>okay so this is a</c><c.colorCCCCCC> newsflash page for
 </c>

00:00:03.090 --> 00:00:08.360 align:start position:0%
<c.colorE5E5E5>okay so this is a</c><c.colorCCCCCC> newsflash page for</c>
<c.colorE5E5E5>Meraki<00:00:03.659><c> printing</c><00:00:05.120><c> so</c><00:00:06.529><c> all</c><00:00:07.529><c> we</c><00:00:08.040><c> need</c><00:00:08.130><c> to</c><00:00:08.189><c> do</c></c>

00:00:08.360 --> 00:00:08.370 align:start position:0%
<c.colorE5E5E5>Meraki printing so all we need to do
 </c>

00:00:08.370 --> 00:00:11.749 align:start position:0%
<c.colorE5E5E5>Meraki printing so all we need to do
here<00:00:08.700><c> is</c><00:00:08.820><c> to</c><00:00:09.000><c> swap</c><00:00:09.330><c> out</c><00:00:09.480><c> the</c><00:00:09.660><c> logo</c><00:00:09.929><c> here</c><00:00:10.650><c> and</c><00:00:10.830><c> I</c></c>

00:00:11.749 --> 00:00:11.759 align:start position:0%
here is to swap out the logo here<c.colorE5E5E5> and I
 </c>

00:00:11.759 --> 00:00:16.400 align:start position:0%
here is to swap out the logo here<c.colorE5E5E5> and I
should<00:00:11.969><c> also</c><00:00:12.120><c> work</c><00:00:12.420><c> on</c><00:00:12.630><c> move</c><00:00:12.840><c> out</c><00:00:13.049><c> as</c><00:00:13.230><c> well</c><00:00:15.410><c> and</c></c>

00:00:16.400 --> 00:00:16.410 align:start position:0%
<c.colorE5E5E5>should also work on move out as well and
 </c>

I used youtube-dl to grab it from YouTube.

I want to convert this to plain text. I can't just strip out the times and colour tags as the text repeats itself .

So I'm wondering if something exists to convert this to plain text or if there is some pseudo code someone could offer so I could code that up?

I have also posted an issue about this with youtube-dl.

Sinistrality answered 10/8, 2018 at 10:15 Comment(0)
H
15

I've used WebVTT-py to extract the plain text transcription.

import webvtt
vtt = webvtt.read('subtitles.vtt')
transcript = ""

lines = []
for line in vtt:
    # Strip the newlines from the end of the text.
    # Split the string if it has a newline in the middle
    # Add the lines to an array
    lines.extend(line.text.strip().splitlines())

# Remove repeated lines
previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

print(transcript)
Habsburg answered 7/9, 2018 at 13:10 Comment(0)
E
10

Command line in bash shell works best for me, being faster, smaller, simpler, effective:

cat myfile.vtt | grep : -v | awk '!seen[$0]++'

This grep removes lines that contain : (colon) by using -v to invert aka not contain :

This awk removes duplicate lines.

Egidio answered 16/9, 2021 at 19:45 Comment(1)
This works for files where the cue payload text doesn't contain timestamp tags (or that have a duplicate of each cue without timestamp tags). For an explanation of the awk syntax see here.Midweek
P
3

Same concept as in Terence Eden's answer but generalized into single functions. The magic of generators improves readability for this task and saves a lot of memory. There's often no need to hold data from files in lists or big strings for processing. So at least webvtt is the here the only part keeping the whole source file in memory.

I found whitespace html entities in my files too so there's a simple replace added. And I made it default to keep the line breaks by default.

This is my version containing pathlib, Typing and Generators:

from pathlib import Path
from typing import Generator
import webvtt


def vtt_lines(src) -> Generator[str, None, None]:
    """
    Extracts all text lines from a vtt file which may contain duplicates

    :param src: File path or file like object
    :return: Generator for lines as strings
    """
    vtt = webvtt.read(src)

    for caption in vtt:  # type: webvtt.structures.Caption
        # A caption which may contain multiple lines
        for line in caption.text.strip().splitlines():  # type: str
            # Process each one of them
            yield line


def deduplicated_lines(lines) -> Generator[str, None, None]:
    """
    Filters all duplicated lines from list or generator

    :param lines: iterable or generator of stringsa
    :return: Generator for lines as strings without duplicates
    """
    last_line = ""
    for line in lines:
        if line == last_line:
            continue

        last_line = line
        yield line


def vtt_to_linear_text(src, savefile: Path, line_end="\n"):
    """
    Converts an vtt caption file to linear text.

    :param src: Path or path like object to an existing vtt file
    :param savefile: Path object to save content in
    :param line_end: Default to line break. May be set to a space for a single line output.
    """
    with savefile.open("w") as writer:
        for line in deduplicated_lines(vtt_lines(src)):
            writer.write(line.replace("&nbsp;", " ").strip() + line_end)

# Demo call
vtt_to_linear_text("file.vtt", Path("file.txt"))
Phalarope answered 25/4, 2021 at 13:24 Comment(0)
M
0

Building onto Terence's response to make a short script to process files.

import argparse
import webvtt
from pathlib import Path

parser = argparse.ArgumentParser(description='Argument Parser Example')
parser.add_argument('-i', '--input', help='Input file or directory', required=True)
parser.add_argument('-o', '--output', help='Output file or directory', required=True)
args = parser.parse_args()

vtt = webvtt.read(args.input)
transcript = ""

lines = []
for line in vtt:
    # Strip the newlines from the end of the text.
    # Split the string if it has a newline in the middle
    # Add the lines to an array
    lines.extend(line.text.strip().splitlines())

# Remove repeated lines
previous = None
for line in lines:
    if line == previous:
       continue
    transcript += "\n" + line
    previous = line

# print(transcript)
Path(args.output).write_text(transcript)
Messenger answered 5/2 at 18:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.