How to punctuate youtube transcripts?
Asked Answered
A

4

18

On Youtube, I can download the CC transcript for a video but the transcript does not contain punctuation. How can I punctuate the transcript automatically?

Apul answered 16/12, 2020 at 6:39 Comment(2)
Can you specify whether you are trying to do it via the youtube apis or via code on the client?Burlesque
Any method is welcome. Better to use a software or service, i.e., upload the raw transcript/video/audio and download the punctuated transcript.Apul
N
10

This is a problem studied in Natural Language Processing (NLP), which is often referred to as punctuation restoreation. There are some deep learning solutions that can achieve this, but they aren't perfect, although they can achieve decent results. You can try using https://github.com/ottokart/punctuator2, which is based on this paper. (you can try it out here).

Neuropsychiatry answered 24/12, 2020 at 15:56 Comment(1)
Fantastic tool. I just used it to punctuate a youtube transcript and it works great. I tried the whole document at first, but it stopped auto-punctuating at around 35K characters. So I hand-divided it into reasonable chunks. What a timesaver.Misbegotten
O
6

In 2023 there are multiple ways to do it:

  1. Use chatGPT. It works very well but because of limits on input text it's quite a cumbersome process for long videos (60min+). Apart from processing batches you have to control output quality for each batch as it is not 100% consistent yet.
  2. Use Deep Multilingual Punctuation Prediction. It can restore the punctuation with accuracy 77% for English text. But it won't fix capital letters.
  3. Use yt-dlp and Whisper. Download mp3 from Youtube and run Whisper. This OpenAI's model does very good speech-to-text and provides output with punctuation. But it's quite slow for long video/audio (processing 60 mins audio takes approx 30 mins). Example implementation
  4. Use yt-dlp and whisper.cpp. This works faster, processing 60 mins audio takes less than 10 mins. My example implementation
  5. Use Shoki.app
Oblivious answered 23/4, 2023 at 3:41 Comment(1)
I tried using chagGPT. Indeed it works well but the prompt has to be carefully written otherwise the target text might change. It is also not free.Bradway
S
4

There's no way to get them from youtube, you'll have to generate them yourself. Google offers a service that generates punctuation for arbitrary text, and from my personal experience, it's more accurate than some competitors, so I would run it through that.

Strawflower answered 23/12, 2020 at 15:32 Comment(1)
This service requires you to extract the audio from video and upload it. And it is a paid service.Apul
S
0

You can use a DistilBERT token classifier to restore punctuations and uppercases. I use this approach for https://www.appblit.com/scribe and it works reasonably well. For other languages we would need to fine-tune a multilingual DistilBERT.

Sybaris answered 16/1 at 17:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.