How can I get word-level timestamps in OpenAI's Whisper ASR?

E

4

18

I use OpenAI's Whisper python lib for speech recognition. How can I get word-level timestamps?

To transcribe with OpenAI's Whisper (tested on Ubuntu 20.04 x64 LTS with an Nvidia GeForce RTX 3090):

conda create -y --name whisperpy39 python==3.9
conda activate whisperpy39
pip install git+https://github.com/openai/whisper.git 
sudo apt update && sudo apt install ffmpeg
whisper recording.wav
whisper recording.wav --model large

If using an Nvidia GeForce RTX 3090, add the following after conda activate whisperpy39:

pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch

Electrostatics answered 23/9, 2022 at 2:15 Comment(0)

D

7

In openai-whisper version 20231117, you can get word level timestamps by setting word_timestamps=True when calling transcribe():

pip install openai-whisper

import whisper
model = whisper.load_model("large")
transcript = model.transcribe(
    word_timestamps=True,
    audio="toto.mp3"
)
for segment in transcript['segments']:
    print(''.join(f"{word['word']}[{word['start']}/{word['end']}]" 
                    for word in segment['words']))

prints:

Toto,[2.98/3.4] I[3.4/3.82] have[3.82/3.96] a[3.96/4.02] feeling[4.02/4.22] we're[4.22/4.44] not[4.44/4.56] in[4.56/4.72] Kansas[4.72/5.14] anymore.[5.14/5.48]

Despiteful answered 9/1, 2024 at 7:4 Comment(3)

This truly helps. But for enterprise proprietary data where we shouldn't call the API, standing today, which downloadable model of Whisper do you think is most accurate with word level timestamping? Also any ref to how to run it? Am new to GenAI models being used through on-prem or GPU instances. Pls guide. – Gentianaceous 5/3, 2024 at 12:35

I have been using whisper-large-v3 with fairly good results, but sometimes it's a bit off. For subtitles, it's fine, for karaoke, maybe not. Some projects specialize in improving the accuracy of these timestamps, for example: github.com/linto-ai/whisper-timestamped - Good luck :) – Despiteful 11/3, 2024 at 4:54

I have followed this blog to emulate creating Whisper Large v3 in Amazon SageMaker, any clue now how to proceed with timestaming? dev.to/mohalbakerkaw/… – Gentianaceous 29/3, 2024 at 19:24

P

15

I created a repo to recover word-level timestamps (and confidence), and also more accurate segment timestamps: https://github.com/Jeronymous/whisper-timestamped

It is built based on the cross-attention weights of Whisper, as in this notebook in the Whisper repo. I tuned a bit the approach to get better location, and added the possibility to get the cross-attention on the fly, so there is no need to run the Whisper model twice. There is no memory issue when processing long audio.

Note: first, I tried the approach of using wav2vec model to realign Whisper's transcribed words to input audio. It works reasonably well, but it has many drawbacks : it needs to handle a separate (wav2vec) model, to perform another inference on the full signal, to have one wav2vec model per language, to normalize the transcribed text so that the set of characters fits the one of wav2vec model (e.g. converting numbers in characters, symbols like "%", currencies...). Also the alignment can have troubles on disfluencies that are usually removed by Whisper (so part of what would recognize wav2vec model is missing, like start of sentences that are reformulated).

Passional answered 23/9, 2022 at 2:15 Comment(3)

Btw this lib is thing of beauty. In my case, I already have a transcript of audio and only need timestamps. Is there a way to feed the transcript to improve accuracy? For example by using initial_prompt option? – Sandry 22/3, 2023 at 0:55

@Sandry I haven't looked at the Jeronymous repository but if you look at cell 19 in the latest Whisper notebook (Multilingual_ASR.ipynb), the line starting "tokens = torch.tensor(..." uses the transcribed tokens to generate the model input. You can replace tokens with your own transcript to achieve what you want. – Franek 5/4, 2023 at 1:8

wav2vec models are also quite good to align a transcript with audio signal, as they assign a probability on characters for each frame of audio. You can have a look at pytorch.org/audio/stable/tutorials/… You just have to make sure to find a wav2vec model for the language you want to process, and that the character set of this model is consistent with your transcript (you might need to normalize your transcript, with things like lower case, "num2words" conversion, punctuation removal,...). – Passional 6/4, 2023 at 22:4

E

8

https://openai.com/blog/whisper/ only mentions "phrase-level timestamps", I infer from it that word-level timestamps are not obtainable without adding more code.

From one of the Whisper authors:

Getting word-level timestamps are not directly supported, but it could be possible using the predicted distribution over the timestamp tokens or the cross-attention weights.

https://github.com/jianfch/stable-ts (MIT License):

This script modifies methods of Whisper's model to gain access to the predicted timestamp tokens of each word without needing addition inference. It also stabilizes the timestamps down to the word level to ensure chronology.

Note that:

Unclear how precise these word-level timestamps are.
subtitles sometimes go out of sync.

Another option: use some word-level forced alignment program. E.g., Lhotse (Apache-2.0 license) has integrated both Whisper ASR and Wav2vec forced alignment:

Electrostatics answered 23/9, 2022 at 16:29 Comment(1)

How do you get "phrase level timestamps"? – Thurgau 11/6, 2023 at 1:22

D

7

In openai-whisper version 20231117, you can get word level timestamps by setting word_timestamps=True when calling transcribe():

pip install openai-whisper

import whisper
model = whisper.load_model("large")
transcript = model.transcribe(
    word_timestamps=True,
    audio="toto.mp3"
)
for segment in transcript['segments']:
    print(''.join(f"{word['word']}[{word['start']}/{word['end']}]" 
                    for word in segment['words']))

prints:

Toto,[2.98/3.4] I[3.4/3.82] have[3.82/3.96] a[3.96/4.02] feeling[4.02/4.22] we're[4.22/4.44] not[4.44/4.56] in[4.56/4.72] Kansas[4.72/5.14] anymore.[5.14/5.48]

Despiteful answered 9/1, 2024 at 7:4 Comment(3)

This truly helps. But for enterprise proprietary data where we shouldn't call the API, standing today, which downloadable model of Whisper do you think is most accurate with word level timestamping? Also any ref to how to run it? Am new to GenAI models being used through on-prem or GPU instances. Pls guide. – Gentianaceous 5/3, 2024 at 12:35

I have been using whisper-large-v3 with fairly good results, but sometimes it's a bit off. For subtitles, it's fine, for karaoke, maybe not. Some projects specialize in improving the accuracy of these timestamps, for example: github.com/linto-ai/whisper-timestamped - Good luck :) – Despiteful 11/3, 2024 at 4:54

I have followed this blog to emulate creating Whisper Large v3 in Amazon SageMaker, any clue now how to proceed with timestaming? dev.to/mohalbakerkaw/… – Gentianaceous 29/3, 2024 at 19:24

E

0

One can use the Python package https://github.com/m-bain/whisperX:

Automatic speech recognition based on Whisper with word-level timestamps
BSD-4-Clause license
faster-whisper backend (MIT license)
also has diarization
paper: https://arxiv.org/abs/2303.00747

Electrostatics answered 29/6, 2023 at 22:14 Comment(0)

Recommended topics

Hot tags