Google Speech Recognition API: timestamp for each word?
Asked Answered
B

3

26

It's possible to use Google's Speech recognition API to get a transcription for an audio file (WAV, MP3, etc.) by doing a request to http://www.google.com/speech-api/v2/recognize?...

Example: I have said "one two three for five" in a WAV file. Google API gives me this:

{
  u'alternative':
  [
    {u'transcript': u'12345'},
    {u'transcript': u'1 2 3 4 5'},
    {u'transcript': u'one two three four five'}
  ],
  u'final': True
}

Question: is it possible to get the time (in seconds) at which each word has been said?

With my example:

['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.

i.e. the word "one" has been said between time 00:00:00.23 and 00:00:00.80,
the word "two" has been said between time 00:00:01.03 and 00:00:01.45 (in seconds).

PS: looking for an API supporting other languages than English, especially French.

Breadbasket answered 4/12, 2015 at 10:39 Comment(2)
Hm? Afaics google speech api does support french, doesn't it?Secondly
@Secondly yes but it doesn't support timestamp for each wordBreadbasket
G
15

I believe the other answer is now out of date. This is now possible with the Google Cloud Search API: https://cloud.google.com/speech/docs/async-time-offsets

Glee answered 29/12, 2017 at 1:40 Comment(0)
P
13

EDIT 2020: Now possible, see the other answers

It is not possible with google API.

If you want word timestamps, you can use other APIs, for example:

Vosk-API - free offline speech recognition API (disclosure: I am the primary author of Vosk).

SpeechMatics SaaS speech recognition API

Speech Recognition API from IBM

Political answered 4/12, 2015 at 12:27 Comment(5)
Thanks! Have you tried these 3 APIs? Are they as good as Google's ? I am amazed each day of how Google's speech recognition is powerful. (I speak (loud) my text messages to my Android phone, and the phone makes nearly no mistake at all !)Breadbasket
They should be comparable in terms of accuracy.Political
It seems that none of them supports French language, sadly.Breadbasket
We tried IBM BlueMix Speech API for exactly this purpose and found the accuracy to be abysmal. Even simple clearly-spoken isolated words like "spoon" would come back as "moon", "room", "doom", "bloom", "whom". And this was after I pre-specified the keyword set to ("spoon") with a low acceptance probability. As the OP mentioned IBM does provide start and stop times for each word (which Google apparently does not), however the accuracy was too low to be usable.Vachill
@Hephaestus, which vendor did you find provides the highest accuracy? Google?Sinistrality
D
9

Yes, it is very much possible. All you need to do is:

In the config set enable_word_time_offsets=True

config = types.RecognitionConfig(
        ....
        enable_word_time_offsets=True)

Then, for each word in the alternative, you can print its start time and end time as in this code:

for result in result.results:
        alternative = result.alternatives[0]
        print(u'Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}'.format(alternative.confidence))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9))

This would give you output in the following format:

Transcript:  Do you want me to give you a call back?
Confidence: 0.949534416199
Word: Do, start_time: 1466.0, end_time: 1466.6
Word: you, start_time: 1466.6, end_time: 1466.7
Word: want, start_time: 1466.7, end_time: 1466.8
Word: me, start_time: 1466.8, end_time: 1466.9
Word: to, start_time: 1466.9, end_time: 1467.1
Word: give, start_time: 1467.1, end_time: 1467.2
Word: you, start_time: 1467.2, end_time: 1467.3
Word: a, start_time: 1467.3, end_time: 1467.4
Word: call, start_time: 1467.4, end_time: 1467.6
Word: back?, start_time: 1467.6, end_time: 1467.7

Source: https://cloud.google.com/speech-to-text/docs/async-time-offsets

Dipstick answered 12/6, 2018 at 10:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.