Google Speech Recognition API: timestamp for each word?

Asked 4/12, 2015 at 10:39 Answered 12/6, 2018 at 10:34

audio speech-recognition speech-to-text speech google-speech-api

It's possible to use Google's Speech recognition API to get a transcription for an audio file (WAV, MP3, etc.) by doing a request to http://www.google.com/speech-api/v2/recognize?...

Example: I have said "one two three for five" in a WAV file. Google API gives me this:

{
  u'alternative':
  [
    {u'transcript': u'12345'},
    {u'transcript': u'1 2 3 4 5'},
    {u'transcript': u'one two three four five'}
  ],
  u'final': True
}

Question: is it possible to get the time (in seconds) at which each word has been said?

With my example:

['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.

i.e. the word "one" has been said between time 00:00:00.23 and 00:00:00.80,
the word "two" has been said between time 00:00:01.03 and 00:00:01.45 (in seconds).

PS: looking for an API supporting other languages than English, especially French.

Breadbasket answered 4/12, 2015 at 10:39 Comment(2)

Hm? Afaics google speech api does support french, doesn't it? – Secondly 6/2, 2016 at 14:55

@Secondly yes but it doesn't support timestamp for each word – Breadbasket 7/2, 2016 at 12:37

I believe the other answer is now out of date. This is now possible with the Google Cloud Search API: https://cloud.google.com/speech/docs/async-time-offsets

Glee answered 29/12, 2017 at 1:40 Comment(0)

EDIT 2020: Now possible, see the other answers

It is not possible with google API.

If you want word timestamps, you can use other APIs, for example:

Vosk-API - free offline speech recognition API (disclosure: I am the primary author of Vosk).

SpeechMatics SaaS speech recognition API

Speech Recognition API from IBM

Political answered 4/12, 2015 at 12:27 Comment(5)

Thanks! Have you tried these 3 APIs? Are they as good as Google's ? I am amazed each day of how Google's speech recognition is powerful. (I speak (loud) my text messages to my Android phone, and the phone makes nearly no mistake at all !) – Breadbasket 4/12, 2015 at 12:44

They should be comparable in terms of accuracy. – Political 4/12, 2015 at 13:51

It seems that none of them supports French language, sadly. – Breadbasket 30/1, 2016 at 16:26

We tried IBM BlueMix Speech API for exactly this purpose and found the accuracy to be abysmal. Even simple clearly-spoken isolated words like "spoon" would come back as "moon", "room", "doom", "bloom", "whom". And this was after I pre-specified the keyword set to ("spoon") with a low acceptance probability. As the OP mentioned IBM does provide start and stop times for each word (which Google apparently does not), however the accuracy was too low to be usable. – Vachill 11/2, 2017 at 6:3

@Hephaestus, which vendor did you find provides the highest accuracy? Google? – Sinistrality 14/9, 2022 at 20:52

Yes, it is very much possible. All you need to do is:

In the config set enable_word_time_offsets=True

config = types.RecognitionConfig(
        ....
        enable_word_time_offsets=True)

Then, for each word in the alternative, you can print its start time and end time as in this code:

for result in result.results:
        alternative = result.alternatives[0]
        print(u'Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}'.format(alternative.confidence))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9))

This would give you output in the following format:

Transcript:  Do you want me to give you a call back?
Confidence: 0.949534416199
Word: Do, start_time: 1466.0, end_time: 1466.6
Word: you, start_time: 1466.6, end_time: 1466.7
Word: want, start_time: 1466.7, end_time: 1466.8
Word: me, start_time: 1466.8, end_time: 1466.9
Word: to, start_time: 1466.9, end_time: 1467.1
Word: give, start_time: 1467.1, end_time: 1467.2
Word: you, start_time: 1467.2, end_time: 1467.3
Word: a, start_time: 1467.3, end_time: 1467.4
Word: call, start_time: 1467.4, end_time: 1467.6
Word: back?, start_time: 1467.6, end_time: 1467.7

Source: https://cloud.google.com/speech-to-text/docs/async-time-offsets

Dipstick answered 12/6, 2018 at 10:34 Comment(0)

Recommended topics

Hot tags