Difference among Microsoft Speech products/platforms
C

1

10

It seems Microsoft offers quite a few speech recognition products, I'd like to know the differences among all of them pls.

  • There is Microsoft Speech API, or SAPI. But somehow Microsoft Cognitive Service Speech API has the same name.

  • Ok now, Microsoft Cognitive Service on Azure offers Speech service API and Bing Speech API. I assume for speech-to-text, both APIs are the same.

  • And then there is System.Speech.Recognition (or Desktop SAPI), Microsoft.Speech.Recognition (or Server SAPI) and Windows.Media.Speech.Recognition. Here and here have some explanations on the difference among the three. But my guesses are they are old speech recognition models based on HMM, aka are not neural network models, and all three can be used offline without internet connection, right?

  • For the Azure speech service and bing speech APIs, they are more advanced speech models right? But I assume there is no way to use them offline on my local machine, as they all require subscription verification. (even tho it seems Bing API has a C# desktop library..)

Essentially I want to have a offline model which does speech-to-text transcription, for my conversation data (5-10 mins for each audio recording), which recognises multi-speakers and outputs timestamps (or timecoded output). I am a bit confused now by all the options. I would be greatly appreciated if someone can explain to me, many thanks!

Cotquean answered 12/6, 2018 at 17:15 Comment(3)
Can you please share your findings? It seems odd that this simple feature of offline transcription which is available for handheld devices such as Android and iOS is not available for Windows PCs. There is SpeechRecognation but the accuracy is lacking without grammer. learn.microsoft.com/en-us/previous-versions/office/developer/…Laureenlaurel
Hi it's been a while. If you want state-of-the-art ASR models then I believe you will have to use the API service of these major providers, which of course means your data will not be processed locally. I am not aware any companies offering federated learning for ASR but my findings may well be outdated by now. If your concern is around privacy, then some companies like IBM offer dedicated cloud. Or deploy sota ASR open sourced models, there are a few pretrained models out there.Cotquean
Thanks for the update. I am looking at DeepSpeech and vosk which are open source,offline, and somehow can work on the client side. Nvidia Nemo is powerful for running on the server side and using an API on the client side.Laureenlaurel
S
8

A difficult question - and part of the reason why it is so difficult: We (Microsoft) seem to present an incoherent story about 'speech' and 'speech apis'. Although I work for Microsoft, the following is my view on this. I try to give some insight on what is being planned in my team (Cognitive Service Speech - Client SDK), but I can't predict all facets of the not-so-near-future.

Early on Microsoft recognized that speech is an important medium, so Microsoft has an extensive and long running history enabling speech in its products. There are really good speech solutions (with local recognition) available, you listed some of those.

We are working on unifying this, and present one place for you to find the state-of-the-art speech solution at Microsoft. This is 'Microsoft Speech Service' (https://learn.microsoft.com/de-de/azure/cognitive-services/speech-service/) - currently in preview.

On the service side it will combine our major speech technologies, like speech-to-text, text-to-speech, intent, translation (and future services) under one umbrella. Speech and languages models are constantly improved and updated. We are developing a client SDK for this service. Over time (later this year) this SDK will be available on all major operating systems (Windows, Linux, Android, iOS) and have support for major programming languages. We will continue to enhance/improve platform and language support for the SDK.

This combination of online service and client SDK will leave the preview-state later this year.

We understand the desire to have local recognition capabilities. It will not be available 'out-of-the-box' in our first SDK release (it is also not part of the current preview). One goal for the SDK is parity (functionality and API) between platforms and languages. This needs a lot of work. Offline is not part of this right now, I can't make any prediction here, neither in features nor timeline ...

So from my point of view - the new Speech Services and the SDK is the way forward. The goal is a unified API on all platforms, easy access to all Microsoft Speech Services. It requires the subscription key, it requires you are 'connected'. We are working hard to get both (server and client) out of preview status later this year.

Hope this helps ...

Wolfgang

Saintjust answered 20/6, 2018 at 10:59 Comment(4)
Thanks very much Wolfgang! I really appreciate your answer! Is Microsoft planning to add speaker diarization, i.e. "who speaks what at what time", in the near future, to your current speech service API?Cotquean
please understand that I can't make statements about non released services, products etc... I can't predict when/if things will be available through a cognitive service, but there are definitely teams working on these scenarios, take at look at what we did show at the //build conference in May: youtube.com/watch?v=ddb3ZgAp9TASaintjust
Thanks @wolfma! Appreciated!Cotquean
Thanks for your answer, New work looks promising, but I still hope that SAPI is not retired/deprecated any soon. It was lightweight fast and really helpful for Dictionary based recognition.Corkwood

© 2022 - 2024 — McMap. All rights reserved.