How can I find the audio format of the selected voice of the SpeechSynthesizer

Asked 8/12, 2015 at 7:24 Answered 3/4, 2019 at 1:37

In a text to speech application by C# I use SpeechSynthesizer class, it has an event named SpeakProgress which is fired for every spoken word. But for some voices the parameter e.AudioPosition is not synchronized with the output audio stream, and the output wave file is played faster than what this position shows (see this related question).

Anyway, I am trying to find the exact information about the bit rate and other information related to the selected voice. As I have experienced if I can initialize the wave file with this information, the synchronizing problem will be resolved. However, if I can't find such information in the SupportedAudioFormat, I know no other way to find them. For example the "Microsoft David Desktop" voice provides no supported format in the VoiceInfo, but it seems it supports a PCM 16000 hz, 16 bit format.

How can I find audio format of the selected voice of the SpeechSynthesizer

 var formats = CurVoice.VoiceInfo.SupportedAudioFormats;

 if (formats.Count > 0)
 {
     var format = formats[0];
     reader.SetOutputToWaveFile(CurAudioFile, format);
 }
 else
 {
        var format = // How can I find it, if the audio hasn't provided it?           
        reader.SetOutputToWaveFile(CurAudioFile, format );
}

Commonweal answered 8/12, 2015 at 7:24 Comment(0)

Update: This answer has been edited following investigation. Initially I was suggesting from memory that SupportedAudioFormats is likely just from (possibly misconfigured) registry data; investigation has shown that for me, on Windows 7, this is definitely the case, and is backed up acecdotally on Windows 8.

Issues with SupportedAudioFormats

System.Speech wraps the venerable COM speech API (SAPI) and some voices are 32 vs 64 bit, or can be misconfigured (on a 64 bit machine's registry, HKLM/Software/Microsoft/Speech/Voices vs HKLM/Software/Wow6432Node/Microsoft/Speech/Voices.

I've pointed ILSpy at System.Speech and its VoiceInfo class, and I'm pretty convinced that SupportedAudioFormats comes solely from registry data, hence it's possible to get zero results back when enumerating SupportedAudioFormats if either your TTS engine isn't properly registered for your application's Platform target (x86, Any or 64 bit), or if the vendor simply doesn't provide this information in the registry.

Voices may still support different, additional or fewer formats, as that's up to the speech engine (code) rather than the registry (data). So it can be a shot in the dark. Standard Windows voices are often times more consistent in this regard than third party voices, but they still don't necessarily usefully provide SupportedAudioFormats.

Finding this Information the Hard Way

I've found it's still possible to get the current format of the current voice - but this does rely on reflection to access the internals of the System.Speech SAPI wrappers.

Consequently this is quite fragile code! And I wouldn't recommend use in production.

Note: The below code does require you to have called Speak() once for setup; more calls would be needed to force setup without Speak(). However, I can call Speak("") to say nothing and that works just fine.

Implementation:

[StructLayout(LayoutKind.Sequential)]
struct WAVEFORMATEX
{
    public ushort wFormatTag;
    public ushort nChannels;
    public uint nSamplesPerSec;
    public uint nAvgBytesPerSec;
    public ushort nBlockAlign;
    public ushort wBitsPerSample;
    public ushort cbSize;
}

WAVEFORMATEX GetCurrentWaveFormat(SpeechSynthesizer synthesizer)
{
    var voiceSynthesis = synthesizer.GetType()
                                    .GetProperty("VoiceSynthesizer", BindingFlags.Instance | BindingFlags.NonPublic)
                                    .GetValue(synthesizer, null);

    var ttsVoice = voiceSynthesis.GetType()
                                 .GetMethod("CurrentVoice", BindingFlags.Instance | BindingFlags.NonPublic)
                                 .Invoke(voiceSynthesis, new object[] { false });

    var waveFormat = (byte[])ttsVoice.GetType()
                                     .GetField("_waveFormat", BindingFlags.Instance | BindingFlags.NonPublic)
                                     .GetValue(ttsVoice);

    var pin = GCHandle.Alloc(waveFormat, GCHandleType.Pinned);
    var format = (WAVEFORMATEX)Marshal.PtrToStructure(pin.AddrOfPinnedObject(), typeof(WAVEFORMATEX));
    pin.Free();

    return format;
}

Usage:

SpeechSynthesizer s = new SpeechSynthesizer();
s.Speak("Hello");
var format = GetCurrentWaveFormat(s);
Debug.WriteLine($"{s.Voice.SupportedAudioFormats.Count} formats are claimed as supported.");
Debug.WriteLine($"Actual format: {format.nChannels} channel {format.nSamplesPerSec} Hz {format.wBitsPerSample} audio");

To test it, I renamed Microsoft Anna's AudioFormats registry key under HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/Tokens/MS-Anna-1033-20-Dsk/Attributes, causing SpeechSynthesizer.Voice.SupportedAudioFormats to have no elements when queried. The below is the output in this situation:

0 formats are claimed as supported.
Actual format: 1 channel 16000 Hz 16 audio

Subgroup answered 17/12, 2015 at 1:31 Comment(11)

Thank you, however as I noticed, the platform target is already "x86". – Commonweal 17/12, 2015 at 7:7

@Commonweal Interesting. What value do you have for HKEY_LOCAL_MACHINE/Software/Microsoft/Speech/Voices/Tokens/(speech engine of choice)/Attributes/AudioFormats? On this PC (Win7 for Microsoft Anna), the default is the REG_SZ string "18". If I rename the AudioFormats key I get no formats when enumerating. Looks like a bitmask (though stored as a REG_SZ) as I can tweak various bits but some combinations are illegal. Likewise under HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/etc., do they differ? Wonder if this is an installer/registry/voice issue, doesn't seem like an API issue. – Subgroup 17/12, 2015 at 21:47

There is no "AudioFormats" attribute there. It seems it has no such an attribute in Window 8.1 – Commonweal 18/12, 2015 at 5:34

@Commonweal Hmm. Well it should be possible to extract this via reflection. I'll update my answer. – Subgroup 18/12, 2015 at 16:32

Thank you very much, it works! Then you extract the format after the speak began?! – Commonweal 18/12, 2015 at 19:5

@Commonweal In effect yes exactly, but it's not so much that - it's more convincing the SpeechSynthesizer that it has to set up the TTS engine and prepare for wave output. Once it's done so, the format it used to do that is cached in _waveFormat. With a bit more legwork with ILSpy, it may be possible to coerce the synthesizer to do this without calling Speak(). But either way I suspect one can just call Speak("") once, query the format, and then use until you change engines or other parameters. – Subgroup 18/12, 2015 at 19:10

Before this solution, I had a trick in which I call speak("please wait") once after SetOutputToDefaultAudioDevice and then after SetOutputToNull. Then in both cases, I record the e.AudioPoision parameter of the SpeakProgress event into variables and by dividing them, I calculate a ratio. By multiplying it by 22050, I try to find the actual sample rate. Which of these solution (your answer and my trick) you think is more robust in an application? – Commonweal 18/12, 2015 at 19:24

By the way, is there a way to check if the output of GetCurrentWaveFormat is a valid format? – Commonweal 18/12, 2015 at 19:30

@Commonweal All just my opinion, but I'd say it's hard to call re your first question; your heuristic doesn't rely on internals which is a big plus, but reflection will give you actual values rather than an estimate. Future changes to the .NET Framework could break any reflection-based solution. As to validity of the wave format, if Speak() succeeds then I'd be surprised if the resulting format wasn't valid (in the sense of not reflecting the actual runtime state), but it'd need more extensive analysis of the SDK code via ILSpy to come up with a definitive answer either way. – Subgroup 18/12, 2015 at 19:48

Actually, I don't mean the validity, I am trying to mix your solution with my heuristic, I want to know if GetCurrentWaveFormat somehow failed then I switch to my heuristic. is there a way to check that? – Commonweal 18/12, 2015 at 19:52

Ah I see. In which case I might wrap the reflecting code in a try/catch, as it's likely to throw if internals change, and will if Speak() isn't called. I might also assert/check that the format has sensible looking struct values. So wFormatTag = WAVE_FORMAT_PCM (1), that nSamplesPerSec is at least 4K and not insane, that wBitsPerSample is 8 to non-insane, and the like. But I'd prefer to rely on a fair bit of testing on common user configurations, engines, OS versions rather than trying to cover all bases. – Subgroup 18/12, 2015 at 21:26

You can't get this information from code. You can only listen to all formats (from poor format like 8 kHz to high quality format like 48 kHz) and observe where it stops getting better, which is what you did, I think.

Internally, the speech engine "asks" the voice for the original audio format only once, and I believe that this value is used only internally by the speech engine, and the speech engine does not expose this value in any way.

For further information:

Let's say you are a voice company. You have recorded your computer voice at 16 kHz, 16 bit, mono.

The user can let your voice speak at 48 kHz, 32 bit, Stereo. The speech engine does this conversion. The speech engine does not care if it really sounds better, it simply does the format conversion.

Let's say the user wants to let your voice speak something. He requests that the file will be saved as 48 kHz, 16 bit, stereo.

SAPI / System.Speech calls your voice with this method:

STDMETHODIMP SpeechEngine::GetOutputFormat(const GUID * pTargetFormatId, const WAVEFORMATEX * pTargetWaveFormatEx,
GUID * pDesiredFormatId, WAVEFORMATEX ** ppCoMemDesiredWaveFormatEx)
{
    HRESULT hr = S_OK;

    //Here we need to return which format our audio data will be that we pass to the speech engine.
    //Our format (16 kHz, 16 bit, mono) will be converted to the format that the user requested. This will be done by the SAPI engine.

    enum SPSTREAMFORMAT sample_rate_at_which_this_voice_was_recorded = SPSF_16kHz16BitMono; //Here you tell the speech engine which format the data has that you will pass back. This way the engine knows if it should upsample you voice data or downsample to match the format that the user requested.

    hr = SpConvertStreamFormatEnum(sample_rate_at_which_this_voice_was_recorded, pDesiredFormatId, ppCoMemDesiredWaveFormatEx);

    return hr;
}

This is the only place where you have to "reveal" what the recorded format of your voice is.

All the "Available formats" rather tell you which conversions your sound card / Windows can do.

I hope I explained it well? As a voice vendor, you don't support any formats. You just tell they speech engine what format your audio data is so that it can do the further conversions.

Julieannjulien answered 3/4, 2019 at 1:37 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Issues with SupportedAudioFormats

Finding this Information the Hard Way

Recommended topics

Hot tags