How to extract variables from speech recognition
Asked Answered
R

1

5

I'm using System.Speech to recognize some phrases or words. One of them is Set timer. I would like to expand this to Set timer for X seconds, and having the code set a timer for X seconds. Is this possible? I have little to no experience with this so far, all I could find is that I have to do something with the grammar class.

Right now I have set up my recognition engine like this:

SpeechRecognitionEngine = new SpeechRecognitionEngine();
SpeechRecognitionEngine.SetInputToDefaultAudioDevice();

var choices = new Choices();
choices.Add("Set timer");

var gb = new GrammarBuilder();
gb.Append(choices);
var g = new Grammar(gb);

SpeechRecognitionEngine.LoadGrammarAsync(g);

SpeechRecognitionEngine.RecognizeAsync(RecognizeMode.Multiple);
SpeechRecognitionEngine.SpeechRecognized += OnSpeechRecognized;

Is there a way to do this?

Ruderal answered 25/3, 2018 at 18:40 Comment(1)
Download Microsoft Speech Platform, SDK and Runtime. Inside the SDK there's Grammar Builder and a grammar validator. Also, there's an already built Grammar specifically oriented to numbers identification (and other annoying elements to setup, like Curreny and Dates). Another interesting thing you'll find there, if you haven't done this before, is a description of how to implemet follow-up/cascading sequences of commands tied to actions to perform. I don't remember how thorough they are, but it'll get you started.Kautz
I
9

First, there is no built-in concept of number. Speech is just sequence of words, and if you need to recognize numbers - you need to recognize words which mean numbers, such as "one" and "fifteen". Some numbers are represented by multiple words, such as "one hundred" or "fifty one" - you need to recognize them too.

You can start with just recognizing numbers from 1 to 9:

var engine = new SpeechRecognitionEngine(CultureInfo.GetCultureInfo("en-US"));
engine.SetInputToDefaultAudioDevice();
var num1To9 = new Choices(
    new SemanticResultValue("one", 1),
    new SemanticResultValue("two", 2),
    new SemanticResultValue("three", 3),
    new SemanticResultValue("four", 4),
    new SemanticResultValue("five", 5),
    new SemanticResultValue("six", 6),
    new SemanticResultValue("seven", 7),
    new SemanticResultValue("eight", 8),
    new SemanticResultValue("nine", 9));

var gb = new GrammarBuilder();
gb.Culture = CultureInfo.GetCultureInfo("en-US");
gb.Append("set timer for");
gb.Append(num1To9);
gb.Append("seconds");
var g = new Grammar(gb);

engine.LoadGrammar(g); // better not use LoadGrammarAsync
engine.SpeechRecognized += OnSpeechRecognized;
engine.RecognizeAsync(RecognizeMode.Multiple);
Console.WriteLine("Speak");
Console.ReadKey();

So our grammar can be read as:

  • "Set timer for" phrase
  • followed by "one" OR "two" OR "three"...
  • followed by "seconds"

We use SemanticResultValue to assign a tag to specific phrase. In this case that tag is number (1,2,3...) corresponding to specific word ("one", "two", "three"). By doing that - you can extract that value from recognition result:

private static void OnSpeechRecognized(object sender, SpeechRecognizedEventArgs e) {
    var numSeconds = (int)e.Result.Semantics.Value;
    Console.WriteLine($"Starting timer for {numSeconds} seconds...");
}

This is already working example which will recognize your phrases like "set timer for five seconds" and allow you to extract semantic value (5) from them.

Now you could combine various number words together, for example:

var num10To19 = new Choices(
    new SemanticResultValue("ten", 10),
    new SemanticResultValue("eleven", 11),
    new SemanticResultValue("twelve", 12),
    new SemanticResultValue("thirteen", 13),
    new SemanticResultValue("fourteen", 14),
    new SemanticResultValue("fifteen", 15),
    new SemanticResultValue("sexteen", 16),
    new SemanticResultValue("seventeen", 17),
    new SemanticResultValue("eighteen", 18),
    new SemanticResultValue("nineteen", 19)
);

var numTensFrom20To90 = new Choices(
    new SemanticResultValue("twenty", 20),
    new SemanticResultValue("thirty", 30),
    new SemanticResultValue("forty", 40),
    new SemanticResultValue("fifty", 50),
    new SemanticResultValue("sixty", 60),
    new SemanticResultValue("seventy", 70),
    new SemanticResultValue("eighty", 80),
    new SemanticResultValue("ninety", 90)
);

var num20to99 = new GrammarBuilder();
// first word is "twenty", "thirty" etc
num20to99.Append(numTensFrom20To90);
// followed by ONE OR ZERO "digit" words ("one", "two", "three" etc)
num20to99.Append(num1To9, 0, 1);

But it gets tricky to correctly assign semantic values to them, because this api with GrammarBuilder is not powerful enough to do that.

When what you want to do cannot be (easily) done with pure GrammarBuilder and related classes - you have to use more powerful xml files, with syntax defined in this specification.

Description of those grammar files are out of scope for this question, but fortunately for your task there is already grammar file provided in Microsoft Speech SDK which you probably already downloaded and installed. So, copy file from "C:\Program Files\Microsoft SDKs\Speech\v11.0\Samples\Sample Grammars\en-US.grxml" (or wherever you installed SDK) and remove some irrelevant things, such as first <tag> element with large CDATA inside.

Rule of interest in this file is named "Cardinal" and allows to recognize numbers from 0 to 1 million. Then our code becomes:

var sampleDoc = new SrgsDocument(@"en-US-sample.grxml");
sampleDoc.Culture = CultureInfo.GetCultureInfo("en-US");
// define new rule, named Timer
SrgsRule rootRule = new SrgsRule("Timer");            
// match "set timer for" phrase
rootRule.Add(new SrgsItem("set timer for"));
// followed by whatever "Cardinal" rule defines (reference to another rule)
rootRule.Add(new SrgsRuleRef(sampleDoc.Rules["Cardinal"]));
// followed by "seconds"
rootRule.Add(new SrgsItem("seconds"));
// add to rules
sampleDoc.Rules.Add(rootRule);
// make it a root rule, so that it will be used for recognition
sampleDoc.Root = rootRule;
var g = new Grammar(sampleDoc);

engine.LoadGrammar(g); // better not use LoadGrammarAsync
engine.SpeechRecognized += OnSpeechRecognized;
engine.RecognizeAsync(RecognizeMode.Multiple);

And handler becomes:

private static void OnSpeechRecognized(object sender, SpeechRecognizedEventArgs e) {
    var numSeconds = Convert.ToInt32(e.Result.Semantics.Value);
    Console.WriteLine($"Starting timer for {numSeconds} seconds...");
}

Now you can regognize numbers up to 1 million.

Of course it's not necessary to define rule in code like we did above - you can define all your rules completely in xml, and then just load it as SrgsDocument and create a Grammar from it.

If you want to recognize multiple commands - here is a sample:

var sampleDoc = new SrgsDocument(@"en-US-sample.grxml");            
sampleDoc.Culture = CultureInfo.GetCultureInfo("en-US");
// this rule is the same as above
var setTimerRule = new SrgsRule("SetTimer");            
setTimerRule.Add(new SrgsItem("set timer for"));            
setTimerRule.Add(new SrgsRuleRef(sampleDoc.Rules["Cardinal"]));            
setTimerRule.Add(new SrgsItem("seconds"));            
sampleDoc.Rules.Add(setTimerRule);

// new rule, clear timer
var clearTimerRule = new SrgsRule("ClearTimer");
// just match this phrase
clearTimerRule.Add(new SrgsItem("clear timer"));
sampleDoc.Rules.Add(clearTimerRule);
// new root rule, marching either set timer OR clear timer
var rootRule = new SrgsRule("Times");
rootRule.Add(new SrgsOneOf( // << OneOf is basically the same as Choice
    //               reference to SetTimer                                         
    new SrgsItem(new SrgsRuleRef(setTimerRule), 
        // assign command name. Both "command" and "settimer" are arbitrary names I chose
        new SrgsSemanticInterpretationTag("out = rules.latest();out.command = 'settimer';")),
    new SrgsItem(new SrgsRuleRef(clearTimerRule),
        // assign command name. If this rule "wins" - command will be cleartimer
        new SrgsSemanticInterpretationTag("out.command = 'cleartimer';"))
));
sampleDoc.Rules.Add(rootRule);
sampleDoc.Root = rootRule;
var g = new Grammar(sampleDoc);

And handler becomes:

private static void OnSpeechRecognized(object sender, SpeechRecognizedEventArgs e) {
    var sem = e.Result.Semantics;
    // here "command" is arbitrary key we assigned in our rule
    var commandName = (string) sem["command"].Value;
    switch (commandName) {
        // also arbitrary values we assigned, not related to rule names or something else
        case "settimer":
            var numSeconds = Convert.ToInt32(sem.Value);
            Console.WriteLine($"Starting timer for {numSeconds} seconds...");
            break;
        case "cleartimer":
            Console.WriteLine("timer cleared");
            break;
    }
}

For completenes - here is how you can do the same with pure xml. Open that "en-US-sample.grxml" file with xml editor and add rules we defined above in code. They will look like this:

<rule id="SetTimer" scope="private">
    <item>set timer for</item>
    <item>
        <ruleref uri="#Cardinal" />
    </item>
    <item>seconds</item>
</rule>

<rule id="ClearTimer" scope="private">
    <item>clear timer</item>
</rule>

<rule id="Timers" scope="public">
    <one-of>
        <item>
            <ruleref uri="#SetTimer" />
            <tag>out = rules.latest(); out.command = 'settimer'</tag>
        </item>
        <item>
            <ruleref uri="#ClearTimer" />
            <tag>out.command = 'cleartimer'</tag>
        </item>
    </one-of>
</rule> 

Now set root rule at root grammar tag:

<grammar xml:lang="en-US" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0" 
    root="Timers">

And save.

Now we don't need to define anything at all in code, all we need to do is load our grammar file:

var sampleDoc = new SrgsDocument(@"en-US-sample.grxml");                        
var g = new Grammar(sampleDoc);
engine.LoadGrammar(g);

That's all. Because "Timers" rule is root rule in grammar file - it will be used in recognition, and will behave exactly the same as version we defined in code.

Informer answered 28/3, 2018 at 10:12 Comment(6)
Wow, this is already an amazing answer just from the effort alone. I'll have to try this another day though, I don't have time for it yet. I did noticed something in the first code snippet, I see you use 3 lines of append. What happens when I want to add more phrases for different things? (Like cortana for example, a program that can recognize a whole list of commands). Is there no problem if I add multiple words/phrases?Ruderal
@RandomStranger you need to combine them with Choice the same way. Say you have grammar builder with logic to recognize "set timer" command (like in this example). Then you have another one, say "clear timer". Then you combine them both into another rule, so that either one or another should be recognized. Like with regular expressions: "^(set timer for \d+ seconds)|(clear timer)$". Note that if you are serious with this stuff - you need to learn that xml syntax, pure Choice etc will not lead you far.Informer
I see, so the SrgsDocument class basically becomes a container for all the phrases I would like to recognize? Edit: I'm definitely going to look into the xml syntax, I didn't even know that existedRuderal
@RandomStranger I've added an example of how you can recognize multiple commands ("set timer for X seconds" and "clear timer" in this case).Informer
Ahhh I see now, thank you so much for your effort! Will try this out tomorrow, I'll let you know if I run against any problems.Ruderal
Just got to try it out early and this works very well, I can't thank you enough!Ruderal

© 2022 - 2024 — McMap. All rights reserved.