How to extract variables from speech recognition

I'm using System.Speech to recognize some phrases or words. One of them is Set timer. I would like to expand this to Set timer for X seconds, and having the code set a timer for X seconds. Is this possible? I have little to no experience with this so far, all I could find is that I have to do something with the grammar class.

Right now I have set up my recognition engine like this:

SpeechRecognitionEngine = new SpeechRecognitionEngine();
SpeechRecognitionEngine.SetInputToDefaultAudioDevice();

var choices = new Choices();
choices.Add("Set timer");

var gb = new GrammarBuilder();
gb.Append(choices);
var g = new Grammar(gb);

SpeechRecognitionEngine.LoadGrammarAsync(g);

SpeechRecognitionEngine.RecognizeAsync(RecognizeMode.Multiple);
SpeechRecognitionEngine.SpeechRecognized += OnSpeechRecognized;

Is there a way to do this?

First, there is no built-in concept of number. Speech is just sequence of words, and if you need to recognize numbers - you need to recognize words which mean numbers, such as "one" and "fifteen". Some numbers are represented by multiple words, such as "one hundred" or "fifty one" - you need to recognize them too.

You can start with just recognizing numbers from 1 to 9:

var engine = new SpeechRecognitionEngine(CultureInfo.GetCultureInfo("en-US"));
engine.SetInputToDefaultAudioDevice();
var num1To9 = new Choices(
    new SemanticResultValue("one", 1),
    new SemanticResultValue("two", 2),
    new SemanticResultValue("three", 3),
    new SemanticResultValue("four", 4),
    new SemanticResultValue("five", 5),
    new SemanticResultValue("six", 6),
    new SemanticResultValue("seven", 7),
    new SemanticResultValue("eight", 8),
    new SemanticResultValue("nine", 9));

var gb = new GrammarBuilder();
gb.Culture = CultureInfo.GetCultureInfo("en-US");
gb.Append("set timer for");
gb.Append(num1To9);
gb.Append("seconds");
var g = new Grammar(gb);

engine.LoadGrammar(g); // better not use LoadGrammarAsync
engine.SpeechRecognized += OnSpeechRecognized;
engine.RecognizeAsync(RecognizeMode.Multiple);
Console.WriteLine("Speak");
Console.ReadKey();

So our grammar can be read as:

"Set timer for" phrase
followed by "one" OR "two" OR "three"...
followed by "seconds"

We use SemanticResultValue to assign a tag to specific phrase. In this case that tag is number (1,2,3...) corresponding to specific word ("one", "two", "three"). By doing that - you can extract that value from recognition result:

private static void OnSpeechRecognized(object sender, SpeechRecognizedEventArgs e) {
    var numSeconds = (int)e.Result.Semantics.Value;
    Console.WriteLine($"Starting timer for {numSeconds} seconds...");
}

This is already working example which will recognize your phrases like "set timer for five seconds" and allow you to extract semantic value (5) from them.

Now you could combine various number words together, for example:

var num10To19 = new Choices(
    new SemanticResultValue("ten", 10),
    new SemanticResultValue("eleven", 11),
    new SemanticResultValue("twelve", 12),
    new SemanticResultValue("thirteen", 13),
    new SemanticResultValue("fourteen", 14),
    new SemanticResultValue("fifteen", 15),
    new SemanticResultValue("sexteen", 16),
    new SemanticResultValue("seventeen", 17),
    new SemanticResultValue("eighteen", 18),
    new SemanticResultValue("nineteen", 19)
);

var numTensFrom20To90 = new Choices(
    new SemanticResultValue("twenty", 20),
    new SemanticResultValue("thirty", 30),
    new SemanticResultValue("forty", 40),
    new SemanticResultValue("fifty", 50),
    new SemanticResultValue("sixty", 60),
    new SemanticResultValue("seventy", 70),
    new SemanticResultValue("eighty", 80),
    new SemanticResultValue("ninety", 90)
);

var num20to99 = new GrammarBuilder();
// first word is "twenty", "thirty" etc
num20to99.Append(numTensFrom20To90);
// followed by ONE OR ZERO "digit" words ("one", "two", "three" etc)
num20to99.Append(num1To9, 0, 1);

But it gets tricky to correctly assign semantic values to them, because this api with GrammarBuilder is not powerful enough to do that.

When what you want to do cannot be (easily) done with pure GrammarBuilder and related classes - you have to use more powerful xml files, with syntax defined in this specification.

Description of those grammar files are out of scope for this question, but fortunately for your task there is already grammar file provided in Microsoft Speech SDK which you probably already downloaded and installed. So, copy file from "C:\Program Files\Microsoft SDKs\Speech\v11.0\Samples\Sample Grammars\en-US.grxml" (or wherever you installed SDK) and remove some irrelevant things, such as first <tag> element with large CDATA inside.

Rule of interest in this file is named "Cardinal" and allows to recognize numbers from 0 to 1 million. Then our code becomes:

var sampleDoc = new SrgsDocument(@"en-US-sample.grxml");
sampleDoc.Culture = CultureInfo.GetCultureInfo("en-US");
// define new rule, named Timer
SrgsRule rootRule = new SrgsRule("Timer");            
// match "set timer for" phrase
rootRule.Add(new SrgsItem("set timer for"));
// followed by whatever "Cardinal" rule defines (reference to another rule)
rootRule.Add(new SrgsRuleRef(sampleDoc.Rules["Cardinal"]));
// followed by "seconds"
rootRule.Add(new SrgsItem("seconds"));
// add to rules
sampleDoc.Rules.Add(rootRule);
// make it a root rule, so that it will be used for recognition
sampleDoc.Root = rootRule;
var g = new Grammar(sampleDoc);

engine.LoadGrammar(g); // better not use LoadGrammarAsync
engine.SpeechRecognized += OnSpeechRecognized;
engine.RecognizeAsync(RecognizeMode.Multiple);

And handler becomes:

private static void OnSpeechRecognized(object sender, SpeechRecognizedEventArgs e) {
    var numSeconds = Convert.ToInt32(e.Result.Semantics.Value);
    Console.WriteLine($"Starting timer for {numSeconds} seconds...");
}

Now you can regognize numbers up to 1 million.

Of course it's not necessary to define rule in code like we did above - you can define all your rules completely in xml, and then just load it as SrgsDocument and create a Grammar from it.

If you want to recognize multiple commands - here is a sample:

var sampleDoc = new SrgsDocument(@"en-US-sample.grxml");            
sampleDoc.Culture = CultureInfo.GetCultureInfo("en-US");
// this rule is the same as above
var setTimerRule = new SrgsRule("SetTimer");            
setTimerRule.Add(new SrgsItem("set timer for"));            
setTimerRule.Add(new SrgsRuleRef(sampleDoc.Rules["Cardinal"]));            
setTimerRule.Add(new SrgsItem("seconds"));            
sampleDoc.Rules.Add(setTimerRule);

// new rule, clear timer
var clearTimerRule = new SrgsRule("ClearTimer");
// just match this phrase
clearTimerRule.Add(new SrgsItem("clear timer"));
sampleDoc.Rules.Add(clearTimerRule);
// new root rule, marching either set timer OR clear timer
var rootRule = new SrgsRule("Times");
rootRule.Add(new SrgsOneOf( // << OneOf is basically the same as Choice
    //               reference to SetTimer                                         
    new SrgsItem(new SrgsRuleRef(setTimerRule), 
        // assign command name. Both "command" and "settimer" are arbitrary names I chose
        new SrgsSemanticInterpretationTag("out = rules.latest();out.command = 'settimer';")),
    new SrgsItem(new SrgsRuleRef(clearTimerRule),
        // assign command name. If this rule "wins" - command will be cleartimer
        new SrgsSemanticInterpretationTag("out.command = 'cleartimer';"))
));
sampleDoc.Rules.Add(rootRule);
sampleDoc.Root = rootRule;
var g = new Grammar(sampleDoc);

And handler becomes:

private static void OnSpeechRecognized(object sender, SpeechRecognizedEventArgs e) {
    var sem = e.Result.Semantics;
    // here "command" is arbitrary key we assigned in our rule
    var commandName = (string) sem["command"].Value;
    switch (commandName) {
        // also arbitrary values we assigned, not related to rule names or something else
        case "settimer":
            var numSeconds = Convert.ToInt32(sem.Value);
            Console.WriteLine($"Starting timer for {numSeconds} seconds...");
            break;
        case "cleartimer":
            Console.WriteLine("timer cleared");
            break;
    }
}

For completenes - here is how you can do the same with pure xml. Open that "en-US-sample.grxml" file with xml editor and add rules we defined above in code. They will look like this:

<rule id="SetTimer" scope="private">
    <item>set timer for</item>
    <item>
        <ruleref uri="#Cardinal" />
    </item>
    <item>seconds</item>
</rule>

<rule id="ClearTimer" scope="private">
    <item>clear timer</item>
</rule>

<rule id="Timers" scope="public">
    <one-of>
        <item>
            <ruleref uri="#SetTimer" />
            <tag>out = rules.latest(); out.command = 'settimer'</tag>
        </item>
        <item>
            <ruleref uri="#ClearTimer" />
            <tag>out.command = 'cleartimer'</tag>
        </item>
    </one-of>
</rule>

Now set root rule at root grammar tag:

<grammar xml:lang="en-US" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0" 
    root="Timers">

And save.

Now we don't need to define anything at all in code, all we need to do is load our grammar file:

var sampleDoc = new SrgsDocument(@"en-US-sample.grxml");                        
var g = new Grammar(sampleDoc);
engine.LoadGrammar(g);

That's all. Because "Timers" rule is root rule in grammar file - it will be used in recognition, and will behave exactly the same as version we defined in code.

Recommended topics

Hot tags