As an employee of one of the largest IT companies, I just came in touch with the term “cognitive systems”.
I really wondered what the term meant and what opportunities there were to use it in own applications.
So I started to read through the entirely IBM Bluemix Watson API documentation. Started from Speech2Text, Text2Speech, Natural Language Processing and the new Dialog API.
It is just amazing what kind of new and powerful APIs IBM offers right now.
2. Clarification of outcome
Soo, after I had an overview about the power of those APIs I decided to develop a new software. To be exact, a smart home control software written in C# .NET Framework 4.6.
The software should be able to control my rooms temperature thermostats as well as to switch channels on the TV, turn off the TV and so on[…] as the minimum features specifications.
But, and here comes the clue, the commands should be executed by voice commands. And I do not mean simple phrases as “TV on” or “TV off”, I mean real complete sentences for instance “Please turn the television off.” or “May you be so kind as to switch the channel to Eurosport.”.
Furthermore I wanted a system, which is able to learn from my commands. It should recognize phrases which are never spoken before and is smart enough to make the right relation mapping to the called function.
Well, that are really tough system requirements and it’s obviously that it’s not possible to do this with native system tools.
3. Basic system concept of voice recognition
Ok then let’s start with the basic system concept.
I needed a reliable speech to text API to recognize my spoken text with a high confidence.
The standard API of the .NET Framework is not smart enough to interpretate text without a complex context model.
But I don’t want to create a context model for all functions etc… Also the models may be unknown. Therefore I’ve to use another API.
I decided to use the IBM Watson Speech2Text API for English texts and Microsoft BING API for German texts (just to play around, because IBM doesn’t offer a German S2T API yet).
3.1 Offline recognition (wait for start sequence)
This leads me to the first problem… Both APIs are online available only. I don’t want to stream all of my conversations to external services. Also both APIs have quota limitations and after a few hours the limit would be reached. Solution for this problem is to use the .NET Framework native speech to text API. I’ve created a custom context model which will just wait for the word “Computer”, to activate the listening mode and to stream the upcoming voice commands to the online services.
4. Handle S2T result text
At this point I’ve decided to use two different approaches. First I’ve started with the IBM Watson Dialog API combined with IBM Watson Natural Processing API and after a while I found another really nice artificial intelligence API. It is called API.AI. With this API you have the opportunity to analyze the received S2T result text. Both are absolutely awesome programming interfaces with high potential to help software developers build smarter software.
After one of the above mentioned natural language processing API has processed the S2T result text, the smart home software will execute the correct action (e.g. increase the temperature in bathroom by 5 degree Celsius).