How to do speech recognition. The word is not a sparrow! Review of voice recognition programs. Comparative test of services

Since the moment the computer was invented, humanity has dreamed of communicating with it in a familiar language - using voice. The average inhabitant of planet Earth does not want to know about any keyboards or mice. He needs the computer to understand him perfectly - and in the literal sense. Simple, fast, clear! While science fiction writers come up with stories about how in a hundred or two years computers will begin to go to stores on orders, massage our heels and scratch our backs, software developers are slowly but surely moving towards the realization of this idea. And if you have to do without scratching under the shoulder blade for now, then it is already very possible to control various applications using your voice and even dictate entire text files to the computer. There are not very many programs for familiarity with PCs yet, but those that exist are developing rapidly. Just a year ago, the utilities described in the article - their earlier versions - were a very sad sight. Today they have grown up, matured - these are no longer some hunted, wet and hungry puppies, but cheerful wolf cubs, which in a year or two will turn into wolves of voice control of a computer.

Dragon Naturally Speaking 8
A unique utility of its kind. Titanic and Zeppelin of “speech” programs in one bottle. A hellish mixture of a voice recognizer, sound computer control and a teacher of the correct pronunciation of English words. But let's talk about everything in order.
The utility is English-language, and therefore can work exclusively with English word forms. Theoretically, it can be taught Dragon Naturally Speaking great and mighty, but, alas and ah, this can only be used for voice control of the PC. The utility will not be able to act as a Russian stenographer - no matter what tricks you try. But you can pick up spoken English in no time. According to the developers, the program recognizes up to 95% of words. The figure is, of course, overestimated, but not as much as that of competitors. By training DNS to match the timbre of your voice (for this you will have to spend about an hour of time dictating various words), you will teach him to understand even very complex brain-twisting phrases, including English swearing. There’s just one “but”... You need to pronounce any phrase very clearly. What, you haven’t taken any articulation courses? Then you will have to practice on your own. Rest assured, after a couple of days of linguistic battles with the DNS, you will amaze any Englishman with the purity of your pronunciation. Do you think we're joking? Not at all! DNS is an ideal tool for training correct pronunciation - as soon as it’s falsified, it immediately issues a warning.
Now regarding voice control. Here the DNS did not disappoint either. We managed to install the program on almost all the utilities we have on our editorial computers. At first he grabbed the throat with a death grip on all the components of the package MS Office. After the voice command, I opened Excel and Word, as well as all other applications. Then it was time for network programs. The Bat!, ICQ, various Internet browsers obeyed DNS the first time. Finally, we tested the utility in working with various utilities of the same class - it worked without blinking an eye. It's funny when one voice control program launches another similar utility. By the way, please note: it doesn’t cost anything to set up DNS to run your favorite games. Say “Warcraft” into the microphone and it immediately loads. The main thing is, do not forget, before giving commands, to teach the program to associate a specific word with a particular utility (configurable in the menu Accuracy Center).
In addition to what was mentioned, the program has many different small goodies built into it, which seem to be optional, but which significantly expand the capabilities of the utility. How do you like, for example, recognizing text from a wav or mp3 file? You download an English-language song in which you cannot make out some words, and the DNS gives you them in text form.
You can sing the praises of DNS almost endlessly. This is the only program in the review that coped with almost all the texts and demonstrated even more capabilities than we expected from it. An unambiguous “must-have” and “expertise”.
Pros: Simple, convenient, with many bells and whistles.
Minuses: The fee for registering for a 30-day trial version is almost $200 , which, to put it mildly, is not modest. The utility does not understand Russian - but this is the problem with almost all similar programs.
Summary: Perhaps the best program for speech recognition and voice control of a computer. If it weren't for the high price, it would be simply ideal.

Realize Voice 4.1
Despite the fact that the creators position Realize Voice like a kind of multi-combiner that can equally easily cope with speech recognition, application management and recitative synthesis, detailed testing showed that the creators, to put it mildly, exaggerate the capabilities of the product. As a speech recognizer, the utility showed itself to be very weak. The percentage of accurately defining words and then translating them into text form is very low. Even lengthy executions of the training module led to nothing. The program refuses to understand many words and expressions. And RV would have been immediately lynched and crucified if not for... unique capabilities in the field of voice control of various applications. Here RV pushed itself and gave such a head start to other utilities that we almost gave a standing ovation. The program can be easily configured to launch any third-party utility (even Word, even ICQ, at least some driver) and even supports working with macros. With their help you can do things that are scary to even think about. On one voice command, which, by the way, can be made Russian, you can attach, for example, the following multi-stage function: open an email client, load a spam filter, go to the server, download all letters with headers in Russian, all with headers in English and with titles longer than 20 characters - delete. This is just an example. In general, the complexity of macros is not limited in any way. The main thing is just to have time to fantasize. The only thing Realize Voice was unable to train for was voice control inside computer games. But in normal applications there are no problems.
As a bonus, RV offers, to put it mildly, an integral function for voice organization of the workspace. This is scientific, and if in Russian, then with your voice you can not only launch applications and control their work, but also load other utilities at any time, switch between windows, close programs... In other words, Bobik at the command “Aport!” not only runs for a bone, but along the way he will stop at the store for milk, throw out the trash, pay the phone bill and buy your girlfriend flowers.
Pros: Unique voice control features, support for complex macros, ease of use.
Minuses: Weak speech recognition module. Price $50.
Summary: The program is simply created for voice control of a computer. It's a pity that the developers sacrificed other important functions of the utility.

Dictation 2004 v. 4.4
Average utility. This is the very case when there seems to be nothing to complain about, but compared to the competitors it doesn’t look very good. Dictation 2004 copes well with recognizing spoken speech, although it cannot compete, for example, with Dragon Naturally Speaking: the latter hits the most vulnerable spot in Dictation 2004 - the percentage of correct word guesses. The program is not doing well with this; additional training cures the disease, but not completely. You can give the utility an “A” for its ability to manage applications, but this will be an assessment for diligence, and not for mastery of the subject, as it does Realize Voice. The developers insist that the program is closely integrated with Word, but we did not notice this - it is no different from working with other utilities. Finally, I want to scratch Dictation 2004's ears because it can recognize speech from wav files quite well, but Dragon Naturally Speaking does it much better. The only unique function of “Dictation” is the ability to recognize speech directly from various external sources (a voice recorder, a player, a music center - hardly anyone will need it). So it turns out that Dictation 2004 is good for everyone, but it’s a pity to pay a “green fifty dollars” ($50) for it.
Pros: Can recognize speech directly from various external devices.
Minuses: Average performance for all functions.
Summary: Cheap, but not very cheerful. A middling utility, a gray mouse in the world of speech recognition programs.

Gorynych PROF 3.0
“Gorynych” is a domestic development. The ability to work with the great and mighty alone can put the program on a pedestal. But let's be objective. The utility is built on two modules that are responsible for recognizing speech dictated into a microphone and for providing commands to various applications. Rigorous testing showed that “Gorynych”, alas, has problems with the Russian language - if we draw analogies with foreign programs and their level of knowledge of English, then the domestic product works somewhere at the level Dictation 2004. That is, everything is great, but there are hiccups. An important point is that the utility has a built-in self-learning block: the more you pay attention to “Gorynych,” the better he understands you and the less indignant he is at your incorrect Russian pronunciation. We tested the utility for only a few hours, and during this time, it seemed to us, the program really became more understandable. Perhaps with longer communication the results will be even better.
Testing of “Gorynych’s” team skills went off without a hitch. The utility does not try to pretend to be a mega-integrated system; only the basic functions of program management are implemented - you won’t have to write any complex macros, but what is there is a solid top five. Launching, shutting down programs, calling up additional windows - the fairytale snake coped with everything and refrained from demonstrating restlessness.
There are two versions of the insidious Gorynych in nature - a lightweight version (Light), sold in jewel packaging for about $5 (ideal for home use) and a full-fledged boxed version for $49 (for the home, the functions are clearly too much).
Pros: Russian language, ergonomic interface, self-learning function, availability of an inexpensive lightweight version.
Minuses: Average performance for all functions, but only against the background of foreign competitors; there are no analogues among domestic utilities.
Summary: Excellent Russian-language program. In the absence of worthy domestic analogues, this is almost the only option for those who are not at all comfortable with English.

What to expect? What to be afraid of?
Despite the relative similarity of “voice” programs, they use different algorithms for speech recognition, decoding and displaying it on the screen as text. Typically, several algorithmic cores are built into one utility, which are responsible for various functions of the utilities. Depending on which of the components in a given program is programmed more carefully, the utility copes better with certain functions. Most often, “voice” applications can work in two main directions.
1) Recognition of Russian or English speech and conversion of voice into a text file. The most difficult function to implement is, of course, for developers. Unfortunately, programs that master this skill perfectly do not yet exist.
2) Voice control of the computer. Some simple - or not very simple, but multi-stage - action is “associated” with some kind of voice command. After this, it is enough to say the treasured word or phrase, and the computer will immediately perform the corresponding operation.
Please note that even demo versions of the programs described in the article take up at least 50 MB. This is due to the large amount of “vocabulary” - in order to understand the spoken word, the utility must already “know” it. Don't expect that speech programs will run quickly on weak machines. To work comfortably with most of these utilities, you need to have a completely modern computer and a good quality microphone.

* * *
In theory you are savvy, it’s a matter of practice. Stock up on utilities, install, master. The market for speech recognition programs is young, which is why utilities behave like little children. You need to look after them, change their diapers on time, make sure they learn new words on time (all programs have a module for teaching new expressions), care for them and cherish them. What will grow from a distribution downloaded from the Internet or purchased depends only on you. If you don’t devote enough time to setting up and training the program, you will grow up to be an obstinate and hooligan boy. Spend a few hours studying documentation, navigating menus, working with a microphone - raise a diligent young man who will follow you everywhere and say: “ What do you want, daddy?! Porridge? Lightly salted cucumbers?”.

If you type on the keyboard too slowly and are too lazy to learn the ten-finger typing method, you can try using modern programs and services for voice text input.

The keyboard is undoubtedly a fairly convenient computer control tool. However, when it comes to typing long text, we understand all of its (and, to be honest, ours :)) imperfections... You also need to be able to type quickly!

A couple of years ago, wanting to simplify my job of writing articles, I decided to find a program that would allow me to convert voice into text. I thought how nice it would be if I just said everything I needed into the microphone, and the computer typed for me :)

Imagine my disappointment when I realized that at that time there were no really working (let alone free) solutions for this matter. There were, however, domestic developments, like “Gorynych” and “Dictograph”. They understood the Russian language, but, alas, the quality of speech recognition was quite low, they required a long setup with the creation of a dictionary for your voice, and they were also quite expensive...

Then Android was born and the situation moved a little from the dead point. In this system, voice input appeared as a built-in (and quite convenient) alternative to input from the virtual on-screen keyboard. And recently in one of the comments I was asked if there is a voice input option for Windows? I answered that not yet, but I decided to look and it turned out that, maybe not entirely full-fledged, but such an opportunity exists! Today’s article will be about the results of my research.

Speech recognition problem

Before we begin analyzing the current solutions for voice input in Windows, I would like to shed some light on the essence of the problem of computer speech recognition. For a more accurate understanding of the process, I suggest taking a look at the following diagram:

As you can see, converting speech into text occurs in several stages:

Voice digitization. At this stage, the quality depends on the clarity of diction, the quality of the microphone and sound card.
Comparing an entry with entries in a dictionary. The “more is better” principle works here: the more recorded words the dictionary contains, the higher the chances that your words will be recognized correctly.
Text output. The system automatically, based on pauses, tries to identify individual lexemes from the speech stream that correspond to template lexemes from the dictionary, and then displays the found matches in the form of text.

The main problem, as you might guess, lies in two main nuances: the quality of the digitized segment of speech and the volume of the dictionary with templates. The first problem can really be minimized even with a cheap microphone and a standard sound card. It is enough just to speak slowly and clearly.

With the second problem, alas, not everything is so simple... A computer, unlike a person, cannot correctly recognize the same phrase said, for example, by a woman and a man. To do this, both voice acting options with different voices must exist in its database!

This is where the main catch lies. Creating a dictionary for one person, in principle, is not so difficult, however, given that each word must be written in several versions, it turns out to be very long and labor-intensive. Therefore, most of the speech recognition programs that exist today are either too expensive or do not have their own dictionaries, leaving the user to create them themselves.

It’s not for nothing that I mentioned Android a little higher. The fact is that Google, which is developing it, has also created the only publicly available global online dictionary for speech recognition today (and multilingual!) called Google Voice API. Yandex is also creating a similar dictionary for the Russian language, but so far, alas, it is still unsuitable for use in real conditions. Therefore, almost all the free solutions that we will consider below work with Google dictionaries. Accordingly, they all have the same recognition quality and the nuances lie only in additional capabilities...

Voice input programs

There are not so many full-fledged programs for voice input for Windows. And those that exist and understand the Russian language are mostly paid... For example, the cost of the popular custom voice-to-text conversion system RealSpeaker starts at 2,587 rubles, and the professional Caesar-R complex starts at 35,900 rubles!

But among all this expensive software, there is one program that does not cost a penny, but at the same time provides functionality that is more than sufficient for most users. It's called MSpeech:

The main program window has the simplest possible interface - a sound level indicator and only three buttons: start recording, stop recording and open the settings window. MSpeech also works quite simply. You need to press the record button, place the cursor in the window in which the text should be displayed and start dictating. For greater convenience, it is better to record and stop it using hotkeys, which can be set in Settings:

In addition to hot keys, you may need to change the type of text transmission to the windows of the desired programs. By default, output is set to the active window, however, you can specify transmission to inactive fields or to fields of a specific program. Among the additional features, it is worth noting the “Commands” group of settings, which allows you to implement voice control of the computer using phrases you specify.

In general, MSpeech is a fairly convenient program that allows you to type text by voice in any Windows window. The only caveat in its use is that the computer must be connected to the Internet to access Google dictionaries.

Voice input online

If you don’t want to install any programs on your computer, but want to try entering text by voice, you can use one of the many online services that work on the same Google dictionaries.

Well, of course, the first thing worth mentioning is Google’s “native” service called Web Speech API:

This service allows you to translate unlimited sections of speech into text in more than 50 languages! You just need to select the language you speak, click on the microphone icon in the upper right corner of the form, if necessary, confirm permission for the site to access the microphone and start speaking.

If you do not use any highly specialized terminology and speak clearly, you can get a very good result. In addition to words, the service also “understands” punctuation marks: if you say “period” or “comma”, the required symbol will appear in the output form.

When recording is complete, the recognized text will be automatically highlighted and you can copy it to the clipboard or send it by mail.

Among the shortcomings, it is worth noting that the service can only work in the Google Chrome browser older than version 25, as well as the lack of multilingual recognition capabilities.

By the way, on our website at the top you will find a completely Russified version of the same form of speech recognition. Enjoy it for your health ;)

There are quite a few similar online speech recognition resources based on the Google service. One of the sites that is of interest to us is Dictation.io:

Unlike the Web Speech API, Dictation.io has a more stylish notepad design. Its main advantage over Google's service is that it allows you to stop recording and then start it again, and the previously entered text will be saved until you press the "Clear" button.

Like the Google service, Dictation.io “knows how” to put periods, commas, as well as exclamation marks and question marks, but does not always begin a new sentence with a capital letter.

If you are looking for a service with maximum functionality, then probably one of the best in this regard will be:

Main advantages of the service:

availability of Russian-language interface;
the ability to view and select recognition options;
presence of voice prompts;
automatic recording shutdown after a long pause;
built-in text editor with functions for copying text to the clipboard, printing it on a printer, sending it by mail or Twitter, and translating it into other languages.

The only drawback of the service (besides the already described general disadvantages of the Web Speech API) is the operating algorithm that is not quite familiar for such services. After pressing the record button and dictating the text, you need to check it, select the option that best matches what you wanted to say, and then transfer it to the text editor below. After which the procedure can be repeated.

Plugins for Chrome

In addition to full-fledged programs and online services, there is another way to recognize speech into text. This method is implemented using plugins for the Google Chrome browser.

The main advantage of using plugins is that with their help you can enter text by voice not only in a special form on the service website, but also in any input field on any web resource! In fact, plugins occupy an intermediate niche between services and full-fledged programs for voice input.

One of the best extensions for translating speech to text is SpeechPad:

I won’t lie if I say that SpeechPad is one of the best Russian-language speech-to-text translation services. On the official website you will find a fairly powerful (albeit a little old in design) online notepad with many advanced functions, including:

support for voice commands for computer control;
improved punctuation support;
function to mute sounds on PC;
integration with Windows (albeit on a paid basis);
the ability to recognize text from video or audio recordings ("Transcription" function);
translation of recognized text into any language;
saving text to a text file available for downloading.

As for the plugin, it provides us with the most simplified functionality of the service. Place the cursor in the input field you need, call the context menu and click on the "SpeechPad" item. Now confirm access to the microphone and, when the input field turns pink, dictate the desired text.

After you stop speaking (a pause of more than 2 seconds), the plugin itself will stop recording and display everything you said in the field. If you wish, you can go to the plugin settings (right click on the plugin icon at the top) and change the default parameters:

Oddly enough, in the entire Google extensions online store I haven’t come across a single worthwhile plugin that would allow voice input in any text field. The only similar extension was the English one. It adds a microphone icon to all input fields on a web page, but it doesn't always position it correctly, so it might end up off the screen...

“I would like to say right away that this is my first time dealing with recognition services. And therefore, I’ll tell you about the services from a layman’s point of view,” noted our expert, “to test recognition, I used three instructions: Google, Yandex and Azure.”

Google

The well-known IT corporation offers to test its Google Cloud Platform product online. Anyone can try the service for free. The product itself is convenient and easy to use.

Pros:

support for more than 80 languages;
fast name processing;
high-quality recognition in conditions of poor communication and in the presence of extraneous sounds.

Minuses:

there are difficulties in recognizing messages with an accent and poor pronunciation, which makes the system difficult to use by anyone other than native speakers;
lack of clear technical support for the service.

Yandex

Speech recognition from Yandex is available in several options:

Cloud
Library for access from mobile applications
"Boxed" version
JavaScript API

But let's be objective. We are primarily interested not in the variety of usage possibilities, but in the quality of speech recognition. Therefore, we used the trial version of SpeechKit.

Pros:

ease of use and configuration;
good text recognition in Russian;
the system provides several answer options and, through neural networks, tries to find the option that is most similar to the truth.

Minuses:

During stream processing, some words may be determined incorrectly.

Azure

Azure is developed by Microsoft. It stands out from its analogues due to its price. But, be prepared to face some difficulties. The instructions presented on the official website are either incomplete or outdated. We were unable to launch the service adequately, so we had to use a third-party launch window. However, even here you will need an Azure service key for testing.

Pros:

Compared to other services, Azure processes messages very quickly in real time.

Minuses:

the system is very sensitive to accent and has difficulty recognizing speech from non-native speakers;
The system operates only in English.

Review results:

After weighing all the pros and cons, we settled on Yandex. SpeechKit is more expensive than Azure, but cheaper than Google Cloud Platform. Google's program has seen a constant improvement in the quality and accuracy of recognition. The service improves itself using machine learning technologies. However, Yandex’s recognition of Russian words and phrases is a level higher.

How to use voice recognition in business?

There are a lot of options for using recognition, but we will focus your attention on the one that will primarily affect your company’s sales. For clarity, let’s look at the recognition process using a real example.

Not so long ago, one well-known SaaS service became our client (at the request of the company, the name of the service was not disclosed). With the help of F1Golos, they recorded two audio videos, one of which was aimed at extending the life of warm customers, the other - at processing customer requests.

How to extend customer life using voice recognition?

Often, SaaS services operate on a monthly subscription fee. Sooner or later, the period of trial use or paid traffic ends. Then there is a need to extend the service. The company decided to warn users about the end of traffic 2 days before the expiration of the term of use. Users were notified via voice mail. The video sounded like this: “Good afternoon, we remind you that your paid period for using the XXX service is ending. To extend the service, say yes; to cancel the services provided, say no.”

Calls from users who said the code words: YES, RENEW, I WANT, MORE DETAILS; were automatically transferred to the company's operators. Thus, about 18% of users renewed their registration thanks to just one call.

How to simplify a data processing system using speech recognition?

The second audio clip, launched by the same company, was of a different nature. They used voice messaging to reduce the cost of verifying phone numbers. Previously, they verified user numbers using a robocall. The robot asked users to press certain keys on the phone. However, with the advent of recognition technologies, the company changed tactics. The text of the new video was as follows: “You have registered on the XXX portal, if you confirm your registration, say yes. If you did not submit a registration request, say no." If the client uttered the words: YES, I CONFIRM, AHA or OF COURSE, the data about this was instantly transferred to the company’s CRM system. And the registration request was confirmed automatically in a couple of minutes. The introduction of recognition technologies has reduced the time of one call from 30 to 17 seconds. Thus, the company reduced costs by almost 2 times.

If you are interested in other ways to use voice recognition, or want to learn more about voice messaging, follow the link. On F1Golos you can sign up for your first newsletter for free and find out for yourself how new recognition technologies work.

Encyclopedic YouTube

1 / 5
Work on speech recognition dates back to the middle of the last century. The first system was created in the early 1950s: its developers set themselves the task of recognizing numbers. The developed system could identify numbers, but spoken in one voice, such as the Bell Laboratories “Audrey” system. It worked by identifying the formant in the power spectrum of each speech passage. In general terms, the system consisted of three main parts: analyzers and quantizers, network matcher patterns and, finally, sensors. It was created, accordingly, on the elemental basis of various frequency filters, switches, and the sensors also included gas-filled tubes [ ] .
By the end of the decade, systems had emerged that recognized vowels independently of the speaker. In the 70s, new methods began to be used that made it possible to achieve more advanced results - the dynamic programming method and the linear prediction method (Linear Predictive Coding - LPC). The aforementioned company, Bell Laboratories, created systems using exactly these methods. In the 80s, the next step in the development of voice recognition systems was the use of Hidden Markov Models (HMM). At this time, the first large voice recognition programs began to appear, such as Kurzweil text-to-speech. In the late 80s, methods of artificial neural networks (Artificial Neural Network - ANN) also began to be used. In 1987, Worlds of Wonder's Julie dolls, which were capable of understanding voices, appeared on the market. And 10 years later, Dragon Systems released the program “NaturallySpeaking 1.0”.

Reliability

The main sources of voice recognition errors are:

Gender recognition can be distinguished as a separate type of problem, which is solved quite successfully - with large amounts of initial data, the gender is determined almost without error, and in short passages such as a stressed vowel sound, the probability of error is 5.3% for men and 3.1% for women.
The problem of voice imitation was also considered. Research by France Telecom has shown that professional voice imitation practically does not increase the likelihood of an identity error - imitators fake the voice only externally, emphasizing the features of speech, but are not able to fake the basic outline of the voice. Even the voices of close relatives, twins, will have a difference, at least in the dynamics of control. But with the development of computer technology, a new problem has arisen that requires the use of new methods of analysis - voice transformation, which increases the probability of error to 50%.
To describe the reliability of the system, there are two criteria used: FRR (False Rejection Rate) - the probability of a false denial of access (error of the first kind) and FAR (False Acceptance Rate) - the probability of a false admission when the system mistakenly identifies a stranger as its own (error of the second type) . Also, sometimes recognition systems are characterized by a parameter such as EER (Equal Error Rates), which represents the point of coincidence of the FRR and FAR probabilities. The more reliable the system, the lower the EER it has.
Identification error values for various biometric modalities

Application

Recognition can be divided into two main areas: identification and verification. In the first case, the system must independently identify the user by voice; in the second case, the system must confirm or deny the identifier presented by the user. Determining the speaker under study consists of a pairwise comparison of voice models that take into account the individual speech characteristics of each speaker. Thus, we first need to collect a fairly large database. And based on the results of this comparison, a list of phonograms can be generated that, with some probability, are the speech of the user we are interested in.
Although voice recognition cannot guarantee a 100% correct result, it can be used quite effectively in areas such as forensics and forensics; intelligence service; anti-terrorist monitoring; safety; banking and so on.

Analysis

The entire process of processing a speech signal can be divided into several main stages:
- signal preprocessing;
- highlighting criteria;
- speaker recognition.
Each stage represents an algorithm or some set of algorithms, which ultimately produces the required result.
The main features of the voice are formed by three main properties: the mechanics of vibration of the vocal folds, the anatomy of the vocal tract and the articulation control system. In addition, sometimes it is possible to use the speaker’s dictionary, his figures of speech. The main features by which a decision is made about the personality of the speaker are formed taking into account all factors of the speech production process: the voice source, the resonant frequencies of the vocal tract and their attenuation, as well as the dynamics of articulation control. If we look at the sources in more detail, the properties of the voice source include: the average frequency of the fundamental tone, the contour and fluctuations of the fundamental frequency, and the shape of the excitation pulse. The spectral characteristics of the vocal tract are described by the spectrum envelope and its mean slope, formant frequencies, long-term spectrum or cepstrum. In addition, the duration of words, rhythm (stress distribution), signal level, frequency and duration of pauses are also considered. To determine these characteristics, it is necessary to use rather complex algorithms, but since, for example, the error of formant frequencies is quite large, cepstrum coefficients calculated from the spectrum envelope or the transfer function of the vocal tract found by the linear prediction method are used to simplify it. In addition to the mentioned cepstrum coefficients, their first and second time differences are also used. This method was first proposed in the works of Davis and Mermelstein.

Cepstral analysis

In works on voice recognition, the most popular method is cepstral transformation of the spectrum of speech signals. The scheme of the method is as follows: over a time interval of 10 - 20 ms, the current power spectrum is calculated, and then the inverse Fourier transform of the logarithm of this spectrum (cepstrum) is applied and the coefficients are found: c n = 1 Θ ∫ 0 Θ ∣ S (j , ω , t) ∣ 2 exp − j n ω Ω ⁡ d ω (\displaystyle c_(n)=(\frac (1)(\Theta ))\int _(0 )^(\Theta )(\mid S(j,\omega ,t)\mid )^(2)\exp ^(-jn\omega \Omega )d\omega ), Ω = 2 2 π Θ , Θ (\displaystyle \Omega =2(\frac (2\pi )(\Theta )),\Theta )- the highest frequency in the spectrum of the speech signal, ∣ S (j , ω , t) ∣ 2 (\displaystyle (\mid S(j,\omega ,t)\mid )^(2))- power spectrum. The number of cepstral coefficients n depends on the required spectrum smoothing, and ranges from 20 to 40. If a comb of bandpass filters is used, then the discrete cepstral transform coefficients are calculated as c n = ∑ m = 1 N log ⁡ Y (m) 2 cos ⁡ π n M (m − 1 2)) (\displaystyle c_(n)=\sum _(m=1)^(N)\log (Y (m)^(2))\cos ((\frac (\pi n)(M))(m-(\frac (1)(2)))))), where Y(m) is the output signal of the m-th filter, c n (\displaystyle c_(n))- nth cepstrum coefficient.
Hearing properties are taken into account through a nonlinear frequency scale transformation, usually on the chalk scale. This scale is formed based on the presence of so-called critical bands in hearing, such that signals of any frequency within the critical band are indistinguishable. The chalk scale is calculated as M (f) = 1125 ln ⁡ (1 + f 700) (\displaystyle M(f)=1125\ln ((1+(\frac (f)(700))))), where f is the frequency in Hz, M is the frequency in chalk. Or another scale is used - bark, such that the difference between the two frequencies, equal to the critical band, is 1 bark. Frequency B is calculated as B = 13 a r c t g (0 . 00076 f) + 3. 5 a r c t g f 7500 (\displaystyle B=13\operatorname (arctg((0.00076f))) +3.5\operatorname (arctg(\frac (f)(7500 ))) ). The coefficients found are sometimes referred to in the literature as MFCC - Mel Frequiency Cepstral Coefficients. Their number ranges from 10 to 30. The use of the first and second time differences of cepstral coefficients triples the dimension of the decision space, but improves the speaker recognition efficiency.
The cepstrum describes the shape of the signal spectrum envelope, which is influenced by both the properties of the excitation source and the features of the vocal tract. Experiments have shown that the spectrum envelope has a strong influence on voice recognition. Therefore, the use of various methods of analyzing the spectrum envelope for voice recognition purposes is quite justified.

Methods

The GMM method follows from the theorem that any probability density function can be represented as a weighted sum of normal distributions:
P (x | λ) = ∑ j = 1 k ω j ϕ (χ , Θ j) (\displaystyle p(x|\lambda)=\sum _(j=1)^(k)(\omega _(j )\phi (\chi ,\Theta _(j)))); λ (\displaystyle \lambda)- speaker model; k - number of model components; ω j (\displaystyle (\omega _(j)))- the weights of the components are such that ∑ j = 1 n ω j = 1. (\displaystyle \sum _(j=1)^(n)(\omega _(j))=1.) ϕ (χ , Θ j) (\displaystyle \phi (\chi ,\Theta _(j)))- distribution function of a multidimensional argument χ , Θ j (\displaystyle \chi ,\Theta _(j)) .ϕ (χ , Θ j) = p (χ ∣ μ j , R j) = 1 (2 π) n 2 ∣ R j ∣ 1 2 exp ⁡ − 1 (χ − μ j) T R j − 1 (χ − μ j) 2 (\displaystyle \phi (\chi ,\Theta _(j))=p(\chi \mid \mu _(j),R_(j))=(\frac (1)(((2\ pi ))^(\frac (n)(2))(\mid R_(j)\mid )^(\frac (1)(2)))\exp (\frac (-1(\chi -\ mu _(j))^(T)R_(j)^(-1)(\chi -\mu _(j)))(2))), ω j (\displaystyle \omega _(j))- its weight, k - the number of components in the mixture. Here n is the dimension of the feature space, μ j ∈ R n (\displaystyle \mu _(j)\in \mathbb (R) ^(n))- vector of mathematical expectation of the j-th component of the mixture, R j ∈ R n × n (\displaystyle R_(j)\in \mathbb (R) ^(n\times n))- covariance matrix.
Very often, systems with this model use a diagonal covariance matrix. It can be used for all components of the model or even for all models. To find the covariance matrix, weights, vectors of means, the EM algorithm is often used. At the input we have a training sequence of vectors X = (x 1 , . . . , x T ) . The model parameters are initialized with initial values and then the parameters are re-estimated at each iteration of the algorithm. To determine the initial parameters, a clustering algorithm such as the K-means algorithm is usually used. After the set of training vectors has been divided into M clusters, the model parameters can be determined as follows: initial values μ j (\displaystyle \mu _(j)) coincide with the centers of the clusters, covariance matrices are calculated based on the vectors included in a given cluster, the weights of the components are determined by the proportion of vectors of a given cluster among the total number of training vectors.
Revaluation of parameters occurs according to the following formulas:

GMM can also be called a continuation of the vector quantization method (centroid method). It creates a codebook for disjoint regions in feature space (often using K-means clustering). Vector quantization is the simplest model in context-independent recognition systems.
The support vector machine (SVM) builds a hyperplane in a multidimensional space that separates two classes - parameters of the target speaker and parameters of speakers from the reference base. The hyperplane is calculated using support vectors - chosen in a special way. A nonlinear transformation of the space of measured parameters into some space of higher-dimensional features will be performed, since the dividing surface may not correspond to the hyperplane. The dividing surface in the hyperplane is constructed by the support vector machine method if the condition of linear separability in the new feature space is satisfied. Thus, the success of using SMM depends on the selected nonlinear transformation in each specific case. The support vector machine is often used with the GMM or HMM method. Typically, for short phrases lasting a few seconds, phoneme-dependent HMMs are better suited for the context-dependent approach.

Popularity

According to New York-based consulting company International Biometric Group, the most common technology is fingerprint scanning. It is noted that of the $127 million in revenue from the sale of biometric devices, 44% comes from fingerprint scanners. Facial recognition systems rank second in terms of demand at 14%, followed by palm shape recognition devices (13%), voice recognition (10%) and iris recognition (8%). Signature verification devices make up 2% of this list. Some of the most famous manufacturers in the voice biometrics market are Nuance Communications, SpeechWorks, VeriVoice.
In February 2016, The Telegraph published an article reporting that customers of the British bank HSBC would be able to access accounts and conduct transactions using voice identification. The transition was supposed to take place in early summer

Man has always been attracted to the idea of controlling a machine using natural language. Perhaps this is partly due to the desire of man to be ABOVE the machine. So to speak, to feel superior. But the main message is to simplify human interaction with artificial intelligence. Voice control in Linux has been implemented with varying degrees of success for almost a quarter of a century. Let's look into the issue and try to get as close to our OS as possible.

The crux of the matter

Systems for working with human voice for Linux have been around for a long time, and there are a great many of them. But not all of them process Russian speech correctly. Some were completely abandoned by the developers. In the first part of our review, we will talk directly about speech recognition systems and voice assistants, and in the second, we will look at specific examples of their use on a Linux desktop.
It is necessary to distinguish between speech recognition systems themselves (translation of speech into text or into commands), such as, for example, CMU Sphinx, Julius, as well as applications based on these two engines, and voice assistants, which have become popular with the development of smartphones and tablets. This is, rather, a by-product of speech recognition systems, their further development and the implementation of all successful ideas of voice recognition, their application in practice. There are few of these for Linux desktops yet.

You need to understand that the speech recognition engine and the interface to it are two different things. This is the basic principle of Linux architecture - dividing a complex mechanism into simpler components. The most difficult work falls on the shoulders of the engines. This is usually a boring console program that runs unnoticed by the user. The user interacts mainly with the interface program. Creating an interface is not difficult, so developers focus their main efforts on developing open-source speech recognition engines.

What happened before

Historically, all speech processing systems in Linux developed slowly and in leaps and bounds. The reason is not the crookedness of the developers, but the high level of entry into the development environment. Writing system code for working with voice requires a highly qualified programmer. Therefore, before starting to understand speech systems in Linux, it is necessary to make a short excursion into history. IBM once had such a wonderful operating system - OS/2 Warp (Merlin). It came out in September back in 1996. In addition to the fact that it had obvious advantages over all other operating systems, OS/2 was equipped with a very advanced speech recognition system - IBM ViaVoice. For that time, this was very cool, considering that the OS ran on systems with a 486 processor with 8 MB of RAM (!).

As you know, OS/2 lost the battle to Windows, but many of its components continued to exist independently. One of these components was the same IBM ViaVoice, which turned into an independent product. Since IBM always loved Linux, ViaVoice was ported to this OS, which gave the brainchild of Linus Torvalds the most advanced speech recognition system of its time.

Unfortunately, the fate of ViaVoice did not turn out the way Linux users would have liked. The engine itself was distributed free of charge, but its sources remained closed. In 2003, IBM sold the rights to the technology to the Canadian-American company Nuance. Nuance, which developed perhaps the most successful commercial speech recognition product - Dragon Naturally Speeking, is still alive today. This is almost the end of the inglorious history of ViaVoice on Linux. During the short time that ViaVoice was free and available to Linux users, several interfaces were developed for it, such as Xvoice. However, the project has long been abandoned and is now practically inoperable.

INFO
The most difficult part of machine speech recognition is natural human language.
What today?

Today everything is much better. In recent years, after the discovery of the Google Voice API sources, the situation with the development of speech recognition systems in Linux has improved significantly, and the quality of recognition has increased. For example, the Linux Speech Recognition project based on the Google Voice API shows very good results for the Russian language. All engines work approximately the same: first, the sound from the microphone of the user’s device enters the recognition system, after which either the voice is processed on the local device, or the recording is sent to a remote server for further processing. The second option is more suitable for smartphones or tablets. Actually, this is exactly how commercial engines work - Siri, Google Now and Cortana.

Of the variety of engines for working with the human voice, there are several that are currently active.

WARNING
Installing many of the described speech recognition systems is a non-trivial task!
CMU Sphinx

Much of the development of CMU Sphinx takes place at Carnegie Mellon University. At different times, both the Massachusetts Institute of Technology and the now deceased Sun Microsystems corporation worked on the project. The engine sources are distributed under the BSD license and are available for both commercial and non-commercial use. Sphinx is not a custom application, but rather a set of tools that can be used to develop end-user applications. Sphinx is now the largest speech recognition project. It consists of several parts:
- Pocketsphinx is a small, fast program that processes sound, acoustic models, grammars and dictionaries;
- Sphinxbase library, required for Pocketsphinx to work;
- Sphinx4 - the actual recognition library;
- Sphinxtrain is a program for training acoustic models (recordings of the human voice).
The project is developing slowly but surely. And most importantly, it can be used in practice. And not only on PCs, but also on mobile devices. In addition, the engine works very well with Russian speech. If you have straight hands and a clear head, you can set up Russian speech recognition using Sphinx to control home appliances or a smart home. In fact, you can turn an ordinary apartment into a smart home, which is what we will do in the second part of this review. Sphinx implementations are available for Android, iOS and even Windows Phone. Unlike the cloud method, when the work of speech recognition falls on the shoulders of Google ASR or Yandex SpeechKit servers, Sphinx works more accurately, faster and cheaper. And completely local. If you wish, you can teach Sphinx the Russian language model and the grammar of user queries. Yes, you will have to work a little during installation. Just like setting up Sphinx voice models and libraries is not an activity for beginners. Because the core of CMU Sphinx, the Sphinx4 library, is written in Java, you can include its code in your speech recognition applications. Specific examples of use will be described in the second part of our review.

VoxForge

Let us especially highlight the concept of a speech corpus. A speech corpus is a structured set of speech fragments, which is provided with software for accessing individual elements of the corpus. In other words, it is a set of human voices in different languages. Without a speech corpus, no speech recognition system can operate. It is difficult to create a high-quality open speech corpus alone or even with a small team, so a special project is collecting recordings of human voices - VoxForge.

Anyone with access to the Internet can contribute to the creation of a speech corpus by simply recording and submitting a speech fragment. This can be done even by phone, but it is more convenient to use the website. Of course, in addition to the audio recording itself, the speech corpus must include additional information, such as phonetic transcription. Without this, speech recording is meaningless for the recognition system.

HTK, Julius and Simon

HTK - Hidden Markov Model Toolkit is a toolkit for research and development of speech recognition tools using hidden Markov models, developed at the University of Cambridge under the patronage of Microsoft (Microsoft once bought this code from a commercial enterprise Entropic Cambridge Research Laboratory Ltd, and then returned it Cambridge together with a restrictive license). The project's sources are available to everyone, but the use of HTK code in products intended for end users is prohibited by the license.

However, this does not mean that HTK is useless for Linux developers: it can be used as an auxiliary tool when developing open-source (and commercial) speech recognition tools, which is what the developers of the open-source Julius engine, which is being developed in Japan, do. Julius works best with Japanese. The great and powerful is also not deprived, because the same VoxForge is used as a voice database.

Continuation is available only to members

Option 1. Join the “site” community to read all materials on the site

Membership in the community within the specified period will give you access to ALL Hacker materials, increase your personal cumulative discount and allow you to accumulate a professional Xakep Score rating!

How to do speech recognition. The word is not a sparrow! Review of voice recognition programs. Comparative test of services

Speech recognition problem

Voice input programs

Voice input online

Plugins for Chrome

Google

Yandex

Azure

Review results:

How to use voice recognition in business?

How to extend customer life using voice recognition?

How to simplify a data processing system using speech recognition?

Encyclopedic YouTube

Reliability

Application

Analysis

Cepstral analysis

Methods

Popularity

The crux of the matter

What happened before

INFO

What today?

WARNING

CMU Sphinx

VoxForge

HTK, Julius and Simon

Continuation is available only to members

Option 1. Join the “site” community to read all materials on the site