A Simple Investigation into Modern Speech Recognition

A beginner’s introduction to the ubiquitous algorithm

5 min readMar 26, 2021

Background

Speech recognition comes in almost all of the most modern technologies, from phones to smart speakers to cars. However, most have little understanding of how this software is able to sort through sounds and distinguish real words from them. My aim for this investigation is to walk through a short historical introduction to speech recognition, Markov Models, the speech recognition method, and finally, some simple Python code to begin writing your first speech recognition program.

Shoebox speech recognition software made by IBM.

History

Although speech recognition has only recently become a large part of modern technology, people have been developing these models since the 1950s. These early speech recognition machines look very different from the modern-day machines not only in size but also in how it processed the data. These projects were entirely based on recognizing patterns on a small set of records rather than the text of a full language. For example, Audrey, one of the earliest speech recognition projects, was a machine that could only recognize spoken numerical digits. Another of the best known early speech recognition machines was Shoebox, developed by IBM, which was a system that recognized digits and arithmetic commands such as plus or total, and then passed the problem to an adding machine. However, progress continued incrementally until the development of the Hidden Markov Model in the mid-80s.

Markov Models

To understand how modern speech recognition works, it’s important to understand what a Markov model is and how it functions. A simple Markov model is relatively basic and can be used to understand many different probabilistic problems. A good way to understand this is the Markov process of a single fair coin flip given below.

There are two states here, heads (1) and tails (2). There is a 50% chance after you get heads to get another heads and a 50% chance after you get heads to get tails. The probability of staying in state 1 (heads) is given by the arrow looping back to state 1, and the probability of moving to state 2 (tails) is given by the arrow pointing from state 1 to state 2.

This is a very simplistic example, but you could imagine a process such as weather that is slightly more complex. If state 1 is warm weather and state 2 is cold weather, there may be a higher chance of staying in the warm state during each time period versus moving from a warm state to a cold state. Similarly, there would likely be a high probability to stay in a warm state if the current state is warm.

Often Markov model problems are framed in a way such that none of the transition probabilities are given. One main example of this is the Hidden Markov model.

Say we have a list of observations about how much ice cream an individual ate every single day. We want to use that sequence of observations in order to determine, without directly observing, the Markov Model between warm weather and cold weather. To do that, the Hidden Markov Model process uses the conditional probabilities of the amount of ice cream eaten when it’s warm versus the conditional probabilities of the amount of ice cream eaten when it’s cold. Solving this model is a relatively complex problem beyond the scope of this article, but you can assume that finding these transition probabilities is a solvable problem.

Hidden Markov model using ice cream to predict weather transition probabilities.

Speech recognition models use this paradigm in order to find the state transition probabilities between some units of speech such as a phoneme, the smallest unit of speech. Rather than the number of scoops of ice cream, the observations are auditory features of recording several milliseconds long. By training this model using many different observations, we are able to find the probability that we found a single phenome given an audio observation.

Once we have a list of phenomes that was likely spoken, a separate classification algorithm is used to determine the most likely word that was spoken for each set of phenomes. This classifier could be anything, but modern software often uses pre-trained neural networks to determine the sequence of words.

This Hidden Markov Model was the main revolution in solving speech recognition technology, but also the increases in computing power and classifiers were key for making speech recognition what it is today.

Simple Python Program

Now that we have a general understanding of how speech recognition works, I am going to go through some very basic Python code to enable you to write your first speech recognition program. Luckily, there are already pre-trained HMM/Classifier libraries that will do essentially all of the work for us on the back end.

I will be using the SpeechRecognition library, which has support for a variety of online and offline engines and APIs. After initializing your recognizer, record your data through your .wav file to initialize your speech_recognition.AudioData object. Finally, you can use SpeechRecognition’s online Google API to recognize your speech with the recognize_google method. See below for the code described.

import speech_recognition as sr
r = sr.Recognizer()
hello = sr.AudioFile('hello.wav')
with hello as source:
    audio = r.record(source)
print(r.recognize_google(audio))

This print statement will print out the hello that I recorded, and you are done. This is of course a very simple code introduction, but you can explore using the speech_recognition library to increase the complexity wherever is needed.

The opportunities are limitless with speech recognition software particularly in data science, and I hope this introduction provides a helpful introduction for those getting started with audio data.

This investigation was completed during my time at Metis Data Science Bootcamp. If you have any questions, please feel free to reach out to me on LinkedIn.