How To Make A Talking Robot: From Speaking Machines To Voice Assistants

Are Alexa and Siri really all that impressive?
A man at a restaurant talking to one of his voice assistants on his phone

We have technology today that just a few decades ago existed solely in science fiction. We have robots that can build furniture, vacuum floors and win Jeopardy. Yet for all of their strengths, robots have yet to master something that even a toddler can do: speak. And it’s not for lack of trying. From the earliest days of robots to today, there have been countless attempts at making technology that can talk. With almost every big tech company investing heavily in voice assistants — Apple’s Siri, Google Assistant, Amazon’s Alexa and more — the pressure to finally crack this problem is increasing.

How do you make a robot talk, though? It’s a testament to evolution that humans — unlike most animals — developed a body capable of producing such a wide range of noises that we’ve devised into language. Trying to figure out how to reproduce that using machines isn’t easy. Here, we’ll look at some of the attempts to do so, from the earliest speech synthesis to the robots that surround us today. 

The Early Talking Machines

The first successful attempts at making a machine that sounds (at least a little bit) like a human goes all the way back to the 18th century. Inventors looked at the human speech system and attempted to recreate it using the materials around them. The resulting products were closer to musical instruments than machines as we think of them today.

One such instrument was a system of resonators created by German scientist Christian Gottlieb Kratzenstein. In the mid-1700s, the first breakthroughs in acoustic science were unlocking the secrets of sound waves. The Imperial Academy of St. Petersburg had put out a challenge for someone to figure out exactly how the human vocal tract produced distinct vowels. Kratzenstein took on this problem and in 1770 produced resonators made of reeds — like reeds you’d find in a harmonica — that approximated A, E, I, O and U sounds.

Around the same time, Austro-Hungarian inventor Wolfgang von Kempelen was at work developing his own speaking machine. He tried out multiple designs over a period of decades and ended up in 1804 with a machine comprising a bellows, a reed and several small parts that could mimic various consonants. While the machine was impressive — especially for its time — von Kempelen’s reputation was tarnished because he was also the inventor of one of the most famous mechanical hoaxes of all time: the Mechanical Turk. Von Kempelen claimed to have invented an automaton that could play chess, but it was really just a dummy manipulated from the inside by a little person who was very good at chess.

These designs continued to be improved upon over the following decades, with robots that sounded more and more like humans. One of the most famous is the Bell Labs Voder, which was exhibited at the 1939 World’s Fair. It was a very complex machine that was operated by a single person, who used a console to “play” human speech.

All of these advances were important to the study of speech synthesis, but they didn’t do much more for the average person than provide a novelty act. While there have been more recent attempts — like the Japanese company that created an impressive yet creepy artificial human mouth back in 2011 — the focus for speech synthesis has shifted almost entirely to the digital realm.

The Recording Revolution

Recording technology showed that the easiest way to make machines talk isn’t to build a human mouth from scratch, but to use existing human voices. Thomas Edison proved this early on when he released the first-ever talking dolls in 1890, which used little phonographs in the body of the doll featuring recorded phrases. It was an achievement at the time, and it paved the way for annoying talking toys for centuries to come.

You might think that pre-recorded phrases, or “canned speech,” is kind of cheating when it comes to making machines talk. And yet, canned speech is one of the most common methods today to create robot speech. When you tell Siri to tell you a joke, that joke is not being formed on the spot, it’s something that was recorded in the studio. The technology is more advanced than a talking doll — Siri has to understand the spoken request and determine how to respond — but the speech itself is essentially the same.

Canned speech can’t do everything, though. While voice assistants are designed to keep you from going off-script (and when you do, you’ll often hear “I don’t understand”), a voice-driven future requires more versatility. The solution is to use concatenative speech. This still requires recording real humans, but rather than producing full sentences, concatenative speech strings together individually recorded words, syllables and even specific letter sounds. Text-to-speech technology is one of the best examples of this, because it takes whatever you write and outputs sound.

While concatenative speech sounds simple, there are a lot of factors that make it more complicated. The most important one is that humans tend to pronounce a single word many different ways. The way you say “the,” for example, changes based on what word comes before it, what word comes after it, where the emphasis is in the sentence, and on and on. Human brains are very good at hearing all these variations and getting the meaning, but computers rely on exactness. To approximate natural speech, concatenative speech needs multiple recordings of every word and sound, and it needs to run an algorithm to figure out which version to use to make a sentence comprehensible to human ears.

No matter what, a bunch of separately recorded words and sounds will not sound as fluid as actual human speech. The main strength of concatenative speech is that it’s easy for humans to understand, even if the listener can tell that it’s not natural speech. Technology is improving all the time, but concatenative speech will probably never be able to match the tones and emotions of a human. It would require a near-infinite number of recordings to do so.

The Present And Future Of Voice Assistants

The widespread use of concatenative speech means that, in addition to its imperfections, there is another factor to consider: the actual human who did the recordings. While we might think of Apple’s voice assistant Siri as a faceless, all-knowing entity, the default voice in North America is voice actor Susan Bennett. And as always, human involvement makes things messier.

First, there’s the bias problem. Plenty has been written about how every mainstream voice assistant, from Siri to Alexa, is a woman, which reflects certain cultural stereotypes of women. There have been attempts to make a nonbinary voice assistant, but that was still based on a single voice. And even if you go beyond gender, there is no such thing as a “neutral” accent — even the General American Accent is tied to class and race — so any voice that’s chosen will represent only a small portion of the population. It’s not an easy problem to solve.

Another issue is that a person’s voice does technically belong to that person. You might think that wouldn’t be an issue, because surely companies pay for the rights to the voices they use in their technology, but that isn’t always the case. Recently, Canadian voice actor Bev Standing sued TikTok for using her voice in their very popular text-to-speech feature. She’d previously done recordings that she said were only to be used for translation. While TikTok hasn’t commented on the case, they did roll out a new voice in the app. It may sound like a one-off case, but voice actors are worried it’s a sign of a future where they have less control of how and where their voices are used.

There’s also fear of using robot voices to imitate people. In a documentary about famous chef Anthony Bourdain, filmmakers asked an artificial intelligence company to create a fake Bourdain voice to read a few lines.  The technology is not particularly widespread yet, but there are unsettling implications for the future if people’s voices can be so easily mimicked.

One possible way to address these issues is to move past canned and concatenated speech and move on to the third option for talking machines: synthetic speech. That means the speech is created entirely from scratch. In some ways, it’s a return to the original speaking machines, except this time it’s done entirely electronically. With modern technology, researchers can create an entire vocal tract digitally. It’s more customizable, more fluid and more emotion-rich than concatenated speech, but it comes at the expense of comprehensibility. Put simply, synthetic speech sounds bad.

The future of robot voices — like the future of artificial intelligence in general — might rely on “deep learning.” This is a way of allowing artificial intelligence to use a neural network to grow smarter over time. A robot could start with basic text-to-speech and slowly gather more skills by interacting with humans. The most famous talking robot right now is Sophia, who is allegedly doing exactly that, using deep learning to sound more and more human (as well as become a citizen of Saudi Arabia). It’s worth pointing out, however, that people who work in artificial intelligence think Sophia is a hoax, kind of like the Mechanical Turk that preceded her. 

So long as Amazon, Google and Apple keep banking on voice assistants, there’s a promising outlook for robotic speech. Yet it can really make you appreciate humankind a little more when you realize something we take for granted is so fraught with complexity. And maybe, someday, we’ll finally be able to have full conversations with our phones.

Learn a new language today.
Try Babbel