How Big Tech is making machines think and talk like humans

Top Stories

Researchers took the conventional route by trying to teach computers the rules of language.
Researchers took the conventional route by trying to teach computers the rules of language.

Most of us still prefer to interact with real people rather than automated machines, but conversational artificial intelligence could change that

By Shalini Verma

  • Follow us on
  • google-news
  • whatsapp
  • telegram

Published: Tue 12 Feb 2019, 7:00 PM

Last updated: Tue 12 Feb 2019, 9:41 PM

The first time I was introduced to a computer, my teacher asked me to type a bunch of obscure commands. At the end of the two-minute exercise, the screen cheerfully flashed Hello Shalini. Humans started to interact with computers through a command-line interface that performed tasks at the behest of engineers who typed rows of command prompts. This required users to learn a plethora of commands that computers would obey. In the mid-80s, graphical user interface ushered in an era of personal computers, thus widening their ambit of users. The graphics got more intuitive, and the screens became more tactile. Yet we could not entirely eliminate the learning curve for the less-technically inclined.
In 2011, Apple launched Siri, which despite being less than perfect demonstrated the liberating possibilities of having a conversation with a machine. Conversation, whether chatting or talking is easily the most intuitive and natural form of communication for humans. Increasingly, the industry momentum is in favour of conversation or natural language as the human-machine interface of choice. Conversation will be the primary way we will search for information, consume digital services, buy things, and get any kind of assistance online. Fundamentally, conversations are dialogues wherein both parties should more or less understand each other. Dig a little deeper, and you will find a trove of technologies powering human-machine dialogues, such as natural language processing, speech recognition, speech synthesis, text to speech to name a few.
Researchers took the conventional route by trying to teach computers the rules of language.  How does one teach machines complex languages that are riddled with as many exceptions as rules? It was like boiling the ocean. So, funding dried up for such seemingly impossible projects. After decades of trial and error, mathematics came to the rescue of our messy languages. Researchers used sophisticated statistical methods to train software to scan large amounts of text already broken down into grammatical components, also called parsed language. Computer recognises patterns in the text and uses it to understand new content. Siri, Alexa, and Google Home started to get smarter with each release, slowly becoming a part of our lives. IBM, Microsoft and Google offer conversational AI services for other businesses to build their own apps and services. 
When you interact with a digital assistant or any conversational AI, it first tries to recognise what you are saying or typing, and then understand what you are trying to say, based on a probability that it understood the words correctly. How should the computer understand "time flies when we are on leave"? Should it take the words literally, or should it make sense of the phrase? Should it treat time as a bird, or should it understand time in the context of leave, by calculating the odds in favour of 'time passing quickly'. The computer uses probability rather than real world colloquialism and common sense to understand what we are saying. It gets more complex when you throw tone of voice into the mix.
Teaching software to understand and respond to our typed chitchat is tough enough. Teaching it to speak to us is an entirely different ball game. It has been a journey of sorts from using 'concatenative' models that strung together recorded voice nuggets to 'parametric' models that let the software produce its own raw audio. The statistical models were manually finetuned to calculate the probability for combination of words to occur in a phrase.
The error rates were high. They generated machine like responses that needed a human makeover. Researchers did not give up.
Luckily, Google found a clever way to synthesize voice using deep neural networks that learn from voice recordings of people talking, and from text that match what they are saying. Deep neural networks modelled on the human brain allow for fast training using databases of human speech and converting waveform or spectrogram into characters. The fidelity of the sound generated from scratch is striking. China's biggest search engine Baidu has trained text-to-speech synthesis systems to clone a voice after listening to a short audio file.
Researchers are taking volumes of parsed and annotated text and letting the software learn from it. IBM's AI tool Project Debater can parse text to construct arguments for and against a topic in a debate. Speech technologies are getting better with pitch, stresses and speed that make for a more natural conversation.
Tech giants are using creative ways to train digital assistants to have more human like conversations. They are hiring writers, poets and playwrights to improve the experience and lend a certain personality to digital assistants. We know that 10 per cent of all conversations is casual chitchat, which requires natural, often witty answers to make the conversation engaging.
The holy grail of conversational AI is the Turing Test, when humans are no longer conscious that they are conversing with a machine. Conversational AI in general will be universally present on websites and apps to redefine our digital experiences. Our days will be filled with conversations with our apps, cars, and appliances. Some days, the conversations will feel so human that they will pass the Turing test.
Shalini Verma is the CEO of PIVOT technologies


More news from