James Vlahos: Talk to me; Amazon, Goodle, Apple and the Race for Voice-Controled AI

The advantage of voice computing is a watershed moment in human history because using words is the defining trait of our species – the ability that set us apart from everyone and everything else. As Google CEO Sundar Pichai said in letter to stockholders, “The next big step will be for the very concept of the “device” to fade away.” Our inventions have always demanded that we adapt to them. With voice however computers are finally doing it our way. Philip Lieberman, once wrote that: “Speech is so essential to our concept of intelligence that its possession is virtually equated with being human.”
When we look at companies that are at forefront of voice development and usage, we could mention Google and Facebook that have made majority of their fortunes from advertising, Amazon which is the biggest digital store, Apple sells its own products and Microsoft provides services and software for business applications. All those business models are under threat from voice. Companies are now in a war to create the dominant new operating system for life itself.
What will replace the conventional internet is conversations with AIs, the new oracles of civilization. The payoff is increased efficiency. The trade-off is diminished independence.
In 2003 DARPA launched the largest artificial intelligence research program in U.S. history. Called CALO that stood for Cognitive Assistant the Learns and Organizes. Program was quite successful. One person that was involved was Adam Cheyer. Adam Cheyer was one of the starts of conversational AI. He worked on predecessors of Siri. He later joined forces with Dag Kittlaus. They also bring on board Tom Gruber, a computer scientist at Stanford. They want to redesign the face of consumer internet. They set up company Active Technologies and created virtual assistant Siri. Core concepts behind Siri were:
• Agent-based architectures
• Natural-language understanding
• Ontologies – were ones had been kicking around research labs for years if not decades.
Company was bought by Apple and that is how virtual assistants era began.
As mentioned, main players in voice AI are the main players in digital industry. Amazon, with its financial power is one of them, but even for Bezos, man not known for his modest ambitions, this was a hard task. Companies like Google that understand natural-language understanding better because their search expertise and Apple that produce some of the best consumer electronic devices in the world and had a head start in voice AI with purchase of Siri, were better positioned then Amazon. Amazon started to develop this area under project Doppler that grew in multinational operation through hiring and acquisitions. They work on two main categories of challenges: speech recognition and language understanding and challenges that required totally new approaches. They created device with six directional microphones. Software controlled voice input through microphone that was picking up the most voice and minimize input from others. This process is known as »beam forming«. Device would be triggered by »wake world«. Alexa was not only world for this product, but would represent all Amazon cloud-based AI that could be used to many other devices too. One of them was Echo, device that Amazon sold one million of them in two weeks.
Apple was paying a price of being the first in the market with Siri. And development of it stalled a little bit between 2014 and 2016, since people that knew product intimately, left the company. Google was taking incremental approach. Releasing some AI features one by one. In 2012 that released virtual assistant called Google Now. Microsoft released Cortana in 2014. Team working on it was led by Larry Heck.But it was only in 2016 when all main companies fully embrace conversational AI as the future of computing and started to publicly addressing this area. Zuckerberg announce on 3. January 2016 that he was developing his private assistant. And later this year he announced that Facebook will launch products in this area. Microsoft was even more aggressive. With what the company was calling Microsoft Bot Framework, developers could create natural-language interfaces for any type of business. They would be backed by the company’s cloud-based AI services to interpret language, organize dialogues, and even gauge the emotions lurking behind people’s words. Microsoft CEO – Satya Nadella called this – conversation as a platform. Google launch their full-fledged virtual helper Assistant in 2016. Replacing Google Now, it was available as smartphone app. And even others joined voice AI hype. Samsung acquires start-up Viv, that was set up by creators of Siri for 114 million USD.
Voice conversation was only one options of conversation with machines. Other was texting. And companies were interesting in possibilities of text-based interactions, especially since they believed that the app age is waning. Nadella even said: »Bots are the new applications. « Apps like Facebook Messenger and WeChat could capitalize on this development. Microsoft on the other hand played on two fronts. One was to offer development environment with its Bot Framework, second was to attract chatbots to Skype, app owned by them. Google tried with Allo. But lessons learned from that area are, that bots are not easy to build, especially good ones and that they are not new app, since sometimes visualization is still stronger than text.
In order to move conversational AI to higher level, some sort of intimacy with computers should be developed. In order to do that, user would have to be more engaged, personally involved and those becoming more relaxed.
The world’s first chatbot – and still one of the best known to this day was Eliza. Created by MIT computer scientist Joseph Weinzenbaum in the mid-1960s. It was a good start, but those first chatbots didn’t impress a lot of people, since they didn’t understand anything. People like Terry Winograd thought that in order for computer to really converse with people, they need to have knowledge. They need to apply reasoning and make logical inferences. During time of this development, new potential use of chatbots was introduced. Video games.
In search of how to build human like conversational AI, one challenge rise to the top. Variability. All of the researchers and developers were trying to teach computers to talk, based on their knowledge of the world and language. But this approach didn’t scale. To really move into higher level of computer communication, computer would need to learn how to communicate by them self.
So, machine learning come in play. Success in image recognition and development in machine learning based on this, could potentially lead to break-throughs in conversational AI. Engineers more and more believe in evolution. Technology behind developments in machine learning are neural networks. Neural networks identify patterns. They learn to associate a given input with desired output. With overcoming some hurdles, capabilities of neural networks would prove to be immensely valuable for voice AI. Some of the main players in this field were Geoffrey Everest Hinton – now connected to Google, Yoshua Bengio and Yann LeCun – now connected to Facebook. Using deep learning and learning algorithm called backpropagation, the use of voice AI moved forward. Conventional voice AI systems was based on rules. But Bengio and LeCun developed automated handwriting recognition in 1998 that was based on backpropagation-enabled neural network and by doing that they tackle variability issue.
Developing system for understanding words consist of many systems:
• Automated speech recognition – recognizing sound
• Natural-language understanding – what is communicated
• Natural-language generation – formulating a reply
• Speech synthesis – how voice-computing devices can audibly reply
Given variability, speech recognition systems act on guessing what have been said and not from position of certainty. In 2016 IBM and Microsoft announced that they gotten word error rate down below 6 %.
The task of determining meaning is called disambiguation, and for decades, computer scientists have been losing sleep over teaching computers that task. In order to bring words recognition closer to computer world, researchers tried to find a way for them to be represented in a computer recognizable way. It was Hinton and Bengio who laid many of the important foundations for representing words with numbers. They figured out a way of doing so using ordered strings of numbers called vectors. This technique is known as word embedding. Neural networks need much more compact word embeddings and in a 2013 paper, a team of Google researchers led by Tomas Mikolov revealed a brilliant way to create them. Instead of giving each word its own unique vector, the numerical numbers inside a vector could express how much this word embodied particular aspect of meaning. But vector values are not set manually, they are instead calculated by a neural network automatically by analyzing waste corpus of natural human writings. The beauty of deep learning is that humans don’t have to pick up the key identifying features. Primary techniques for features identification in voice AI is distributional semantics.
In order to improve efficiency of computer ability to reply we can use scalable technique named information retrieval (IR). AI grabs a suitable response from a database or web page. Since there is so much content online, that improves AI options of replying. And if matched with some scripted approach with blanks that can be filled in this way it can be even more powerful. Sometimes you need to use “phrase-based statistical machine translation”, to get to proper reply. At Google they pioneered sequence-to-sequence techniques for better translation process. It was done by Oriol Vinyals and Quoc Le. And they have an idea to use this technique: encoding one phrase as a vector and decoding second phrase as a reply, for creating AI led dialogues. Adding LSTM (long short-term memory) network, they actually are distilling long messages to the most appropriate short phrase and based on it, they are creating proper response.
When reply is ready – scripted, retrieved or generated, computers need to speak. Trying to create synthetic voices, that can be used on command, was a long process. One example of successful creation is DeepMind’s WaveNet technology, that helps Google Assistant to speak and is parametric synthesis on steroids. Lyrebird is also one of the companies that developed technology to clone the voices of specific people.
Once companies were developed Virtual Assistants able to communicate with people, they started to look for their improvements. Google realized that Assistants app with highest user retention rates are the one with the strongest personality. Microsoft also got feedback from users that they eagerly personify technology. They wanted to bolster trust and they believe that if voice AI will have approachable personality people will try to learn their skill set portfolio more carefully. But developing personality has some challenges. Especially uniformity. Robert Hoffer, one of ActiveBuddy cofounders, said: “The problem with creating character for the mass market is if you drive in the center of the road, you get hit by a car going in one direction or another.” As with other IT based approaches in today’s business environment voice AI is also working hard on intense tailoring of Virtual Assistants personality customization.
Social conversation is the ultimate challenge for voice AI. Creating the best conversational AI is struggling with a challenge of balance between manually engineered and machine-learning approaches. It is difficult to train a neural network for conversation because there isn’t a clear goal – like winning at the game of GO, so that the system can through trial and error approach on a massive scale find the optimal approach. Part of difficulties are also connected with incorporating many elements: type of bots, dialogue politics, algorithms and neural networks. It is also hard to find proper responses in waste amount of content available. Sometime dedicated socialbots are employed to asses and retireve proper responses. In order to tackle challenges scientis are looking for hybrid approach. Combining knowledge-based AI with machine learning-type AI to create hybrids that are better then stand allone approaches. This moved technology capabilites up another level. AIs that can engage with people socially and emotionally, even if on limited basis, are beginning to tackle roles that were never before possible.
One of those roles was the one of conversational computers playing role of human friends. Popular toy Barbie was one of those. Despite her status as plaything it counts as one of more ambitious efforts to create synthetic companion through conversational technology. Another company working in this area is PullString. They created some of Alexa’s most popular conversational skills and chatbot that in the first day of existence, exchanged six millions messages with fans of the Call of Dutty video game. They use rule-based approach in large part of their products.
Out of the big guys Microsoft undertake project Xiaolce, which the company bills as general conversation device. Its philosophy of how it would response to people is fundamentally different from that of utility-based Cortana. Google and Amazon also are looking how their virtual assistants can forge emotional connections to people, but with Xiaolce Microsoft has taken the lead in using EQ to promote friendship. To do that they use machine learning, but first people manually tag training data with predominant emotions – using Paul Eckman’s model of six basic emotions. But emotions are complex; even people routinely misread them. So all sistems that are based on them are hard to manage.
Eugen Kuyda tried to create customized AI companion in order for it to be more seducive. He saw that people didn’t actually want to talk to any other bots, they just want to talked with their bots about them self. So his company called Replika essentially created Narcissus bots.
As technology is moving forward, people are beginnig to recognize a third ontological category – beings that are less than humans but more than machines. The key issue here could be, if this new class of beings capable of detracting our relationships with actual humans. One advantage they have over human friends is that they are always available, which is sometimes not the case with human friends.
Another area where voice AI will have great influence is how we know what we know. Question answering is one of the most used features in conversational AI. Tunstall-Pedoes’s vision is that computer will respond to questions in a single pass – one shot answer would go mainstream with voice computing. In 2007 they launched web site True Knowledge. True Knowledge digital brain consist of: natural-language understanding, system amassed facts (majority of them were automatically retrieved from sources of structured data) and knowledge graph (encoded system of how data relate to one another). Knowledge graph encoded relationship in a taxonomic sense. The system charaterized the nature of each connection in standarized ways. At first they were not so succesfull with their product, because of bad user interface. But in 2012 they debuted product (app) called Evi, that reach number one in Apple app store. Eventually they were bought and their device was Echo.
Search for best method for using AI as oracles was something all big companies are doing for a very long time. With voice it was even harder since written search are usually one to three word and spoken are at least three to four words. Google acguire company Metaweb with their product Freebase and use it as cornerstone of their Knowledge Graph. Microsoft also build one (Concept Graph). Amazon, Facebook and Apple, they all bought companies that were developing solutions for knowledge graphs. Some researchers are moving beyond knowledge graphs. They are trying to deploy sistems that are hunting for answers in sources of unstructured data: web pages, scanned documents and digitalized books. IBM has done that with Watson, which could access 200 million pages of content.
With voice search everybody is trying to grab »position zero«. Position zero is critical because the instant answer in voice is most often what gets read aloud. And it is often the only thing that gets read. Whether by paid or organic discovery on Amazon, Google, or elsewhere, companies who want to be found in the voice era face heightened pressure to finish on top. As Gary Morgenthaler said: »A million blue links from Google is worth far less than one correct answer from Siri.« So fight for voice search is strong. Facebook first ever smart home device, the Facebook portal, launch in 2018 and use Alexa as voice assistant. Amazon has potentially the most to gain in shift from conventional search engines to AI oracles. They reach and arrangement with Microsoft to allow Cortana (and Bing) to be available on Alexa. Microsoft use it to reach customers.
Nature of information is changing in voice age. Company Automated Insights is creating news using AI journalism. But with rise of conversational AI, threats are also growing. Propaganda bots will come to life even more richly. With rise of conversational AI and rise of platforms owned by biggest companies, that are sharing informations based on their access to other sources, we will come into situation when traditional defense, that platforms only share information of others and as such are not responsible for it, will be a little hollow in the voice era. The control of knowledge is a potent power and it is being consolidated in the hands of an elite club. Before rise of internet and search engines, knowledge gathering was an active process. Some people actually appreciate the thrill of this hunt: gathering information, evaluating its veracity, and synthesizing it. But Google research has shown that the average person simply wants a good answer as quickly as possible.
In the are of visibility, voice controled devices can act not only as force of progress but also as control elements. Now everbody are saying – Don’t worry, we are not spying on you. But what if. This is another area where strict regulations and checking will be needed. But sometimes those devices can be used in order to prevent some critical situations. Question is when to use them and act on information received through them.
Steeping even furthe, eyeing virtual immortality. Soul Machines, company based in New Zealand is one of the most succesfull ones in this field. But since this is still relativelly new field, some monetization practices are needed to really incentivize companies to start looking for solutions in this area.
If we do overview on companies:
• Apple positions Siri as great feature on their devices, but not as the product being sold.
• Microsoft struggles to get its conversational technology in front of customers. Chatbots are available on Bing and Skype, but those platforms are not as popular as some of the competitors platforms. Microsoft pitches Cortana as a workplace assistant. But they can be strong in enterprise segment.
• Facebook is still in incomplete phase, but we should not forget about them.
• Google and Amazon are favorites in this race. But new voice lead business model will bring some changes. Amazon is selling Echo Dot bellow cost prices, but they believe they will earn money on services. But will they learn money from advertising or with improved shooping process. Actually advertising model changing represent big threat to Google, since with voice search, they will lose customer exposure to adds. With rise of mobile use of internet, this add time already decrease, with voice it will almost diminish. But shooping on the other hand will bring benefit to Amazon. Whenever somen seeks information about or orders product by voice without specifying the brand, Amazon picks which one gets mentioned as the first option. Suddenly you are buying what Amazon tells you to buy. Google has responded to this threat with teaming up with some of the biggest brick and mortal retailers.