The Future of Real-World Chatbots

NLU complexities: challenges and open problems in Conversational AI

Image Credit

This blog was guest-written by Leo Cordoba.

In the last few years we have witnessed a race between companies to create the biggest, most powerful and most interesting named NLP (Natural Language Processing) model. These models are trained on huge corpuses of textual data and learn meaningful representations of words (and other tokens, such as emojis 😀). Sometimes these models can be used directly and in some cases they need to be trained on specific “downstream” tasks like sentiment analysis or question answering. Some models are trained in multiple languages and others are language-specific models.

Some examples include Ulmfit, ElMo, Bert (and their various flavors like Albert, Roberta, etc.), XLNet, GPT-2 and GPT-3, NVIDIA’s Megatron, Microsoft’s Turing Natural Language Generation model (T-NLG), Google’s Meena and Pegasus, Facebook’s Blenderbot, T5, MT5 and many more (many more!)…

These models have pushed the frontiers of what’s possible and now we are starting to see amazing new applications of these models every day:

  • text generation (great GPT-2 examples here)

And what do these have to do with chatbots?

Well, generally speaking commercial-use chatbots (as opposed to general domain chatbots) need to solve three things:

  • Understand user request, e.g. create a “query”

The process of understanding the user’s request usually involves detecting intents and entities. With that information the system can create a query (for example, using SQL) that will be used to retrieve information from a database.

In this article we’ll aim to show concrete examples of how complex these tasks can get and some high-level approaches for the future of conversational AI.

Natural Language Understanding? 😰

What we’ve got here is failure to communicate….and we could say that there are many communication pain points between a bot and a user.

Let’s say our user asks us “How much is the millennium falcon?” Here the intent would be something like “asking-price” and the relevant entity is “millennium falcon”, with this information we could easily create a query to find the desired information.

Our conversation continues as follow:

  • User: How much is the millennium falcon?

In this case, the second question doesn’t have an explicit intent. In order to understand the intention you need to backtrack to the previous intent. Of course, a conversation can take several turns and you may need to retrieve some data from the very beginning, so the bot needs to have long-term memory. Let’s see how this conversation goes…

  • User: Ok, and what colors do you have the millennium falcon? Do you accept Bitcoin?

In this case you can see that the user is asking for two different things: a color of the product and a payment method. This is a multi-intent situation where the user is communicating two goals in one statement. With some REGEX you could handle that case but what about the next one:

  • User: Ok, and would you send a red millennium falcon to my address? And do you accept Bitcoins?

Now there is one extra intent in the first part of the sentence. And things can get much more difficult in a real situation:

  • User: Ok, and would you send a millennium falcon to my address?

Here the system will need to backtrack to the product question, then understand that the second part of the last sentence answers the question but the first part is a new question… Again, a multi-intent situation but with a backtrack. By now, I guess we can discard using REGEX as a realistic/robust solution…

Most of the assistants don’t deal with single intent utterances in the same way they do with multi-intent examples. Try this in Google:

Ok, you may think that there’s no relation between asking for the president of the US and the time, and you right, so try something like:

Or the other way round…

In both cases you should see the weather in Buenos Aires but not the time… This is just an example of how difficult it may be to add these kinds of functionalities to an assistant or bot.

Now, back to our conversation, imagine that our system doesn’t accept Bitcoin but does accept Ethereum — what do these two have in common? They are both cryptocurrency! Using this similarity we could make a new recommendation to the user. This kind of knowledge sometimes is encoded in networks called onthologies like WordNet and ConceptNet.

  • Bot: I’m afraid we don’t accept Bitcoin, but we do use Ethereum, would that work?

Finally, at the checkout…

  • Bot: So we’ll be sending you a millennium falcon to Skywalker Ave. 4321, and the total is 0.62 ETH, is that ok?

We, of course, know that it refers to the millennium falcon because colors don’t have any significant meaning in relation to the address or the total amount but how can we tell this to the bot? Well, this is known as a relation extraction problem.

Imagine you are at a restaurant and you tell a voicebot that you want the “Extra chicken Caesar Salad and a coke”. Then you see another dish that you just can’t wait to try and tell the bot that “Please, remove the salad and add a big portion of “Bibimbap”. Here you need to understand that when the user says “remove the salad” the user is aiming to remove the “Extra chicken Caesar Salad” (and not the coke, of course).

Additional problems arise when you start using voice assistants and not just textual based chabots, in particular there are some informal expressions that are quite common in spoken language. See this example: “can I have a coffee with cream and a muffin…emmm, yeah a coffee and a muffin, please”. The user mentions twice both the coffee and the muffin but only wants one of each of them, the user is repeating.

More problems arise when there are more than one person. One of such problems is related to more than one person speaking. Here is an example of multiple-speakers:

  • Hi, I want a coffee, please. Daddy, can I have a muffin? (a child speaks in the background).

A similar situation is when there is one person speaking to the bot but mid-sentence speaks to someone else, a side-bar conversation version of the previous one would be:

  • Hi, I want a coffee, please. Do you want a muffin? (here the person is asking someone else…) and can I get a muffin?

Some final words on tese probelms. Ok, you saw that, right? Well, when you talk to an assistant an ASR (automatic speech recognition) will turn your voice into text and there can be mistakes. So we found a new problem here and it’s called misspelling correction. In addition, the problem is different if the word is existent (“probelms” can be turned into “problems”) or non-existent (the characters in “tese” can’t be rearrenged to make it “these”).

Of course, you need to tell the machine what the misspelled words are, this is a misspelling detection problem.

One other class of problem is that even if you get to build a nice working model for any of these problems, you still need to get it into production with a reasonable execution speed and user response. Some of these models are huge and while they can work well they are slow to execute. One technique that has been used is to develop lite models that are built using distillation techniques. These models are lighter and also faster. If that’s not enough you can always use quantization, which is reducing the number of digits of the parameters.

The future?

Given all these interesting challenges of building a bot — a number of different solutions and machine learning models can come into play.

One interesting open question is how to consolidate and get rid of several of those numerous individual models? Well, Google (Meena) and Facebook (Blenderbot here and here) have been working on solutions to that. They trained their models by feeding a transformer with real world conversations with the objective of trying to predict the next token. Blenderbot, for example, is trained using the last 14 turns of a conversation.

Their results are just amazing. Their models generate realistic conversations in an open domain situation just training and end-to-end model. And, of course, the model is open sourced. However, they can’t be used to retrieve information from outside what the model already knows. Remember the query vs information retrieval distinction we made above? Well, in a pure deep learning framework the text is generated only from what the model parameters encode and, although the result can be amazing, we can’t send an insert or an update to our database… Maybe adding a text to SQL model on top of any of these would make the magic…

Last but not least, text generation models sound more and more human-like these days but they all have a big drawback, which is that their outcome is really hard to control. The AI community is working to prevent models from generating offensive language and there are lots of discussions going around this topic.

The future of Conversational AI, NLP and Machine Learning in general is quite wide and a very active area of research. Here are some other interesting articles and readings on this topic:

This article brings the Conversational AI blog series to an end. We hope you enjoyed the articles and if you have questions, feedback, or suggestions please feel free to leave a comment. Thanks!

Loves Technology, Startups, and Tacos.