Part 3 - design NLU training data

Feb 16, 2022

Introduction

In the previous tutorial, we tried to understand how Rasa framework works in the background by taking a closer look at its libraries and how they operate together to initialize the structure of any Rasa bot. In this tutorial, we are going to learn how to create a NLU training data for our bot to categorize user utterances and store structured information using keywords like entities and lookup tables. Then, we will give some best practices to design effective NLU training data that can produce great results.

Training examples

NLU training data is an example of user utterances to categorize what the user wants to achieve during an interaction with a bot. To define the NLU training data, you need to include a top-level key named nlu at the beginning of nlu.yml file. Then, you can classify the user utterances into categories that represent intents in Rasa. Intents are added under nlu key to describe what users might say in each category, you can give them any name you want but just avoid spaces and special characters. Each intent should have a name that matches the goal users like to accomplish with that intent.

# (In case you don't specify the version key, Rasa will assume you are using the latest version)
version: "3.0"

nlu:
- intent: greeting
  examples: |
    - hey
    - hello
    - hi
    - hello there
    - good morning
    - good evening
    - hey there
    - hey dude
    - goodmorning
    - goodevening
    - good afternoon

- intent: goodbye
  examples: |
    - good night
    - bye
    - goodbye
    - have a nice day
    - see you around
    - bye bye
    - see you later

Next, don't forget to register your intents in the domain.yml as follows:

version: "3.0"

intents:
  - greet
  - goodbye

It is a good idea to split up your training examples into multiple files in case your bot has many topics to discuss. According to Rasa, this is a good practice for NLU to be considered to manage different types of conversations your bot may represent. Also, training examples can include sometimes the following types of structured data that can be used to extract and add information to help improve intent identification:

Entities

Term ENTITIES belongs to a big area of research in NLP named NER which stands for Named Entity Recognition. Researchers and professors use the academic term ENTITIES to refer to a particular object in the world (person, animal, place, thing, and even concept). For example, the term tiger refers to a particular animal and class of Earth's biggest wild cats. Too simply, you can think of NER like an operation that discriminates words in a body of text, tags named entities and classifies them into categories like animals, concepts, people, projects, devices, etc.

For more details, you can visit this free resource to know about the latest papers, code, and datasets used for NER research.

In Rasa, Entities are structured pieces of information that can be extracted from user messages, it can hold important detail about the user like numbers, dates, names so that the bot could use them later in the conversation. Let's take an example for a flight booking, it would be very useful for your bot to know which detail in user input refers to a destination. That's why in this example when the user says that they would like to book a ticket to Casablanca, Casablanca is extracted as an entity of a type destination.

Example: I would like to book a flight to Casablanca

The training data for entity extraction should be stored inside your intent examples in nlu.yml file. The word that should be extracted as an entity must be surrounded by square brackets and then next to it you should include the label of this entity inside of the parenthesis or you can descriptively define the entity as shown in the second example below:

version: "3.0"

nlu:

- intent: book_destination
  examples:
  	- I would like to book a flight to [Casablanca](destination)
  	- I want to book a ticket to [Casablanca]{"entity": "destination", "value": "Casablanca"}

Next, you should register your entities in domain.yml as we did below:

version: "3.0"

intents:
  - greet
  - book_destination
  - goodbye

entities:
  - destination

Then, don't forget to define the entities your bot will use to extract information from user intents in the same file (domain.yml) as below. Also, you can ignore_entities to ignore unused entities if you have them:

version: "3.0"

intents:
  - greet
  - book_destination
      use_entities:
        - name
      #ignore_entities:
      	#- email
  - goodbye

entities:
  - destination

There are other three ways entities can be extracted in Rasa using pre-built models, Regex, and machine learning which we will discuss later in the future tutorials. On the other hand, Rasa uses JSON as a format for the output of an entity extraction which consists of entity category, entity value, confidence levels, and the component that extracted the entity. Note that before you decide whether you should use entities or not, it's preferable to think about what information the bot needs for the user goals.

Here is an example of an extracted entity in JSON format:

Received user message 'I'm from australia' with intent '{'name': 'your_nation', 'confidence': 0.9201310276985168}' and entities '[{'entity': 'nations', 'start': 9, 'end': 18, 'value': 'australia', 'extractor': 'RegexEntityExtractor'}]'

Synonyms

As the name suggests, you can use synonyms in case users may refer to the information your bot wants to extract in multiple ways. For example, you have an entity type positivemood that your bot may use to get positive emotions from users. Let's say this entity type is assigned only to one value like good. Here, users may refer to equivalent terms to this entity something like pretty good, very good, and really good, here comes the purpose of using synonyms. However, entities cannot help to generalize to unseen synonyms, meaning that if we take the previous example, entities cannot extract unseen equivalent synonyms to the term good such as fine, awesome, or well, etc. In other words, synonyms can be used in case the bot wants to map the extracted entities to a single value.

According to Rasa documentation, synonyms should be defined in your nlu.yml using the following format:

version: "3.0"

nlu:

- intent: positive
  examples: |
    - I'm feeling [good]{"entity": "positivemood"}
    - I'm [really good]{"entity": "positivemood"}
    - [pretty good]{"entity": "positivemood"} as always

- synonyms: good
  examples:
    - pretty good
    - really good

Or, you can use the in-line method to define synonyms:

version: "3.0"

nlu:

- intent: positive
  examples: |
    - I'm feeling [good]{"entity": "positivemood", "value": "good"}
    - I'm [really good]{"entity": "positivemood", "value": "good"}
    - [pretty good]{"entity": "positivemood", "value": "good"} as always

Remember that you should always include synonyms in your training examples so that they can be extracted as entity and mapped to the corresponding value.

Regular expressions

Regular expressions can improve the entity extraction by using the RegexEntityExtractor component within your pipeline in config.yml. Let's say your bot wants to extract phone numbers from users, the pattern of the regular expression must match the given number. In the following example, we are including annotated examples and a specific regex pattern to make the bot extract 7 digits of any given phone number:

version: "3.0"

nlu:

- regex: phone
  examples: |
    - \d{7}

- intent: phone_number
  examples: |
    - my phone number is [7653421](phone)
    - This is my phone number [7654321](phone)

Note that it is preferable to assign the same name you give to your entity to the name of your regex pattern to make the extraction works properly.

Lookup tables

You can use lookup tables to extract words users may say to refer to the entities you set in your training examples. In this example, the bot can ask the user what is your nationality?. Here, the user may reply with countries different than Morocco, so lookup tables can help you set known possible values that might be included within the user message. Based on the example below, you need to create a lookup table that contains all countries.

version: "3.0"

nlu:

- intent: your_nation  
  examples: |
    - I'm [Moroccan]{"entity": "nations"}
    - I'm from [Morocco]{"entity": "nations"}
    - I'm coming from [Morocco]{"entity": "nations"} 
    - [Morocco]{"entity": "nations"} is my country

- lookup: nations  
  examples: |
    - America
    - Australia
    - Brazil
    - Japan
    - Denmark
    - Egypt
    - Nigeria
    - India 
    - Finland 
    - Canada  
    - Columbia
    - New Zealand
    - Jamaica

Lookup tables use RegexEntityExtractor component to help extract entities in combination with RegexFeaturizer component. Therefore, you have to enable this setting in your config.yml. Don't forget to provide enough annotated examples to reach great results.

Best practices

Sometimes, we might focus on quantity instead of quality to create our data which could lead to some bad practices that can cause the training data to go bad. Based on this blog from Rasa, I'm including below some good habits to keep in mind when you design the NLU of your bot:

It is preferable to use real data from real-world conversations instead of producing implausible examples using some autogeneration techniques.
Create distinct training examples for each category to avoid intent confusion.
Don't think of synonyms like an approach to improve entity extraction, it's just a feature to map your related entities to a single value. So, use synonyms wisely.
To make your model extract entities correctly, it's a good practice to include some of your entities from the lookup table in your training examples so that can give the model a better representation of the given entity.
Include an out-of-scope intent to confine the conversation to the bot's domain.
Keep track of your training examples like you keep track of your code so that you can roll back changes if things don't go as expected.
Don't skip testing, it helps to make sure your model gives the desired predictions.

What’s next?

Hopefully, this tutorial helps you learn what NLU is and how you can create one for your bot. In the next blog, we will learn in detail how to create stories to design a conversation flow and make your bot able to interact with user messages and generalize to unseen conversation paths.

In case you missed it, here is the link to my previous blogs on Rasa:

Part 1 - getting started with your first Rasa bot

Part 2 - closer look at Rasa components

If you have any questions, leave them in the comments below or feel free to contact me on LinkedIn. Don’t forget this post is public so feel free to share it.

Rochdi Khalid’s Newsletter

Discussion about this post