Beating the Unbeatable*

Posted on

Beating the Unbeatable*

Big corporations like Microsoft, IBM, Google, Amazon or Facebook have an unfair advantage when it comes to Machine Learning. They have massive amounts of data, datacenters and the budget to hire the most talented software engineers in the world. So how realistically can a small start-up compete with these mighty corporations in the AI and Machine Learning game?

As a 25-strong AI-focused company here at Faktion we are always concerned about delivering the most accurate and cutting-edge Natural Language Understanding (NLU) models. NLU models are used by some of the largest and most innovative companies in the world for conversational interfaces (chat, voice, e-mail), document classification, mood analysis, script optimization, … To underpin the claim that our models are the best, we decided to benchmark our existing NLU model for intent classification against giants of the industry: LUIS from Microsoft and Watson Conversation from IBM.

We decided to benchmark previously unbeatable NLU engines against our NLU models for languages that receive little attention. As far as we are aware, no benchmarks exist for Dutch and French NLU models. The engines delivered by Microsoft and IBM enjoy an unfair advantage of unlimited budgets, powerful datacenters and immense amounts of user data. But still something subtle like language peculiarities is missing. We are building chatbots in non-English speaking countries, and for our clients, privacy and language-specific nuances are of great importance. Our expertise is mainly in Deep Learning and NLU models where the language is treated as a sequence of words (like utterances and sentences) connected to each other and processed with a Recurrent Neural Nets architecture.

We focused this benchmarking effort on several key aspects. First we wanted to understand how well all NLU engines are classifying intents for clean Dutch expressions. Second we benchmarked NLU engines against real-world chatbot expressions used in production systems. Finally, we compare the accuracy of the NLU models for 2 languages, namely French and Dutch. This gave us a clear insight into the language-specific accuracy of the models in our largest client zones (The Netherlands, Flanders, France, Luxembourg and Wallonia).

Experimental Setting

Here we define necessary steps to reproduce our results with respect to LUIS and Watson NLU engines. We start with data pre-processing and cleaning routines which can be reduced to the following set of actions.

For expressions:
  1. Strip the utterance (remove heading and trailing whitespaces).
  2. For every utterance replace all EOF, tabs and newline characters with whitespaces.
  3. If the total number of characters is >500: split the expression on whitespace and re-join words until the 500-character limit is reached (LUIS API requirement).
  4. Convert all characters in the expression to lowercase.
For intent labels:
  1. Strip the intent (remove heading and trailing whitespaces).
  2. For every intent replace all non alphanumeric characters with underscores (Watson API requirement).
  3. Take only the first 128 characters from every intent name (Watson API requirement).
  4. Convert all characters in the expression to lowercase.

After these necessary steps were performed on the raw expression data we proceeded with some additional post-processing steps to ensure the integrity of the input data:

  1. Remove all duplicates.
  2. Take only non-empty expressions into account.
Training and test setup

To ensure the reproducibility of the training/test routines we define here some of the tools, techniques and methodologies used to split and evaluate the models.

  1. We used only Python code and Scikit-Learn framework to split the data and evaluate the models.
  2. We performed a stratified random 5-fold test-train split (data was shuffled) for the chatbot expressions.
  3. Where possible (e.g. every model run) we set the random seed to 123.
  4. We ran all the models with default parameters and confidence thresholds.

All test predictions coming from different test-train splits (separately for clean Dutch and production chatbot expressions) were consolidated in the end to one CSV file. For LUIS and Watson NLU engines we always keep on polling the server for the training procedure to end. All other LUIS and Watson specific API requirements are met as well.


We start with the analysis of clean Dutch expressions which are ideal to quickly verify the predictive power of all NLU engines. Then we proceed to the results for chatbot expressions and their corresponding intent classification.

We present 2 main performance metrics: classification accuracy and F1-score. All results are aggregated separately according to either the intent name or chatbot name. We outline the boxplots of these metrics displaying mean, median, quantiles and outliers.

The displayed information can be summarized as follows:

  1. The X-axis represents the NLU-engine dimension with additional accompanying scores denoting:
    • weighted average across all scores (weighting is done w.r.t. the number of expressions per intent or chatbot) – the first score in the brackets.
    • the median of all scores – the second score in the brackets and the orange line in the boxplot.
  2. The Y-axis represents the score dimension in the range [0..1]
  3. In addition to median, quantiles and fences, the mean score is indicated with a green triangle.
Figures for clean Dutch expressions

The boxplots below represent the classification accuracies and F1-scores per intent for clean Dutch expressions. We can easily recognize that our NLU engine is outperforming LUIS by a large margin and is winning over IBM Watson in terms of overall statistics and variance of results.

Figures for real-world chatbot expressions

Next we switch to the results for chatbot expressions as less clean and more realistic examples of NLU engines in action. We can still observe a clear win of the Faktion NLU engine over LUIS and a close runner-up by IBM Watson. The first figure denotes results obtained from the Dutch expressions. The second figure represents French chatbot expressions and corresponding classification metrics.

As we can that notice all NLU engines have problems with the classification of some particular intents which do end up being completely misclassified (zero accuracies and F1-scores). These intents do not have enough expressions in the training data. Next we aggregate performance metrics per chatbot and outline this statistics in the boxplot below. As before the first figure denotes our findings obtained from the Dutch expressions while the second one is representable of the French chatbots.

We still observe a predominant supremacy of our home-brewed NLU engine in comparison to LUIS and IBM Watson. This is well aligned with our production day-to-day classification performance findings.

Discussion and conclusion

In this small blog-post we discussed NLU systems for languages that are not often (if ever) taken into consideration for benchmarking: Dutch and French. We have demonstrated that a mixture of good Machine/Deep Learning, Natural Language Understanding and domain-specific expertise can lead to a significant boost in performance. This is even more valuable if we consider how many resources large corps like Microsoft and IBM can pour into the Deep Learning and NLU research and development.

An added benefit is that we are a European focussed, GDPR certified and privacy aware service: no need to send your confidential customer data to some American megacorp. We can also deploy on premise or in your private cloud.

Our empirical findings and obtained performance evaluations do confirm that even a small AI-focused company can outperform giants of the industry having in mind customer needs and the right blend of technical and domain expertise in the hands of highly devoted team of engineers and data scientists.

* an allegory to Defending the Undefendable