Natural Language Processing Research at Babbel
These days we can choose from multiple language learning apps. They might be alike in terms of their goal – to help someone learn a language – but they differ when it comes to the teaching methods and especially, the content. It is not surprising that digital learning platforms often apply automated techniques to create their lessons. However, this is not the case at Babbel where we have a whole department of linguists, teachers and language enthusiasts dedicated to the crafting of the learning content and carrying out research on how people learn languages.
Diversity comes with many advantages and it also is one of the core values at Babbel. After all, the company’s purpose is creating mutual understanding through language. This understanding needs to be reached in our product teams too, where colleagues might be natives in different professional languages, such as the language of a linguist or an engineer. Collaborating on a project between teammates coming from varying professional backgrounds can sometimes seem challenging. In this article, however, I would like to share a success story of such interdisciplinary collaboration which starts with the NLP Chapter.
Natural language processing (NLP) is a field where linguistics and programming meet. This is why it is often also called computational linguistics. At its core, it is an engineering discipline that works on language related problems.
Some of the better known NLP tasks are machine translation, speech recognition, virtual assistants and autocorrect. In our domain – language learning – corresponding examples could be: translation of complex language into simpler expressions, recognition of an accented speech, a chatbot to practice your conversational skills with or automated grammar error correction.
At Babbel, we have two teams working on NLP projects. C-3PO is a product team that currently focuses on automated feedback on pronunciation, proficiency assessment and speech recognition. The other group is the NLP Chapter.
The NLP Chapter is a self-organised research team of NLP enthusiasts – engineers and linguists alike – whose purpose is promoting NLP within the company. It is an interdisciplinary group where the editors learn about engineering and the engineers learn about didactics. Our work consists of discussing scientific papers and working on NLP research projects that could be beneficial for language learners.
The first Chapter project aimed to complement the in-house proficiency assessment tool the Text Complexity Analyser (TCA). The tool analysed texts as one unit and we wanted to extend its functionality to a word level analysis.
The project was a few months in when we learned that the new SemEval Shared Task, a well known research competition in the NLP field, was also targeting lexical complexity. Therefore, the NLP Chapter decided to participate.
You can take a look at the following deep dive into linguistic complexity if you are curious to learn more about the subject before moving forward to the competition details.
Linguistic Complexity: Deep Dive
You might wonder, what linguistic complexity is and why it is important in the first place? One should start by saying that the concept of complexity itself is not well defined. However, in the context of this study, let’s say that complexity relates to the amount of cognitive effort needed in order to understand, describe and explain things – such as words (Rescher, 1998). Therefore, perception of complexity depends on the observer.
We can illustrate it with some linguistic examples. Aphasia makes it difficult to understand sentence structures, e.g. word order. In other cases, it might be the words themselves that are the most challenging. In linguistics, the former is called syntactic complexity and the latter – lexical complexity. However, the source of difficulty does not necessarily have to relate to linguistics. For example, people with dyslexia find it easier to read texts written in some fonts than in the others.
The benefit of linguistic complexity measuring tools might be more evident now. In the case of aphasia, exchanging complex sentence structures with simpler ones could help people read. Another great example is Wikipedia which sometimes gives an option to select between English and Simple English. You can check it out with this article on computational linguistics.
In the language learning domain, the target audience is people who do not speak the language fluently. A complexity analysis tool like Babbel’s TCA allows us to create content that matches the learner’s proficiency level and at the same time can be used to test the language skills. I myself have used it to find easy-to-read books (which might be quite counter-intuitive – apparently, the vocabulary of Harry Potter is more difficult than Kafka’s Trial!)
What about the native speakers, what matters in this case? Seems that it is a combination of various linguistic features (lexical and syntactic), plus the background of a person – this blog article might be easy to understand for computational linguists but considerably harder for people who are unfamiliar with the topic.
When it comes to lexical complexity specifically, the best single indicator that we have so far is frequency which represents how often words are used in daily life (it can be computed by counting word occurrences in a large text corpus). For example, the or hi are used often and therefore perceived as easy, aerospace is somewhat less general and locus might be even unheard of. In other words, frequency relates to the usage of the language (fun fact: the full vocabulary size of a native English speaker ranges from 15,000 to 30,000 words but knowing just a few thousand most common words accounts for the majority of the daily speech).
One then might wonder whether we choose to use some words more often because we find them easier or are these words perceived as easy because they are used often? This is a discussion for another time though. For now, let’s go back to the study of lexical complexity as perceived by native speakers.
SemEval Shared Task
SemEval is a series of international natural language processing research workshops whose mission is to advance the current state of the art in semantic analysis. The Lexical Complexity Prediction (LCP) shared task hosted at SemEval 2021 provided participants with a new annotated English dataset. Each data point in the dataset consisted of a target word in a context and its annotation – the complexity score on a 5 point easy-to-difficult scale. Here is an example of one such data point:
Context: This is not usually the way colleagues are treated.
Target word: colleagues
The goal for the competing teams was to use the given examples to train a model that could predict the complexity level of any word.
In more detail, the whole Machine Learning lifecycle for a task like this consists of three main components:
- Data collection and annotation. This step has been carried out by the competition organisers (the complexity annotations were made by English speakers – each score is the average of seven estimations).
- Feature engineering: additionally enriching each data point with information known to correlate with lexical complexity.
- Model training and tuning: building a predictive model (training) and improving its performance by tweaking various parameters (tuning).
The latest NLP research usually focuses on the last part, that is, on developing complex Machine Learning models, while the linguistic side of the puzzle gets much less attention. However, since half of the NLP Chapter members are linguists, we decided to use it to our advantage and focus on the feature engineering step rather than the algorithmic improvements.
The following deep dive gives a simpler explanation of the task at hand. Feel free to skip it if everything makes sense so far.
Machine Learning: Deep Dive
Essentially, Machine Learning is a technique that tries to mimic human reasoning so let’s solve the problem from the human perspective. Our task is to learn how to estimate the complexity score of any English word on a 5 point easy-to-difficult scale. We have to figure out how to do it from these examples (training data):
We start with the feature engineering step – adding helpful information to the data. Let’s say that we have read some papers on lexical complexity and we decide to add word length and frequency (these are our selected features). The dataset now looks like this:
|1 000 000|
You probably just looked at the new information and found some patterns (or, as you might have guessed, completed the training and tuning step!). Even though we might be unaware of the complex computations happening in the brain, it is quite evident that:
- high frequency correlates with low complexity
- lower character count often correlates with low complexity
These are our learnings (recognised patterns). Now we are ready to apply them to new words (make predictions). We are given these examples (test data):
First, we add the necessary information:
Perhaps we are not too confident but we could guess that the easy-to-difficult ranking of these words is this-engagement-pineal. Visualising the data is even more helpful:
We see that there are clear clusters: this seems to be somewhere in between 1 and 2, engagement – 3 and pineal – 5.
With examples like these the problem might seem simple. However, in real life, there are plenty of cases where the conclusions are less evident, especially if precision is important. Consider these examples:
1 000 000
Or the word and in two different contexts:
day and night
chromosome and chromatid
Is a easier than an because it is shorter and more frequent? Can the complexity level of one word vary depending on its context?
This is why training datasets often contain thousands or millions of examples and many more features. The computations, too, are much more complex and can differ significantly depending on the task. In fact, as you can see below, the difference between Machine Learning and Deep Learning is the kind of computation that is applied.
Terms Artificial Intelligence or Machine Learning often are misleading for people who are not familiar with the subject – there is no actual thinking going on in the background. Learning in this context simply means the functionality to recognise patterns in data (such as high correlation between word frequency and complexity) and making predictions on new data (such as inferring word complexity when only the frequency is known).
The first step of our work was to figure out which linguistic properties are the main indicators of linguistic complexity. Some of the well known lexical complexity predictors are word length and frequency. After our research, however, we have included even such information as age of acquisition, which refers to the age people tend to learn a word, and percentage of the population that knows a word. On the technical side, we aimed to design a simple system that would not require much classifier tuning.
Our best model was trained on only a 36 linguistic feature set using a simple Random Forest classifier with the default training hyperparameters (which means that we did not change any of the configurable settings at all!). This is a common statistical Machine Learning method that works by constructing multiple decision trees (hence, the forest). A decision tree is a type of reasoning that we use in our daily lives – like the ones below for a decision whether to go outside or stay inside (only in our case, the decision was the complexity score of a word and the “questions” were based on a random selection of the features). The final decision is then made by averaging the output of all such trees.
In comparison, many other submitted systems applied Deep Learning methods and their feature sets contained hundreds of features, such as word embeddings (word representations as vectors). One of the main advantages of a simpler system like ours is that it is computationally cheaper to build and maintain, and therefore, can be easier applied in a product.
Results & Conclusion
The NLP Chapter paper A Simple System Goes a Long Way successfully went through the peer review process and our model was placed in the upper half of the shared task with Pearson’s score of 0.7588. The score of the winning system was 0.7886. You can find our paper here and the task proceedings here.
A scientific publication was an unexpected and rewarding outcome of our first project. We have demonstrated how important linguistic research is in natural language processing. This is often forgotten and people spend unnecessary effort to develop complex systems that are computationally expensive and difficult to replicate. This could not have been achieved without the special set-up of Babbelonians, a crowd of linguists and engineers.
Additionally, it has been a true learning experience for the chapter members. The engineers have learned to look at the problems from the didactical perspective and the editors became more aware about the possibilities and limitations of NLP and how this technology can assist them in daily work.
While C-3PO is currently incorporating the developed lexical complexity technology into the TCA, the NLP Chapter continues its work with the next research challenge – a prototype for automated assessment of learner essays. Just like with the first project, we are trying to combine both, didactics and NLP, and are picking up some new skills one the way too.
Photo by Patrick Tomasso on Unsplash