Named entity recognition (NER) is an indispensable approach for analyzing chunks of text and locating and classifying predefined entities, such as location, companies or person names.
We can use NER in many real-world problems, e.g.
Over the last few years, several methods and models have been published that allow us to use NER out of the box or fine-tune a model to a domain-specific dataset. Even the best machine learning architecture needs high-quality training data to make accurate predictions.
Unfortunately, creating a high-quality training dataset is a manual process where we have to assign each word of the dataset an entity label. To manually label all data requires a lot of time and is not cheap.
In some cases, we only have weakly-labeled data. For example, we know that some entities are in the given text, but we do not know the exact position of the word.
If we want to use the weakly-labeled data for the model training, we have three options:
The first and second approach requires a model that is able to predict for each word if it is a named entity and the associated class (e.g. organization or country). Therefore, we could use the weakly-labeled data or label a few samples manually. We expect that a few good labeled samples would outperform a model that was trained on the whole weakly-labeled dataset.
To avoid additional labeling, the idea behind the third approach is to map the challenges as a sequence-to-sequence problem. Therefore, we want to train a sequence-to-sequence model that learns to generate information based on a given context. In Particular, we want to build a question-answering model that takes the text including the named entities as input. The model will be trained to answer questions about specific entities. If the entity is part of the context the model should return the entity value.
Assume we pass the following sentence into our model: “Berlin lies in northeastern Germany, [...]”, with the question about countries, the model should answer “Germany”.
The advantage of this approach is that we do not need specific information about the exact position of the entity in the given text.
However, we do not know which approach we should prefer if we have weakly-labeled data. We want to find a good threshold between the effort of manual labeling work and model accuracy.
The goal of this project is to find the best working approach to handle the NER model training with a weakly-labeled dataset. You will propose an advisory approach when we should use which method.
Therefore, you will looking into possible training datasets and train at least three different machine learning models for the task of NER:
Afterwards, your mission is to compare each model and find an answer to the question which approach is the most efficient. If you are the right agent for that mission, feel free to contact us.
The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The typical duration is 6 to 8 weeks. The preferred duration for this specific project is 8 weeks.
Thomas Dehaene Chapter Lead
Matthias Richter Machine Learning Engineer (daily supervisor)