Named entity recognition (NER)—also called entity chunking or entity extraction—is a component of natural language processing (NLP) that identifies predefined categories of objects in a body of text.
These categories can include, but are not limited to, names of individuals, organizations, locations, expressions of times, quantities, medical codes, monetary values and percentages, among others. Essentially, NER is the process of taking a string of text (i.e., a sentence, paragraph or entire document), and identifying and classifying the entities that refer to each category.
When the term “NER” was coined at the Sixth Message Understanding Conference (MUC-6), the goal was to streamline information extraction tasks, which involved processing large amounts of unstructured text and identifying key information. Since then, NER has expanded and evolved, owing much of its evolution to advancements in machine learning and deep learning techniques.
According to a 2019 survey, about 64 percent of companies rely on structured data from internal resources, but fewer than 18 percent are leveraging unstructured data and social media comments to inform business decisions1.
The organizations that do utilize NER for unstructured data extraction rely on a range of approaches, but most fall into three broad categories: rule-based approaches, machine learning approaches and hybrid approaches.
Since the inception of NER, there have been some significant methodological advancements, especially those that rely on deep learning-based techniques. Newer iterations include:
The first step of NER is to aggregate a dataset of annotated text. The dataset should contain examples of text where named entities are labeled or marked, indicating their types. The annotations can be done manually or using automated methods.
Once the dataset is collected, the text should be cleaned and formatted. You may need to remove unnecessary characters, normalize the text and/or split text into sentences or tokens.
During this stage, relevant features are extracted from the preprocessed text. These features can include part-of-speech tagging (POS tagging), word embeddings and contextual information, among others. The choice of features will depend on the specific NER model the organization uses.
The next step is to train a machine learning or deep learning model using the annotated dataset and the extracted features. The model learns to identify patterns and relationships between words in the text, as well as their corresponding named entity labels.
After you have trained the NER model, it should be evaluated to assess its performance. You can measure metrics like precision, recall and F1 score, which indicate how well the model correctly identifies and classifies named entities.
Based on the evaluation results, you will refine the model to improve its performance. This can include adjusting hyperparameters, modifying the training data and/or using more advanced techniques (e.g., ensembling or domain adaptation).
At this stage, you can start using the model for inference on new, unseen text. The model will take the input text, apply the preprocessing steps, extract relevant features and ultimately predict the named entity labels for each token or span of text.
The output of the NER model may need to undergo post-processing steps to refine results and/or add contextual information. You may need to complete tasks like entity linking, wherein the named entities are linked to knowledge bases or databases for further enrichment.
The easiest way to implement a named entity recognition system is to rely on an application programming interface (API). NER APIs are web-based or local interfaces that provide access to NER functionalities. Some popular examples of NER APIs are:
NLTK is a leading open-source platform for building Python programs to work with human language data. It provides easy-to-use interfaces for more than 100 trained extraction models2. It also includes text processing libraries for classification, tokenization, stemming, tagging, parsing and semantic reasoning. NLKT has its own classifier to recognize named entities, called ne_chunk, but also provides a wrapper to use the Stanford NER tagger in Python.
Developed by Stanford University, the Stanford NER is a Java implementation widely considered the standard entity extraction library. It relies on CRF and provides pre-trained models for extracting named entities.
Written in Python and known for its speed and user-friendliness, SpaCy is an open-source software library for advanced NLP. It's built on the very latest research and was designed for use with real products. It also has an advanced statistical system that allows users to build customized NER extractors.
As technologies continue to evolve, NER systems will only become more ubiquitous, helping organizations make sense of the data they encounter every day. So far, it’s proven instrumental to multiple sectors, from healthcare and finance to customer service and cybersecurity.
Some of the most impactful use cases are:
NER is a crucial first step in extracting useful, structured information from large, unstructured databases. Search engines use NER to improve the relevance and preciseness of their search results.
News aggregators use NER to categorize articles and stories based on the named entities they contain, enabling a more organized, efficient way of presenting news to audiences. For instance, NER for news apps automates the classification process, grouping similar news stories together and providing a more comprehensive view of particular news events.
With the proliferation of social media platforms, the amount of textual data available for analysis is overwhelming. NER plays a significant role in social media analysis, identifying key entities in posts and comments to understand trends and public opinions about different topics (especially opinions around brands and products). This information can help companies conduct sentiment analyses, develop marketing strategies, craft customer service responses and accelerate product development efforts.
Virtual assistants and generative artificial intelligence chatbots and use NER to understand user requests and customer support queries accurately. By identifying critical entities in user queries, these AI-powered tools can provide precise, context-specific responses. For example, in the query "Find Soul Food restaurants near Piedmont Park," NER helps the assistant understand "Soul Food" as the cuisine, "restaurants" as the type of establishment and "Piedmont Park" as the location.
In cybersecurity, NER helps companies identify potential threats and anomalies in network logs and other security-related data. For example, it can identify suspicious IP addresses, URLs, usernames and filenames in network security logs. As such, NER can facilitate more thorough security incident investigations and improve overall network security.
NER has come a long way since its inception, integrating innovative technologies and expanding prolifically in its usefulness along the way. However, there are a few noteworthy challenges to consider when assessing NER technologies.
While NER has made a lot of progress for languages like English, it doesn’t have the same level of accuracy for many others. This is often due to a lack of labeled data in these languages. Cross-lingual NER, which involves transferring knowledge from one language to another, is an active area of research that may help bridge the NET language gap.
Sometimes entities can also be nested within other entities, and recognizing these nested entities can be challenging. For example, in the sentence "The Pennsylvania State University, University Park was established in 1855," both "Pennsylvania State University" and "The Pennsylvania State University, University Park" are valid entities.
Furthermore, while general NER models can identify common entities like names and locations, they may struggle with entities that are specific to a certain domain. For example, in the medical field, identifying complex terms like disease names or drug names can be challenging. Domain-specific NER models can be trained on specialized, domain-specific data, but procuring that information can itself prove challenging.
NER models can also encounter broader issues with ambiguity (for instance, "Apple" could refer to a fruit or the tech company); entity name variation (e.g., "USA," "U.S.A.," "United States" and "United States of America" all refer to the same country); and limited contextual information (wherein texts and/or sentences don’t contain enough context to accurately identify and categorize entities).
Though NER has its challenges, ongoing advancements are constantly improving its accuracy and applicability, and therefore helping minimize the impact of existing technology gaps.
While NER is a well-established field, there is still much work to be done.
Taking a look at the future, one promising area is unsupervised learning techniques for NER. While supervised learning techniques have performed well, they require lots of labeled data, which can be challenging to obtain. Unsupervised learning techniques don’t require labeled data and can help organizations overcome data availability challenges.
Another interesting direction is the integration of NER with other NLP tasks. For example, joint models for NER and entity linking (which involves linking entities to their corresponding entries in a knowledge base) or NER and coreference resolution (which involves determining when two or more expressions in a text refer to the same entity) could allow for systems that better understand and process text.
Few-shot learning and multimodal NER also expand the capabilities of NER technologies. With few-shot learning, models are trained to perform tasks with only a few examples, which can be particularly helpful when labeled data is scarce. Multimodal NER, on the other hand, involves integrating text with other entity types. An image or piece of audio, for example, could provide additional context that helps in recognizing entities.
Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.
Accelerate the business value of artificial intelligence with a powerful and flexible portfolio of libraries, services and applications.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Discover how natural language processing can help you to converse more naturally with computers.
We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.
Explore IBM Developer’s website to access blogs, articles, newsletters and learn more about IBM embeddable AI.
1 Analytics and AI-driven enterprises thrive in the Age of With, Deloitte Insights, 25 July 2019
2 3 open source NLP tools for data extraction, InfoWorld, 10 July 2023