Modern transformer model architectures based on neural networks dominate many NLP tasks. This tutorial focuses on classical approaches that remain valuable in data science workflows where interpretability, limited dependencies and predictable summary length matter. These methods are often used to automate the generation of concise summaries from a large corpus without requiring a labeled dataset.
By the end of this tutorial you’ll understand:
Text summarization can be broadly categorized into two approaches:
Automatic text summarization began in 1958 with Hans Peter Luhn, an IBM researcher who published “The automatic creation of literature and abstracts”. Luhn’s algorithm was groundbreaking in its simplicity: determine sentence importance by counting the frequency of meaningful words. Though basic by today’s standards, this frequency-based approach established the foundation for subsequent work in the field.
Luhn’s statistical method had clear limitations—it couldn’t capture semantic relationships, context or nuance in language. Over the following decades, researchers expanded on his work by incorporating:
Understanding these algorithms illuminates fundamental concepts in information retrieval (IR) and natural language processing (NLP), while showing the field’s evolution from rule-based systems to sophisticated deep-learning models used today. Today, these models are commonly accessed through platforms like Hugging Face, exposed through an API and powered by frameworks such as PyTorch.
The following section provides a step-by-step walkthrough for implementing classic extractive text summarization algorithms in Python.
To run this project, clone the GitHub repository by using GitHub as the HTTPS URL. For detailed steps on how to clone a repository, refer to the GitHub documentation. You can find this specific tutorial inside the ibmdotcom-tutorials repo under the generative AI directory.
This tutorial uses a Jupyter Notebook to demonstrate text summarization with Python through the Sumy, a lightweight Python library rather than a large-scale artificial intelligence system. Jupyter Notebooks are versatile tools that allow you to combine code, text and visualization in a single environment. You can run this notebook in your local IDE or explore cloud-based options like watsonx.ai Runtime, which provides a managed environment for running Jupyter Notebooks.
Whether you choose to run the notebook locally or in the cloud, the steps and code remain the same. Simply ensure that the required Python libraries are installed in your environment.
The following Python code installs the required packages and prepares the environment for running extractive summarization techniques.
As previously mentioned, the Luhn algorithm is a statistical, frequency-based approach to extractive summarization. Luhn’s algorithm works on the premise that the most important sentences in a document are those sentences that contain the most significant words. This approach makes Luhn effective for quickly extracting salient sentences without semantic modeling. The algorithm determines significant words by how frequently (but not too frequently) they occur.
Luhn algorithm workflow
Try it out yourself by running the following codeblock:
Here is an example of the expected output (you can get different summarization results depending on factors like library versions, input formatting and tokenization):
LexRank is an extractive summarization algorithm that applies the concept of graph-based ranking to text summarization techniques focused on sentence centrality. It ranks sentences based on their similarity to other sentences.
LexRank algorithm workflow
Try the LexRank summarization as follows:
Here is the example summarization result with LexRank:
You might notice that the URL summary is not as complete as the previous algorithm. Sometimes, LexRank produces short or unusual summaries when summarizing content from a URL. This issue happens because LexRank relies on comparing sentence similarity and webpages often contain short headings or fragmented text that don’t provide enough context for the algorithm to rank sentences meaningfully.
In contrast, Luhn looks at word frequency, so it can still pick out the most important sentences even in sparse or messy text. This example illustrates that while LexRank is powerful for well-structured documents, it’s not always the best choice for web scraping or heading-heavy content.
Latent semantic analysis (LSA) is a technique that extracts hidden conceptual meaning from a text. LSA identifies the core concepts in a document and selects sentences that best represent those concepts.
LSA algorithm workflow
Try out LSA yourself here:
Here are example summarization results use LSA:
LSA differs from Luhn and LexRank because it focuses on the underlying concepts or topics in a text rather than just word frequency or sentence similarity. Luhn is great when you want a broad summary based on important keywords, and LexRank works well for well-structured text where sentence relationships matter.
However, LSA is ideal when you want a coherent, concept-focused summary, especially for longer documents with multiple paragraphs. It can highlight the main ideas without getting distracted by repeated keywords or short headings. In short, choose LSA when understanding the key themes is more important than capturing every high-frequency term.
In this tutorial, you explored three classic extractive summarization algorithms—Luhn, LexRank and LSA—and learned how they approach the task in different ways. Luhn focuses on word frequency, LexRank uses sentence similarity and LSA identifies underlying concepts to select the most meaningful sentences.
Each method has its strengths: Luhn works well for general keyword-based summaries. LexRank is effective for structured text with clear sentence relationships. LSA shines when you want a coherent, concept-focused overview of longer documents.
Understanding these approaches gives you the foundation to choose the right extractive summarization technique for your projects and shows how the field has evolved from simple rule-based methods to sophisticated semantic analysis. These classic approaches also serve as strong baselines when evaluating modern abstractive systems or designing hybrid pipelines for real-world use cases.
1 Luhn, Hans Peter. “The automatic creation of literature abstracts”. IBM journal of research and development 2, no. 2 (1958): 159–165
2 Erkan, Günes, and Dragomir R. Radev. “Lexrank: Graph-based lexical centrality as salience in text summarization”. Journal of artificial intelligence research 22 (2004): 457–479
3 Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman. “Indexing by latent semantic analysis”. Journal of the American society for information science 41, no. 6 (1990): 391–407