How to summarize text using Python NLP and extractive text summarization

By Vanna Winland

In this tutorial, learn how Python text summarization works by exploring and comparing 3 classic extractive algorithms: Luhn’s algorithm,¹ LexRank,² and Latent Semantic Analysis (LSA).³

Vanna Winland

AI Advocate & Technology Writer

Modern transformer model architectures based on neural networks dominate many NLP tasks. This tutorial focuses on classical approaches that remain valuable in data science workflows where interpretability, limited dependencies and predictable summary length matter. These methods are often used to automate the generation of concise summaries from a large corpus without requiring a labeled dataset.

By the end of this tutorial you’ll understand:

How frequency-based, graph-based and semantic summarization algorithms work
How the strengths and limitations of each approach manifest
How to implement these algorithms in Python with the Sumy library
When to choose extractive versus abstractive text summarization for your projects

Extractive versus abstractive summarization

Text summarization can be broadly categorized into two approaches:

Extractive summarization selects and combines existing sentences directly from the source text to create a summary. Think of it like highlighting the most important sentences in a document. This tutorial focuses on extractive methods, which dominated the field for decades and remain valuable for their interpretability and reliability.
Abstractive summarization generates new sentences to convey the original meaning, similar to how we might paraphrase or rewrite key points. Modern large language models (LLMs) like generative pre-trained transformer (GPT), Granite and Claude excel at this approach, although extractive methods still offer advantages in transparency and computational efficiency. Abstractive systems typically rely on transformers pre-trained on large-scale data, sometimes fine-tuned for specific summarization tasks or machine translation objectives.

The evolution of extractive summarization

Automatic text summarization began in 1958 with Hans Peter Luhn, an IBM researcher who published “The automatic creation of literature and abstracts”. Luhn’s algorithm was groundbreaking in its simplicity: determine sentence importance by counting the frequency of meaningful words. Though basic by today’s standards, this frequency-based approach established the foundation for subsequent work in the field.

Luhn’s statistical method had clear limitations—it couldn’t capture semantic relationships, context or nuance in language. Over the following decades, researchers expanded on his work by incorporating:

Graph-based methods like LexRank, which identify important sentences by analyzing similarity patterns across the entire document.
Semantic approaches like LSA, which uncover hidden thematic structures with linear algebra to understand meaning beyond surface-level word matching.

Understanding these algorithms illuminates fundamental concepts in information retrieval (IR) and natural language processing (NLP), while showing the field’s evolution from rule-based systems to sophisticated deep-learning models used today. Today, these models are commonly accessed through platforms like Hugging Face, exposed through an API and powered by frameworks such as PyTorch.

The following section provides a step-by-step walkthrough for implementing classic extractive text summarization algorithms in Python.

Steps

Step 1

To run this project, clone the GitHub repository by using GitHub as the HTTPS URL. For detailed steps on how to clone a repository, refer to the GitHub documentation. You can find this specific tutorial inside the ibmdotcom-tutorials repo under the generative AI directory.

Step 2: Set up your environment

This tutorial uses a Jupyter Notebook to demonstrate text summarization with Python through the Sumy, a lightweight Python library rather than a large-scale artificial intelligence system. Jupyter Notebooks are versatile tools that allow you to combine code, text and visualization in a single environment. You can run this notebook in your local IDE or explore cloud-based options like watsonx.ai Runtime, which provides a managed environment for running Jupyter Notebooks.

Whether you choose to run the notebook locally or in the cloud, the steps and code remain the same. Simply ensure that the required Python libraries are installed in your environment.

Step 3: Install and import

The following Python code installs the required packages and prepares the environment for running extractive summarization techniques.

%pip install sumy
%pip install lxml_html_clean
%pip install requests beautifulsoup4
%pip install numpy

import requests # Import requests library
from bs4 import BeautifulSoup # Add BeautifulSoup for HTML parsing
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer # Import LuhnSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer # Import LexRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer # Import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk nltk.download(‘punkt_tab’)

Summarize texts with different algorithms

Luhn

As previously mentioned, the Luhn algorithm is a statistical, frequency-based approach to extractive summarization. Luhn’s algorithm works on the premise that the most important sentences in a document are those sentences that contain the most significant words. This approach makes Luhn effective for quickly extracting salient sentences without semantic modeling. The algorithm determines significant words by how frequently (but not too frequently) they occur.

Luhn algorithm workflow

Preprocessing: Stop words are common words that appear frequently in language (like “a”, “this”, “is”, “the”) but carry little meaningful information. These words are filtered out of the text data. A technique called stemming is applied to reduce words to their root forms (“running”, “runs”, “ran” -> “run”).
Word scoring: The frequency of each word is calculated. Words that appear with moderate to high frequency are considered significant, while common words (including stop words) and rare words are given less weight.
Sentence scoring: Each sentence is scored based on clusters of significant words. A sentence’s score is based on the density of significant words within these clusters, calculated as the ratio of significant words to total words in the cluster.
Summary generation: The top-scoring sentences are selected and presented in their original order to create a summary.

Try it out yourself by running the following codeblock:

# Luhn Extractive Summarization Example

def luhn_summarize(input_data, sentence_count=2, input_type=”text”):
    “””
    Summarize text using the Luhn algorithm.

    Args: input_data (str): The input text, URL, or file path.
sentence_count (int): Number of sentences for the summary.
input_type (str): Type of input - “text”, “url”, or “file”.

    Returns:
        list: Summary sentences.
    “””
    if input_type == “url”:
       response = requests.get(input_data)
response.raise_for_status()
       soup = BeautifulSoup(response.text, ‘html.parser’) # Parse HTML content
text = soup.get_text() # Extract plain text
    elif input_type == “file”:
        with open(input_data, ‘r’, encoding=’utf-8’) as file:
            text = file.read()
    else: text = input_data

    # Parse the input text parser = PlaintextParser.from_string(text, Tokenizer(“english”))

    # Initialize summarizer with stemmer
    summarizer = LuhnSummarizer(Stemmer(“english”))
    summarizer.stop_words = get_stop_words(“english”)

    # Generate summary
    summary = summarizer(parser.document, sentence_count)
    return summary

# Test with sample text
sample_text = “””
Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries. Modern approaches can identify the main ideas in a document and present them with minimal human involvement. Extractive methods select representative sentences directly from the source text, while abstractive methods generate new phrasing based on the original meaning. These techniques are increasingly used in information retrieval, research analysis, and other applications where quick understanding of text is essential.
“””

# Summarize plain text

summary = luhn_summarize(sample_text, 2, input_type=”text”)
print(“Summary from text:”)
for sentence in summary:
    print(sentence)

# Summarize from a URL url = “https://www.ibm.com/think/topics/natural-language-processing”
summary = luhn_summarize(url, 2, input_type=”url”)
print(“\nSummary from URL:”) for sentence in summary:
    print(sentence)

Example Luhn algorithm summarization

Here is an example of the expected output (you can get different summarization results depending on factors like library versions, input formatting and tokenization):

Summary from text: Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries. Extractive methods select representative sentences directly from the source text, while abstractive methods generate new phrasing based on the original meaning.

Summary from URL: NLP enables computers and digital devices to recognize, understand and generate text and speech by combining computational linguistics, the rule-based modeling of human language together with statistical modeling, machine learning and deep learning. In document processing, NLP tools can automatically classify, extract key information and summarize content, reducing the time and errors associated with manual data handling.

LexRank

LexRank is an extractive summarization algorithm that applies the concept of graph-based ranking to text summarization techniques focused on sentence centrality. It ranks sentences based on their similarity to other sentences.

LexRank algorithm workflow

Generate a similarity graph: Each sentence is represented as a node in a graph. The similarity is calculated between every pair of sentences (typically through cosine similarity on TF-IDF vectors). Sentences are connected with edges weighted by similarity scores.
Compute sentence centrality: Importance scores are calculated for each sentence. The process uses iterative voting inspired by Google’s PageRank algorithm. Each sentence begins with an equal score. In each iteration, a sentence’s score is updated based on the scores of the sentences it is connected to. Sentences that are similar to many highly scored sentences will themselves receive higher scores. This process repeats until the score stabilizes, with a dampening factor ensuring convergence. The result is a reinforcement effect where sentences discussing central themes naturally accumulate higher scores.
Select the top sentences: The highest-scoring sentences are extracted to form a summary, presented in their original document order.

Try the LexRank summarization as follows:

# LexRank Extractive Summarization Example

def lexrank_summarize(input_data, sentence_count=2, input_type=”text”):
    “””
    Summarize text using the LexRank algorithm.

    Args:
        input_data (str): The input text or URL to summarize.
        sentence_count (int): Number of sentences for the summary.
        input_type (str): Type of input - “text” or “url”.

    Returns:
        list: Summary sentences.
    “””
    if input_type == “url”:
        response = requests.get(input_data)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, ‘html.parser’) # Parse HTML content
        text = soup.get_text(separator=’ ‘) # Extract plain text
    else: text = input_data

    # Parse the input text
    parser = PlaintextParser.from_string(text, Tokenizer(“english”))

    # Initialize LexRank summarizer with stemmer
    summarizer = LexRankSummarizer(Stemmer(“english”))
    summarizer.stop_words = get_stop_words(“english”)

    # Generate summary
    summary = summarizer(parser.document, sentence_count)
    return summary

# Test with sample text
sample_text = “””
Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries. Modern approaches can identify the main ideas in a document and present them with minimal human involvement. Extractive methods select representative sentences directly from the source text, while abstractive methods generate new phrasing based on the original meaning. These techniques are increasingly used in information retrieval, research analysis, and other applications where quick understanding of text is essential.
“””

# Summarize plain text
summary = lexrank_summarize(sample_text, 2, input_type=”text”)
print(“Summary from text:”)
for sentence in summary:
    print(sentence)

# Summarize from a URL
url = “https://www.ibm.com/think/topics/natural-language-processing”
summary = lexrank_summarize(url, 2, input_type=”url”)
print(“\nSummary from URL:”)
for sentence in summary:
    print(sentence)

Example LexRank algorithm summarization

Here is the example summarization result with LexRank:

Summary from text: Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries. Modern approaches can identify the main ideas in a document and present them with minimal human involvement.

Summary from URL: What is NLP? AI models

You might notice that the URL summary is not as complete as the previous algorithm. Sometimes, LexRank produces short or unusual summaries when summarizing content from a URL. This issue happens because LexRank relies on comparing sentence similarity and webpages often contain short headings or fragmented text that don’t provide enough context for the algorithm to rank sentences meaningfully.

In contrast, Luhn looks at word frequency, so it can still pick out the most important sentences even in sparse or messy text. This example illustrates that while LexRank is powerful for well-structured documents, it’s not always the best choice for web scraping or heading-heavy content.

LSA

Latent semantic analysis (LSA) is a technique that extracts hidden conceptual meaning from a text. LSA identifies the core concepts in a document and selects sentences that best represent those concepts.

LSA algorithm workflow

Create a term-sentence matrix: The algorithm begins by building a matrix where rows represent words, columns represent sentences, and each cell contains a word’s frequency (or TF-IDF weight) in that sentence. Before sentence construction, documents undergo truncation to control size and reduce noise.
Apply singular value decomposition (SVD): Decompose this matrix in three matrices to capture the underlying semantic structure: U (words to topics), Σ (relative topic strength), Vᵀ (topics to sentences). SVD identifies the most important “topics” or “concepts” in the document by finding patterns in how words co-occur across sentences.
Score sentences: A calculation is made that represents the most important concepts identified by SVD for each sentence. Sentences with strong representation across top concepts receive higher scores.
Generate summary: Top-scoring sentences are selected and presented in their original order.

Try out LSA yourself here:

# LSA Extractive Summarization Example

def lsa_summarize(input_data, sentence_count=2, input_type=”text”):
    “””
    Summarize text using the LSA algorithm.

    Args:
        input_data (str): The input text or URL to summarize.
        sentence_count (int): Number of sentences for the summary.
        input_type (str): Type of input - “text” or “url”.

    Returns: list: Summary sentences.
    “””
    if input_type == “url”:
        response = requests.get(input_data)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, ‘html.parser’) # Parse HTML content
        text = soup.get_text(separator=’ ‘) # Extract plain text
    else: text = input_data

    # Parse the input text
    parser = PlaintextParser.from_string(text, Tokenizer(“english”))

    # Initialize LSA summarizer with stemmer
summarizer = LsaSummarizer(Stemmer(“english”))
    summarizer.stop_words = get_stop_words(“english”)

    # Generate summary
    summary = summarizer(parser.document, sentence_count)
    return summary

# Test with sample text
sample_text = “””
Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries. Modern approaches can identify the main ideas in a document and present them with minimal human involvement. Extractive methods select representative sentences directly from the source text, while abstractive methods generate new phrasing based on the original meaning. These techniques are increasingly used in information retrieval, research analysis, and other applications where quick understanding of text is essential.
“””

# Summarize plain text
summary = lsa_summarize(sample_text, 2, input_type=”text”)
print(“Summary from text:”)
for sentence in summary:
    print(sentence)

# Summarize from a URL
url = “https://www.ibm.com/think/topics/natural-language-processing”
summary = lsa_summarize(url, 2, input_type=”url”)
print(“\nSummary from URL:”)
for sentence in summary:
    print(sentence)

Example LSA algorithm summarization

Here are example summarization results use LSA:

Summary from text: Modern approaches can identify the main ideas in a document and present them with minimal human involvement. These techniques are increasingly used in information retrieval, research analysis, and other applications where quick understanding of text is essential.

Summary from URL: NLP is already part of everyday life for many, powering search engines, prompting chatbots for customer service with spoken commands, voice-operated GPS systems and question-answering digital assistants on smartphones such as Amazon’s Alexa, Apple’s Siri and Microsoft’s Cortana. But NLP solutions can become confused if spoken input is in an obscure dialect, mumbled, too full of slang, homonyms, incorrect grammar, idioms, fragments, mispronunciations, contractions or recorded with too much background noise.

LSA differs from Luhn and LexRank because it focuses on the underlying concepts or topics in a text rather than just word frequency or sentence similarity. Luhn is great when you want a broad summary based on important keywords, and LexRank works well for well-structured text where sentence relationships matter.

However, LSA is ideal when you want a coherent, concept-focused summary, especially for longer documents with multiple paragraphs. It can highlight the main ideas without getting distracted by repeated keywords or short headings. In short, choose LSA when understanding the key themes is more important than capturing every high-frequency term.

Conclusion

In this tutorial, you explored three classic extractive summarization algorithms—Luhn, LexRank and LSA—and learned how they approach the task in different ways. Luhn focuses on word frequency, LexRank uses sentence similarity and LSA identifies underlying concepts to select the most meaningful sentences.

Each method has its strengths: Luhn works well for general keyword-based summaries. LexRank is effective for structured text with clear sentence relationships. LSA shines when you want a coherent, concept-focused overview of longer documents.

Understanding these approaches gives you the foundation to choose the right extractive summarization technique for your projects and shows how the field has evolved from simple rule-based methods to sophisticated semantic analysis. These classic approaches also serve as strong baselines when evaluating modern abstractive systems or designing hybrid pipelines for real-world use cases.

Footnotes

¹ Luhn, Hans Peter. “The automatic creation of literature abstracts”. IBM journal of research and development 2, no. 2 (1958): 159–165

² Erkan, Günes, and Dragomir R. Radev. “Lexrank: Graph-based lexical centrality as salience in text summarization”. Journal of artificial intelligence research 22 (2004): 457–479

³ Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman. “Indexing by latent semantic analysis”. Journal of the American society for information science 41, no. 6 (1990): 391–407

How to summarize text with Python NLP and extractive text summarization