My IBM Log in Subscribe

Code summarization with Granite

 

 

Authors

Joshua Noble

Data Scientist

Introduction

Code summarization is the process of generating a natural language description of snippets of code. Common code summarization tasks are exploring a new code base, learning a new programming language or generate code comments and explanations of functions. Generating a summarization of a snippet of code is similar to generating a text summarization of a natural language document. It differs because the large language model (LLM) generating the summary needs to understand the programming language that it's reading through while identifying the underlying logic of what it's trying to accomplish.

Code summaries are a valuable part of software development, helping with software maintenance by performing automatic source code summarization, creating natural language summaries for documentation or parsing large-scale code bases. Newer LLM models that use a transformer-based approach can act as either code summarization models or run code generation as well. These functions are possible because the models have been trained on large datasets built from sources such as Github repositories that include code and comments along with documentation for that code.

Before LLMs were popularized, approaches to code summarization required parsing code semantics and generating an abstract syntax tree (AST) of each identifier from the code that could then be used to generate documentation.1,2 With the advent of deep learning and neural networks, approaches rooted in computer science were abandoned for approaches that borrowed more from the methodology of neural machine translation3,4.

For transformer models, larger context windows lead to better results. Many of the newest state-of-the-art Granite™ Code models such as Granite-8B-Code-Instruct-128K, have a 128k context window. A larger context window allows the model to hold more text in a working memory. This helps monitor key moments and details in a drawn-out chat, or a lengthy document or codebase. This working memory enables an LLM-based chatbot to generate responses that make sense in the immediate moment and over a longer context, helping them outperform models with smaller context windows both by human evaluation and evaluation metrics.5

Larger context windows enable models to hold more text in their working memory, helping to keep track of key moments and details in drawn-out chats or lengthy documents or codebase.

When ChatGPT was first introduced its context window was 4,000 tokens. If your conversation went over the 3,000-word chat-interface limit, the chatbot was likely to hallucinate and veer off-topic. Today, the standard is 32,000 tokens, with the industry shifting to 128,000 tokens. That is about the length of a 250-page book. IBM now has two Granite models with a 128,000-token window and more are on their way.

3D design of balls rolling on a track

The latest AI News + Insights 


Expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Step 1: Set up your environment

In this step, we’ll guide you through creating an IBM account to access Jupyter Notebooks.

1. Log in to watsonx.ai™ using your IBM Cloud® account.

2. Click + to create a new project.

a. Select Create an empty project.

b. Enter a project name in the Name field.

c. Create a Cloud Object Storage for storing your project assets if not already created.

d. Select Create.

3. Create a Jupyter Notebook.

a. Select the Assets tab in your project environment.

b. Click New asset.

c. Select the Working with models option in the left panel.

d. Click Working with data and models using Python and R notebooks.

e. Enter a name for your notebook in the Name field. Choose Runtime 23.1 on Python (4 vCPU 16 GB RAM) to define the configuration.

f. Select Create.

4. Set up a watsonx.ai Runtime instance and API key

a. Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).

b. Generate an API Key.

c. Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.

Step 2: Load Granite Code Instruct

First, we’ll install the open source Hugging Face Hub library to download models:

!pip install huggingface_hub

Now we can download Granite-8B-Code-Instruct-128K:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-8b-code-instruct-128k")
model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-8b-code-instruct-128k")

Now we can begin using Granite Code Instruct.

Step 3: Get a simple explanation

Let's give our model a fairly complex function call from the GluonTS library, taken from their GitHub repository. This is a long block of code to paste in a prompt so we'll store it in a variable:

ll_func_2 = """

    def call(
        self, data: torch.Tensor, weights: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

    assert (
        data.shape == weights.shape
    ), "data and observed_indicator must have same shape"

    with torch.no_grad():

        observed_data = torch.where(weights == 1, data, torch.nan)
        med = torch.nanmedian(observed_data, dim=self.dim, keepdim=True).values
        q1 = torch.nanquantile(observed_data, 0.25, dim=self.dim, keepdim=True)
        q3 = torch.nanquantile(observed_data, 0.75, dim=self.dim, keepdim=True)
        iqr = q3 - q1

        # if observed data is all zeros, nanmedian returns nan
        loc = torch.where(torch.isnan(med), torch.zeros_like(med), med)
        scale = torch.where(torch.isnan(iqr), torch.ones_like(iqr), iqr)
        scale = torch.maximum(scale, torch.full_like(iqr, self.minimum_scale))
        scaled_data = (data - loc) / scale

        if not self.keepdim:
            loc = torch.squeeze(loc, dim=self.dim)
            scale = torch.squeeze(scale, dim=self.dim)

        # assert no nans in scaled data, loc or scale
        assert not torch.any(torch.isnan(scaled_data))
        assert not torch.any(torch.isnan(loc))
        assert not torch.any(torch.isnan(scale))
        assert not torch.any(scale == 0)

        return scaled_data, loc, scaledef call(
            self, data: torch.Tensor, weights: torch.Tensor
        ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

        assert (
            data.shape == weights.shape
        ), "data and observed_indicator must have same shape"

 

Now we can ask Granite Code for a summary by tokenizing our chat and then callingmodel.generate()

chat = [
{ "role": "user", "content": "Explain what this function does" + ll_func_2 },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# have the encoder turn text into output tokens
inputs = tokenizer(chat, return_tensors="pt").input_ids
start_index = inputs.shape[-1]

# generate output tokens
output = model.generate(inputs, max_new_tokens=1500)

# have the decoder turn output tokens into text
output = tokenizer.batch_decode(output[start_index:])

# loop over the batch to print, in this example the batch size is 1
print(output[0])

Let’s check the generated summary:

Answer:
This function performs data normalization using the interquartile range (IQR) method. It takes two arguments: data and weights. The data argument is a tensor of shape (batch_size, sequence_length, hidden_size), where batch_size is the number of samples in a batch, sequence_length is the length of the input sequence, and hidden_size is the dimensionality of the hidden state. The weights argument is a tensor of shape (batch_size, sequence_length), where each element is either 0 or 1, indicating whether the corresponding element in the data tensor is observed or missing.

The function first checks that the data and weights tensors have the same shape. Then, it computes the median and IQR of the observed data points (those with a weight of 1) using the torch.nanmedian and torch.nanquantile functions. The torch.nanmedian function computes the median along the specified dimension (`self.dim`), while the torch.nanquantile function computes the 25th and 75th percentiles.

Next, the function computes the location (`loc`) and scale (`scale`) parameters of the normalized distribution. The location is set to the median of the observed data points, and the scale is set to the IQR of the observed data points. If the observed data points are all zeros, the function sets the location to zero. The function also ensures that the scale is at least a minimum value (`self.minimum_scale`) to avoid division by zero.

Finally, the function computes the normalized data (`scaled_data`) by subtracting the location and dividing by the scale. If the keepdim argument is set to False, the function removes the specified dimension from the location and scale tensors. The function also performs assertions to ensure that the normalized data, location, and scale tensors do not contain any NaN values or zeros in the scale.

The function returns a tuple containing the normalized data, location, and scale tensors.

The description is both highly detailed and accurately describes the function that we passed in and the underlying algorithm.

Step 4 Generate documentation

Generating documentation is a powerful application of code summarization since it not only explains code to any future user reading the code but also can help generate documentation using a tool like pydoc.

We can now prompt Granite Code to generate docstrings for a function. Because of the larger context window we can pass an entire function along with a request to generate a docstring. 

Please write a properly formatted Python docstring for the following function:

def get_json_from_html(html: str, key: str, num_chars: int = 2, stop: str = '"') -> str:
    pos_begin = html.find(key) + len(key) + num_chars
    pos_end = html.find(stop, pos_begin)
    return html[pos_begin:pos_end]

Granite Code returns excellent documentation in Python format:

Extracts a JSON object from the given HTML string using the specified key and stop character.
Args:

    html (str): The input HTML string to extract the JSON object from.

    key (str): The key used to locate the start of the JSON object in the HTML string.

    num_chars (int, optional): The number of characters after the key to include in the extracted JSON object. Defaults to 2.

    stop (str, optional): The character that marks the end of the JSON object in the HTML string. Defaults to '"'.

Returns:

    str: The extracted JSON object as a string.

Now if we want, we can now re-use the previous prompt which is still present in the large context window:

chat = "Rewrite the previous function to be a class method and include the docstrings."

# have the encoder turn text into output tokens
inputs = tokenizer(chat, return_tensors="pt").input_ids
start_index = inputs.shape[-1]

# generate output tokens
output = model.generate(inputs, max_new_tokens=1500)

# have the decoder turn output tokens into text
output = tokenizer.batch_decode(output[start_index:])

# loop over the batch to print, in this example the batch size is 1
print(output[0])

This returns

class HtmlJsonExtractor:

    def init(self, html):

    """
    Initializes the HtmlJsonExtractor object with the given HTML string.

    Args:

        html (str): The input HTML string to extract JSON objects from.
    """

        self.html = html

    @classmethod

    def extract(cls, html, key, num_chars=2, stop='"'):

    """
    Extracts a JSON object from the given HTML string using the specified key and stop character.

    Args:

        html (str): The input HTML string to extract the JSON object from.

        key (str): The key used to locate the start of the JSON object in the HTML string.

        num_chars (int, optional): The number of characters after the key to include in the extracted JSON object. Defaults to 2.

             stop (str, optional): The character that marks the end of the JSON object in the HTML string. Defaults to '"'.

    Returns:

        str: The extracted JSON object as a string.
    """

        pos_begin = html.find(key) + len(key) + num_chars
        pos_end = html.find(stop, pos_begin)
        return html[pos_begin:pos_end]

The model has now created a class with an appropriate name, made the previous code a method of that class, and inserted our docstrings.

Step 5 Get a code review

Let's now prompt the model to help us optimize a code snippet:

chat = """Please optimize the following code:

def fib(n):
    if (n < 2):
        return 1
    else:
        return fib(n-1) + fib(n-2)"""

Granite code returns:

The given Python function fib(n) calculates the nth Fibonacci number using recursion. However, this implementation has a time complexity of O(2^n), which is inefficient for large inputs. We can

optimize it using dynamic programming or memoization to reduce the time complexity to O(n). Here's an optimized version of the function:

```python

def fib_optimized(n, memo={}):
    if n < 2:
        return 1
    elif n not in memo:
        memo[n] = fib_optimized(n-1, memo) + fib_optimized(n-2, memo)
    return memo[n]
```

This optimized version uses a dictionary memo to store previously calculated Fibonacci numbers. When calculating the nth Fibonacci number, it first checks if it has already been calculated and stored in memo. If so, it returns the stored value instead of recalculating it, which saves time and improves performance.

Here the model has explained not only what’s wrong with the function in terms of efficiency and given the implementation a Big-O complexity, but it’s also provided a better implementation. Providing review and explanations is a way to leverage summarization and as a part of more general code generation to help developers improve.

Summary

In this tutorial you learned about code summarization and used the Granite Code model with an expanded context window of 128k tokens to generate explanations of Python code. We also used prompts to generate new documentation, add a class around a piece of code and use the expanded context window to append that documentation to a new code window. Finally, we had Granite Code analyze and summarize a code snippet and explain how it could be improved.

Resources

Related topic
Related topic
Related solutions

Related solutions

IBM watsonx Code Assistant™

Harness generative AI and advanced automation to create enterprise-ready code faster.

Explore watsonx Code Assistant
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Harness generative AI and advanced automation to create enterprise-ready code faster. IBM watsonx Code Assistant™ leverages Granite models to augment developer skill sets, simplifying and automating your development and modernization efforts.

Explore watsonx Code Assistant
References

1 Sonia Haiduc, Jairo Aponte, Andrian Marcus, “Supporting program comprehension with source code summarization,” ICSE '10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering https://doi.org/10.1145/3377811.3380383.

2 Paul W. McBurney, Collin McMillan, “Automatic Source Code Summarization of Context for Java Methods,” https://ieeexplore.ieee.org/document/7181703.

3 Chen Lin, Zhichao Ouyang, Junqing Zhuang, Jianqiang Chen, Hui Li, Rongxin Wu, "Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting" IEEE/ACM, International Conference on Program Comprehension (ICPC 2021) https://arxiv.org/abs/2103.07845.

4 Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Xudong Liu, “Retrieval-based neural source code summarization,” ICSE '10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineeringhttps://doi.org/10.1145/1810295.1810335.

5 Xinyi Hou, Yanjie Zhao, Yue Huang, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Jin, John Grundy, Haoyu Wang, "Large Language Models for Software Engineering: A Systematic Literature Review", https://arxiv.org/abs/2308.10620.