Use Automatic Speech Recognition to Generate a Podcast Transcript

In this tutorial, you will use the open source IBM® Granite® 3.3 speech model to generate an IBM “Mixture of Experts” podcast transcript from YouTube. Then, using the open source IBM Granite-3.3-8B-Instruct large language model (LLM), you will output a summary of the generated transcript. You will run this code on a watsonx.ai® notebook.

Automatic speech recognition

Automatic speech recognition (ASR) also known as speech recognition or speech-to-text, is the technology that converts spoken language into written text. Various machine learning algorithms and artificial intelligence computation techniques are used to convert speech into text. Speech recognition technology has evolved significantly from its beginnings in the mid-twentieth century to today.

In the 1960s, spectrograms were initially used to analyze speech. In the subsequent decades, a shift to statistical models occurred. Hidden Markov Models (HMMs) appeared and became dominant for modeling sequences of small sound units known as phonemes in linguistics. ASR systems architecture was made up of three separate components: an acoustic model, a language model and a decoder.

By the 2010s, advancements in deep learning began impacting the traditional speech recognition systems architecture. Encoder-decoder models might use a recurrent neural network (RNN) or a convolutional neural network (CNN) architecture where an encoder processes input data and a decoder generates output based on the encoder's representation. Models can be trained on large unlabeled datasets of audio-text pairs to learn how to correspond audio signals with transcriptions. Popular ASR models include DeepSpeech and Wav2Vec.

Today, virtual assistants such as Apple’s Siri, Amazon’s Alexa or Microsoft’s Cortana use ASR technology to process real-time human speech. They are also able to integrate speech-to-text with large language models (LLMs) and natural language processing (NLP). LLMs can be used to add context, which can help when word choices are more ambiguous or if there is variability in human speech patterns.

Prerequisites

You need an IBM Cloud® account to create a watsonx.ai project.

Steps

Step 1. Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

1. Log in to watsonx.ai by using your IBM Cloud account.

2. Create a watsonx.ai project.

3. Create a Jupyter Notebook.

Make sure that you chooseGPU 2xV100 Runtime 24.1 to define the configuration. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial.

Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This tutorial is available on GitHub. To view more Granite tutorials, check out the IBM Granite Community.

Step 2. Install and import relevant libraries

We have a few dependencies for this tutorial. Make sure to import the following packages; if they're not installed, you can resolve this issue with a quick pip installation.

If you receive a "pip dependency resolver" error related to the caikit-nlp package, you can ignore it for now as the rest of the notebook should still be able to run normally.

# Install required packages
! pip install -q peft torchaudio soundfile pytubefix pytube moviepy tqdm https://github.com/huggingface/transformers/archive/main.zip

# Required imports
import json
import os

from pytubefix import YouTube
from tqdm import tqdm
from moviepy.audio.io.AudioFileClip import AudioFileClip

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

Step 3. Download the podcast audio from YouTube

In this tutorial, we will use the latest episode of the IBM "Mixture of Experts" podcast, "AI on IBM z17, Meta's Llama 4 and Google Cloud Next 2025". The podcast is hosted on YouTube. We'll first create aYouTube object and use thestreams.filter(only_audio=True) method to capture only the raw audio. From there, we'll extract the audio from the video and save it as an M4A audio file,out_file .base is the full file name, including the directory in which the file will be saved without them4a extension. We'll use thebase variable later when we convert the audio format.

url = "https://www.youtube.com/watch?v=90fUR1PQgt4" #latest episode 37 minutes

# Create a YouTube object
yt = YouTube(url)

# Download only the audio stream from the video
video = yt.streams.filter(only_audio=True).first()

# Save the audio to a file
out_file = video.download()

# Get the base name and extension of the downloaded audio
base = os.path.splitext(out_file)[0]

Step 4: Prepare the podcast audio file for model inference

We'll need to make a couple of modifications to the podcast audio file before we can use it for model inference.

First, we need to convert the M4A file to a WAV file to use it with the Granite Speech model. We will use the moviepy library to do this conversion. We can use the base variable that we defined earlier to create the new file name with the .wav extension.

# Load the M4A file
audio_clip = AudioFileClip(out_file)

# Write the audio to a WAV file
audio_clip.write_audiofile(base+".wav")

# Close the audio clip
audio_clip.close()

audio_path = base+".wav"

Next, we'll use torchaudiodio.load() to load the audio file as a tensor and extract the sample rate.

We'll also need to convert the returned waveform from stereo sound to mono sound. We can do this by taking the average of the stereo sound channels by using torch.mean().

#Resulting waveform and sample rate
waveform, sample_rate = torchaudio.load(audio_path, normalize=True)

# convert from stereo to mono
mono_waveform = torch.mean(waveform, dim=0, keepdim=True)

# confirm the waveform is mono
assert mono_waveform.shape[0] == 1 # mono

Next, we need to resample the mono waveform to the model's sample rate: 16 khz. We can use torchaudio’s resampling API to accomplish this.

# Resample the mono waveform to the model's sample rate
resample_transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
resampled_waveform = resample_transform(mono_waveform)

Finally, we can split the resampled waveform into chunks of equal size to feed into the model for easier inference.

We'll use torch.split() to split the full resampled waveform into chunks of 30 seconds and a chunk size sample equal to 30 seconds * 16 khz. This step will give us a list of waveforms, chunks, each with 30 seconds of audio data. We will feed each chunk into the model for inference.

# Define the desired chunk size
chunk_size_seconds = 30 
chunk_size_samples = chunk_size_seconds * 16000

# Split the waveform into chunks of equal size
chunks = torch.split(resampled_waveform, chunk_size_samples, dim=1)

Step 5: Load and instantiate the Granite speech model

Now we can start instantiating our speech model.

We will first set our torch device to CPU. If device is set to GPU, you might encounter out of memory errors when running this notebook, but CPU should work just fine on your watsonx.ai notebook. We can then set up our processor and tokenizer for the model.

device = 'cpu'
model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
    model_name, trust_remote_code=True)
tokenizer = speech_granite_processor.tokenizer

If you're running your notebook on the watsonx.ai platform, you may also need to run the following code to manually edit the adapter_config.json file. This will avoid an error when loading the model.

adapter_config_file = hf_hub_download(model_name, 'adapter_config.json')

#load the existing config file and print it
with open(adapter_config_file, 'r') as file:
    data = json.load(file)

#remove key, value pairs in config file throwing error
keys_to_delete = ['layer_replication', 'loftq_config', 'megatron_config', 'megatron_core', 'use_dora', 'use_rslora']

for key in keys_to_delete:
    if key in data:
        del data[key]

# write the updated config file back to disk
with open(adapter_config_file, 'w') as file:
    json.dump(data, file, indent=4)

with open(adapter_config_file, 'r') as file:
    data = json.load(file)

Great, now we can finally load the model! We'll use AutoModelForSpeechSeq2Seq from the transformers library and the from_pretrained method to load the model.

speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(model_name, trust_remote_code=True).to(device)

Step 6: Create an ASR system with the Granite speech model

Now that we have the model loaded and the audio data prepared, we can use it to generate text from speech.

We'll start by creating a prompt for the model to transcribe the audio data. We'll use tokenizer.apply_chat_template() to convert the prompt into a format that can be fed into the model.

chat = [
    {
        "role": "system",
        "content": "Knowledge Cutoff Date: April 2025.\nToday's Date: April 16, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
    },
    {
        "role": "user",
        "content": "<|audio|>can you transcribe the speech into a written format?",
    }
]

text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

Then, we can set up an empty list generated_texts, to gather the generated text from each chunk of audio input.

We set up a for loop to iterate through each audio chunk and pass it to the model for generation. Here, we will also track the progress of the loop by using a tqdm progress bar.

The model inputs are created through the speech_granite_processor that we established earlier. The processor takes the text and chunk as input and returns a processed version of the audio data for the model to use.

The model outputs are produced by using the speech model's generate method. From there, we use the tokenizer to convert the model outputs into human-readable text and store each chunk's transcription into our generated_texts list.

generated_texts = []

for chunk in tqdm(chunks, desc="Generating transcript..."):

    model_inputs = speech_granite_processor(
        text,
        chunk,
        device=device, # Computation device; returned tensors are put on CPU
        return_tensors="pt",
    ).to(device)
    
    # Generate
    model_outputs = speech_granite.generate(
        **model_inputs,
        max_new_tokens=1000,
        num_beams=1,
        do_sample=False,
        min_length=1,
        top_p=1.0,
        repetition_penalty=1.0,
        length_penalty=1.0,
        temperature=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,)

    num_input_tokens = model_inputs["input_ids"].shape[-1]
    new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)

    output_text = tokenizer.batch_decode(
        new_tokens, add_special_tokens=False, skip_special_tokens=True)[0]

    generated_texts.append(output_text)

Since the chunk transcripts are currently individual strings in a list, we'll join the strings together with a space in between to make one cohesive full transcript.

full_transcript = " ".join(generated_texts)

Step 7: Use the Granite instruct model for summarization

Now that we have a full transcript, we'll use the same model to summarize it. We can access the Granite-3.3-8B-Instruct model directly from Granite-speech-3.3-8b by simply calling it with a text prompt that doesn't contain the <|audio|> token.

We'll set up a new prompt to instruct this model to generate a summary of the full transcript. We can use tokenizer.apply_chat_template() again to convert the prompt for model inference.

conv = [{"role": "user", 
         "content": f"Compose a single, unified summary of the following transcript. Your response should only include the unified summary. Do not provide any further explanation. Transcript:{full_transcript}"}]

text = tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)

We'll use speech_granite_processor again to create out model inputs, but we won't pass in any audio file this time.

model_inputs = speech_granite_processor(
    text,
    device=device, # Computation device; returned tensors are put on CPU
    return_tensors="pt",
).to(device)

We will receive output from speech_granite.generate() as a tensor. We can convert this output to text by using tokenizer.decode(). And print our final summary!

output = speech_granite.generate(
    **model_inputs,
    max_new_tokens= 2000, # concise summary
)

summary = tokenizer.decode(output[0, model_inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(summary)

Output:

In the 50th episode of Mixture of Experts, the panel discusses various AI-related topics. 

Kate Soule, Director of Technical Product Management at Granite, estimates that 90% of enterprise data is unstructured. 

Hilary Hunter, IBM Fellow and CTO of IBM Infrastructure, introduces IBM's new mainframe launch, IBM z, emphasizing its zero downtime and eight nines of reliability, crucial for global financial transactions. 

The conversation also touches on Meta's Llama 4 release, featuring three models: Scout (100 billion parameters), Maverick (200 billion parameters), and Behemoth (two trillion parameters). The panelists discuss the implications of these models, particularly the mixture of experts architecture, and its potential to become more community-driven. 

Shobhit Varshney Head of Data and AI for the Americas, shares insights on handling unstructured data in enterprises, advocating for bringing AI close to transaction data for low-latency, mission-critical applications. 

The episode concludes with a brief overview of Google Cloud Next, highlighting advancements in AI models, on-premises AI capabilities, and Google's focus on AI for media creation and agent-to-agent communication. 

The panel also references a Pew Research report on American perceptions of AI, noting discrepancies between experts' optimism and the general public's concerns about job impacts from AI.

Conclusion

In this tutorial, you downloaded an English audio file from YouTube. You transformed the audio file for consumption by the Granite speech model, generated a full transcript of the audio and used a Granite instruct model to generate a summary of the transcript.

Beyond the hype - How AI assistants drive real business value

Explore top use cases for leveraging AI assistants, understand the potential impact of Gen AI and automation technology on your business, and learn how to get started.

Use automatic speech recognition (ASR) to generate a podcast transcript using Granite 3.3 and watsonx.ai