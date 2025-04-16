16 April 2025
In this tutorial, you will use the open source IBM® Granite® 3.3 speech model to generate an IBM "Mixture of Experts" podcast transcript from YouTube. Then, using the open source IBM Granite-3.3-8B-Instruct large language model (LLM), you will output a summary of the generated transcript. You will run this code on a watsonx.ai® notebook.
Automatic speech recognition (ASR) also known as speech recognition or speech-to-text, is the technology that converts spoken language into written text. Various machine learning algorithms and artificial intelligence computation techniques are used to convert speech into text. Speech recognition technology has evolved significantly from its beginnings in the mid-twentieth century to today.
In the 1960s, spectrograms were initially used to analyze speech. In the subsequent decades, a shift to statistical models occurred. Hidden Markov Models (HMMs) appeared and became dominant for modeling sequences of small sound units known as phonemes in linguistics. ASR systems architecture was made up of three separate components: an acoustic model, a language model and a decoder.
By the 2010s, advancements in deep learning began impacting the traditional speech recognition systems architecture. Encoder-decoder models might use a recurrent neural network (RNN) or a convolutional neural network (CNN) architecture where an encoder processes input data and a decoder generates output based on the encoder's representation. Models can be trained on large unlabeled datasets of audio-text pairs to learn how to correspond audio signals with transcriptions. Popular ASR models include DeepSpeech and Wav2Vec.
Today, virtual assistants such as Apple’s Siri, Amazon’s Alexa or Microsoft’s Cortana use ASR technology to process real-time human speech. They are also able to integrate speech-to-text with large language models (LLMs) and natural language processing (NLP). LLMs can be used to add context, which can help when word choices are more ambiguous or if there is variability in human speech patterns.
You need an IBM Cloud® account to create a watsonx.ai project.
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
1. Log in to watsonx.ai by using your IBM Cloud account.
2. Create a watsonx.ai project.
3. Create a Jupyter Notebook.
Make sure that you choose
Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This tutorial is available on GitHub. To view more Granite tutorials, check out the IBM Granite Community.
We have a few dependencies for this tutorial. Make sure to import the following packages; if they're not installed, you can resolve this issue with a quick pip installation.
If you receive a "pip dependency resolver" error related to the
caikit-nlp package, you can ignore it for now as the rest of the notebook should still be able to run normally.
# Install required packages
! pip install -q peft torchaudio soundfile pytubefix pytube moviepy tqdm https://github.com/huggingface/transformers/archive/main.zip
# Required imports
import json
import os
from pytubefix import YouTube
from tqdm import tqdm
from moviepy.audio.io.AudioFileClip import AudioFileClip
import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download
In this tutorial, we will use the latest episode of the IBM "Mixture of Experts" podcast, "AI on IBM z17, Meta's Llama 4 and Google Cloud Next 2025". The podcast is hosted on YouTube. We'll first create a
url = "https://www.youtube.com/watch?v=90fUR1PQgt4" #latest episode 37 minutes
# Create a YouTube object
yt = YouTube(url)
# Download only the audio stream from the video
video = yt.streams.filter(only_audio=True).first()
# Save the audio to a file
out_file = video.download()
# Get the base name and extension of the downloaded audio
base = os.path.splitext(out_file)[0]
We'll need to make a couple of modifications to the podcast audio file before we can use it for model inference.
First, we need to convert the M4A file to a WAV file to use it with the Granite Speech model. We will use the moviepy library to do this conversion. We can use the base variable that we defined earlier to create the new file name with the .wav extension.
# Load the M4A file
audio_clip = AudioFileClip(out_file)
# Write the audio to a WAV file
audio_clip.write_audiofile(base+".wav")
# Close the audio clip
audio_clip.close()
audio_path = base+".wav"
Next, we'll use
torchaudiodio.load() to load the audio file as a tensor and extract the sample rate.
We'll also need to convert the returned waveform from stereo sound to mono sound. We can do this by taking the average of the stereo sound channels by using
torch.mean().
#Resulting waveform and sample rate
waveform, sample_rate = torchaudio.load(audio_path, normalize=True)
# convert from stereo to mono
mono_waveform = torch.mean(waveform, dim=0, keepdim=True)
# confirm the waveform is mono
assert mono_waveform.shape[0] == 1 # mono
Next, we need to resample the mono waveform to the model's sample rate: 16 khz. We can use torchaudio’s resampling API to accomplish this.
# Resample the mono waveform to the model's sample rate
resample_transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
resampled_waveform = resample_transform(mono_waveform)
Finally, we can split the resampled waveform into chunks of equal size to feed into the model for easier inference.
We'll use
torch.split() to split the full resampled waveform into chunks of 30 seconds and a chunk size sample equal to 30 seconds * 16 khz. This step will give us a list of waveforms,
chunks, each with 30 seconds of audio data. We will feed each chunk into the model for inference.
# Define the desired chunk size
chunk_size_seconds = 30
chunk_size_samples = chunk_size_seconds * 16000
# Split the waveform into chunks of equal size
chunks = torch.split(resampled_waveform, chunk_size_samples, dim=1)
Now we can start instantiating our speech model.
We will first set our torch device to CPU. If device is set to GPU, you might encounter out of memory errors when running this notebook, but CPU should work just fine on your watsonx.ai notebook. We can then set up our processor and tokenizer for the model.
device = 'cpu'
model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
model_name, trust_remote_code=True)
tokenizer = speech_granite_processor.tokenizer
If you're running your notebook on the watsonx.ai platform, you may also need to run the following code to manually edit the
adapter_config.json file. This will avoid an error when loading the model.
adapter_config_file = hf_hub_download(model_name, 'adapter_config.json')
#load the existing config file and print it
with open(adapter_config_file, 'r') as file:
data = json.load(file)
#remove key, value pairs in config file throwing error
keys_to_delete = ['layer_replication', 'loftq_config', 'megatron_config', 'megatron_core', 'use_dora', 'use_rslora']
for key in keys_to_delete:
if key in data:
del data[key]
# write the updated config file back to disk
with open(adapter_config_file, 'w') as file:
json.dump(data, file, indent=4)
with open(adapter_config_file, 'r') as file:
data = json.load(file)
Great, now we can finally load the model! We'll use
AutoModModelForSpeechSeq2Seq from the
transformers library and the
from_pretrained method to load the model.
speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(model_name, trust_remote_code=True).to(device)
Now that we have the model loaded and the audio data prepared, we can use it to generate text from speech.
We'll start by creating a prompt for the model to transcribe the audio data. We'll use
tokenizer.apply_chat_template() to convert the prompt into a format that can be fed into the model.
chat = [
{
"role": "system",
"content": "Knowledge Cutoff Date: April 2025.\nToday's Date: April 16, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
},
{
"role": "user",
"content": "<|audio|>can you transcribe the speech into a written format?",
}
]
text = tokenizer.apply_chat_template(
chat, tokenize=False, add_generation_prompt=True
)
Then, we can set up an empty list
generated_texts, to gather the generated text from each chunk of audio input.
We set up a
for loop to iterate through each audio chunk and pass it to the model for generation. Here, we will also track the progress of the loop by using a
tqdm progress bar.
The model inputs are created through the
speech_granite_processor that we established earlier. The processor takes the
text and
chunk as input and returns a processed version of the audio data for the model to use.
The model outputs are produced by using the speech model's
generate method. From there, we use the
tokenizer to convert the model outputs into human-readable text and store each chunk's transcription into our
generated_texts list.
generated_texts = []
for chunk in tqdm(chunks, desc="Generating transcript..."):
model_inputs = speech_granite_processor(
text,
chunk,
device=device, # Computation device; returned tensors are put on CPU
return_tensors="pt",
).to(device)
# Generate
model_outputs = speech_granite.generate(
**model_inputs,
max_new_tokens=1000,
num_beams=1,
do_sample=False,
min_length=1,
top_p=1.0,
repetition_penalty=1.0,
length_penalty=1.0,
temperature=1.0,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,)
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
output_text = tokenizer.batch_decode(
new_tokens, add_special_tokens=False, skip_special_tokens=True)[0]
generated_texts.append(output_text)
Since the chunk transcripts are currently individual strings in a list, we'll join the strings together with a space in between to make one cohesive full transcript.
full_transcript = " ".join(generated_texts)
Now that we have a full transcript, we'll need a different model to generate a summary of the transcript.
Let's instantiate the Granite instruct model also by using the transformers library.
model_path="ibm-granite/granite-3.3-8b-instruct"
instruct_model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map=device,
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(
model_path
)
We'll set up a new prompt to instruct this model to generate a summary of the full transcript. We can use
tokenizer.apply_chat_template() again to convert the prompt for model inference.
conv = [{"role": "user",
"content": f"Compose a single, unified summary of the following transcript. Your response should only include the unified summary. Do not provide any further explanation. Transcript:{full_transcript}"}]
input_ids = tokenizer.apply_chat_template(conv, return_tensors="pt", return_dict=True, add_generation_prompt=True).to(device)
We will receive output from
instructuct_model.generate() as a tensor. We can convert this output to text by using
tokenizer.decode(). And print our final summary!
output = instruct_model.generate(
**input_ids,
max_new_tokens= 2000, # concise summary
)
summary = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True)
print(summary)
Output:
In the 50th episode of Mixture of Experts, the panel discusses various AI-related topics.
Kate Soul, Director of Technical Product Management at Granite, estimates that 90% of enterprise data is unstructured.
Hilary Hunter, IBM Fellow and CTO of IBM Infrastructure, introduces IBM's new mainframe launch, IBM z, emphasizing its zero downtime and eight nines of reliability, crucial for global financial transactions.
The conversation also touches on Meta's Llama 4 release, featuring three models: Scout (100 billion parameters), Maverick (200 billion parameters), and Behemoth (two trillion parameters). The panelists discuss the implications of these models, particularly the mixture of experts architecture, and its potential to become more community-driven.
Shobhit Varshney Head of Data and AI for the Americas, shares insights on handling unstructured data in enterprises, advocating for bringing AI close to transaction data for low-latency, mission-critical applications.
The episode concludes with a brief overview of Google Cloud Next, highlighting advancements in AI models, on-premises AI capabilities, and Google's focus on AI for media creation and agent-to-agent communication.
The panel also references a Pew Research report on American perceptions of AI, noting discrepancies between experts' optimism and the general public's concerns about job impacts from AI.
In this tutorial, you downloaded an English audio file from YouTube. You transformed the audio file for consumption by the Granite speech model, generated a full transcript of the audio and used a Granite instruct model to generate a summary of the transcript.
