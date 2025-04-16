Next, we'll use torchaudiodio.load() to load the audio file as a tensor and extract the sample rate.

We'll also need to convert the returned waveform from stereo sound to mono sound. We can do this by taking the average of the stereo sound channels by using torch.mean() .

#Resulting waveform and sample rate waveform, sample_rate = torchaudio.load(audio_path, normalize=True) # convert from stereo to mono mono_waveform = torch.mean(waveform, dim=0, keepdim=True) # confirm the waveform is mono assert mono_waveform.shape[0] == 1 # mono

Next, we need to resample the mono waveform to the model's sample rate: 16 khz. We can use torchaudio’s resampling API to accomplish this.

# Resample the mono waveform to the model's sample rate resample_transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000) resampled_waveform = resample_transform(mono_waveform)

Finally, we can split the resampled waveform into chunks of equal size to feed into the model for easier inference.

We'll use torch.split() to split the full resampled waveform into chunks of 30 seconds and a chunk size sample equal to 30 seconds * 16 khz. This step will give us a list of waveforms, chunks , each with 30 seconds of audio data. We will feed each chunk into the model for inference.

# Define the desired chunk size chunk_size_seconds = 30 chunk_size_samples = chunk_size_seconds * 16000 # Split the waveform into chunks of equal size chunks = torch.split(resampled_waveform, chunk_size_samples, dim=1)

Step 5: Load and instantiate the Granite speech model

Now we can start instantiating our speech model.

We will first set our torch device to CPU. If device is set to GPU, you might encounter out of memory errors when running this notebook, but CPU should work just fine on your watsonx.ai notebook. We can then set up our processor and tokenizer for the model.

device = 'cpu' model_name = "ibm-granite/granite-speech-3.3-8b" speech_granite_processor = AutoProcessor.from_pretrained( model_name, trust_remote_code=True) tokenizer = speech_granite_processor.tokenizer

If you're running your notebook on the watsonx.ai platform, you may also need to run the following code to manually edit the adapter_config.json file. This will avoid an error when loading the model.

adapter_config_file = hf_hub_download(model_name, 'adapter_config.json') #load the existing config file and print it with open(adapter_config_file, 'r') as file: data = json.load(file) #remove key, value pairs in config file throwing error keys_to_delete = ['layer_replication', 'loftq_config', 'megatron_config', 'megatron_core', 'use_dora', 'use_rslora'] for key in keys_to_delete: if key in data: del data[key] # write the updated config file back to disk with open(adapter_config_file, 'w') as file: json.dump(data, file, indent=4) with open(adapter_config_file, 'r') as file: data = json.load(file)

Great, now we can finally load the model! We'll use AutoModModelForSpeechSeq2Seq from the transformers library and the from_pretrained method to load the model.

speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(model_name, trust_remote_code=True).to(device)

Step 6: Create an ASR system with the Granite speech model

Now that we have the model loaded and the audio data prepared, we can use it to generate text from speech.

We'll start by creating a prompt for the model to transcribe the audio data. We'll use tokenizer.apply_chat_template() to convert the prompt into a format that can be fed into the model.

chat = [ { "role": "system", "content": "Knowledge Cutoff Date: April 2025.

Today's Date: April 16, 2025.

You are Granite, developed by IBM. You are a helpful AI assistant", }, { "role": "user", "content": "<|audio|>can you transcribe the speech into a written format?", } ] text = tokenizer.apply_chat_template( chat, tokenize=False, add_generation_prompt=True )

Then, we can set up an empty list generated_texts , to gather the generated text from each chunk of audio input.

We set up a for loop to iterate through each audio chunk and pass it to the model for generation. Here, we will also track the progress of the loop by using a tqdm progress bar.

The model inputs are created through the speech_granite_processor that we established earlier. The processor takes the text and chunk as input and returns a processed version of the audio data for the model to use.

The model outputs are produced by using the speech model's generate method. From there, we use the tokenizer to convert the model outputs into human-readable text and store each chunk's transcription into our generated_texts list.

generated_texts = [] for chunk in tqdm(chunks, desc="Generating transcript..."): model_inputs = speech_granite_processor( text, chunk, device=device, # Computation device; returned tensors are put on CPU return_tensors="pt", ).to(device) # Generate model_outputs = speech_granite.generate( **model_inputs, max_new_tokens=1000, num_beams=1, do_sample=False, min_length=1, top_p=1.0, repetition_penalty=1.0, length_penalty=1.0, temperature=1.0, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id,) num_input_tokens = model_inputs["input_ids"].shape[-1] new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0) output_text = tokenizer.batch_decode( new_tokens, add_special_tokens=False, skip_special_tokens=True)[0] generated_texts.append(output_text)

Since the chunk transcripts are currently individual strings in a list, we'll join the strings together with a space in between to make one cohesive full transcript.

full_transcript = " ".join(generated_texts)

Step 7: Instantiate the Granite instruct model and use it for summarization

Now that we have a full transcript, we'll need a different model to generate a summary of the transcript.

Let's instantiate the Granite instruct model also by using the transformers library.

model_path="ibm-granite/granite-3.3-8b-instruct" instruct_model = AutoModelForCausalLM.from_pretrained( model_path, device_map=device, torch_dtype=torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained( model_path )

We'll set up a new prompt to instruct this model to generate a summary of the full transcript. We can use tokenizer.apply_chat_template() again to convert the prompt for model inference.

conv = [{"role": "user", "content": f"Compose a single, unified summary of the following transcript. Your response should only include the unified summary. Do not provide any further explanation. Transcript:{full_transcript}"}] input_ids = tokenizer.apply_chat_template(conv, return_tensors="pt", return_dict=True, add_generation_prompt=True).to(device)

We will receive output from instructuct_model.generate() as a tensor. We can convert this output to text by using tokenizer.decode() . And print our final summary!

output = instruct_model.generate( **input_ids, max_new_tokens= 2000, # concise summary ) summary = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True) print(summary)

Output: