使用自动语音识别技术生成播客文字稿

在本教程中，您将使用开源的 IBM® Granite® 3.3 语音模型从 YouTube 生成 IBM Mixture of Experts 播客记录。然后，使用开源的IBM Granite-3.3-8B-Instruct大语言模型，您将输出生成的转录摘要。您将在 watsonx.ai® 上运行些代码笔记本。

自动语音识别

自动语音识别 (ASR)，也称为语音识别或语音转文本，是一种将口语转换为书面文本的技术。各种机器学习算法和人工智能计算技术用于将语音转换为文本。从 20 世纪中叶问世的语音识别技术至今已经取得了长足的发展。

20 世纪 60 年代，频谱图被广泛用于分析语音。在随后的几十年中，重心转向统计模型。隐马尔可夫模型 (HMM) 出现并成为主流，用于为语言学中的音素等小音素单元的序列建模。ASR 系统在架构上由三个独立的组件组成：声学模型、语言模型和解码器。

到了 2010 年代，深度学习领域的进步开始对传统语音识别系统架构产生冲击。编码器-解码器模型可能使用递归神经网络 (RNN) 或卷积神经网络 (CNN) 架构，其中编码器处理输入数据，解码器根据编码器的表示生成输出数据。模型可在大型未标记的音频-文本对数据集上进行训练，以学习如何将音频信号与转录文本对应起来。常用的 ASR 模型包括 DeepSpeech 和 Wav2Vec。

如今，苹果公司的 Siri、亚马逊的 Alexa 或微软的 Cortana 等虚拟助手，都使用 ASR 技术来处理实时人类语音。它们还能够将语音转文本与大型语言模型 (LLM) 和自然语言处理 (NLP) 进行整合。LLM 可用于添加上下文，这在词语选择较为模糊或人类语音模式存在差异时会非常有帮助。

先决条件

您需要一个 IBM® Cloud 帐户来创建 watsonx.ai 项目。

步骤

第 1 步：设置环境

虽然您可以从多种工具中进行选择，但本教程将引导您完成设置 IBM 帐户以使用 Jupyter Notebook 的全过程。

使用您的 IBM Cloud 帐户登录 watsonx.ai。

2. 创建 watsonx.ai 项目。

3. 创建 Jupyter Notebook。

确保您选择GPU 2xV100 Runtime 24.1 来定义配置。此步骤会打开 Jupyter Notebook 环境，您可以在其中复制本教程中的代码。

或者，您可以将此笔记本下载到本地系统并将其作为资产上传到您的 watsonx.ai 项目。本教程也可在 GitHub 上找到。要查看更多 Granite 教程，请访问 IBM Granite 社区。

步骤 2：安装并导入相关库

本教程有一些依赖项。请确保导入以下安装包；如果尚未安装，可以通过快速的 pip 安装来解决此问题。如果收到与 caikit-nlp 软件包相关的”pip 依赖关系解析器“错误，您可暂时忽略该错误，因为笔记本的其余部分应该仍然能够正常运行。

# Install required packages
! pip install -q peft torchaudio soundfile pytubefix pytube moviepy tqdm https://github.com/huggingface/transformers/archive/main.zip

# Required imports
import json
import os

from pytubefix import YouTube
from tqdm import tqdm
from moviepy.audio.io.AudioFileClip import AudioFileClip

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

第 3 步：从 YouTube 下载播客音频

在本教程中，我们将使用最新一期的 IBM Mixture of Experts 播客，主题为“IBM z17、Meta 的 Llama 4 和 Google Cloud Next 2025 上的 AI”。该播客在 YouTube 上播出。我们将首先创建一个YouTube 对象并使用streams.filter(only_audio=True) 方法仅捕获原始音频。然后，我们将从视频中提取音频并将其另存为 M4A 音频文件，out_file 。base 是完整文件名，包括保存文件的目录，但不包括m4a 扩展名。我们将在稍后转换音频格式时使用base 变量。

url = "https://www.youtube.com/watch?v=90fUR1PQgt4" #latest episode 37 minutes

# Create a YouTube object
yt = YouTube(url)

# Download only the audio stream from the video
video = yt.streams.filter(only_audio=True).first()

# Save the audio to a file
out_file = video.download()

# Get the base name and extension of the downloaded audio
base = os.path.splitext(out_file)[0]

第 4 步：为模型推理准备播客音频文件

我们需要对播客音频文件进行几处修改，然后才能将其用于模型推理。首先，我们需要将 M4A 文件转换为 WAV 文件，以便将其与 Granite Speech 模型配合使用。我们将使用 moviepy 库来完成此转换。我们可以使用之前定义的基础变量来创建带有 .wav 扩展名的新文件名。

# Load the M4A file
audio_clip = AudioFileClip(out_file)

# Write the audio to a WAV file
audio_clip.write_audiofile(base+".wav")

# Close the audio clip
audio_clip.close()

audio_path = base+".wav"

接下来，我们将使用 torchaudiodio.load()将音频文件加载为张量并提取采样率。我们还需要将返回的波形从立体声转换为单声道。我们可以通过使用 torch.mean() 取立体声声道的平均值来实现。

#Resulting waveform and sample rate
waveform, sample_rate = torchaudio.load(audio_path, normalize=True)

# convert from stereo to mono
mono_waveform = torch.mean(waveform, dim=0, keepdim=True)

# confirm the waveform is mono
assert mono_waveform.shape[0] == 1 # mono

接下来，我们需要对单声道波形进行重采样，达到模型的采样率：16 khz。我们可以使用 torchaudio 的重采样 API 来完成此操作。

# Resample the mono waveform to the model's sample rate
resample_transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
resampled_waveform = resample_transform(mono_waveform)

最后，我们可以将重采样的波形分割为大小相等的区块，以便更轻松地输入模型进行推理。

我们将使用 torch.split() 把完整的重采样波形分割成 30 秒的区块，并且每个区块的采样点数等于 30 秒乘以 16 khz。这一步将为我们提供一个波形列表，即“区块”，其中每个区块包含 30 秒的音频数据。我们会将每个区块输入模型以进行推理。

# Define the desired chunk size
chunk_size_seconds = 30 
chunk_size_samples = chunk_size_seconds * 16000

# Split the waveform into chunks of equal size
chunks = torch.split(resampled_waveform, chunk_size_samples, dim=1)

##第 5 步：加载并实例化 Granite 语音模型

现在我们可以开始实例化我们的语音模型了。我们首先将 Torch 设备设置为 CPU。如果设备设置为 GPU，在运行此笔记本时可能会遇到内存不足错误，但 CPU 在您的 watsonx.ai 笔记本上应该可以正常工作。然后，我们可以为模型设置处理器和分词器。

device = 'cpu'
model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
    model_name, trust_remote_code=True)
tokenizer = speech_granite_processor.tokenizer

如果您在 watsonx.ai 平台上运行笔记本时，可能还需要运行以下代码来手动编辑 adapter_config.json 文件。这样将避免在加载模型时出错。

adapter_config_file = hf_hub_download(model_name, 'adapter_config.json')

#load the existing config file and print it
with open(adapter_config_file, 'r') as file:
    data = json.load(file)

#remove key, value pairs in config file throwing error
keys_to_delete = ['layer_replication', 'loftq_config', 'megatron_config', 'megatron_core', 'use_dora', 'use_rslora']

for key in keys_to_delete:
    if key in data:
        del data[key]

# write the updated config file back to disk
with open(adapter_config_file, 'w') as file:
    json.dump(data, file, indent=4)

with open(adapter_config_file, 'r') as file:
    data = json.load(file)

很好，现在我们终于可以加载模型了！我们将使用 transformers 库中的 AutoModelForSpeechSeq2Seq 和 from_pretrained 方法来加载模型。

speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(model_name, trust_remote_code=True).to(device)

第 6 步：使用 Granite 语音模型创建 ASR 系统

现在，我们已加载了模型并准备好音频数据，可以使用它来将语音生成文本了。首先，我们将为模型创建提示，以转录音频数据。我们将使用 tokenizer.apply_chat_template() 将提示转换为可以输入模型的格式。

chat = [
    {
        "role": "system",
        "content": "Knowledge Cutoff Date: April 2025.\nToday's Date: April 16, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
    },
    {
        "role": "user",
        "content": "<|audio|>can you transcribe the speech into a written format?",
    }
]

text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

然后，我们可以设置一个空列表 generated_texts，用于收集每个音频输入区块生成的文本。我们设置了一个 for 循环来迭代每个音频区块，并将其传递给模型用于生成。在这里，我们还将使用 tqdm 进度条跟踪循环的进度。模型输入是通过我们之前建立的 speech_granite_processor 创建的。处理器将 text 和 chunk 作为输入，并返回音频数据的处理版本以供模型使用。

模型输出是使用语音模型的 generate 方法生成的。接着，我们使用 tokenizer 将模型输出转换为人类可读的文本，并将每个区块的转录文本存储到我们的 generated_texts 列表中。

generated_texts = []

for chunk in tqdm(chunks, desc="Generating transcript..."):

    model_inputs = speech_granite_processor(
        text,
        chunk,
        device=device, # Computation device; returned tensors are put on CPU
        return_tensors="pt",
    ).to(device)
    
    # Generate
    model_outputs = speech_granite.generate(
        **model_inputs,
        max_new_tokens=1000,
        num_beams=1,
        do_sample=False,
        min_length=1,
        top_p=1.0,
        repetition_penalty=1.0,
        length_penalty=1.0,
        temperature=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,)

    num_input_tokens = model_inputs["input_ids"].shape[-1]
    new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)

    output_text = tokenizer.batch_decode(
        new_tokens, add_special_tokens=False, skip_special_tokens=True)[0]

    generated_texts.append(output_text)

由于区块的转录文本目前是列表中的单个字符串，我们会用空格将这些字符串连接起来，形成一个连贯的完整文字稿。

full_transcript = " ".join(generated_texts)

第 7 步：使用 Granite 指令模型进行总结

现在我们有了完整的文字稿，我们将使用相同的模型对其进行总结。我们可以访问 Granite-3.3-8B-Instruct 模型，即通过调用不包含 <|audio|> 令牌的文本提示直接从 Granite-speech-3.3-8b 来实现。我们将设置一个新提示，以指示该模型生成完整记录的摘要。我们可以再次使用 tokenizer.apply_chat_template()来转换用于模型推理的提示。

conv = [{"role": "user", 
         "content": f"Compose a single, unified summary of the following transcript. Your response should only include the unified summary. Do not provide any further explanation. Transcript:{full_transcript}"}]

text = tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)

我们将再次使用 speech_granite_processor 来创建模型输入，但这次我们不会传入任何音频文件。

model_inputs = speech_granite_processor(
    text,
    device=device, # Computation device; returned tensors are put on CPU
    return_tensors="pt",
).to(device)

我们将接收来自 speech_granite.generate() 的作为张量的输出。我们可以通过使用 tokenizer.decode() 将此输出转换为文本。然后打印我们的最终摘要！

output = speech_granite.generate(
    **model_inputs,
    max_new_tokens= 2000, # concise summary
)

summary = tokenizer.decode(output[0, model_inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(summary)

输出：

In the 50th episode of Mixture of Experts, the panel discusses various AI-related topics. 

Kate Soule, Director of Technical Product Management at Granite, estimates that 90% of enterprise data is unstructured. 

Hilary Hunter, IBM Fellow and CTO of IBM Infrastructure, introduces IBM's new mainframe launch, IBM z, emphasizing its zero downtime and eight nines of reliability, crucial for global financial transactions. 

The conversation also touches on Meta's Llama 4 release, featuring three models: Scout (100 billion parameters), Maverick (200 billion parameters), and Behemoth (two trillion parameters). The panelists discuss the implications of these models, particularly the mixture of experts architecture, and its potential to become more community-driven. 

Shobhit Varshney Head of Data and AI for the Americas, shares insights on handling unstructured data in enterprises, advocating for bringing AI close to transaction data for low-latency, mission-critical applications. 

The episode concludes with a brief overview of Google Cloud Next, highlighting advancements in AI models, on-premises AI capabilities, and Google's focus on AI for media creation and agent-to-agent communication. 

The panel also references a Pew Research report on American perceptions of AI, noting discrepancies between experts' optimism and the general public's concerns about job impacts from AI.

总结

在本教程中，您从 YouTube 下载了英语音频文件。您对该音频文件进行了转换供 Granite 语音模型使用，生成完整的音频文字稿，并使用 Granite 指令模型生成文字稿的摘要。

您的组织准备好利用生成式 AI 吗？

了解五种关键编排功能，这些功能可以帮助组织有效应对实施生成式 AI 的挑战。

使用自动语音识别 (ASR) 技术，借助Granite 3.3和 watsonx.ai 生成播客文字稿