自動音声認識を使用してポッドキャストの書き起こしを生成

このチュートリアルでは、オープンソースのIBM® Granite 3.3音声モデルを使い、YouTubeからIBMの「Mixture of Experts」ポッドキャストの書き起こしを作成します。次に、オープンソースのIBM Granite-3.3-8B-Instruct大規模言語モデル（LLM）を使用して、生成された書き起こしの要約を出力します。このコードをwatsonx.aiのノートブック上で実行します。

自動音声認識

自動音声認識（ASR）は、音声認識またはSpeech to Textと呼ばれ、話し言葉を書面テキストに変換するテクノロジーです。音声をテキストに変換するために、さまざまな機械学習アルゴリズムや人工知能の計算技術が使われています。音声認識テクノロジーは、20世紀半ばの始まりから今日に至るまで、大きく進化してきました。

1960年代、スペクトログラムは初めて音声を分析するために使用されました。その後数十年間で、統計モデルへの移行が起こりました。隠れマルコフ・モデル（HMM）は、言語学で音素として知られる、連続した小さな音声単位のモデリング用に登場し、主流になりました。ASRシステム・アーキテクチャーは、音響モデル、言語モデル、デコーダーという3つの別々のコンポーネントで構成されていました。

2010年代までに、ディープラーニングの進歩が従来の音声認識システム・アーキテクチャーに影響を与え始めました。エンコーダー-デコーダー・モデルは、リカレント・ニューラル・ネットワーク（RNN）や畳み込みニューラル・ネットワーク（CNN）アーキテクチャーを用いることがあり、エンコーダーがインプットデータを処理し、デコーダーがエンコーダーの表現に基づいてアウトプットを生成します。モデルは、音声信号と文字起こしを対応付ける方法を学習するために、音声とテキストのペアの大規模なラベルなしデータセットでトレーニングされます。一般的なASRモデルには、DeepSpeechやWav2Vecなどがあります。

現在、AppleのSiri、AmazonのAlexa、MicrosoftのCortana などのバーチャル・アシスタントは、ASRテクノロジーを使用して人間の音声をリアルタイムで処理しています。また、Speech to Textを大規模言語モデル（LLM）や自然言語処理（NLP）と統合することもできます。LLMを使用してコンテキストを追加することで、単語の選択がより曖昧な場合や、人間の音声パターンにばらつきがある場合に役立ちます。

前提条件

watsonx.aiプロジェクトを作成するには、IBM® Cloudのアカウントが必要です。

手順

ステップ1. 環境を設定する

いくつかあるツールの中から選択することもできますが、このチュートリアルでは、Jupyter Notebookを使用してIBMアカウントを設定する方法について説明します。

1. IBM Cloudアカウントを使用してwatsonx.aiにログインします。

2. watsonx.aiプロジェクトを作成します。

3. Jupyter Notebookを作成します。

構成を定義するため、GPU 2xV100 Runtime 24.1 を選択していることを確認します。このステップでは、このチュートリアルからコードをコピーできるJupyter Notebook環境が開きます。

あるいは、このノートブックをローカル・システムにダウンロードし、watsonx.aiプロジェクトにアセットとしてアップロードすることもできます。このチュートリアルは、GitHubでも公開されています。さらにGraniteのチュートリアルを表示するには、IBM® Graniteコミュニティをご覧ください。

ステップ2. 関連ライブラリーをインストールおよびインポートする

このチュートリアルにはいくつかの依存関係があります。以下のパッケージを必ずインポートしてください。それらがインストールされていない場合は、pipをクイック・インストールすることで問題が解決されます。

「caikit-nlp」パッケージに関連する「pip dependency resolver」エラーが発生した場合は、ノートブックの残りの部分は正常に実行できるはずであるため、当面は無視してください。

# Install required packages
! pip install -q peft torchaudio soundfile pytubefix pytube moviepy tqdm https://github.com/huggingface/transformers/archive/main.zip

# Required imports
import json
import os

from pytubefix import YouTube
from tqdm import tqdm
from moviepy.audio.io.AudioFileClip import AudioFileClip

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

ステップ3. YouTubeからポッドキャストの音声をダウンロードする

このチュートリアルでは、IBM「Mixture of Experts」ポッドキャストの最新エピソード「AI on IBM z17, Meta's Llama 4 and Google Cloud Next 2025」を使用します。ポッドキャストはYouTubeでホストされています。まず、IBMがYouTube オブジェクトを作成し、streams.filter(only_audio=True) メソッドを使って生の音声のみをキャプチャします。そこから、動画から音声を抽出しM4A音声ファイルとして保存します。完全なファイル名はout_file .base で、m4a 拡張子なしでファイルが保存されるディレクトリーを含みます。後ほど、音声フォーマットを変換する際にbase 変数を使用します。

url = "https://www.youtube.com/watch?v=90fUR1PQgt4" #latest episode 37 minutes

# Create a YouTube object
yt = YouTube(url)

# Download only the audio stream from the video
video = yt.streams.filter(only_audio=True).first()

# Save the audio to a file
out_file = video.download()

# Get the base name and extension of the downloaded audio
base = os.path.splitext(out_file)[0]

ステップ4：モデル推論用のポッドキャストの音声ファイルを準備する

モデル推論に使用する前に、ポッドキャストの音声ファイルをいくつか変更する必要があります。

まず、M4AファイルをWAVファイルに変換してGranite Speechモデルで使用する必要があります。この変換には、 Cloudpyライブラリーを使用します。先ほど定義した基本変数を使用して、.wav拡張子を含む新しいファイル名を作成できます。

# Load the M4A file
audio_clip = AudioFileClip(out_file)

# Write the audio to a WAV file
audio_clip.write_audiofile(base+".wav")

# Close the audio clip
audio_clip.close()

audio_path = base+".wav"

次に、「torchaudio.load()」を使用します。音声ファイルをテンソルとしてロードし、サンプル・レートを抽出します。また、返されたソフトウェアをステレオ音声からモノ音声に変換する必要があります。これは、「torch.mean()」を使用してステレオ音声チャネルの平均値を取得することで実現できます。

#Resulting waveform and sample rate
waveform, sample_rate = torchaudio.load(audio_path, normalize=True)

# convert from stereo to mono
mono_waveform = torch.mean(waveform, dim=0, keepdim=True)

# confirm the waveform is mono
assert mono_waveform.shape[0] == 1 # mono

次に、モノ波形をモデルのサンプルレート16 khzに合わせて再サンプリングする必要があります。これを達成するために、torchaudioのリサンプリングAPIを使用します。

# Resample the mono waveform to the model's sample rate
resample_transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
resampled_waveform = resample_transform(mono_waveform)

最後に、再サンプリングした音声を等しいサイズのチャンクに分割し、モデルに入力して推論を容易にすることができます。

「torch.split()」を使用して、完全に再サンプリングされた波形を30秒のチャンクと、30秒* 16 khzに相当するチャンク・サイズのサンプルに分割します。このステップでは、それぞれ30秒の音声データを持つ波形のリスト「チャンク」が作成されます。各チャンクを推論のためにモデルに送り込みます。

# Define the desired chunk size
chunk_size_seconds = 30 
chunk_size_samples = chunk_size_seconds * 16000

# Split the waveform into chunks of equal size
chunks = torch.split(resampled_waveform, chunk_size_samples, dim=1)

##ステップ5：Granite音声モデルをロードしてインスタンス化する

これで音声モデルのインスタンス化を開始できます。

まず、TorchデバイスをCPUに設定します。デバイスがGPUに設定されている場合、このノートブックを実行するときにメモリ不足エラーが発生する可能性がありますが、CPUはwatsonx.aiノートブックで正常に動作するでしょう。その後、モデル用にプロセッサーとトークン化をセットアップします。

device = 'cpu'
model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
    model_name, trust_remote_code=True)
tokenizer = speech_granite_processor.tokenizer

もしもwatsonx.aiプラットフォームでノートブックを実行している場合は、次のコードを実行して「adapter_config.json」ファイルを手動で編集する必要がある場合もあります。これにより、モデルを読み込む際のエラーが回避されます。

adapter_config_file = hf_hub_download(model_name, 'adapter_config.json')

#load the existing config file and print it
with open(adapter_config_file, 'r') as file:
    data = json.load(file)

#remove key, value pairs in config file throwing error
keys_to_delete = ['layer_replication', 'loftq_config', 'megatron_config', 'megatron_core', 'use_dora', 'use_rslora']

for key in keys_to_delete:
    if key in data:
        del data[key]

# write the updated config file back to disk
with open(adapter_config_file, 'w') as file:
    json.dump(data, file, indent=4)

with open(adapter_config_file, 'r') as file:
    data = json.load(file)

順調です、モデルをロードできるようになりましたね。「transformers」ライブラリーの「AutoModelForSpeech2Seq」と、「from_pretrained」メソッドを使用してモデルを読み込みます。

speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(model_name, trust_remote_code=True).to(device)

##ステップ6：Granite音声モデルを使用してASRシステムを作成する

モデルをロードし、音声データを準備したので、それを使用して音声からテキストを生成できます。

まず、モデルが音声データを文字起こしするためのプロンプトを作成します。「tokenizer.apply_chat_template()」を使用して、プロンプトをモデルに取り込める形式に変換します。

chat = [
    {
        "role": "system",
        "content": "Knowledge Cutoff Date: April 2025.\nToday's Date: April 16, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
    },
    {
        "role": "user",
        "content": "<|audio|>can you transcribe the speech into a written format?",
    }
]

text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

次に、空のリスト「generated_texts」を設定して、音声インプットの各チャンクから生成されたテキストを収集できます。各音声チャンクを反復し、それを生成モデルに渡す「for」ループを設定しました。ここでは、「tqdm」進行バーを使用してループの進行状況も追跡します。モデル・インプットは、以前に確立した「speech_granite_processor」を通じて作成されます。プロセッサーは「テキスト」と「チャンク」をインプットとして受け取り、モデルが使用できるように処理されたバージョンの音声データを返します。アウトプットは、音声モデルの「生成」メソッドを使用して生成されます。そこから、「tokenizer」を使ってモデルのアウトプットを人間が読めるテキストに変換し、各チャンクの書き起こしを「generated_texts」リストに保管します。

generated_texts = []

for chunk in tqdm(chunks, desc="Generating transcript..."):

    model_inputs = speech_granite_processor(
        text,
        chunk,
        device=device, # Computation device; returned tensors are put on CPU
        return_tensors="pt",
    ).to(device)
    
    # Generate
    model_outputs = speech_granite.generate(
        **model_inputs,
        max_new_tokens=1000,
        num_beams=1,
        do_sample=False,
        min_length=1,
        top_p=1.0,
        repetition_penalty=1.0,
        length_penalty=1.0,
        temperature=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,)

    num_input_tokens = model_inputs["input_ids"].shape[-1]
    new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)

    output_text = tokenizer.batch_decode(
        new_tokens, add_special_tokens=False, skip_special_tokens=True)[0]

    generated_texts.append(output_text)

チャンクの書き起こしはリスト内の個々の文字列なので、一貫性のある完全な書き起こしを作成するために、スペースを挟んで文字列を結合します。

full_transcript = " ".join(generated_texts)

##ステップ7：Granite指示モデルを要約に使用する

完全な書き起こしができたので、同じモデルを使用して要約します。「<|audio|>」トークンを含まないテキスト・プロンプトで呼び出すだけで、Granite-speech-3.3-8bから直接Granite-3.3-8B-Instructモデルにアクセスできます。新しいプロンプトを設定し、このモデルに書き起こし全体の概要を生成するよう指示します。「tokenizer.apply_chat_template()」を再度使用し、モデル推論用のプロンプトに変換します。

conv = [{"role": "user", 
         "content": f"Compose a single, unified summary of the following transcript. Your response should only include the unified summary. Do not provide any further explanation. Transcript:{full_transcript}"}]

text = tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)

再び「speech_granite_processor」を使用してモデル・インプットを作成しますが、今回は音声ファイルは渡しません。

model_inputs = speech_granite_processor(
    text,
    device=device, # Computation device; returned tensors are put on CPU
    return_tensors="pt",
).to(device)

「speech_granite.generate()」からのアウトプットをテンソルとして受け取ります。「tokenizer.decode()」を使用して、このアウトプットをテキストに変換できます。では、最終的な要約を表示しましょう。

output = speech_granite.generate(
    **model_inputs,
    max_new_tokens= 2000, # concise summary
)

summary = tokenizer.decode(output[0, model_inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(summary)

アウトプット：

In the 50th episode of Mixture of Experts, the panel discusses various AI-related topics. 

Kate Soule, Director of Technical Product Management at Granite, estimates that 90% of enterprise data is unstructured. 

Hilary Hunter, IBM Fellow and CTO of IBM Infrastructure, introduces IBM's new mainframe launch, IBM z, emphasizing its zero downtime and eight nines of reliability, crucial for global financial transactions. 

The conversation also touches on Meta's Llama 4 release, featuring three models: Scout (100 billion parameters), Maverick (200 billion parameters), and Behemoth (two trillion parameters). The panelists discuss the implications of these models, particularly the mixture of experts architecture, and its potential to become more community-driven. 

Shobhit Varshney Head of Data and AI for the Americas, shares insights on handling unstructured data in enterprises, advocating for bringing AI close to transaction data for low-latency, mission-critical applications. 

The episode concludes with a brief overview of Google Cloud Next, highlighting advancements in AI models, on-premises AI capabilities, and Google's focus on AI for media creation and agent-to-agent communication. 

The panel also references a Pew Research report on American perceptions of AI, noting discrepancies between experts' optimism and the general public's concerns about job impacts from AI.

まとめ

このチュートリアルでは、YouTubeから英語の音声ファイルをダウンロードしました。Granite音声モデルで消費する音声ファイルを変換し、音声の完全な書き起こしを生成し、Granite指示モデルを使用して書き起こしの要約を生成しました。

貴社では生成AIを活用する準備ができていますか

組織が生成AIを効果的に導入する際の課題に対処するのに役立つ5つの主要なオーケストレーション機能について説明します。

Granite 3.3とwatsonx.aiを使用した自動音声認識（ASR）により、ポッドキャストの書き起こしを生成