자동 음성 인식을 사용해 팟캐스트 전사본 생성하기

이 튜토리얼에서는 오픈 소스 IBM Granite 3.3 음성 모델을 사용해 YouTube에 있는 IBM Mixture of Experts 팟캐스트 영상의 전사본을 생성합니다. 그런 다음 오픈 소스 IBM Granite-3.3-8B-Instruct 대규모 언어 모델(LLM)을 사용하여 생성된 전사본의 요약을 출력합니다. 이 코드는 watsonx.ai 노트북에서 실행합니다.

자동 음성 인식

자동 음성 인식(ASR)은 음성 인식 또는 Speech to Text라고도 하며, 말한 언어를 문자 텍스트로 변환하는 기술입니다. 음성을 텍스트로 변환하기 위해 다양한 머신러닝 알고리즘과 인공지능 계산 기법이 사용됩니다. 음성 인식 기술은 20세기 중반에 시작된 이래 오늘날까지 크게 발전해 왔습니다.

1960년대에 처음에는 음성 분석에 스펙트로그램이 사용되었습니다. 이후 수십 년 동안 통계 모델로의 전환이 일어났습니다. 은닉 마르코프 모델(HMM)은 등장한 이후, 언어학에서 음소라고 불리는 작은 소리 단위의 연속을 모델링하는 데 주로 사용되었습니다. ASR 시스템 아키텍처는 음향 모델, 언어 모델, 디코더의 세 가지 개별 구성 요소로 구성되었습니다.

2010년대에 들어서면서 딥 러닝의 발전이 기존의 음성 인식 시스템 아키텍처에 영향을 미치기 시작했습니다. 인코더-디코더 모델은 인코더가 입력 데이터를 처리하고 디코더가 인코더의 표현을 기반으로 아웃풋을 생성하는 순환 신경망(RNN) 또는 컨볼루션 신경망(CNN) 아키텍처를 사용할 수 있습니다. 모델은 오디오 신호와 전사 간의 대응 관계를 학습하기 위해 레이블 없는 대규모 오디오-텍스트 쌍 데이터 세트를 이용해 학습됩니다. 인기 있는 ASR 모델로는 DeepSpeech와 Wav2Vec이 있습니다.

오늘날 Apple의 Siri, Amazon의 Alexa 또는 Microsoft의 Cortana와 같은 가상 어시스턴트는 ASR 기술을 활용해 인간의 음성을 실시간으로 처리합니다. 또한 이 가상 어시스턴트는 Speech to Text를 대규모 언어 모델(LLM) 및 자연어 처리(NLP)와 통합할 수 있습니다. LLM은 컨텍스트를 추가하는 데 사용할 수 있으며, 이는 단어 선택이 더 모호하거나 인간의 언어 패턴에 변동이 있는 경우에 도움이 될 수 있습니다.

전제조건

watsonx.ai 프로젝트를 생성하려면 IBM Cloud 계정이 필요합니다.

단계

1단계. 환경 설정

여러 툴 중에서 선택할 수 있지만, 이 튜토리얼에서는 Jupyter Notebook을 사용하기 위해 IBM 계정을 설정하는 방법을 안내합니다.

1. IBM Cloud 계정을 사용하여 watsonx.ai에 로그인합니다.

2. watsonx.ai 프로젝트를 생성합니다.

3. Jupyter Notebook을 만듭니다.

반드시GPU 2xV100 Runtime 24.1 을 선택해 구성을 정의합니다. 이 단계에서는 이 튜토리얼의 코드를 복사할 수 있는 Jupyter Notebook 환경이 열립니다.

또는 이 노트북을 로컬 시스템에 다운로드하여 watsonx.ai 프로젝트에 에셋으로 업로드할 수 있습니다. 이 튜토리얼은 Github에서도 확인할 수 있습니다. 더 많은 Granite 튜토리얼은 IBM Granite 커뮤니티를 확인하세요.

2단계. 관련 라이브러리 설치 및 가져오기

이 튜토리얼에는 몇 가지 종속성이 있습니다. 다음 패키지를 반드시 가져와야 합니다. 설치되어 있지 않은 경우 빠른 pip 설치로 이 문제를 해결할 수 있습니다.

caikit-nlp 패키지와 관련된 "pip dependency resolver" 오류가 발생하면, 노트북의 나머지 부분은 정상적으로 실행될 수 있으니 일단 무시해도 됩니다.

# Install required packages
! pip install -q peft torchaudio soundfile pytubefix pytube moviepy tqdm https://github.com/huggingface/transformers/archive/main.zip

# Required imports
import json
import os

from pytubefix import YouTube
from tqdm import tqdm
from moviepy.audio.io.AudioFileClip import AudioFileClip

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

3단계. YouTube에서 팟캐스트 오디오 다운로드

이 튜토리얼에서는 IBM 'Mixture of Experts' 팟캐스트의 최신 에피소드, 'AI on IBM z17, Meta's Llama 4 and Google Cloud Next 2025'를 사용합니다. 팟캐스트는 YouTube에서 호스팅됩니다. 먼저YouTube 객체를 생성하고streams.filter(only_audio=True) 메서드를 사용해 원시 오디오만 캡처합니다. 여기에서 동영상에서 오디오를 추출하여 M4A 오디오 파일로 저장합니다.out_file .base 는 파일이 저장될 디렉토리를 포함한 전체 파일 이름입니다. 이때m4a 확장자는 제외됩니다. 나중에 오디오 형식을 변환할 때base 변수를 사용하겠습니다.

url = "https://www.youtube.com/watch?v=90fUR1PQgt4" #latest episode 37 minutes

# Create a YouTube object
yt = YouTube(url)

# Download only the audio stream from the video
video = yt.streams.filter(only_audio=True).first()

# Save the audio to a file
out_file = video.download()

# Get the base name and extension of the downloaded audio
base = os.path.splitext(out_file)[0]

4단계: 모델 추론을 위해 팟캐스트 오디오 파일 준비하기

모델 추론에 사용하기 전에 팟캐스트 오디오 파일을 몇 가지 수정해야 합니다.

먼저 M4A 파일을 WAV 파일로 변환해 Granite Speech 모델과 함께 사용해야 합니다. 이 변환을 위해 moviepy 라이브러리를 사용합니다. 앞서 정의한 base 변수를 사용하여 .wav 확장자를 가진 새 파일 이름을 만들 수 있습니다.

# Load the M4A file
audio_clip = AudioFileClip(out_file)

# Write the audio to a WAV file
audio_clip.write_audiofile(base+".wav")

# Close the audio clip
audio_clip.close()

audio_path = base+".wav"

다음으로 torchaudiodio.load()를 사용해 오디오 파일을 텐서로 로드하고 샘플링 속도를 추출합니다.

또한 반환된 파형을 스테레오 사운드에서 모노 사운드로 변환해야 합니다. 이를 위해 torch.mean()을 사용하여 스테레오 사운드 채널의 평균을 계산하면 됩니다.

#Resulting waveform and sample rate
waveform, sample_rate = torchaudio.load(audio_path, normalize=True)

# convert from stereo to mono
mono_waveform = torch.mean(waveform, dim=0, keepdim=True)

# confirm the waveform is mono
assert mono_waveform.shape[0] == 1 # mono

다음으로, 모노 파형을 모델의 샘플링 속도인 16kHz로 리샘플링해야 합니다. 이를 위해 torchaudio의 리샘플링 API를 사용할 수 있습니다.

# Resample the mono waveform to the model's sample rate
resample_transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
resampled_waveform = resample_transform(mono_waveform)

마지막으로, 리샘플링된 파형을 동일한 크기의 청크로 나누어 모델에 입력하면, 추론이 보다 수월해집니다.

torch.split()을 사용해 리샘플링된 전체 파형을 30초 단위 청크로 분할하며, 각 청크의 샘플 수는 30초 * 16kHz로 설정합니다. 이 단계에서는 각 파형의 chunks 목록을 얻을 수 있으며, 각 청크에는 30초간의 오디오 데이터가 포함되어 있습니다. 각 청크를 모델에 입력하여 추론을 수행합니다.

# Define the desired chunk size
chunk_size_seconds = 30 
chunk_size_samples = chunk_size_seconds * 16000

# Split the waveform into chunks of equal size
chunks = torch.split(resampled_waveform, chunk_size_samples, dim=1)

5단계: Granite 음성 모델 로드 및 인스턴스화

이제 음성 모델 인스턴스화를 시작할 수 있습니다.

먼저 torch device를 CPU로 설정합니다. 장치가 GPU로 설정된 경우 이 노트북을 실행할 때 메모리 부족 오류가 발생할 수 있지만, CPU는 watsonx.ai 노트북에서 정상적으로 작동할 것입니다. 그 후 모델에 사용할 프로세서와 토크나이저를 설정할 수 있습니다.

device = 'cpu'
model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
    model_name, trust_remote_code=True)
tokenizer = speech_granite_processor.tokenizer

watsonx.ai 플랫폼에서 노트북을 실행하는 경우, 다음 코드를 실행하여 `adapter_config.json' 파일을 수동으로 편집해야 할 수도 있습니다. 이렇게 하면 모델을 로드할 때 오류가 발생하지 않습니다.

adapter_config_file = hf_hub_download(model_name, 'adapter_config.json')

#load the existing config file and print it
with open(adapter_config_file, 'r') as file:
    data = json.load(file)

#remove key, value pairs in config file throwing error
keys_to_delete = ['layer_replication', 'loftq_config', 'megatron_config', 'megatron_core', 'use_dora', 'use_rslora']

for key in keys_to_delete:
    if key in data:
        del data[key]

# write the updated config file back to disk
with open(adapter_config_file, 'w') as file:
    json.dump(data, file, indent=4)

with open(adapter_config_file, 'r') as file:
    data = json.load(file)

좋습니다. 이제 모델을 로드할 수 있습니다! transformers 라이브러리의 AutoModelForSpeechSeq2Seq와 from_pretrained 메서드를 사용하여 모델을 로드하겠습니다.

speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(model_name, trust_remote_code=True).to(device)

6단계: Granite 음성 모델로 ASR 시스템 만들기

모델을 로드하고 오디오 데이터를 준비했으니, 이제 이를 사용하여 음성에서 텍스트를 생성할 수 있습니다.

먼저 모델이 오디오 데이터를 전사할 수 있는 프롬프트를 생성하는 것으로 시작하겠습니다. tokenizer.apply_chat_template()을 사용해 프롬프트를 모델에 입력할 수 있는 형식으로 변환합니다.

chat = [
    {
        "role": "system",
        "content": "Knowledge Cutoff Date: April 2025.\nToday's Date: April 16, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
    },
    {
        "role": "user",
        "content": "<|audio|>can you transcribe the speech into a written format?",
    }
]

text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

그런 다음, 빈 목록 generated_texts를 설정하여 각 오디오 입력 청크에서 생성된 텍스트를 수집할 수 있습니다.

각 오디오 청크를 반복하고 모델에 전달하여 생성하도록 for 반복문을 설정했습니다. 여기에서는 tqdm 진행률 표시줄을 사용하여 반복문의 진행 상황도 확인합니다.

모델 입력은 이전에 설정한 speech_granite_processor를 통해 생성됩니다. 프로세서는text와 chunk를 입력으로 받아, 모델이 사용할 오디오 데이터를 처리한 버전을 반환합니다.

모델 출력은 음성 모델의 generate 메서드를 사용하여 생성됩니다. 그런 다음 tokenizer를 사용하여 모델 아웃풋을 사람이 읽을 수 있는 텍스트로 변환하고, 각 청크의 전사본을 generated_texts 목록에 저장합니다.

generated_texts = []

for chunk in tqdm(chunks, desc="Generating transcript..."):

    model_inputs = speech_granite_processor(
        text,
        chunk,
        device=device, # Computation device; returned tensors are put on CPU
        return_tensors="pt",
    ).to(device)
    
    # Generate
    model_outputs = speech_granite.generate(
        **model_inputs,
        max_new_tokens=1000,
        num_beams=1,
        do_sample=False,
        min_length=1,
        top_p=1.0,
        repetition_penalty=1.0,
        length_penalty=1.0,
        temperature=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,)

    num_input_tokens = model_inputs["input_ids"].shape[-1]
    new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)

    output_text = tokenizer.batch_decode(
        new_tokens, add_special_tokens=False, skip_special_tokens=True)[0]

    generated_texts.append(output_text)

청크 전사본은 현재 목록의 개별 문자열이므로 문자열을 사이에 공백으로 연결하여 하나의 일관된 전체 전사본을 만듭니다.

full_transcript = " ".join(generated_texts)

7단계: Granite instruct 모델을 사용하여 요약하기

전체 스크립트를 작성했으니, 이제 동일한 모델을 사용하여 요약하겠습니다. < |audio| > 토큰을 포함하지 않는 텍스트 프롬프트로 호출하면 Granite-speech-3.3-8b를 통해 Granite-3.3-8B-Instruct 모델에 직접 액세스할 수 있습니다.

이 모델에 전체 전사본의 요약을 생성하도록 지시하는 새 프롬프트를 설정합니다. tokenizer.apply_chat_template()`를 다시 사용해 모델 추론을 위한 프롬프트를 변환합니다.

conv = [{"role": "user", 
         "content": f"Compose a single, unified summary of the following transcript. Your response should only include the unified summary. Do not provide any further explanation. Transcript:{full_transcript}"}]

text = tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)

speech_granite_processor를 다시 사용하여 모델 입력을 생성하되, 이번에는 오디오 파일을 전달하지 않습니다.

model_inputs = speech_granite_processor(
    text,
    device=device, # Computation device; returned tensors are put on CPU
    return_tensors="pt",
).to(device)

speech_granite.generate()의 아웃풋을 텐서로 받습니다. tokenizer.decode()를 사용하여 이 출력을 텍스트로 변환하고 최종 요약을 출력할 수 있습니다!

output = speech_granite.generate(
    **model_inputs,
    max_new_tokens= 2000, # concise summary
)

summary = tokenizer.decode(output[0, model_inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(summary)

아웃풋:

In the 50th episode of Mixture of Experts, the panel discusses various AI-related topics. 

Kate Soule, Director of Technical Product Management at Granite, estimates that 90% of enterprise data is unstructured. 

Hilary Hunter, IBM Fellow and CTO of IBM Infrastructure, introduces IBM's new mainframe launch, IBM z, emphasizing its zero downtime and eight nines of reliability, crucial for global financial transactions. 

The conversation also touches on Meta's Llama 4 release, featuring three models: Scout (100 billion parameters), Maverick (200 billion parameters), and Behemoth (two trillion parameters). The panelists discuss the implications of these models, particularly the mixture of experts architecture, and its potential to become more community-driven. 

Shobhit Varshney Head of Data and AI for the Americas, shares insights on handling unstructured data in enterprises, advocating for bringing AI close to transaction data for low-latency, mission-critical applications. 

The episode concludes with a brief overview of Google Cloud Next, highlighting advancements in AI models, on-premises AI capabilities, and Google's focus on AI for media creation and agent-to-agent communication. 

The panel also references a Pew Research report on American perceptions of AI, noting discrepancies between experts' optimism and the general public's concerns about job impacts from AI.

결론

이 튜토리얼에서는 YouTube에서 영어 오디오 파일을 다운로드했습니다. Granite 음성 모델에서 사용할 오디오 파일을 변환하고, 오디오의 전체 전사본을 생성한 후, Granite Instruct 모델을 사용하여 전사본 요약을 생성했습니다.

과대 광고 그 이상 - AI 어시스턴트가 실제 비즈니스 가치를 창출하는 방법

보고서를 읽고 AI 어시스턴트를 활용하는 주요 사용 사례를 살펴보고, 생성형 AI 및 자동화 기술이 비즈니스에 미치는 잠재적 영향을 이해하고, 시작하는 방법을 알아보세요.

자동 음성 인식(ASR)을 사용해 Granite 3.3 및 watsonx.ai로 팟캐스트 스크립트 생성하기