macOS利用GPU加速Whisper语音转文字

1. 前言

前几天,写了 开源Whisper语音转文字 的内容,但是在 macOS 上使用 whisper 默认的 --device mps 命令进行 GPU 加速,会报错。而单纯用 CPU 跑 whisper,耗时比较长。

搜索🔍相关主题后,发现:目前 GitHub 的 whisper 项目 1在 macOS 上安装后,暂时还不能使用 GPU 加速。

不过,可以基于 C/C++ 语言的 whisper.cpp 2,在 macOS 上实现用 GPU 加速 Whisper 语音转文字。

结果表明:在 macOS 上用 whisper.cpp 实现 GPU 加速 Whisper 语音转文字的速度提升非常明显。

2. 具体过程

2.1 下载语言包

建议利用官网命令下载,终端输入:

whisper 20231218.wav --language Chinese --model large

large 语言包的路径为:~/.cache/whisper/large-v3.pt,文件格式为 pt ,大小约 2.9 G。

2.2 转换 pt 文件为 bin 文件

下载并解压两个 GitHub 的全文件夹—— whisper 1 和 whisper.cpp 2

利用 whisper.cpp 提供的转换工具 convert-pt-to-ggml.py 将原始 pt 文件转换为 bin 文件,终端输入:

python /Users/name/Downloads/whisper.cpp-master/models/convert-pt-to-ggml.py ~/.cache/whisper/large-v3.pt /Users/name/Downloads/whisper-main/ /Users/name/Downloads/whisper.cpp-master/models/

解释:

  • /Users/name/Downloads/whisper.cpp-master/models/convert-pt-to-ggml.py 为 convert-pt-to-ggml.py 文件的具体路径
  • ~/.cache/whisper/large-v3.pt 为第 1 步下载的语言包
  • /Users/name/Downloads/whisper.cpp-master/models/ 为生成 bin 文件的具体路径

命令执行后,会在 /Users/name/Downloads/whisper.cpp-master/models/ 文件夹下生成 ggml-model.bin 文件,大小约 3.1 G。

2.3 准备 16 kHz 的语音 WAV 文件

终端输入:

ffmpeg -i 20231218.mp4 -f WAV -acodec pcm_s16le -ac 1 -ar 16000 20231218.WAV

解释:

  • ffmpeg 是转换视频和音频的工具 4
  • 20231218.mp4 为输入文件
  • 20231218.WAV 为输出文件

注意⚠️: WAV 文件必须是 16 kHz,不然会出现以下错误:

read_wav: WAV file '20231218.wav' must be 16 kHz
error: failed to read WAV file 'samples/20231218.wav'

2.4 利用 GPU 加速 Whisper 转换语音为文字

切换到 whisper.cpp-master 文件夹下,终端输入:

./main -m models/ggml-model.bin -f samples/20231218.wav -l zh
  • ./main 为 whisper.cpp 的主要工具,可以 ./main -h查看具体帮助
  • models/ggml-model.bin 为语言包文件
  • samples/20231218.wav 为需要转换的语音,由第 3 步生成
  • -l zh 为指定语言

输出的结果为:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/name/Downloads/whisper.cpp-master/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_metal_init: maxTransferRate               = built-in GPU
whisper_model_load:    Metal buffer size =  3094.88 MB
whisper_model_load: model size    = 3094.36 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/name/Downloads/whisper.cpp-master/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_metal_init: maxTransferRate               = built-in GPU
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.49 MB
whisper_init_state: compute buffer (encode) =  212.49 MB
whisper_init_state: compute buffer (cross)  =    9.45 MB
whisper_init_state: compute buffer (decode) =   99.30 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/20231218.wav' (8748884 samples, 546.8 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = zh, task = transcribe, timestamps = 1 ...

...
...
...

whisper_print_timings:     load time =   934.38 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   288.84 ms
whisper_print_timings:   sample time =  3426.48 ms / 12652 runs (    0.27 ms per run)
whisper_print_timings:   encode time =  9619.86 ms /    20 runs (  480.99 ms per run)
whisper_print_timings:   decode time =  4992.92 ms /   316 runs (   15.80 ms per run)
whisper_print_timings:   batchd time = 93701.06 ms / 12241 runs (    7.65 ms per run)
whisper_print_timings:   prompt time =  1840.59 ms /  4184 runs (    0.44 ms per run)
whisper_print_timings:    total time = 114824.53 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

利用 Apple M2 GPU (GPU family: MTLGPUFamilyApple8 (1008)) 5 加速后,语音转文字的速度提升非常明显。

3. 延伸阅读

  1. openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision
  2. ggerganov/whisper.cpp: Port of OpenAI’s Whisper model in C/C++
  3. 开源Whisper语音转文字
  4. FFmpeg
  5. MTLGPUFamilyApple8 | Apple Developer Documentation