macOS利用GPU加速Whisper语音转文字
1. 前言
前几天,写了 开源Whisper语音转文字 的内容,但是在 macOS 上使用 whisper 默认的 --device mps
命令进行 GPU 加速,会报错。而单纯用 CPU 跑 whisper,耗时比较长。
搜索🔍相关主题后,发现:目前 GitHub 的 whisper 项目 1在 macOS 上安装后,暂时还不能使用 GPU 加速。
不过,可以基于 C/C++ 语言的 whisper.cpp 2,在 macOS 上实现用 GPU 加速 Whisper 语音转文字。
结果表明:在 macOS 上用 whisper.cpp 实现 GPU 加速 Whisper 语音转文字的速度提升非常明显。
2. 具体过程
2.1 下载语言包
建议利用官网命令下载,终端输入:
whisper 20231218.wav --language Chinese --model large
large 语言包的路径为:~/.cache/whisper/large-v3.pt
,文件格式为 pt ,大小约 2.9 G。
2.2 转换 pt 文件为 bin 文件
下载并解压两个 GitHub 的全文件夹—— whisper 1 和 whisper.cpp 2。
利用 whisper.cpp 提供的转换工具 convert-pt-to-ggml.py 将原始 pt 文件转换为 bin 文件,终端输入:
python /Users/name/Downloads/whisper.cpp-master/models/convert-pt-to-ggml.py ~/.cache/whisper/large-v3.pt /Users/name/Downloads/whisper-main/ /Users/name/Downloads/whisper.cpp-master/models/
解释:
/Users/name/Downloads/whisper.cpp-master/models/convert-pt-to-ggml.py
为 convert-pt-to-ggml.py 文件的具体路径~/.cache/whisper/large-v3.pt
为第 1 步下载的语言包/Users/name/Downloads/whisper.cpp-master/models/
为生成 bin 文件的具体路径
命令执行后,会在 /Users/name/Downloads/whisper.cpp-master/models/
文件夹下生成 ggml-model.bin 文件,大小约 3.1 G。
2.3 准备 16 kHz 的语音 WAV 文件
终端输入:
ffmpeg -i 20231218.mp4 -f WAV -acodec pcm_s16le -ac 1 -ar 16000 20231218.WAV
解释:
- ffmpeg 是转换视频和音频的工具 4
20231218.mp4
为输入文件20231218.WAV
为输出文件
注意⚠️: WAV 文件必须是 16 kHz,不然会出现以下错误:
read_wav: WAV file '20231218.wav' must be 16 kHz
error: failed to read WAV file 'samples/20231218.wav'
2.4 利用 GPU 加速 Whisper 转换语音为文字
切换到 whisper.cpp-master 文件夹下,终端输入:
./main -m models/ggml-model.bin -f samples/20231218.wav -l zh
./main
为 whisper.cpp 的主要工具,可以./main -h
查看具体帮助models/ggml-model.bin
为语言包文件samples/20231218.wav
为需要转换的语音,由第 3 步生成-l zh
为指定语言
输出的结果为:
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/name/Downloads/whisper.cpp-master/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
ggml_metal_init: maxTransferRate = built-in GPU
whisper_model_load: Metal buffer size = 3094.88 MB
whisper_model_load: model size = 3094.36 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/name/Downloads/whisper.cpp-master/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
ggml_metal_init: maxTransferRate = built-in GPU
whisper_init_state: kv self size = 220.20 MB
whisper_init_state: kv cross size = 245.76 MB
whisper_init_state: compute buffer (conv) = 32.49 MB
whisper_init_state: compute buffer (encode) = 212.49 MB
whisper_init_state: compute buffer (cross) = 9.45 MB
whisper_init_state: compute buffer (decode) = 99.30 MB
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 |
main: processing 'samples/20231218.wav' (8748884 samples, 546.8 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = zh, task = transcribe, timestamps = 1 ...
...
...
...
whisper_print_timings: load time = 934.38 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 288.84 ms
whisper_print_timings: sample time = 3426.48 ms / 12652 runs ( 0.27 ms per run)
whisper_print_timings: encode time = 9619.86 ms / 20 runs ( 480.99 ms per run)
whisper_print_timings: decode time = 4992.92 ms / 316 runs ( 15.80 ms per run)
whisper_print_timings: batchd time = 93701.06 ms / 12241 runs ( 7.65 ms per run)
whisper_print_timings: prompt time = 1840.59 ms / 4184 runs ( 0.44 ms per run)
whisper_print_timings: total time = 114824.53 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
利用 Apple M2 GPU (GPU family: MTLGPUFamilyApple8 (1008)) 5 加速后,语音转文字的速度提升非常明显。