MiniCPM-o 4.5 Ascend NPU 迁移简记

文章目录

一、环境准备

基础镜像与 PyTorch 生态

使用 vllm-ascend 0.13.0.rc3 镜像作为基础环境：

pip install torchaudio==2.8.0
pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils[all]>=1.0.5"

ffmpeg 源码编译

MiniCPM-o 的视频处理依赖 ffmpeg，需从源码编译（--enable-shared 是关键，decord 需要链接 libavcodec）：

wget https://ffmpeg.org/releases/ffmpeg-4.4.2.tar.bz2
tar -xvf ffmpeg-4.4.2.tar.bz2 && cd ffmpeg-4.4.2
./configure --enable-shared --prefix=/usr/local/ffmpeg
make -j 64 && make install
cd ..

如果 ffmpeg 命令无输出，添加环境变量：

echo 'export PATH="/usr/local/ffmpeg/bin:$PATH"' >> /etc/profile.d/ffmpeg.sh
echo 'export LD_LIBRARY_PATH="/usr/local/ffmpeg/lib:$LD_LIBRARY_PATH"' >> /etc/profile.d/ffmpeg.sh
source /etc/profile

decord 源码编译

git clone --recursive https://github.com/dmlc/decord --depth 1
cd decord && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DFFMPEG_DIR:PATH="/usr/local/ffmpeg/"
make

cd ../python
python setup.py sdist bdist_wheel
pip install dist/decord-0.6.0-cp310-cp310-linux_aarch64.whl
cd ../..

其他依赖

pip install moviepy==2.1.2 librosa==0.9.0 pillow==10.4.0 \
    accelerate onnx \
    -i https://mirrors.aliyun.com/pypi/simple/

# 项目依赖
cd /data/MiniCPM-o-Demo
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/

# 配置文件
cp config.example.json config.json
# 修改 config.json 中的 model.model_path

前端构建（bun）

curl -fsSL https://bun.sh/install | bash
source /root/.bashrc

cd /data/MiniCPM-o-Demo/frontend/mobile
bun install

环境就绪后即可进入迁移改造。

把 MiniCPM-o 4.5 官方 PyTorch+CUDA Web Demo 完整迁移到华为 Ascend NPU。不依赖 flagos，用 torch_npu 原生 transfer_to_npu，新增 --device npu 一个参数切换 CUDA/NPU 双模。全项目 30+ 处 CUDA 硬编码分布在 8 个文件中，逐一改造。

二、项目结构

MiniCPM-o-Demo/
├── worker.py              # 推理 Worker，每卡一个进程
├── gateway.py             # 请求路由 Gateway
├── start_all.sh           # 一键启动脚本
├── core/processors/
│   ├── unified.py         # 统一处理器，模型加载+模式切换
│   ├── base.py / factory.py
├── MiniCPMO45/
│   ├── modeling_minicpmo_unified.py   # 模型定义(主)
│   └── modeling_minicpmo.py           # 模型定义(旧)
├── benchmark.py / precompile.py

三、迁移策略

建立设备抽象层 device_utils.py，统一替换所有 torch.cuda.* 调用。核心三板斧在入口最顶部执行：

import torch_npu
import torch_npu.contrib.transfer_to_npu  # 导入即全局 patch .cuda()→.npu()
torch_npu.npu.set_compile_mode(jit_compile=False)     # eager 模式
torch_npu.npu.config.allow_internal_format = False    # 保证精度

新增 device_utils.py 封装：

from device_utils import init_npu, empty_cache, synchronize

init_npu("npu")         # 自动 import torch_npu + 三板斧初始化
empty_cache()           # 代替 torch.cuda.empty_cache()
synchronize()           # 代替 torch.cuda.synchronize()
dm = device_module()    # 取实际设备模块（torch_npu.npu 或 torch.cuda）

四、改动清单

文件	改动内容	处数
device_utils.py	新增：设备抽象层	1 新文件
worker.py	--device 参数、init_npu()、empty_cache×2、传 device	8
core/processors/unified.py	bfloat16→float32(NPU)、.npu() 替代 .cuda()、empty_cache()	4
MiniCPMO45/modeling_minicpmo_unified.py	empty_cache×2、synchronize×2、RNG state NPU 适配	7
MiniCPMO45/modeling_minicpmo.py	empty_cache×1、RNG state NPU 适配	3
benchmark.py / precompile.py	--device 参数、模型加载适配	各 5
start_all.sh	NPU/CUDA 双模、127.0.0.1、venv 自动检测	7

4.1 模型加载（unified.py）

# 原始
model.bfloat16().eval().cuda()

# NPU 路径
if self.device == "npu":
    model.float().eval()              # Ascend 不支持 bf16
    import torch_npu.contrib.transfer_to_npu
    model.npu()                       # 直接 .npu()
elif self.device == "cuda":
    model.bfloat16().eval().cuda()    # 原始逻辑不变

4.2 Worker 入口（worker.py）

parser.add_argument("--device", type=str, default="cuda",
                    choices=["cuda", "npu", "auto"])
args = parser.parse_args()
init_npu(args.device)  # ⚠️ 必须在模型加载前调用

4.3 模型文件 CUDA API 替换

try:
    from device_utils import device_module
    _dm = device_module()  # 取实际模块（torch_npu.npu 或 torch.cuda）
except ImportError:
    _dm = torch.cuda       # fallback

# 所有 torch.cuda.* 替换为 _dm.*
_dm.empty_cache()
_dm.synchronize()
_dm.is_available()
_dm.get_rng_state()        # NPU 不支持时 fallback 到 CPU RNG

4.4 start_all.sh — 完整改动

设备检测——自动识别 NPU/CUDA：

# ============ 检测设备 ============
if [ "$DEVICE" = "npu" ]; then
    if [ -z "$ASCEND_RT_VISIBLE_DEVICES" ]; then
        NUM_GPUS=$(npu-smi info -l 2>/dev/null | grep -c "NPU" || echo 1)
        GPU_LIST=$(seq 0 $((NUM_GPUS - 1)) | tr '\n' ',' | sed 's/,$//')
    else
        GPU_LIST="$ASCEND_RT_VISIBLE_DEVICES"
        NUM_GPUS=$(echo "$GPU_LIST" | tr ',' '\n' | wc -l)
    fi
    DEVICE_FLAG="--device npu"
    DEVICE_ENV="ASCEND_RT_VISIBLE_DEVICES"
else
    # NVIDIA CUDA（原始逻辑不变）
    ...
    DEVICE_FLAG=""
    DEVICE_ENV="CUDA_VISIBLE_DEVICES"
fi

Worker 启动行改成动态变量：

nohup env $DEVICE_ENV=$GPU_ID PYTHONPATH=. $VENV_PYTHON worker.py \
    --port $WORKER_PORT --gpu-id $GPU_ID --worker-index $GPU_IDX \
    $DEVICE_FLAG \
    > "tmp/worker_${GPU_IDX}.log" 2>&1 &

所有 localhost → 127.0.0.1，venv 路径自动检测（有 .venv 用 venv，没有走系统 Python）。

五、启动方式

一键脚本启动

DEVICE=npu ASCEND_RT_VISIBLE_DEVICES=4 \
SKIP_MOBILE_BUILD=1 SKIP_DOCS_BUILD=1 \
bash start_all.sh

手动分步启动（调试用）

# Worker
ASCEND_RT_VISIBLE_DEVICES=4 PYTHONPATH=. python worker.py \
    --device npu --worker-index 0 --port 22400 &

# Gateway
PYTHONPATH=. python gateway.py --port 8006 --workers 127.0.0.1:22400 &

六、踩坑记录

6.1 transfer_to_npu 是模块不是函数

# ❌ TypeError: module not callable
from torch_npu.contrib import transfer_to_npu
model = transfer_to_npu(model)

# ✅ 导入即全局 monkey-patch，直接调 .npu()
import torch_npu.contrib.transfer_to_npu
model.npu()

6.2 device_module 是函数不是模块属性

# ❌ AttributeError: function has no attribute 'empty_cache'
from device_utils import device_module as dm
dm.empty_cache()

# ✅ 导入具名函数
from device_utils import empty_cache, synchronize
empty_cache()

6.3 Device Map 日志误报 CPU

worker.py 用 "cuda" in str(device) 判断，NPU 返回 npu:0 不含 cuda 所以打印 ⚠ CPU!。实际所有参数都在 NPU 上，npu-smi info 可确认。

6.4 Gateway 用 127.0.0.1 不用 localhost

部分服务器 /etc/hosts 没有 localhost→127.0.0.1 映射，导致 Gateway 连不上 Worker。

6.5 安全组放行端口

云服务器需在控制台安全组 + 系统防火墙（iptables/firewalld）中开放 8006。

七、torch.compile 在 NPU 上不可用

官方 torch.compile 加速（A100 上全双工从 0.9s→0.5s）底层是 Triton→CUDA kernel，仅支持 NVIDIA GPU。Ascend NPU 架构不同，Triton kernel 无法运行。precompile.py 对 NPU 无意义。加速方向：模型量化、多卡并行、等待华为算子优化。

八、验证

curl http://127.0.0.1:22400/health
# {"status":"healthy","model_loaded":true,"gpu_id":4}

curl -k https://127.0.0.1:8006/health
# {"status":"healthy"}

启动耗时（Ascend 910B, float32）：模型加载 18.8s + Unified 初始化 8.5s = 总计 27.3s。

浏览器访问 https://公网IP:8006 即可使用。

GitHub: OpenBMB/MiniCPM-o-Demo

简记。

九、测试验证

修改模型路径后，测试文件顶部也需加上 NPU 三板斧初始化：

# tests/test_chat.py / test_streaming.py / test_duplex.py 顶部
import torch_npu
import torch_npu.contrib.transfer_to_npu
torch_npu.npu.set_compile_mode(jit_compile=False)
torch_npu.npu.config.allow_internal_format = False

修改 conftest.py 模型路径为实际路径，然后运行：

python -m pytest tests/test_chat.py tests/test_streaming.py tests/test_duplex.py -v

测试结果（14 passed / 11 failed）

模块	通过	失败	失败原因
Chat	6/8	2	缺 ref_audio 素材、无效图片路径（测试 fixture 问题）
Streaming	7/10	3	缺 ref_audio 素材、多轮 KV cache 记忆边缘 case
Duplex	0/5	5	全部缺 ref_audio 素材文件
合计	14	11	10 个缺测试素材，1 个 NPU 精度边缘 case

通过的典型用例：

✅ simple_chat: 1+1等于2
✅ multi_turn: 42 × 2 = 84
✅ audio_understanding: 音频复述正确
✅ image_understanding: 植物大战僵尸游戏截图
✅ greedy_decoding: 天空是蓝色的
✅ long_response: 自我介绍 >50 字符
✅ streaming 文本/音频: 流式输出正常
✅ complete_turn 多轮: KV Cache 跨轮复用正常
✅ session 切换/重置: 状态隔离正常

失败的全部是测试环境缺少 wav 素材文件和图像文件，与 NPU 迁移无关。唯一一个实质性的边缘 case 是 KV cache 多轮对话记忆测试（"我叫小明" → "我叫什么名字"），模型未正确回忆，可能与 float32 精度差异有关，待后续调优。

简记。