- 一、背景
- 二、k3s 离线安装
- 2.1 下载离线文件
- 2.2 部署到离线节点
- 2.3 国内镜像加速(在线安装备选)
- 2.4 自定义数据目录
- 三、Master 节点:安装前准备
- 3.1 节点标签与命名空间
- 3.2 创建用户与日志目录(所有 Master 节点执行)
- 3.3 导入镜像(k3s containerd 方式)
- 四、Master 节点:安装组件
- 4.1 安装 Volcano
- 4.2 安装 ClusterD
- 4.3 安装 Ascend Operator
- 4.4 安装 Infer Operator
- 4.5 验证
- 五、Worker 节点:安装前准备
- 5.1 标签
- 5.2 用户与日志目录(所有 Worker 节点执行)
- 5.3 导入镜像
- 5.4 配置昇腾 Containerd Runtime
- 六、Worker 节点:安装组件
- 6.1 安装 NodeD
- 6.2 安装 NPU-Exporter
- 6.3 安装 Ascend Device Plugin
- 6.4 验证
- 七、踩坑记录
- 7.1 ErrImageNeverPull:镜像命名空间问题
- 7.2 k3s ctr 与原生 ctr 隔离
- 7.3 DaemonSet 节点选择器不匹配
- 7.4 ClusterD YAML 中 imagePullPolicy 差异
- 八、小结
一、背景
在华为昇腾 aarch64 服务器上基于 k3s 1.35.5 部署 MindCluster v26.0.0(原 MindX DL),服务器无法直连 GitHub,全程离线操作。k3s 使用自带的 containerd 容器运行时,并配置华为昇腾 Docker Runtime 以支持 NPU 设备调度。
环境信息:
- OS: Ubuntu aarch64
- K3s: v1.35.5+k3s1
- MindCluster: v26.0.0
- 昇腾 Runtime: Ascend-Docker-Runtime
参考文档:[MindCluster 下载](https://www.hiascend.com/developer/software/mindcluster/download?versionId=467&ids=55%2C103%2C26958bcc909e4cd48fa56d4c4a43ebec%2C58%2C60%2C64) | [MindCluster 文档](https://www.hiascend.com/developer/software/mindcluster/document)
二、k3s 离线安装
服务器无法访问 GitHub,使用 gh-proxy 代理下载所需文件,再离线部署。
2.1 下载离线文件
# 使用 GitHub 加速代理下载
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-arm64
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-airgap-images-arm64.tar
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-airgap-images-arm64.tar.zst
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/sha256sum-arm64.txt
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-images.txt
curl -Lo install.sh https://get.k3s.io
# 使用 GitHub 加速代理下载
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-arm64
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-airgap-images-arm64.tar
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-airgap-images-arm64.tar.zst
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/sha256sum-arm64.txt
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-images.txt
curl -Lo install.sh https://get.k3s.io如果 docker 已安装,也可单独拉取镜像:
for i in `cat k3s-images.txt`; do
docker pull $i
done
2.2 部署到离线节点
将下载的文件拷贝到目标服务器后:
# 放置二进制文件
sudo cp k3s-arm64 /usr/local/bin/k3s
sudo chmod +x /usr/local/bin/k3s
# 放置离线镜像包
sudo mkdir -p /var/lib/rancher/k3s/agent/images/
sudo cp k3s-airgap-images-arm64.tar.zst /var/lib/rancher/k3s/agent/images/
# 离线安装(跳过在线下载)
sudo INSTALL_K3S_SKIP_DOWNLOAD=true ./install.sh
2.3 国内镜像加速(在线安装备选)
curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn sh -
2.4 自定义数据目录
curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn sh -默认数据目录为 `/var/lib/rancher/k3s`。若需更改,安装后修改 systemd 服务文件:
sudo vim /etc/systemd/system/k3s.service
# 在 ExecStart 行添加:--data-dir /your/new/path
sudo systemctl daemon-reload
sudo systemctl restart k3s
注意事项:新路径所在文件系统需支持 `d_type`(XFS 格式化时需加 `-n ftype=1`),迁移前务必停止 k3s 并备份数据。
三、Master 节点:安装前准备
3.1 节点标签与命名空间
# 给 Master 节点打标签(兼容新旧 k8s role 命名)
kubectl label nodes -l node-role.kubernetes.io/control-plane masterselector=dls-master-node --overwrite
kubectl label nodes -l node-role.kubernetes.io/master masterselector=dls-master-node --overwrite
# 创建命名空间
kubectl create ns mindx-dl
kubectl create ns cluster-system
3.2 创建用户与日志目录(所有 Master 节点执行)
# 创建 hwMindX 用户(Ubuntu)
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX
# 日志目录
mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
for dir in ascend-operator infer-operator clusterd volcano-controller volcano-scheduler; do
mkdir -m 750 /var/log/mindx-dl/$dir
chown hwMindX:hwMindX /var/log/mindx-dl/$dir
done
3.3 导入镜像(k3s containerd 方式)
# 给 Master 节点打标签(兼容新旧 k8s role 命名)
kubectl label nodes -l node-role.kubernetes.io/control-plane masterselector=dls-master-node --overwrite
kubectl label nodes -l node-role.kubernetes.io/master masterselector=dls-master-node --overwrite
# 创建命名空间
kubectl create ns mindx-dl
kubectl create ns cluster-system# 创建 hwMindX 用户(Ubuntu)
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX
# 日志目录
mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
for dir in ascend-operator infer-operator clusterd volcano-controller volcano-scheduler; do
mkdir -m 750 /var/log/mindx-dl/$dir
chown hwMindX:hwMindX /var/log/mindx-dl/$dir
done
3.3 导入镜像(k3s containerd 方式)
由于 k3s 使用内置 containerd 而非 docker,镜像需导入 k8s.io 命名空间。此处踩坑较多,后面单独说明。
# 从华为云 SWR 拉取镜像
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-scheduler:v1.9.0-v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-controller-manager:v1.9.0-v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-operator:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/infer-operator:v26.0.0
# 关键:tag 到 docker.io/library/ 前缀,否则 imagePullPolicy: Never 会失败
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-scheduler:v1.9.0-v26.0.0 docker.io/library/volcanosh/vc-scheduler:v1.9.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-controller-manager:v1.9.0-v26.0.0 docker.io/library/volcanosh/vc-controller-manager:v1.9.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:v26.0.0 docker.io/library/clusterd:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-operator:v26.0.0 docker.io/library/ascend-operator:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/infer-operator:v26.0.0 docker.io/library/infer-operator:v26.0.0
四、Master 节点:安装组件
4.1 安装 Volcano
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-volcano_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-volcano_26.0.0_linux-aarch64.zip
cd volcano-v1.9.0
kubectl apply -f volcano-v1.9.0.yaml
4.2 安装 ClusterD
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-clusterd_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-clusterd_26.0.0_linux-aarch64.zip
kubectl apply -f clusterd-v26.0.0.yaml
4.3 安装 Ascend Operator
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-ascend-operator_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-ascend-operator_26.0.0_linux-aarch64.zip
kubectl apply -f ascend-operator-v26.0.0.yaml
4.4 安装 Infer Operator
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-infer-operator_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-infer-operator_26.0.0_linux-aarch64.zip
kubectl apply -f infer-operator-v26.0.0.yaml
4.5 验证
kubectl get pod -A
# 确认 volcano (2个)、clusterd (1个)、ascend-operator (1个)、infer-operator (1个) 均为 Running
五、Worker 节点:安装前准备
5.1 标签
# Master 兼 Worker 场景
kubectl label nodes -l 'node-role.kubernetes.io/control-plane' -l 'node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite
# 纯 Worker 场景
kubectl label nodes -l '!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite
# 芯片类型标签(根据 npu-smi info 的 chip name 选择)
kubectl label nodes -l workerselector=dls-worker-node accelerator=huawei-Ascend910 --overwrite
5.2 用户与日志目录(所有 Worker 节点执行)
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX
mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
for dir in devicePlugin npu-exporter noded; do
mkdir -m 750 /var/log/mindx-dl/$dir
chown root:root /var/log/mindx-dl/$dir
done
5.3 导入镜像
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0 docker.io/library/noded:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0 docker.io/library/npu-exporter:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0 docker.io/library/ascend-k8sdeviceplugin:v26.0.0
5.4 配置昇腾 Containerd Runtime
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-volcano_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-volcano_26.0.0_linux-aarch64.zip
cd volcano-v1.9.0
kubectl apply -f volcano-v1.9.0.yamlwget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-clusterd_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-clusterd_26.0.0_linux-aarch64.zip
kubectl apply -f clusterd-v26.0.0.yaml
4.3 安装 Ascend Operator
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-ascend-operator_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-ascend-operator_26.0.0_linux-aarch64.zip
kubectl apply -f ascend-operator-v26.0.0.yaml
4.4 安装 Infer Operator
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-infer-operator_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-infer-operator_26.0.0_linux-aarch64.zip
kubectl apply -f infer-operator-v26.0.0.yaml
4.5 验证
kubectl get pod -A
# 确认 volcano (2个)、clusterd (1个)、ascend-operator (1个)、infer-operator (1个) 均为 Running
五、Worker 节点:安装前准备
5.1 标签
# Master 兼 Worker 场景
kubectl label nodes -l 'node-role.kubernetes.io/control-plane' -l 'node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite
# 纯 Worker 场景
kubectl label nodes -l '!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite
# 芯片类型标签(根据 npu-smi info 的 chip name 选择)
kubectl label nodes -l workerselector=dls-worker-node accelerator=huawei-Ascend910 --overwrite
5.2 用户与日志目录(所有 Worker 节点执行)
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX
mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
for dir in devicePlugin npu-exporter noded; do
mkdir -m 750 /var/log/mindx-dl/$dir
chown root:root /var/log/mindx-dl/$dir
done
5.3 导入镜像
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0 docker.io/library/noded:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0 docker.io/library/npu-exporter:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0 docker.io/library/ascend-k8sdeviceplugin:v26.0.0
5.4 配置昇腾 Containerd Runtime
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-ascend-operator_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-ascend-operator_26.0.0_linux-aarch64.zip
kubectl apply -f ascend-operator-v26.0.0.yamlwget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-infer-operator_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-infer-operator_26.0.0_linux-aarch64.zip
kubectl apply -f infer-operator-v26.0.0.yaml
4.5 验证
kubectl get pod -A
# 确认 volcano (2个)、clusterd (1个)、ascend-operator (1个)、infer-operator (1个) 均为 Running
五、Worker 节点:安装前准备
5.1 标签
# Master 兼 Worker 场景
kubectl label nodes -l 'node-role.kubernetes.io/control-plane' -l 'node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite
# 纯 Worker 场景
kubectl label nodes -l '!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite
# 芯片类型标签(根据 npu-smi info 的 chip name 选择)
kubectl label nodes -l workerselector=dls-worker-node accelerator=huawei-Ascend910 --overwrite
5.2 用户与日志目录(所有 Worker 节点执行)
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX
mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
for dir in devicePlugin npu-exporter noded; do
mkdir -m 750 /var/log/mindx-dl/$dir
chown root:root /var/log/mindx-dl/$dir
done
5.3 导入镜像
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0 docker.io/library/noded:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0 docker.io/library/npu-exporter:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0 docker.io/library/ascend-k8sdeviceplugin:v26.0.0
5.4 配置昇腾 Containerd Runtime
kubectl get pod -A
# 确认 volcano (2个)、clusterd (1个)、ascend-operator (1个)、infer-operator (1个) 均为 Running# Master 兼 Worker 场景
kubectl label nodes -l 'node-role.kubernetes.io/control-plane' -l 'node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite
# 纯 Worker 场景
kubectl label nodes -l '!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite
# 芯片类型标签(根据 npu-smi info 的 chip name 选择)
kubectl label nodes -l workerselector=dls-worker-node accelerator=huawei-Ascend910 --overwrite
5.2 用户与日志目录(所有 Worker 节点执行)
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX
mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
for dir in devicePlugin npu-exporter noded; do
mkdir -m 750 /var/log/mindx-dl/$dir
chown root:root /var/log/mindx-dl/$dir
done
5.3 导入镜像
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0 docker.io/library/noded:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0 docker.io/library/npu-exporter:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0 docker.io/library/ascend-k8sdeviceplugin:v26.0.0
5.4 配置昇腾 Containerd Runtime
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX
mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
for dir in devicePlugin npu-exporter noded; do
mkdir -m 750 /var/log/mindx-dl/$dir
chown root:root /var/log/mindx-dl/$dir
donek3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0 docker.io/library/noded:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0 docker.io/library/npu-exporter:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0 docker.io/library/ascend-k8sdeviceplugin:v26.0.0
5.4 配置昇腾 Containerd Runtime
编辑 k3s containerd 配置文件,默认路径为 `/var/lib/rancher/k3s/agent/etc/containerd/config.toml`(或通过 `--data-dir` 指定的路径)。
关键配置:
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "ascend"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"
配置完成后重启 k3s:
sudo systemctl restart k3s
六、Worker 节点:安装组件
6.1 安装 NodeD
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-noded_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-noded_26.0.0_linux-aarch64.zip
kubectl apply -f noded-v26.0.0.yaml
6.2 安装 NPU-Exporter
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-npu-exporter_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-npu-exporter_26.0.0_linux-aarch64.zip
kubectl apply -f npu-exporter-26.0.0.yaml
6.3 安装 Ascend Device Plugin
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-device-plugin_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-device-plugin_26.0.0_linux-aarch64.zip
kubectl apply -f device-plugin-npu-volcano-v26.0.0.yaml
kubectl apply -f device-plugin-volcano-v26.0.0.yaml
kubectl apply -f device-plugin-310P-volcano-v26.0.0.yaml
6.4 验证
kubectl get pod -A -o wide | grep noded
kubectl get pod -A -o wide | grep npu-exporter
kubectl get pod -A -o wide | grep ascend-device-plugin
七、踩坑记录
7.1 ErrImageNeverPull:镜像命名空间问题
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-noded_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-noded_26.0.0_linux-aarch64.zip
kubectl apply -f noded-v26.0.0.yamlwget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-npu-exporter_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-npu-exporter_26.0.0_linux-aarch64.zip
kubectl apply -f npu-exporter-26.0.0.yaml
6.3 安装 Ascend Device Plugin
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-device-plugin_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-device-plugin_26.0.0_linux-aarch64.zip
kubectl apply -f device-plugin-npu-volcano-v26.0.0.yaml
kubectl apply -f device-plugin-volcano-v26.0.0.yaml
kubectl apply -f device-plugin-310P-volcano-v26.0.0.yaml
6.4 验证
kubectl get pod -A -o wide | grep noded
kubectl get pod -A -o wide | grep npu-exporter
kubectl get pod -A -o wide | grep ascend-device-plugin
七、踩坑记录
7.1 ErrImageNeverPull:镜像命名空间问题
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-device-plugin_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-device-plugin_26.0.0_linux-aarch64.zip
kubectl apply -f device-plugin-npu-volcano-v26.0.0.yaml
kubectl apply -f device-plugin-volcano-v26.0.0.yaml
kubectl apply -f device-plugin-310P-volcano-v26.0.0.yamlkubectl get pod -A -o wide | grep noded
kubectl get pod -A -o wide | grep npu-exporter
kubectl get pod -A -o wide | grep ascend-device-plugin
七、踩坑记录
7.1 ErrImageNeverPull:镜像命名空间问题
部署 ClusterD 等组件时遇到 Pod 报错:
Warning ErrImageNeverPull Container image "clusterd:v26.0.0" is not present with pull policy of Never
MindCluster 的 YAML 中 `imagePullPolicy: Never`,而 k3s containerd 的 `k8s.io` 命名空间下镜像引用规则较严格。仅 tag 到 `docker.io/clusterd:v26.0.0` 不够,必须 tag 到 `docker.io/library/clusterd:v26.0.0` 才能被正确识别。
对比 Volcano 的 YAML 使用了 `imagePullPolicy: IfNotPresent`,所以即使镜像 tag 不精确也能工作。但 MindCluster 多数组件使用 `Never`,tag 必须精准。
7.2 k3s ctr 与原生 ctr 隔离
k3s 内置的 `k3s ctr` 和系统 `ctr` 是两个独立工具,镜像存储完全隔离:
- `k3s ctr`:操作 k3s 专用 containerd,镜像在 `k8s.io` 命名空间
- `ctr`:系统原生 containerd,默认 `default` 命名空间
拉取/导入镜像必须用 `k3s ctr`,用原生 `ctr` 操作 k3s 看不到对应镜像。如需从 docker 导出再导入 k3s:
# docker 导出
docker save ubuntu:22.04 -o ubuntu-22.04.tar
# 导入到 k3s containerd(必须指定 -n k8s.io)
k3s ctr -n k8s.io images import ubuntu-22.04.tar
7.3 DaemonSet 节点选择器不匹配
部署 NodeD 和 Device Plugin 后 Pod 数为 0,原因是 DaemonSet 的 `nodeSelector` 与节点标签不匹配:
- NodeD 需要 `workerselector=dls-worker-node`
- Device Plugin 需要 `accelerator=huawei-npu` 或 `accelerator=huawei-Ascend910`
Master 兼 Worker 的单节点场景需要把两类标签都打上:
kubectl label node <node-name> workerselector=dls-worker-node --overwrite
kubectl label node <node-name> accelerator=huawei-Ascend910 --overwrite
7.4 ClusterD YAML 中 imagePullPolicy 差异
ClusterD 的 YAML 默认 `imagePullPolicy: Never`,而其他组件(Volcano 等)使用 `IfNotPresent`。如果镜像 tag 未精确匹配,需要修改 YAML 中的策略或确保 tag 路径完全一致。
八、小结
整个部署流程中,核心难点集中在 k3s containerd 的镜像管理机制上:命名空间隔离、`docker.io/library/` 前缀要求、与原生 docker/ctr 的差异。MindCluster 官方文档以原生 K8s + Docker 运行时为主,在 k3s 环境下需要额外适配。
建议部署顺序:先确保 k3s 正常运行 → 配置昇腾 Runtime → 打标签 → 导入镜像并验证 tag → 按 Master 组件 → Worker 组件的顺序逐步部署,每一步都用 `kubectl get pod -A` 验证后再继续。
简记。







