k3s 1.35.5 离线部署华为昇腾 MindCluster v26.0.0 简记

发表于 11 小时前  14 次阅读


文章目录

一、背景

在华为昇腾 aarch64 服务器上基于 k3s 1.35.5 部署 MindCluster v26.0.0(原 MindX DL),服务器无法直连 GitHub,全程离线操作。k3s 使用自带的 containerd 容器运行时,并配置华为昇腾 Docker Runtime 以支持 NPU 设备调度。

环境信息:

  • OS: Ubuntu aarch64
  • K3s: v1.35.5+k3s1
  • MindCluster: v26.0.0
  • 昇腾 Runtime: Ascend-Docker-Runtime

参考文档:[MindCluster 下载](https://www.hiascend.com/developer/software/mindcluster/download?versionId=467&ids=55%2C103%2C26958bcc909e4cd48fa56d4c4a43ebec%2C58%2C60%2C64) | [MindCluster 文档](https://www.hiascend.com/developer/software/mindcluster/document)

二、k3s 离线安装

服务器无法访问 GitHub,使用 gh-proxy 代理下载所需文件,再离线部署。

2.1 下载离线文件

# 使用 GitHub 加速代理下载
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-arm64
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-airgap-images-arm64.tar
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-airgap-images-arm64.tar.zst
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/sha256sum-arm64.txt
wget https://gh-proxy.org/https://github.com/k3s-io/k3s/releases/download/v1.35.5%2Bk3s1/k3s-images.txt
curl -Lo install.sh https://get.k3s.io

如果 docker 已安装,也可单独拉取镜像:

for i in `cat k3s-images.txt`; do
  docker pull $i
done

2.2 部署到离线节点

将下载的文件拷贝到目标服务器后:

# 放置二进制文件
sudo cp k3s-arm64 /usr/local/bin/k3s
sudo chmod +x /usr/local/bin/k3s

# 放置离线镜像包
sudo mkdir -p /var/lib/rancher/k3s/agent/images/
sudo cp k3s-airgap-images-arm64.tar.zst /var/lib/rancher/k3s/agent/images/

# 离线安装(跳过在线下载)
sudo INSTALL_K3S_SKIP_DOWNLOAD=true ./install.sh

2.3 国内镜像加速(在线安装备选)

curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn sh -

2.4 自定义数据目录

默认数据目录为 `/var/lib/rancher/k3s`。若需更改,安装后修改 systemd 服务文件:

sudo vim /etc/systemd/system/k3s.service
# 在 ExecStart 行添加:--data-dir /your/new/path
sudo systemctl daemon-reload
sudo systemctl restart k3s

注意事项:新路径所在文件系统需支持 `d_type`(XFS 格式化时需加 `-n ftype=1`),迁移前务必停止 k3s 并备份数据。

三、Master 节点:安装前准备

3.1 节点标签与命名空间

# 给 Master 节点打标签(兼容新旧 k8s role 命名)
kubectl label nodes -l node-role.kubernetes.io/control-plane masterselector=dls-master-node --overwrite
kubectl label nodes -l node-role.kubernetes.io/master masterselector=dls-master-node --overwrite

# 创建命名空间
kubectl create ns mindx-dl
kubectl create ns cluster-system

3.2 创建用户与日志目录(所有 Master 节点执行)

# 创建 hwMindX 用户(Ubuntu)
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX

# 日志目录
mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl

for dir in ascend-operator infer-operator clusterd volcano-controller volcano-scheduler; do
  mkdir -m 750 /var/log/mindx-dl/$dir
  chown hwMindX:hwMindX /var/log/mindx-dl/$dir
done

3.3 导入镜像(k3s containerd 方式)

由于 k3s 使用内置 containerd 而非 docker,镜像需导入 k8s.io 命名空间。此处踩坑较多,后面单独说明。

# 从华为云 SWR 拉取镜像
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-scheduler:v1.9.0-v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-controller-manager:v1.9.0-v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-operator:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/infer-operator:v26.0.0

# 关键:tag 到 docker.io/library/ 前缀,否则 imagePullPolicy: Never 会失败
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-scheduler:v1.9.0-v26.0.0 docker.io/library/volcanosh/vc-scheduler:v1.9.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-controller-manager:v1.9.0-v26.0.0 docker.io/library/volcanosh/vc-controller-manager:v1.9.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:v26.0.0 docker.io/library/clusterd:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-operator:v26.0.0 docker.io/library/ascend-operator:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/infer-operator:v26.0.0 docker.io/library/infer-operator:v26.0.0

四、Master 节点:安装组件

4.1 安装 Volcano

wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-volcano_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-volcano_26.0.0_linux-aarch64.zip
cd volcano-v1.9.0
kubectl apply -f volcano-v1.9.0.yaml

4.2 安装 ClusterD

wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-clusterd_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-clusterd_26.0.0_linux-aarch64.zip
kubectl apply -f clusterd-v26.0.0.yaml

4.3 安装 Ascend Operator

wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-ascend-operator_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-ascend-operator_26.0.0_linux-aarch64.zip
kubectl apply -f ascend-operator-v26.0.0.yaml

4.4 安装 Infer Operator

wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-infer-operator_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-infer-operator_26.0.0_linux-aarch64.zip
kubectl apply -f infer-operator-v26.0.0.yaml

4.5 验证

kubectl get pod -A
# 确认 volcano (2个)、clusterd (1个)、ascend-operator (1个)、infer-operator (1个) 均为 Running

五、Worker 节点:安装前准备

5.1 标签

# Master 兼 Worker 场景
kubectl label nodes -l 'node-role.kubernetes.io/control-plane' -l 'node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite

# 纯 Worker 场景
kubectl label nodes -l '!node-role.kubernetes.io/control-plane,!node-role.kubernetes.io/master' workerselector=dls-worker-node --overwrite

# 芯片类型标签(根据 npu-smi info 的 chip name 选择)
kubectl label nodes -l workerselector=dls-worker-node accelerator=huawei-Ascend910 --overwrite

5.2 用户与日志目录(所有 Worker 节点执行)

useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX

mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
for dir in devicePlugin npu-exporter noded; do
  mkdir -m 750 /var/log/mindx-dl/$dir
  chown root:root /var/log/mindx-dl/$dir
done

5.3 导入镜像

k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0
k3s ctr images pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0

k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v26.0.0 docker.io/library/noded:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v26.0.0 docker.io/library/npu-exporter:v26.0.0
k3s ctr images tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v26.0.0 docker.io/library/ascend-k8sdeviceplugin:v26.0.0

5.4 配置昇腾 Containerd Runtime

编辑 k3s containerd 配置文件,默认路径为 `/var/lib/rancher/k3s/agent/etc/containerd/config.toml`(或通过 `--data-dir` 指定的路径)。

关键配置:

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "ascend"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    BinaryName = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"

配置完成后重启 k3s:

sudo systemctl restart k3s

六、Worker 节点:安装组件

6.1 安装 NodeD

wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-noded_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-noded_26.0.0_linux-aarch64.zip
kubectl apply -f noded-v26.0.0.yaml

6.2 安装 NPU-Exporter

wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-npu-exporter_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-npu-exporter_26.0.0_linux-aarch64.zip
kubectl apply -f npu-exporter-26.0.0.yaml

6.3 安装 Ascend Device Plugin

wget https://gitcode.com/Ascend/mind-cluster/releases/download/v26.0.0/Ascend-mindxdl-device-plugin_26.0.0_linux-aarch64.zip
unzip Ascend-mindxdl-device-plugin_26.0.0_linux-aarch64.zip
kubectl apply -f device-plugin-npu-volcano-v26.0.0.yaml
kubectl apply -f device-plugin-volcano-v26.0.0.yaml
kubectl apply -f device-plugin-310P-volcano-v26.0.0.yaml

6.4 验证

kubectl get pod -A -o wide | grep noded
kubectl get pod -A -o wide | grep npu-exporter
kubectl get pod -A -o wide | grep ascend-device-plugin

七、踩坑记录

7.1 ErrImageNeverPull:镜像命名空间问题

部署 ClusterD 等组件时遇到 Pod 报错:

Warning  ErrImageNeverPull  Container image "clusterd:v26.0.0" is not present with pull policy of Never

MindCluster 的 YAML 中 `imagePullPolicy: Never`,而 k3s containerd 的 `k8s.io` 命名空间下镜像引用规则较严格。仅 tag 到 `docker.io/clusterd:v26.0.0` 不够,必须 tag 到 `docker.io/library/clusterd:v26.0.0` 才能被正确识别。

对比 Volcano 的 YAML 使用了 `imagePullPolicy: IfNotPresent`,所以即使镜像 tag 不精确也能工作。但 MindCluster 多数组件使用 `Never`,tag 必须精准。

7.2 k3s ctr 与原生 ctr 隔离

k3s 内置的 `k3s ctr` 和系统 `ctr` 是两个独立工具,镜像存储完全隔离:

  • `k3s ctr`:操作 k3s 专用 containerd,镜像在 `k8s.io` 命名空间
  • `ctr`:系统原生 containerd,默认 `default` 命名空间

拉取/导入镜像必须用 `k3s ctr`,用原生 `ctr` 操作 k3s 看不到对应镜像。如需从 docker 导出再导入 k3s:

# docker 导出
docker save ubuntu:22.04 -o ubuntu-22.04.tar

# 导入到 k3s containerd(必须指定 -n k8s.io)
k3s ctr -n k8s.io images import ubuntu-22.04.tar

7.3 DaemonSet 节点选择器不匹配

部署 NodeD 和 Device Plugin 后 Pod 数为 0,原因是 DaemonSet 的 `nodeSelector` 与节点标签不匹配:

  • NodeD 需要 `workerselector=dls-worker-node`
  • Device Plugin 需要 `accelerator=huawei-npu` 或 `accelerator=huawei-Ascend910`

Master 兼 Worker 的单节点场景需要把两类标签都打上:

kubectl label node <node-name> workerselector=dls-worker-node --overwrite
kubectl label node <node-name> accelerator=huawei-Ascend910 --overwrite

7.4 ClusterD YAML 中 imagePullPolicy 差异

ClusterD 的 YAML 默认 `imagePullPolicy: Never`,而其他组件(Volcano 等)使用 `IfNotPresent`。如果镜像 tag 未精确匹配,需要修改 YAML 中的策略或确保 tag 路径完全一致。

八、小结

整个部署流程中,核心难点集中在 k3s containerd 的镜像管理机制上:命名空间隔离、`docker.io/library/` 前缀要求、与原生 docker/ctr 的差异。MindCluster 官方文档以原生 K8s + Docker 运行时为主,在 k3s 环境下需要额外适配。

建议部署顺序:先确保 k3s 正常运行 → 配置昇腾 Runtime → 打标签 → 导入镜像并验证 tag → 按 Master 组件 → Worker 组件的顺序逐步部署,每一步都用 `kubectl get pod -A` 验证后再继续。

简记。

本站文章基于国际协议BY-NA-SA 4.0协议共享;
如未特殊说明,本站文章皆为原创文章,请规范转载。

0

scanz个人博客