安装 配置helm 安装helm
wget https://get.helm.sh/helm-v3.15.3-linux-amd64.tar.gz tar xf helm-v3.15.3-linux-amd64.tar.gz cp linux-amd64/helm /usr/local/bin/
添加代理
export https_proxy=http://192.168.2.1:7890;export http_proxy=http://192.168.2.1:7890;export all_proxy=socks5://192.168.2.1:7890
添加milvus helm源
helm repo add milvus https://zilliztech.github.io/milvus-helm/ helm repo update
离线安装milvus helm地址:https://github.com/zilliztech/milvus-helm.git
查看当前版本
helm search repo milvus --versions WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config NAME CHART VERSION APP VERSION DESCRIPTION milvus/milvus 4.2.4 2.4.7 Milvus is an open-source vector database built ... milvus/minio 8.0.17 master High Performance, Kubernetes Native Object Storage
获取milvus_manifest.yaml
helm template my-release milvus/milvus --version 4.2.4 > milvus_manifest.yaml
指定namespace
helm template my-release milvus/milvus --version 4.1.28 --namespace milvus -f custom-values.yaml > milvus_manifest.yaml
helm template my-release milvus/milvus \ --version 4.2.56 \ --namespace milvus \ --set nodeSelector.node=milvus \ --set serviceAccount.create=true \ --set metrics.serviceMonitor.enabled=true \ --set attu.enabled=true \ --set minio.enabled=false \ --set externalS3.enabled=true \ --set externalS3.host=x.x.x.x \ --set externalS3.port=8081 \ --set externalS3.accessKey=xxxxxxxxxxxxxxx \ --set externalS3.secretKey=xxxxxxxxxxxxxxxxxxxxxxxxxxxx \ --set externalS3.bucketName=milvus-bucket-zx \ > milvus_manifest_external_etcd.yaml
外置etcd
helm template my-release milvus/milvus \ --version 4.2.56 \ --namespace milvus \ --set nodeSelector.node=milvus \ --set serviceAccount.create=true \ --set metrics.serviceMonitor.enabled=true \ --set attu.enabled=true \ --set minio.enabled=false \ --set etcd.enabled=false \ --set externalEtcd.enabled=true \ --set externalEtcd.endpoints[0]=192.168.1.11:2379 \ --set externalEtcd.endpoints[1]=192.168.1.12:2379 \ --set externalEtcd.endpoints[2]=192.168.1.13:2379 \ --set externalS3.enabled=true \ --set externalS3.host=x.x.x.x \ --set externalS3.port=8081 \ --set externalS3.accessKey=xxxxxxxxxxxxxxx \ --set externalS3.secretKey=xxxxxxxxxxxxxxxxxxxxxxxxxxxx \ --set externalS3.bucketName=milvus-bucket-zx \ > milvus_manifest_external_etcd.yaml
安装milvus
kubectl apply -f milvus_manifest.yaml
安装ETCD集群 下载etcd
https://github.com/etcd-io/etcd/releases?q=3.5.18&expanded=true
解压目录
tar xf etcd-v3.5.18-linux-amd64.tar.gz cp etcd-v3.5.18-linux-amd64/etcd* /usr/local/bin/
创建目录
mkdir -p /data/etcd/{conf,data,log}
创建配置文件
cat > /data/etcd/conf/etcd.conf.yml << EOF # /etc/etcd/etcd.conf.yml # etcd 生产配置(适用于 Milvus) # === 基础配置 === name: etcd0 # 每个节点分别设为 etcd0, etcd1, etcd2 data-dir: /data/etcd/data listen-peer-urls: http://0.0.0.0:2380 listen-client-urls: http://0.0.0.0:2379 advertise-client-urls: http://192.168.1.10:2379 # 替换为本机实际 IP initial-advertise-peer-urls: http://192.168.1.10:2380 # 替换为本机实际 IP # 集群初始成员(所有节点必须完全一致) initial-cluster: etcd0=http://192.168.1.10:2380,etcd1=http://192.168.1.11:2380,etcd2=http://192.168.1.12:2380 initial-cluster-state: new initial-cluster-token: milvus-etcd-cluster # === Milvus 关键优化 === quota-backend-bytes: 17179869184 # 16 GB,防止写满 max-txn-ops: 131072 # Milvus 要求 ≥128K max-request-bytes: 10485760 # 10 MB,支持大请求 # === 稳定性与性能 === heartbeat-interval: 100 # 100ms election-timeout: 5000 # 5s,避免网络抖动误判 auto-compaction-mode: revision # 按 revision 压缩 auto-compaction-retention: "1000" # 保留最近 1000 个 revision snapshot-count: 50000 # 每 5 万次写入生成快照 # === 日志 === log-outputs: - stdout - /data/etcd/log/etcd.log enable-log-rotation: true log-rotation-config-json: | { "maxsize": 100, "maxage": 7, "compress": true } EOF
启动服务
cat > /etc/systemd/system/etcd.service << EOF [Unit] Description=etcd: distributed reliable key-value store Documentation=https://etcd.io/docs After=network.target [Service] Type=notify ExecStart=/usr/local/bin/etcd --config-file=/data/etcd/conf/etcd.conf.yml Restart=always RestartSec=10 LimitNOFILE=65536 TimeoutStartSec=0 [Install] WantedBy=multi-user.target EOF
systemctl enable etcd systemctl start etcd
查看集群状态
etcdctl member list -w table
+------------------+---------+-------+----------------------------+----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-------+----------------------------+----------------------------+------------+ | 6ae5c40407554081 | started | etcd2 | http://192.168.97.235:2380 | http://192.168.97.235:2379 | false | | 7b1c5ca31530f6ee | started | etcd1 | http://192.168.97.57:2380 | http://192.168.97.57:2379 | false | | f4aa7f67ae39563f | started | etcd0 | http://192.168.97.151:2380 | http://192.168.97.151:2379 | false | +------------------+---------+-------+----------------------------+----------------------------+------------+
额外配置 修改minio默认密码 编辑milvus_manifest.yaml
apiVersion: v1 kind: Secret metadata: name: my-release-minio labels: app: minio chart: minio-8.0.17 release: my-release heritage: Helm type: Opaque data: # 使用base64加密 accesskey: "bWluaW9hZG1pbg==" secretkey: "MTIzNDU2Nzg="
apiVersion: v1 kind: ConfigMap metadata: name: my-release-milvus data: default.yaml: |+ minio: address: my-release-minio port: 9000 accessKeyID: minioadmin # secretAccessKey和secretkey需要保持一致 secretAccessKey: 12345678 useSSL: false bucketName: milvus-bucket rootPath: file useIAM: false cloudProvider: aws iamEndpoint: region: useVirtualHost: false
minio使用StorageClass 编辑milvus_manifest.yaml
apiVersion: apps/v1 kind: StatefulSet metadata: name: my-release-minio labels: app: minio chart: minio-8.0.17 release: my-release heritage: Helm spec: volumeClaimTemplates: - metadata: name: export spec: storageClassName: nfs-client # 指定StorageClass名称为nfs accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 2560Gi
etcd使用hostpath 由于etcd持久化存储到nfs中存在性能不足的问题,所以需要将etcd的数据使用hostpath的方式映射到宿主机
volumeMounts: - name: data mountPath: /bitnami/etcd volumes: - name: data hostPath: path: /opt/milvus/etcd type: DirectoryOrCreate #volumeClaimTemplates: # - metadata: # name: data # spec: # accessModes: # - "ReadWriteOnce" # resources: # requests: # storage: "10Gi"
开启milvus认证 编辑milvus_manifest.yaml
apiVersion: v1 kind: ConfigMap metadata: name: my-release-milvus namespace: default data: user.yaml: |+ common: security: authorizationEnabled: true defaultRootPassword: 123456
开启认证默认权限:root/Milvus
配置minio nginx代理 编辑/etc/nginx/nginx.conf
#将nginx代理的所有请求实体的大小限制为1024M client_max_body_size 10240M;
编辑/etc/nginx/conf.d/minio.conf
server { listen 9000; location / { proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header Host $http_host; proxy_connect_timeout 300; # Default is HTTP/1, keepalive is only enabled in HTTP/1.1 proxy_http_version 1.1; proxy_set_header Connection ""; chunked_transfer_encoding off; proxy_pass http://10.43.82.8:9000; } }
参考链接:https://minio.org.cn/docs/minio/linux/integrations/setup-nginx-proxy-with-minio.html
Nginx代理Milvus server { listen 50498 http2; location / { grpc_pass grpc://milvus_test; } } upstream milvus_test { server 192.168.31.42:30717 ; }
已知问题 my-release-milvus-rootcoord启动失败 my-release-milvus-rootcoord启动失败,错误日志如下:
[2024/10/18 11:13:51.532 +00:00] [ERROR] [msgstream/mq_msgstream.go:138] ["retry func failed"] ["retry time"=4] [error="no partitioned metadata for topic{public/milvus-test/by-dev-rootcoord-dml_0} in lookup response"]
解决思路:检查my-release-pulsar-broker-0 pod中提示Policies not found for public/milvus-test namespace,问题原因zk没有创建public/milvus-test namespace,在zk中查询默认的namespace
root@my-release-pulsar-zookeeper-0:/pulsar# bin/pulsar-admin --admin-url http://my-release-pulsar-broker:8080 namespaces list public "public/default"
创建namespace,即可恢复服务
bin/pulsar-admin --admin-url http://my-release-pulsar-broker:8080 namespaces create public/milvus-test
my-release-pulsar-zookeeper启动失败 zk日志报错:
java.net.UnknownHo.stException:my-release-pulsar-zookeeper-2.my-release-pulsar-zookeeper.milvus-gpu.svc.cluster.local
发现只有两个pod启动,但是my-release-pulsar-zookeeper-2 pod并没有被创建,清理zk数据目录重新部署
修改configmap my-release-pulsar-bookie
PULSAR_MEM: | -Xms8192m -Xmx8192m -XX:MaxDirectMemorySize=16384m dbStorage_readAheadCacheMaxSizeMb: "64" dbStorage_rocksDB_blockCacheSize: "8388608" dbStorage_rocksDB_writeBufferSizeMB: "16" dbStorage_writeCacheMaxSizeMb: "64" nettyMaxFrameSizeBytes: "104867840"
Prometheus监控部署 参考链接:https://milvus.io/docs/zh/monitor.md
生成创建serviceMonitor配置文件
helm template my-release milvus/milvus \ --version 4.2.4 \ --namespace milvus \ --set metrics.serviceMonitor.enabled=true \ --show-only templates/servicemonitor.yaml > servicemonitor.yaml
生成文件如下
--- # Source: milvus/templates/servicemonitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-release-milvus namespace: milvus labels: helm.sh/chart: milvus-4.2.4 app.kubernetes.io/name: milvus app.kubernetes.io/instance: my-release app.kubernetes.io/version: "2.4.7" app.kubernetes.io/managed-by: Helm spec: endpoints: - honorLabels: true interval: 30s scrapeTimeout: 10s path: /metrics port: metrics namespaceSelector: matchNames: - milvus selector: matchLabels: app.kubernetes.io/name: milvus app.kubernetes.io/instance: my-release targetLabels: - app.kubernetes.io/name - app.kubernetes.io/instance - component
访问https://github.com/prometheus-operator/kube-prometheus,根据k8s版本确定分支
创建monitor crd
kubectl apply -f https://github.com/prometheus-operator/kube-prometheus/blob/release-0.9/manifests/setup/prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
创建serviceMonitor
kubectl apply -f servicemonitor.yaml
Prometheus server配置milvus job
- job_name: milvus kubernetes_sd_configs: - role: endpoints namespaces: names: [milvus] # 仅监控milvus命名空间 relabel_configs: - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name] regex: milvus action: keep