ETCD 部署及恢复操作指南
1. ETCD 部署
1.1 集群拓扑结构
| 主机名 |
IP 地址 |
CPU |
内存 |
角色 |
数据目录 |
配置文件目录 |
| etcd0 |
192.168.1.10 |
16C |
32G |
etcd 节点 |
/data/etcd |
/etc/etcd/etcd.conf |
| etcd1 |
192.168.1.11 |
16C |
32G |
etcd 节点 |
/data/etcd |
/etc/etcd/etcd.conf |
| etcd2 |
192.168.1.12 |
16C |
32G |
etcd 节点 |
/data/etcd |
/etc/etcd/etcd.conf |
1.2 文件系统要求
建议使用 ext4 或 xfs 文件系统
1.3 ulimit 调整
在 /etc/security/limits.conf中添加:
* soft nofile 65536 * hard nofile 65536
|
1.4 内核参数优化
在 /etc/sysctl.conf中添加:
net.ipv4.tcp_keepalive_time = 300 net.ipv4.tcp_keepalive_intvl = 60 net.ipv4.tcp_keepalive_probes = 5 vm.overcommit_memory = 1 kernel.shmall = 2097152 kernel.shmmax = 2147483648
|
执行配置生效:
1.5 解压并部署
# 下载etcd wget https://github.com/etcd-io/etcd/releases/download/v3.5.21/etcd-v3.5.21-linux-amd64.tar.gz tar -zxf etcd-v3.5.21-linux-amd64.tar.gz cd etcd-v3.5.21-linux-amd64 cp etcd etcdctl etcdutl /usr/local/bin/
# 创建配置文件目录及数据目录 mkdir /data/etcd mkdir /etc/etcd
|
1.6 配置文件 etcd.conf
配置优化参数根据实际配置调整
cat > /etc/etcd/etcd.conf <<EOF # 节点名称(每个节点唯一) name: etcd0 # etcd1, etcd2 ...
# 数据目录(必须使用高性能本地 SSD,禁止 NFS/网络盘) data-dir: /data/etcd
# 监听地址(绑定到具体 IP 更安全,但 0.0.0.0 也可接受) listen-peer-urls: http://0.0.0.0:2380 listen-client-urls: http://0.0.0.0:2379
# 广播地址(必须为本机真实 IP) advertise-client-urls: http://192.168.1.10:2379 initial-advertise-peer-urls: http://192.168.1.10:2380
# 集群初始配置(三节点) initial-cluster: etcd0=http://192.168.1.10:2380,etcd1=http://192.168.1.11:2380,etcd2=http://192.168.1.12:2380 initial-cluster-token: milvus-etcd-cluster initial-cluster-state: new
# ======================== # 🔧 Milvus 关键性能优化 # ========================
# 1. 后端配额:防止 etcd 因写入过多而 OOM(Milvus 元数据可能增长快) quota-backend-bytes: 17179869184 # 16 GB(默认 2GB,建议 8~16GB)
# 2. 快照频率:减少 WAL 日志体积,加速恢复 snapshot-count: 10000 # 默认 10万,Milvus 写多,建议降低到 1万~5万
# 3. 心跳与选举超时(需满足:election-timeout > 10 * heartbeat-interval) heartbeat-interval: 100 # ms(默认 100) election-timeout: 2000 # ms(默认 1000 → 建议 2000~5000,避免网络抖动误判)
# 4. 请求大小限制(Milvus 可能批量操作) max-request-bytes: 52428800 # 50 MB(默认 1.5MB,Milvus DDL 可能较大) max-txn-ops: 131072 # 默认 128,提升至 128K,支持大事务
# 5. 后端批处理(提升写吞吐) backend-batch-limit: 65536 # 默认 1万,可提高 backend-batch-interval: 50 # ms(默认 100ms,缩短提交延迟)
# 6. 自动压缩历史版本(Milvus 不依赖旧版本,可激进压缩) auto-compaction-mode: periodic auto-compaction-retention: "1h" # 每小时自动压缩一次(保留1小时历史)
# ======================== # 📝 日志配置 # ======================== log-outputs: - stdout - /var/log/etcd.log enable-log-rotation: true log-rotation-config-json: | { "maxsize": 100, "maxage": 7, "compress": true }
# ======================== # ⚠️ 可选:启用 metrics(用于监控) # ======================== # listen-metrics-urls: http://0.0.0.0:2381 EOF
|
1.7 配置开机自启
cat > /etc/systemd/system/etcd.service <<EOF [Unit] Description=Etcd Server After=network.target
[Service] Type=notify EnvironmentFile=-/etc/etcd/etcd.conf ExecStart=/usr/local/bin/etcd --config-file=/etc/etcd/etcd.conf Restart=on-failure LimitNOFILE=65536
[Install] WantedBy=multi-user.target EOF
# 加载并启动etcd服务 systemctl daemon-reload systemctl enable --now etcd
|
1.8 查看集群状态
etcdctl member list -w table +------------------+---------+-------+--------------------------+--------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-------+--------------------------+--------------------------+------------+ | 37499ff739d6c21 | started | rke03 | http://192.168.1.15:2380 | http://192.168.1.15:2379 | false | | 79e7c26cb0fc149 | started | rke02 | http://192.168.1.16:2380 | http://192.168.1.16:2379 | false | | b4773de1c1f38771 | started | rke01 | http://192.168.1.11:2380 | http://192.168.1.11:2379 | false | +------------------+---------+-------+--------------------------+--------------------------+------------+
|
查看集群健康状态
etcdctl endpoint health -w table +----------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +----------------+--------+------------+-------+ | 127.0.0.1:2379 | true | 1.740588ms | | +----------------+--------+------------+-------+
|
2. ETCD 集群故障节点剔除及恢复
2.1 停止故障节点 etcd 服务
2.2 剔除故障节点
etcdctl member remove 18f6f0bc5b947c0d
|
2.3 查看 etcd 集群成员状态
2.4 清理故障节点数据
2.5 添加新成员到集群
etcdctl member add etcd2 --peer-urls=http://172.22.16.85:2380
|
2.6 修改 etcd.conf 配置
将:
initial-cluster-state new
|
替换为:
initial-cluster-state existing
|
2.7 启动 etcd 服务
2.8 查看集群状态
etcdctl member list etcdctl endpoint status --cluster -w table
|
3. ETCD 集群数据恢复
参考链接:https://www.cnblogs.com/xishuai/p/docker-etcd.html
3.1 数据备份
停止 etcd 服务(集群所有节点都要操作)
查询 endpoint
etcdctl member list -w table
|
获取 leader 节点
etcdctl endpoint status --cluster -w table
|
查询数据记录数
etcdctl get --prefix "" | grep -c '^'
|
备份 etcd 数据到本地
etcdctl snapshot save etcd_backup_20240809.tar.gz
|
查看备份文件的数据
etcdctl snapshot status etcd_backup_20240809.tar.gz -w table Deprecated: Use `etcdutl snapshot status` instead.
+----------+----------+------------+------------+ | HASH | REVISION | TOTAL KEYS | TOTAL SIZE | +----------+----------+------------+------------+ | 8c54a0ac | 15687011 | 10141 | 4.8 MB | +----------+----------+------------+------------+
|
同步文件到恢复节点
scp etcd_backup_20240808.tar.gz rke01:/root/ scp etcd_backup_20240808.tar.gz rke02:/root/ scp etcd_backup_20240808.tar.gz rke03:/root/
|
3.2 数据恢复
确认备份数据一致性
etcdctl snapshot status etcd_backup_20240809.tar.gz -w table Deprecated: Use `etcdutl snapshot status` instead.
+----------+----------+------------+------------+ | HASH | REVISION | TOTAL KEYS | TOTAL SIZE | +----------+----------+------------+------------+ | 8c54a0ac | 15687011 | 10141 | 4.8 MB | +----------+----------+------------+------------+
|
恢复 rke01 节点
etcdctl snapshot restore etcd_backup_20240812.tar.gz \ --data-dir=/data/etcd --name rke01 \ --initial-advertise-peer-urls http://192.168.1.11:2380 \ --initial-cluster-token docker-etcd \ --initial-cluster rke01=http://192.168.1.11:2380,rke02=http://192.168.1.16:2380,rke03=http://192.168.1.15:2380
|
恢复 rke02 节点
etcdctl snapshot restore etcd_backup_20240812.tar.gz \ --data-dir=/opt/etcd --name rke02 \ --initial-advertise-peer-urls http://192.168.1.16:2380 \ --initial-cluster-token docker-etcd \ --initial-cluster rke01=http://192.168.1.11:2380,rke02=http://192.168.1.16:2380,rke03=http://192.168.1.15:2380
|
恢复 rke03 节点
etcdctl snapshot restore etcd_backup_20240812.tar.gz \ --data-dir=/opt/etcd --name rke03 \ --initial-advertise-peer-urls http://192.168.1.15:2380 \ --initial-cluster-token docker-etcd \ --initial-cluster rke01=http://192.168.1.11:2380,rke02=http://192.168.1.16:2380,rke03=http://192.168.1.15:2380
|
启动 etcd 服务
注:集群所有节点要同时启动操作
systemctl start etcd systemctl status etcd
|
3.3 查看集群状态
etcdctl member list etcdctl endpoint status --cluster -w table
|
3.4 验证集群数据恢复情况
查看恢复的所有数据
统计数据,判断迁移是否成功
etcdctl get --prefix "" | grep -c '^'
|