ETCD集群部署

ETCD 部署及恢复操作指南

1. ETCD 部署

1.1 集群拓扑结构

主机名 IP 地址 CPU 内存 角色 数据目录 配置文件目录
etcd0 192.168.1.10 16C 32G etcd 节点 /data/etcd /etc/etcd/etcd.conf
etcd1 192.168.1.11 16C 32G etcd 节点 /data/etcd /etc/etcd/etcd.conf
etcd2 192.168.1.12 16C 32G etcd 节点 /data/etcd /etc/etcd/etcd.conf

1.2 文件系统要求

建议使用 ext4 或 xfs 文件系统

1.3 ulimit 调整

/etc/security/limits.conf中添加:

* soft nofile 65536
* hard nofile 65536

1.4 内核参数优化

/etc/sysctl.conf中添加:

net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 5
vm.overcommit_memory = 1
kernel.shmall = 2097152
kernel.shmmax = 2147483648

执行配置生效:

sysctl -p

1.5 解压并部署

# 下载etcd
wget https://github.com/etcd-io/etcd/releases/download/v3.5.21/etcd-v3.5.21-linux-amd64.tar.gz
tar -zxf etcd-v3.5.21-linux-amd64.tar.gz
cd etcd-v3.5.21-linux-amd64
cp etcd etcdctl etcdutl /usr/local/bin/

# 创建配置文件目录及数据目录
mkdir /data/etcd
mkdir /etc/etcd

1.6 配置文件 etcd.conf

配置优化参数根据实际配置调整

cat > /etc/etcd/etcd.conf <<EOF
# 节点名称(每个节点唯一)
name: etcd0 # etcd1, etcd2 ...

# 数据目录(必须使用高性能本地 SSD,禁止 NFS/网络盘)
data-dir: /data/etcd

# 监听地址(绑定到具体 IP 更安全,但 0.0.0.0 也可接受)
listen-peer-urls: http://0.0.0.0:2380
listen-client-urls: http://0.0.0.0:2379

# 广播地址(必须为本机真实 IP)
advertise-client-urls: http://192.168.1.10:2379
initial-advertise-peer-urls: http://192.168.1.10:2380

# 集群初始配置(三节点)
initial-cluster: etcd0=http://192.168.1.10:2380,etcd1=http://192.168.1.11:2380,etcd2=http://192.168.1.12:2380
initial-cluster-token: milvus-etcd-cluster
initial-cluster-state: new

# ========================
# 🔧 Milvus 关键性能优化
# ========================

# 1. 后端配额:防止 etcd 因写入过多而 OOM(Milvus 元数据可能增长快)
quota-backend-bytes: 17179869184 # 16 GB(默认 2GB,建议 8~16GB)

# 2. 快照频率:减少 WAL 日志体积,加速恢复
snapshot-count: 10000 # 默认 10万,Milvus 写多,建议降低到 1万~5万

# 3. 心跳与选举超时(需满足:election-timeout > 10 * heartbeat-interval)
heartbeat-interval: 100 # ms(默认 100)
election-timeout: 2000 # ms(默认 1000 → 建议 2000~5000,避免网络抖动误判)

# 4. 请求大小限制(Milvus 可能批量操作)
max-request-bytes: 52428800 # 50 MB(默认 1.5MB,Milvus DDL 可能较大)
max-txn-ops: 131072 # 默认 128,提升至 128K,支持大事务

# 5. 后端批处理(提升写吞吐)
backend-batch-limit: 65536 # 默认 1万,可提高
backend-batch-interval: 50 # ms(默认 100ms,缩短提交延迟)

# 6. 自动压缩历史版本(Milvus 不依赖旧版本,可激进压缩)
auto-compaction-mode: periodic
auto-compaction-retention: "1h" # 每小时自动压缩一次(保留1小时历史)

# ========================
# 📝 日志配置
# ========================
log-outputs:
- stdout
- /var/log/etcd.log
enable-log-rotation: true
log-rotation-config-json: |
{
"maxsize": 100,
"maxage": 7,
"compress": true
}

# ========================
# ⚠️ 可选:启用 metrics(用于监控)
# ========================
# listen-metrics-urls: http://0.0.0.0:2381
EOF

1.7 配置开机自启

cat > /etc/systemd/system/etcd.service <<EOF
[Unit]
Description=Etcd Server
After=network.target

[Service]
Type=notify
EnvironmentFile=-/etc/etcd/etcd.conf
ExecStart=/usr/local/bin/etcd --config-file=/etc/etcd/etcd.conf
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

# 加载并启动etcd服务
systemctl daemon-reload
systemctl enable --now etcd

1.8 查看集群状态

etcdctl member list -w table
+------------------+---------+-------+--------------------------+--------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-------+--------------------------+--------------------------+------------+
| 37499ff739d6c21 | started | rke03 | http://192.168.1.15:2380 | http://192.168.1.15:2379 | false |
| 79e7c26cb0fc149 | started | rke02 | http://192.168.1.16:2380 | http://192.168.1.16:2379 | false |
| b4773de1c1f38771 | started | rke01 | http://192.168.1.11:2380 | http://192.168.1.11:2379 | false |
+------------------+---------+-------+--------------------------+--------------------------+------------+

查看集群健康状态

etcdctl endpoint health -w table
+----------------+--------+------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+----------------+--------+------------+-------+
| 127.0.0.1:2379 | true | 1.740588ms | |
+----------------+--------+------------+-------+

2. ETCD 集群故障节点剔除及恢复

2.1 停止故障节点 etcd 服务

systemctl stop etcd

2.2 剔除故障节点

etcdctl member remove 18f6f0bc5b947c0d

2.3 查看 etcd 集群成员状态

etcdctl member list

2.4 清理故障节点数据

rm -rf /data/etcd/member

2.5 添加新成员到集群

etcdctl member add etcd2 --peer-urls=http://172.22.16.85:2380

2.6 修改 etcd.conf 配置

将:

initial-cluster-state new

替换为:

initial-cluster-state existing

2.7 启动 etcd 服务

systemctl start etcd

2.8 查看集群状态

etcdctl member list
etcdctl endpoint status --cluster -w table

3. ETCD 集群数据恢复

参考链接:https://www.cnblogs.com/xishuai/p/docker-etcd.html

3.1 数据备份

停止 etcd 服务(集群所有节点都要操作)

systemctl stop etcd

查询 endpoint

etcdctl member list -w table

获取 leader 节点

etcdctl endpoint status --cluster -w table

查询数据记录数

etcdctl get --prefix "" | grep -c '^'

备份 etcd 数据到本地

etcdctl snapshot save etcd_backup_20240809.tar.gz

查看备份文件的数据

etcdctl snapshot status etcd_backup_20240809.tar.gz -w table
Deprecated: Use `etcdutl snapshot status` instead.

+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 8c54a0ac | 15687011 | 10141 | 4.8 MB |
+----------+----------+------------+------------+

同步文件到恢复节点

scp etcd_backup_20240808.tar.gz rke01:/root/
scp etcd_backup_20240808.tar.gz rke02:/root/
scp etcd_backup_20240808.tar.gz rke03:/root/

3.2 数据恢复

确认备份数据一致性

etcdctl snapshot status etcd_backup_20240809.tar.gz -w table
Deprecated: Use `etcdutl snapshot status` instead.

+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 8c54a0ac | 15687011 | 10141 | 4.8 MB |
+----------+----------+------------+------------+

恢复 rke01 节点

etcdctl snapshot restore etcd_backup_20240812.tar.gz \
--data-dir=/data/etcd --name rke01 \
--initial-advertise-peer-urls http://192.168.1.11:2380 \
--initial-cluster-token docker-etcd \
--initial-cluster rke01=http://192.168.1.11:2380,rke02=http://192.168.1.16:2380,rke03=http://192.168.1.15:2380

恢复 rke02 节点

etcdctl snapshot restore etcd_backup_20240812.tar.gz \
--data-dir=/opt/etcd --name rke02 \
--initial-advertise-peer-urls http://192.168.1.16:2380 \
--initial-cluster-token docker-etcd \
--initial-cluster rke01=http://192.168.1.11:2380,rke02=http://192.168.1.16:2380,rke03=http://192.168.1.15:2380

恢复 rke03 节点

etcdctl snapshot restore etcd_backup_20240812.tar.gz \
--data-dir=/opt/etcd --name rke03 \
--initial-advertise-peer-urls http://192.168.1.15:2380 \
--initial-cluster-token docker-etcd \
--initial-cluster rke01=http://192.168.1.11:2380,rke02=http://192.168.1.16:2380,rke03=http://192.168.1.15:2380

启动 etcd 服务

注:集群所有节点要同时启动操作

systemctl start etcd
systemctl status etcd

3.3 查看集群状态

etcdctl member list
etcdctl endpoint status --cluster -w table

3.4 验证集群数据恢复情况

查看恢复的所有数据

etcdctl get --prefix ""

统计数据,判断迁移是否成功

etcdctl get --prefix "" | grep -c '^'
文章作者: 慕容峻才
文章链接: https://www.acaiblog.top/ETCD集群部署/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 阿才的博客
微信打赏
支付宝打赏