由于项目要求自建 Kubernetes 集群，所以之前需要用到 PersistentVolume 的地方都是使用的 Local 或者直接使用 HostPath。搜索了一下发现 GlusterFS 可以部署在 Kubernetes 中，并作为 StorageProvisioner，自动管理 PersistentVolume 的创建和销毁，作为使用者只需要创建、删除需要的 PersistentVolumeClaim 即可。

参考项目 gluster-kubernetes

主要由三部分组成

Kubernetes 容器管理平台
GlusterFS 可拓展的存储系统
heketi 为 GlusterFS 提供卷管理 RESTful 接口

基础环境

按照 setup guide 检查环境

至少三个节点：

node-181: 192.168.136.181/24
node-182: 192.168.136.182/24
node-183: 192.168.136.183/24

每个节点至少有一个裸块设备（raw block device）

$ for HOST in node-{1..3}; do ssh root@$HOST lsblk /dev/sdb; done
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb    8:16   0  16G  0 disk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb    8:16   0  16G  0 disk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb    8:16   0  16G  0 disk

每个节点开放防火墙端口

$ firewall-cmd --list-all
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: ens160
  sources:
  services: ssh dhcpv6-client
  ports: 179/tcp 6443/tcp 2379-2380/tcp 10242-10250/tcp 30000-32767/tcp
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

未开放则开放对应端口

$ sudo firewall-cmd --add-port=2222/tcp --add-port=24007/tcp --add-port=24008/tcp --add-port=49152-49251/tcp --permanent
success
$ sudo firewall-cmd --reload
success

内核加载下列模块

dm_snapshot
dm_mirror
dm_thin_pool

查看模块是否被加载

$ lsmod | cut -d ' ' -f 1 | grep dm_snapshot
dm_snapshot

未加载则通过下面命令加载

sudo modprobe dm_mirror
sudo cat <<EOf > /etc/modules-load.d/dm_mirror.conf
dm_mirror
EOF

每个节点需要有 mount.glusterfs 这个命令，如果没有则需要安装

RHEL:

sudo yum -y install glusterfs-fuse

安装之后确认版本

$ glusterfs --version
glusterfs 3.12.2

部署概述

管理员需要向 heketi 提供 GlusterFS 集群的拓扑。大多数部署任务都由脚本 gk-deploy 处理。以下是脚本执行步骤的概述：

为 heketi 创建 ServiceAccount 使之能与 GlusterFS 节点安全地通信
以 DaemonSet 的形式将 GlusterFS 部署至 Kubneretes 中的指定节点
部署一个 heketi 实例 deploy-heketi，用于初始化 heketi 的数据库
创建 GlusterFS 的 Service 和 Endpoint，并通过通过创建 GlusterFS 卷来初始化 heketi 数据库，随后将数据库复制到同一个卷上供最终的 heketi 实例使用
删除所有 deploy-heketi 相关的资源
部署最终的 heketi 实例

部署流程

1. 创建拓扑文件

如上文所述，管理员必须提供 GlusterFS 集群的拓扑信息。采用拓扑文件的形式定义，该文件描述 GlusterFS 集群中存在的节点以及附加到它们的块设备以供 heketi 使用。项目提供了示例拓扑文件，创建自定义拓扑文件时需要注意两点：

确保拓扑文件仅列出了用于 heketi 的块设备，heketi 需要利用整个块设备，它将被分区并格式化。
hostname 有一些误导，manage 是节点的 hostname，而 storage 是节点用于后端存储通信的 IP。

我的拓扑文件

{
  "clusters": [
    {
      "nodes": [
        {
          "node": {
            "hostnames": {
              "manage": [
                "node-181"
              ],
              "storage": [
                "192.168.136.181"
              ]
            },
            "zone": 1
          },
          "devices": [
            "/dev/sdb"
          ]
        },
        {
          "node": {
            "hostnames": {
              "manage": [
                "node-182"
              ],
              "storage": [
                "192.168.136.182"
              ]
            },
            "zone": 1
          },
          "devices": [
            "/dev/sdb"
          ]
        },
        {
          "node": {
            "hostnames": {
              "manage": [
                "node-183"
              ],
              "storage": [
                "192.168.136.183"
              ]
            },
            "zone": 1
          },
          "devices": [
            "/dev/sdb"
          ]
        }
      ]
    }
  ]
}

2. 执行部署脚本

默认情况下拓扑文件应与部署脚本处于同一目录下，也可以使用第一个非选项参数来指定
默认情况下在 kube-templates 目录下获取 Kubernetes 模板文件，或者通过 -t 参数指定。
默认情况下不会部署 GlusterFS，允许在任何现有的 GlusterFS 集群中使用 hekebi。通过 -g 参数以 DaemonSet 的方式部署 GlusterFS 至拓扑文件指定的节点上。

遇到的问题

1. daemonset/glusterfs 启动的 pod 始终处于 READY 0/1

查看 daemonset/glusterfs 配置

$ kubectl get -o yaml daemonset/glusterfs
...
        livenessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - if command -v /usr/local/bin/status-probe.sh; then /usr/local/bin/status-probe.sh
              liveness; else systemctl status glusterd.service; fi
          failureThreshold: 50
          initialDelaySeconds: 40
          periodSeconds: 25
          successThreshold: 1
          timeoutSeconds: 3
        name: glusterfs
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - if command -v /usr/local/bin/status-probe.sh; then /usr/local/bin/status-probe.sh
              readiness; else systemctl status glusterd.service; fi
          failureThreshold: 50
          initialDelaySeconds: 40
          periodSeconds: 25
          successThreshold: 1
          timeoutSeconds: 3
...

在 pod 中执行 livenesss 检查的命令，得知原因是 gluster-blockd.service 未启动

$ kubectl exec glusterfs-74rsl -- /usr/local/bin/status-probe.sh readiness
failed check: systemctl -q is-active gluster-blockd.service

检查 gluster-blockd.service 的日志得知是依赖的 rpcbind.service 未启动

$ kubectl exec glusterfs-74rsl -- journalctl -u gluster-blockd.service
...

检查 rpcbind.service 的日志可知失败原因为 Dependcy failed

$ kubectl exec glusterfs-74rsl -- journalctl -u rpcbind.service
...

检查 rpcbind.service 的依赖，发现端口 111 被占用，但未查出占用的进程

$ kubectl exec glusterfs-74rsl -- systemctl cat rpcbind.service
# /usr/lib/systemd/system/rpcbind.service
[Unit]
Description=RPC bind service
DefaultDependencies=no

# Make sure we use the IP addresses listed for
# rpcbind.socket, no matter how this unit is started.
Requires=rpcbind.socket
Wants=rpcbind.target
After=systemd-tmpfiles-setup.service

[Service]
Type=forking
EnvironmentFile=/etc/sysconfig/rpcbind
ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS

[Install]
WantedBy=multi-user.target

$ kubectl exec glusterfs-74rsl -- systemctl cat rpcbind.socket
# /usr/lib/systemd/system/rpcbind.socket
[Unit]
Description=RPCbind Server Activation Socket

[Socket]
ListenStream=/var/run/rpcbind.sock

# RPC netconfig can't handle ipv6/ipv4 dual sockets
BindIPv6Only=ipv6-only
ListenStream=0.0.0.0:111
ListenDatagram=0.0.0.0:111
ListenStream=[::]:111
ListenDatagram=[::]:111

[Install]
WantedBy=sockets.target

$ kubectl exec glusterfs-74rsl -- ss -nplt | grep 111
LISTEN     0      128          *:111                      *:*
LISTEN     0      128         :::111                     :::*

在宿主机上检查端口 111，发现是被 systemd（pid=1）占用

$ ss -nplt | grep 111
LISTEN     0      128          *:111                      *:*                   users:(("systemd",pid=1,fd=29))
LISTEN     0      128         :::111                     :::*                   users:(("systemd",pid=1,fd=31))

systemd 不能随意停止，经 Google 得知，停止 rpcbind.socket 即可解除 111 端口占用。

sudo systemctl stop rpcbind.socket

为了永久解决此问题，查找 /usr/lib/systemd/system/rpcbind.socket 属于哪个软件，并卸载

yum -y erase $(rpm -qf /usr/lib/systemd/system/rpcbind.socket)

2. daemonset/glusterfs 无法创建 PV

glusterfs-74rsl Pod 异常终止，查看日志得知在执行命令 pvcreate .... /dev/sdb 时出错。查看日志得到

Device /dev/sdb excluded by a filter

查看 lvm 的配置文件，检查 filter 配置

$ grep filter /etc/lvm/lvm.conf
  # is used to drive LVM filtering like MD component detection, multipath
  # Configuration option devices/filter.
  # Run vgscan after changing the filter to regenerate the cache.
  # See the use_lvmetad comment for a special case regarding filters.
  # filter = [ "a|.*/|" ]
  # filter = [ "r|/dev/cdrom|" ]
  # filter = [ "a|loop|", "r|.*|" ]
  # filter = [ "a|loop|", "r|/dev/hdc|", "a|/dev/ide|", "r|.*|" ]
  # filter = [ "a|^/dev/hda8$|", "r|.*/|" ]
  # filter = [ "a|.*/|" ]
  # Configuration option devices/global_filter.
  # Because devices/filter may be overridden from the command line, it is
  # not suitable for system-wide device filtering, e.g. udev and lvmetad.
  # Use global_filter to hide devices from these LVM system components.
  # The syntax is the same as devices/filter. Devices rejected by
  # global_filter are not opened by LVM.
  # global_filter = [ "a|.*/|" ]
  # The results of filtering are cached on disk to avoid rescanning dud
  # This is a quick way of filtering out block devices that are not
  # by pvscan --cache), devices/filter is ignored and all devices are
  # scanned by default. lvmetad always keeps unfiltered information
  # which is provided to LVM commands. Each LVM command then filters
  # based on devices/filter. This does not apply to other, non-regexp,
  # filtering settings: component filters such as multipath and MD
  # are checked during pvscan --cache. To filter a device and prevent
  # devices/global_filter.
  # Configuration option activation/mlock_filter.
  # mlock_filter = [ "locale/locale-archive", "gconv/gconv-modules.cache" ]

未找到生效的 filter，经 Google 得知 GPT 格式的硬盘，在末尾有分区表，而 lvm 会将此类设备过滤，即使分区表是空的。

$ parted /dev/sdb print
Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start  End  Size  File system  Name  Flags

使用命令 wipefs -a /dev/sdb 清楚 GPT 信息

$ sudo wipefs -a /dev/sdb
/dev/sdb: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
/dev/sdb: 8 bytes were erased at offset 0x3fffffe00 (gpt): 45 46 49 20 50 41 52 54
/dev/sdb: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
/dev/sdb: calling ioclt to re-read partition table: Success

之后便可以使用 pvcreate 创建 pv