etcd maintenance

Your etcd pod might fail to start due to exceeded database space.

To maintain storage resources that the etcd keyspace uses, refer to Managing etcd clusters. Set the space quota and complete the history compaction and defragmentation.

etcd write-ahead log (WAL) directory maintenance

By default, etcd WAL directory is set to etcd_wal_dir: /var/lib/etcd-wal in config.yaml. You can set the directory /var/lib/etcd-wal to a centralized remote log directory for persistent logging.

The etcd WAL log sizing value that etcd uses is set in /etc/cfc/pods/etcd.json by --max-wals. For example, if --max-wals=0, the maximum number of WAL files that are retained is unlimited. If --max-wals=5, the maximum number of WAL files that are retained is 5. If --max-wals is not in etcd.json, the WAL log sizing is using the default value of 5.

To manually set the file number, for example to 5, follow these steps:

  1. Log in to one master node of your high availability (HA) environment, or log in to etcd node if you separated etcd from the master.

  2. Stop etcd by running the following command:

     mv /etc/cfc/pods/etcd.json /etc/cfc/etcd.json
    

    Important: Do not create any backup file under /etc/cfc/pods, and do not run the following command and option: cp /etc/cfc/pods/etcd.json /etc/cfc/pods/etcd.json.orig.

  3. Run docker ps | grep etcd. No output means that etcd has stopped.

  4. Edit the file /etc/cfc/etcd.json to set --max-wals=5.

  5. Start etcd by running the following command:

     mv /etc/cfc/etcd.json /etc/cfc/pods/etcd.json
    
  6. Run docker ps | grep etcd, to check whether etcd starts running. The output should resemble the following code:

    # docker ps | grep etcd
    fbd4e804a818        e21fb69683f3                          "etcd --name=etcd0 -…"   10 minutes ago      Up
    10 minutes                           k8s_etcd_k8s-etcd-172.29.214.11_kube- 
    system_b93a2f44fc31e2719f2ec07ae0f1bf43_3
    6de280044570        mycluster.icp:8500/ibmcom/pause:3.1   "/pause"                 12 minutes ago
    Up 12 minutes                           k8s_POD_k8s-etcd-172.29.214.11_kube- 
    system_b93a2f44fc31e2719f2ec07ae0f1bf43_3
    
  7. Run ps -ef | grep "name\=etcd" | grep max-wals to check whether max-wals is set to 5. Note You might have to wait a few minutes for the WAL log under /var/lib/etcd-wal to be reduced to 5.

  8. Repeat steps 1-7 on each master (etcd) node.

Refer to etcd settings for more information.

etcd pod failed to start due to exceeded database space

Symptoms:

etcd pods are in the CrashloopBackOff state. The error log shows the following error message:

  Error from server: etcdserver: mvcc: database space exceeded

Cause:

The etcd storage resources or the etcd WAL directory needs maintenance.

Resolving the problem:

  1. Clean the etcd WAL directory. By default, the directory is set to /var/lib/etcd-wal. You can use df -h | grep etcd-wal to check the storage usage.

    If disk is full, refer to etcd write-ahead log (WAL) directory maintenance to purge the etcd WAL files.

  2. To release storage space, follow the instructions for defragmentation in Managing etcd clusters.

etcd pod failed to start due to inconsistent data

Symptoms:

The etcd pods failed to start. The error log shows the following output:

  2018-12-27 17:54:22.267699 C | raft: 8362bb192cc722e8 state.commit 5801 is out of range [2320232, 2320232]
  panic: 8362bb192cc722e8 state.commit 5801 is out of range [2320232, 2320232]
  goroutine 1 [running]:
  github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420161420, 0xf975a1, 0x2b, 0xc420058340, 0x4, 0x4)
      /tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c

Cause: This error is due to inconsistent etcd data.

Resolving the problem:

  1. Configure etcdctl:

    1. To access your etcd cluster by using the command line interface (CLI), you must install and configure etcdctl, the command line client for etcd. You can obtain the etcdctl binary file from the ibmcom/etcd:v3.2.18 image by running the following command:
      docker run --rm -v /usr/local/bin:/data <cluster_name>.icp:8500/ibmcom/etcd:v3.2.18 cp /usr/local/bin/etcdctl /data
      
    2. Set up the endpoint as one of your available etcd members by running the following command:
      export endpoint=<Endpoint IP address>
      
    3. To use the etcdctl v3 API, set up an alias by running the following command:
      alias etcdctl3="ETCDCTL_API=3 etcdctl --endpoints=https://${endpoint}:4001 --cacert=/etc/cfc/conf/etcd/ca.pem --cert=/etc/cfc/conf/etcd/client.pem --key=/etc/cfc/conf/etcd/client-key.pem"
      
  2. Update the etcd members:

    1. Check your existing etcd cluster members by running the following command. The command might resemble the following output:

      # etcdctl3 member list
      2bc7764897fe35ec, started, etcd1, https://<Member IP address>, https://<Member IP address>:4001
      77a992292013374b, started, etcd0, https://<Member IP address>, https://<Member IP address>:4001
      f0f3d76c8bf22bca, started, etcd2, https://<Member IP address>, https://<Member IP address>:4001
      

      The etcd2 node is the node that failed to start.

    2. On the failed node, etcd2 in the example, stop the etcd by running the following command:

        mv /etc/cfc/pods/etcd.json /etc/cfc/etcd.json
      
    3. Run the following command to verify that the etcd runs. If there is no output, etcd is not running:

       docker ps | grep etcd
      
    4. Remove the old etcd2 member by running the following command:

      # etcdctl3 member remove f0f3d76c8bf22bca
      Member f0f3d76c8bf22bca removed from cluster 71e83e6eb99a602f
      
    5. Add the etcd2 member back by running the following command:

      # etcdctl3 member add etcd2 --peer-urls="https://9.111.255.212:2380"
      Member 969909b46db234fe added to cluster 71e83e6eb99a602f
      
      ETCD_NAME="etcd2"
      ETCD_INITIAL_CLUSTER="etcd1=https://9.111.255.206:2380,etcd0=https://9.111.255.130:2380,etcd2=https://9.111.255.212:2380"
      ETCD_INITIAL_CLUSTER_STATE="existing"
      
    6. On the etcd2 node, clean the etcd data directory by running the following commands:

      # rm -r /var/lib/etcd/*
      # rm -r /var/lib/etcd-wal/*
      
    7. On the etcd2 node, edit the /etc/cfc/pods/etcd.json file and add --initial-cluster-state=existing.

    8. On the etcd2 node, restart etcd by running the following command:
      mv /etc/cfc/etcd.json /etc/cfc/pods/etcd.json
      
  3. To verify that etcd runs, run the following command:
    docker ps | grep etcd