etcd maintenance

Your etcd pod might fail to start due to exceeded database space.

To maintain storage resources that the etcd keyspace uses, refer to Managing etcd clusters. Set the space quota and complete the history compaction and defragmentation.

etcd write-ahead log (WAL) directory maintenance

By default, etcd WAL directory is set to etcd_wal_dir: /var/lib/etcd-wal in config.yaml. You can set the directory /var/lib/etcd-wal to a centralized remote log directory for persistent logging.

The etcd WAL log sizing value is set in /etc/cfc/pods/etcd.json by --max-wals. For example, if --max-wals=0, the maximum number of WAL files that are retained is unlimited. If --max-wals=5, the maximum number of WAL files that are retained is 5. If there is not a file number assigned to --max-wals in etcd.json, the WAL log sizing default value is 5.

To manually set the file number, for example to 5, follow these steps:

  1. Log in to one master node of your high availability (HA) environment, or log in to your etcd node if you separated etcd from the master.

  2. Stop etcd by running the following command:

    mv /etc/cfc/pods/etcd.json /etc/cfc/etcd.json
    

    Important Do not create any backup file under /etc/cfc/pods, and do not run the following command and option: cp /etc/cfc/pods/etcd.json /etc/cfc/pods/etcd.json.orig.

  3. Run the following command to verify that etcd stopped. If there is no output, then the etcd stopped:

    docker ps | grep etcd
    
  4. Edit the /etc/cfc/etcd.json file to set --max-wals=5.

  5. Start etcd by running the following command:

     mv /etc/cfc/etcd.json /etc/cfc/pods/etcd.json
    
  6. To verify that the etcd runs, run the following command:

     docker ps | grep etcd
    

    The command might resemble the following output:

     # docker ps | grep etcd
     fbd4e804a818        e21fb69683f3                          "etcd --name=etcd0 -…"   10 minutes ago      Up
     10 minutes                           k8s_etcd_k8s-etcd-172.29.214.11_kube- 
     system_b93a2f44fc31e2719f2ec07ae0f1bf43_3
     6de280044570        mycluster.icp:8500/ibmcom/pause:3.1   "/pause"                 12 minutes ago
     Up 12 minutes                           k8s_POD_k8s-etcd-172.29.214.11_kube- 
     system_b93a2f44fc31e2719f2ec07ae0f1bf43_3
    
  7. Run the following command to check for the file number that is assigned to max-wals:

     ps -ef | grep "name\=etcd" | grep max-wals
    

    Note: You might have to wait a few minutes for the WAL log under /var/lib/etcd-wal to be reduced to 5.

  8. Repeat steps 1-7 on each master (etcd) node.

Refer to etcd settings for more information.

etcd pod failed to start due to exceeded database space

Symptoms:

The etcd pods are in the CrashLoopBackOff state. The error log shows the following error message:

  Error from server: etcdserver: mvcc: database space exceeded

Cause:

The etcd storage resources or the etcd WAL directory needs maintenance.

Resolving the problem:

  1. Clean the etcd WAL directory. By default, the directory is set to /var/lib/etcd-wal. You can use df -h | grep etcd-wal to check the storage usage.

    If disk is full, refer to etcd write-ahead log (WAL) directory maintenance to purge the etcd WAL files.

  2. To release storage space, follow the instructions for defragmentation in Managing etcd clusters.

etcd pod failed to start due to inconsistent data

Symptoms:

The etcd pods failed to start. The error log shows the following output:

  2018-12-27 17:54:22.267699 C | raft: 8362bb192cc722e8 state.commit 5801 is out of range [2320232, 2320232]
  panic: 8362bb192cc722e8 state.commit 5801 is out of range [2320232, 2320232]
  goroutine 1 [running]:
  github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420161420, 0xf975a1, 0x2b, 0xc420058340, 0x4, 0x4)
      /tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c

Cause: This error is due to inconsistent etcd data.

Resolving the problem:

  1. Configure etcdctl:

    1. To access your etcd cluster by using the command line interface (CLI), you must install and configure etcdctl, the command line client for etcd. You can obtain the etcdctl binary file from the ibmcom/etcd:v3.2.18 image by running the following command:

      docker run --rm -v /usr/local/bin:/data <cluster_CA_domain>:8500/ibmcom/etcd:v3.2.18 cp /usr/local/bin/etcdctl /data
      

      Where, <cluster_CA_domain> is the certificate authority (CA) domain that was set in the config.yaml file during installation.

    2. Set up the endpoint as one of your available etcd members by running the following command:

      export endpoint=<Endpoint IP address>
      
    3. To use the etcdctl v3 API, set up an alias by running the following command:
      alias etcdctl3="ETCDCTL_API=3 etcdctl --endpoints=https://${endpoint}:4001 --cacert=/etc/cfc/conf/etcd/ca.pem --cert=/etc/cfc/conf/etcd/client.pem --key=/etc/cfc/conf/etcd/client-key.pem"
      
  2. Update the etcd members:

    1. Check your existing etcd cluster members by running the following command. The command might resemble the following output:

      # etcdctl3 member list
      2bc7764897fe35ec, started, etcd1, https://<Member IP address>, https://<Member IP address>:4001
      77a992292013374b, started, etcd0, https://<Member IP address>, https://<Member IP address>:4001
      f0f3d76c8bf22bca, started, etcd2, https://<Member IP address>, https://<Member IP address>:4001
      

      The etcd2 node is the node that failed to start.

    2. On the failed node, etcd2 in the example, stop the etcd by running the following command:

        mv /etc/cfc/pods/etcd.json /etc/cfc/etcd.json
      
    3. Run the following command to verify that the etcd runs. If there is no output, etcd is not running:

       docker ps | grep etcd
      
    4. Remove the old etcd2 member by running the following command:

      # etcdctl3 member remove f0f3d76c8bf22bca
      Member f0f3d76c8bf22bca removed from cluster 71e83e6eb99a602f
      
  1. Add the etcd2 member back by running the following command. The peer-urls IP address is the node that you are recovering.

    # etcdctl3 member add etcd2 --peer-urls="https://9.111.255.212:2380"
    Member 969909b46db234fe added to cluster 71e83e6eb99a602f
    
    ETCD_NAME="etcd2"
    ETCD_INITIAL_CLUSTER="etcd1=https://9.111.255.206:2380,etcd0=https://9.111.255.130:2380,etcd2=https://9.111.255.212:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"
    
  2. On the etcd2 node, clean the etcd data directory by running the following commands:

    # rm -r /var/lib/etcd/*
    # rm -r /var/lib/etcd-wal/*
    
  3. On the etcd2 node, edit the /etc/cfc/etcd.json file to update the value of --initial-cluster-state to existing.

  4. On the etcd2 node, restart etcd by running the following command:

    mv /etc/cfc/etcd.json /etc/cfc/pods/etcd.json
    
    1. To verify that etcd runs, run the following command:
      docker ps | grep etcd