Restore failure with bootstrap pod error
Postgres bootstrap pod reports an error and restore does not progress.
Symptoms
The postgres bootstrap pod goes into an error state. Subsequent bootstrap pods come up showing
running
, but the restore does not progress. The logs of the postgres bootstrap pod
show:2022-08-01 15:03:58,337 INFO: Lock owner: None; I am management-09384302-postgres-bootstrap-v6xwz
2022-08-01 15:03:58,337 INFO: not healthy enough for leader race
2022-08-01 15:03:58,337 INFO: bootstrap in progress
ERROR: [125]: remote-0 process on 'management-09384302-postgres-backrest-shared-repo.e2e-automation.svc.cluster.local.' terminated unexpectedly [255]: ssh: connect to host management-09384302-postgres-backrest-shared-repo.e2e-automation.svc.cluster.local. port 2022: Connection timed out
Mon Aug 1 15:03:58 UTC 2022 ERROR: pgBackRest primary Creation: pgBackRest restore failed when creating primary
2022-08-01 15:03:59,010 INFO: removing initialize key after failed attempt to bootstrap the cluster
2022-08-01 15:03:59,026 INFO: renaming data directory to /pgdata/management-09384302-postgres_2022-08-01-15-03-59
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 171, in main
return patroni_main()
File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 139, in patroni_main
abstract_main(Patroni, schema)
File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 100, in abstract_main
controller.run()
File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 109, in run
super(Patroni, self).run()
File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 59, in run
self._run_cycle()
File "/usr/local/lib/python3.6/site-packages/patroni/__init__.py", line 112, in _run_cycle
logger.info(self.ha.run_cycle())
File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1469, in run_cycle
info = self._run_cycle()
File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1343, in _run_cycle
return self.post_bootstrap()
File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1236, in post_bootstrap
self.cancel_initialization()
File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1229, in cancel_initialization
raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
Resolution
Delete the Kubernetes API Connect ManagementCluster oplock and restore job, then start the
restore again:
kubectl delete cm <cr name>-oplock
kubectl delete mgmtr <restoreCR name>