Troubleshooting
Problem
How to fix the issue about null value in column "end_time" in postgresql database log?
Symptom
The detail log is as below:
2018-01-26 00:00:00 CET xcatdb xcatadm ERROR: null value in column "end_time" violates not-null constraint
2018-01-26 00:00:00 CET xcatdb xcatadm STATEMENT: INSERT INTO LSF_FINISH_JOB (JOB_ID, JOB_ARRAY_INDEX, JOB_NAME, USER_NAME, JOB_NAME_FULL, QUEUE, SUBMIT_TIME, START_TIME, END_TIME, EXEC_HOSTSTR, PRIORITY, FROM_HOST, COMMAND, IN_FILE, OUT_FILE, PROJECT_NAME, CWD, EXECCWD, ASKHOSTSTR, APPLICATION, DESCRIPTION, CLUSTER_NAME, SUBMIT_TIME_GMT, START_TIME_GMT, USER_GROUP, JOB_GROUP, RES_REQ, EXIT_STATUS, FROM_HOST_CLUSTER, PEND_REASONS, JOBPID, LICENSEPROJECT, RESERVETIME, ESTIMATEDRUNTIME, EXECHOME, EXECUSER, MEM, SWAP, NTHREADS, PGIDSTR, PIDSTR, JOB_STATUS, NUM_PROCESSORS, RU_UTIME, RU_STIME, CPU_TIME, NUM_SLOTS, NUM_EXEC_PROCS, CPU_UTIL, WORKLOAD_TYPE, DURATION, PATH, CREATE_TIME_GMT, GRAPHIC_SESSION_ID, GRAPHIC_HOST, APPLICATION_NAME, PATH_ALIAS, NOTIFICATION, BPOST_MSG, RUNTIME, FLOW_ID, END_TIME_GMT, PATH_REMOVABLE, INSERT_SEQ) SELECT JOB_ID, JOB_ARRAY_INDEX, JOB_NAME, USER_NAME, JOB_NAME_FULL, QUEUE, SUBMIT_TIME, START_TIME, END_TIME, EXEC_HOSTSTR, PRIORITY, FROM_HOST, COMMAND, IN_FILE, OUT_FILE, PROJECT_NAME, CWD, EXECCWD, ASKHOSTSTR, APPLICATION, DESCRIPTION, CLUSTER_NAME, SUBMIT_TIME_GMT, START_TIME_GMT, USER_GROUP, JOB_GROUP, RES_REQ, EXIT_STATUS, FROM_HOST_CLUSTER, PEND_REASONS, JOBPID, LICENSEPROJECT, RESERVETIME, ESTIMATEDRUNTIME, EXECHOME, EXECUSER, MEM, SWAP, NTHREADS, PGIDSTR, PIDSTR, JOB_STATUS, NUM_PROCESSORS, RU_UTIME, RU_STIME, CPU_TIME, NUM_SLOTS, NUM_EXEC_PROCS, CPU_UTIL, WORKLOAD_TYPE, DURATION, PATH, CREATE_TIME_GMT, GRAPHIC_SESSION_ID, GRAPHIC_HOST, APPLICATION_NAME, PATH_ALIAS, NOTIFICATION, BPOST_MSG, RUNTIME, FLOW_ID, END_TIME_GMT, PATH_REMOVABLE, INSERT_SEQ FROM LSF_LIVE_JOB a WHERE a.JOB_ID = $1 AND a.JOB_ARRAY_INDEX = $2 AND a.CLUSTER_NAME = $3
Cause
When this issue happens, there are corresponding errors in commondataloader log as below:
2018-01-26 00:00:00.878 CET WARN com.platform.perf.dataloader.lsf.commonjobloader.RecoverJobs.recoverJobs(RecoverJobs.java:112) RecoverJobs: Exception when moving jobs. org.postgresql.util.PSQLException: ERROR: null value in column "end_time" violates not-null constraint
This issue belongs to the FinishedJobThread of commonjobloader.
Diagnosing The Problem
Because there are some dangling records with null end_time in LSF_LIVE_JOB table, to resolve this issue, the dangling records need be found out and be deleted.
Resolving The Problem
According to the below steps, the issue can be resolved:
1. Open the debug log, you can see the detail job ID information for the dangling records (Dangling job ID are 159017 and 159018).
2018-01-26 00:00:00.876 CET DEBUG com.platform.perf.dataloader.lsf.commonjobloader.RecoverJobs.recoverJobs(RecoverJobs.java:89) Dangling job found: 159017:0
2018-01-26 00:00:00.876 CET DEBUG com.platform.perf.dataloader.lsf.commonjobloader.RecoverJobs.recoverJobs(RecoverJobs.java:89) Dangling job found: 159018:0
2. Enter the database admin command, execute this command: #psql -U xcatadm xcatdb
3. Run the below command to find out the dangling records,
xcatdb=> select job_id,job_name,command,submit_time,start_time,end_time from LSF_LIVE_JOB where JOB_ID=159017;
job_id | job_name | command | submit_time | start_time | end_time
--------+----------+----------+---------------------+---------------------+---------------------
159017 | Wal_opti |#!/bin/bash;... | 2017-06-16 09:25:31 | 2017-06-16 09:25:32 |
xcatdb=> select job_id,job_name,command,submit_time,start_time,end_time from LSF_LIVE_JOB where JOB_ID=159018;
job_id | job_name | command | submit_time | start_time | end_time
--------+----------+----------+---------------------+---------------------+---------------------
159018 | Wal_opti |#!/bin/bash;.. | 2017-06-16 09:25:31 | 2017-06-16 09:25:32 |
4. As you see, <1> The commonjobloader debug log show this two records can't be moved to finished table; <2> According to the database checking, you can find that the submit_time and start_time of two records are very old from now. So you can make a conclusion that the two records are dangling records which need be removed from database. After the below database action, the issue is fixed:
xcatdb=> DELETE FROM LSF_LIVE_JOB WHERE JOB_ID = 159017;
xcatdb=> DELETE FROM LSF_LIVE_JOB WHERE JOB_ID = 159018;
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
isg3T1026936