IBM Support

Managing database size with HPC 3.2

Troubleshooting


Problem

With HPC 3.x version customers experience sudden growth of database which leads to partition fill up. This technote will provide some background knowledge on how to mitigate the risk and maintain database.

Symptom

Sudden fill up of /var partition on the head node leading to cluster going down as job data are not able to be appended into database

Cause

Size of database can increase to various reasons. The scope of the document is provide some insight on what needs to be done to avoid such incidence. For root cause analysis customer must reach our support people.

Environment

HPC 3.1, HPC 3.2

Diagnosing The Problem

Step 1: Identify

#du -h --max-depth=1  /var/lib/pgsql/9.1/| sort -nr

this command will show which folder is taking maximum space. Typically it will be data partition, if yes repeat the same command inside data partition

#du -h --max-depth=1  /var/lib/pgsql/9.1/data | sort -nr


Step 2: Check purger service:

#perfadmin list

the purger service should be running, if stopped start the service

#perfadmin start purger

This will start the purger service that will start clearing the old job information at scheduled interval (default midnight) that is stored in the database. Please note if the purger service was stopped for a long period of time it will take some time for the purger to clear out the database depending on the size (even days).

Check the log file for any errors in purger service:


#tail -f /usr/share/pmc/perf/logs/purger.log.<master_hostname>

Step 3. Check size of the tables to determine which table is occupying the largest size

3.a. Logon to hpcdb as hpcadmin

#psql -U hpcadmin -d hpcdb

where hpcadmin is the username for database admin.

Please note the following commands although undestructive please use with caution, if unsure please contact IBM Technical Support.

3.b. On the sql prompt, type the following command:


SELECT pg_size_pretty(pg_database_size('hpcdb'));
The output will show how big the database is.

3.c. Now to determine the table which is taking up most space copy and paste the entire command below:
SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 20;

This should provide the table taking up the more space.


Once the biggest table is determined confirm with following commands :
hpcdb=>select pg_size_pretty(pg_relation_size('tablename'));
hpcdb=> select pg_size_pretty(pg_relation_size('tablename'));

for example:
hpcdb=>select pg_size_pretty(pg_relation_size('lsb_events'));
hpcdb=> select pg_size_pretty(pg_relation_size('lsf_bjobs'));

Step 4: Check the number of roles in the database:
Use the following SQL command to check roles:
select count(*) from pmc_role;
By default the count will be 4 if there are numerous records this can cause issues with database.
For example, a count showing 100 roles

Step 5: check errors in postgres database:
Check startup log for postgres under
# /var/lib/pgsql/9.1/data/pg_log
see if there are any errors or there are too many jobs running as this will definitely cause problems in size

Resolving The Problem

For Step 1 above:

If the size on pg_wal folder is higher then follow the instructions from this Technote:


http://www.ibm.com/support/docview.wss?uid=isg3T1015637

For Step 3:
This will require some analysis depending on which table size is taking up more space. Please contact IBM technical support for resolution.

General Maintenance:
The following settings can be tweaked to reduce database table sizes:

A. Configure finished jobs to be purged within a shorter time in pmc.conf, add the following lines in /usr/share/pmc/gui/conf/pmc.conf
FINISH_JOB_TIME_TO_LIVE=14 (days)
this value can be changed to 7 days

B. Turn off dynamic job information in the job list in pmc.conf, add the following line in /usr/share/pmc/gui/conf/pmc.conf
ENABLE_DYNAMIC_JOB_ATTRIBUTES=N

C. Reduce the Duration parameter to 7 (from 14) days, in all XML configuration files for purger.
#ls /usr/share/pmc/perf/conf/purger/*.xml
/usr/share/pmc/perf/conf/purger/purger_bjobs.xml
/usr/share/pmc/perf/conf/purger/purger_ego_rawdata.xml
/usr/share/pmc/perf/conf/purger/purger_hpc_rawdata.xml
/usr/share/pmc/perf/conf/purger/purger_lsf_basic_rawdata.xml

Backup and restore database:
Please refer to following developerworks article for instruction on how to backup and restore your database
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/IBM%20Platform%20HPC%20Wiki/page/Back%20up%20and%20restore%20DB

Analyse and Vaccum database:
Following commands can be used to analyse whether your database needs to be vaccumed.
*Please note vaccuming a database is a CPU intensive operation depending on database size and sometimes will require exclusive lock on tables. It is highly recommended that the following commands are used with caution*.

From the master node shell prompt:
#psql -U postgres -d hpcdb -c analyze;

#psql -U postgres -d hpcdb -c vacuum;

Read the description of the SQL command VACUUM for details by typing the following on the shell prompt
#vacuumdb --help
vacuumdb cleans and analyzes a PostgreSQL database.

Scheduling Cron for database maintenance:

To check the size of the database (hpcdb) the following command can be setup as a script for a weekly cronjob

#sqlrunner -q "SELECT pg_size_pretty(pg_database_size('hpcdb'));"

for example:
#sqlrunner -q "SELECT pg_size_pretty(pg_database_size('hpcdb'));" >/tmp/dbsize.txt && mail -s "size of hpcdb" username@yourdomain.com </tmp/dbsize.txt

To check the size of the table "lsf_bjobs" : (please note you will need to provide the password for user hpcadmin)
#psql -U hpcadmin -d hpcdb -c "select pg_size_pretty(pg_relation_size('lsf_bjobs'));"

Schedule a cron job to analyse and vaccum database:

With PostgreSQL command-line, we can write a script and schedule it to run at midnight with cron command.
example:
#!/bin/bash
#this is a simple script to run vacuumdb
echo "----------------------------" >>/var/log/vacuumd.log
date >>/var/log/vacuumd.log
echo "----------------------------" >>/var/log/vacuumd.log
/usr/bin/vacuumdb -f -v -d <databasename> &&>>/var/log/vacuumd.log

Requirement: password file for postgres user
Please refer to the following documentation for information:
http://www.postgresql.org/docs/current/static/libpq-pgpass.html

Steps to create password file
a. create $HOME/.pgpass file
for example:
#echo "localhost:*:*:postgres:postgrespwd" > $HOME/.pgpass

#echo "` chmod 0600 $HOME/.pgpass `"

#cat .pgpass
hostname:port:database:username:password
localhost:*:*:postgres:<password>

# ls -la .pgpass
-rw------- 1 root root 32 Jul 10 13:02 .pgpass

Examples:
#cat sqlvaccumexample1.sh
#!/bin/bash
/usr/bin/vacuumdb -f -v -U postgres -d kusudb -h localhost -p 5432

#cat sqlvaccumexample2.sh
#!/bin/bash
/usr/bin/vacuumdb -f -v -U postgres -d kusudb


[{"Product":{"code":"SSENRW","label":"Platform HPC for System x"},"Business Unit":{"code":"BU010","label":"Systems - Storage"},"Component":"Database-Postgres","Platform":[{"code":"PF016","label":"Linux"}],"Version":"3.0.1;3.1;3.2","Edition":""},{"Product":{"code":"SSDV85","label":"Platform Cluster Manager"},"Business Unit":{"code":"BU010","label":"Systems - Storage"},"Component":"Database-Postgres","Platform":[{"code":"PF016","label":"Linux"}],"Version":"3.1;3.2","Edition":""}]

Document Information

Modified date:
17 June 2018

UID

isg3T1021098