Performance tuning for large clusters
Store rrd files on a separate disk
About this task
To improve performance, store the rrd files on a separate disk from the database, and create a symbolic link to the new location.
Procedure
Increase the database memory
About this task
If you see errors that the database is running out of memory, increase the maximum memory that is allocated to the database.
Database 1114 errors usually indicate that the database needs more memory.
Procedure
Enable on-demand rrd file updating for systems with heavy disk I/O
About this task
If you have problems with high disk I/O wait times, enable on-demand rrd file updating. After it is enabled if you still have problems with high disk I/O wait times, you can also look into using Spine. It is an add-on Cacti feature.
Procedure
- Go to .
- Select the Performance tab.
- In the On Demand RRD Update Settings section, select Enable On Demand RRD Updating.
- You can optionally change how often your RRD files are updated by modifying the values for How Often Should Boost Update All RRDs and Maximum Records fields.
- Click Save.
Configure concurrent poller processes
About this task
Procedure
- Go to .
- Click the Poller tab.
- In the General section, change the value for Maximum Concurrent Poller Processes.
- Click Save.
Enable database record partitioning
About this task
Database record partitioning splits the larger LSF job data tables into multiple tables and speeds up processing during database maintenance operations.
Partitioning is required if you have many jobs per day or if you want to extend the time to keep job summary data.
Size the partitions to have a maximum of about 2 million records. When you are sizing the partitions, consider the amount of memory the host has for the database. Each job record occupies 4 KB in the database.
You can also specify what elapsed time period to use for each partition. Increasing the time period means that the database contains more data for overall analysis, but also increases the system impact of removing job records.
Procedure
Configure data collection frequency
About this task
For large clusters, change the data collection frequency. Data collection frequency is configured for each LSF cluster.
If you see continuous errors similar to the following in the cacti.log file, decrease the data collection frequency:
ERROR: Run-On/Abended Process Detected for ClusterName:'Large Cluster', ClusterID:'1',
Process:'GRIDJOBS', PID:'19749', Attempting to Kill PID
Procedure
Increase LSF API timeout values
About this task
If you see errors in the cacti.log file that indicates the LSF APIs are timing out, increase the timeout value.
Procedure
Enhance the Database Performance
About this task
Procedure
- Edit the /etc/my.cnf file and change the value of innodb_flush_log_at_trx_commit to 2.
- Restart the mysqld service:
service mysqld restart
.