Performance tuning for large clusters

Store rrd files on a separate disk

About this task

To improve performance, store the rrd files on a separate disk from the database, and create a symbolic link to the new location.

Procedure

  1. Create a directory on a separate disk.

    For example:

    mkdir /newDirectory
    
  2. Copy files to your new directory.

    For example:

    cp –p /opt/cacti/rra/*  /newDirectory
    
  3. Make a backup or your existing files and create a symbolic link to your new directory.

    For example:

    mv /opt/cacti/rra    /opt/cacti/rra.bak
    ln –s /newDirectory   /opt/cacti/rra
    

Increase the database memory

About this task

If you see errors that the database is running out of memory, increase the maximum memory that is allocated to the database.

Database 1114 errors usually indicate that the database needs more memory.

Procedure

  1. Edit the /etc/my.cnf file and increase the max_heap_table_size value.
  2. Restart the mysqld service:
    service mysqld restart
    

Enable on-demand rrd file updating for systems with heavy disk I/O

About this task

If you have problems with high disk I/O wait times, enable on-demand rrd file updating. After it is enabled if you still have problems with high disk I/O wait times, you can also look into using Spine. It is an add-on Cacti feature.

Procedure

  1. Go to Console > Configuration > RTM Settings.
  2. Select the Performance tab.
  3. In the On Demand RRD Update Settings section, select Enable On Demand RRD Updating.
  4. You can optionally change how often your RRD files are updated by modifying the values for How Often Should Boost Update All RRDs and Maximum Records fields.
  5. Click Save.

Configure concurrent poller processes

About this task

Procedure

  1. Go to Console > Configuration > RTM Settings.
  2. Click the Poller tab.
  3. In the General section, change the value for Maximum Concurrent Poller Processes.
  4. Click Save.

Enable database record partitioning

About this task

Database record partitioning splits the larger LSF job data tables into multiple tables and speeds up processing during database maintenance operations.

Partitioning is required if you have many jobs per day or if you want to extend the time to keep job summary data.

Size the partitions to have a maximum of about 2 million records. When you are sizing the partitions, consider the amount of memory the host has for the database. Each job record occupies 4 KB in the database.

You can also specify what elapsed time period to use for each partition. Increasing the time period means that the database contains more data for overall analysis, but also increases the system impact of removing job records.

Procedure

  1. Go to Console > Configuration > RTM Settings.
  2. Select the Maint tab.
  3. In the Large System Settings section, select Enable Record Partitioning to place a check mark.
  4. Specify elapsed time period in the Partition Size field.

  5. Specify the maximum number of partitions to keep in the database in the Maximum Partitions field.

    When the number of partitioned tables reaches this number, RTM deletes the oldest partitioned table before it creates a new partitioned table. Increasing the number of partitions means that the database contains more data for overall analysis but also increases the size of the database.

  6. Click Save.

Configure data collection frequency

About this task

For large clusters, change the data collection frequency. Data collection frequency is configured for each LSF cluster.

If you see continuous errors similar to the following in the cacti.log file, decrease the data collection frequency:

ERROR: Run-On/Abended Process Detected for ClusterName:'Large Cluster', ClusterID:'1',
Process:'GRIDJOBS', PID:'19749', Attempting to Kill PID

Procedure

  1. Go to Console > Clusters > Clusters.
  2. Select the name of the LSF cluster you want to modify.
  3. Click the Poller tab.
  4. You can modify values for the following fields:
    • In the Queue/Host/Load Collection Settings section:

      Collection Frequency and Max Allowed Runtime

    • In the Job Collection Settings section: Minor Collection Frequency, Major Collection Frequency, and Max Allowed Runtime.
  5. Click Save.

Increase LSF API timeout values

About this task

If you see errors in the cacti.log file that indicates the LSF APIs are timing out, increase the timeout value.

Procedure

  1. Go to Console > Clusters > Clusters.
  2. Select the name of the LSF cluster you want to modify.
  3. Click the Advanced tab.
  4. In the Cluster Connection Timeout Settings section, modify values for the following fields:
    • Base Timeout
    • Batch Timeout
    • Batch Job Info Timeout
    • Batch Job Info Retries
  5. Click Save.

Enhance the Database Performance

About this task

Change innodb_flush_log_at_trx setting in my.cnf to 2 to flush the database log to disk every second not every query commit. This change reduces the amount of random disk I/O and thus increases RTM's scalability.

Procedure

  1. Edit the /etc/my.cnf file and change the value of innodb_flush_log_at_trx_commit to 2.
  2. Restart the mysqld service: service mysqld restart.