Performing file system scan to collect metadata from IBM Spectrum Scale

You can use the file system scanning tool, IBM Spectrum Scale Scanner, to collect system metadata from IBM Spectrum Scale to be ingested into IBM Spectrum® Discover.

About this task

The IBM Spectrum Scale Scanner tool uses the IBM Spectrum Scale information lifecycle management (ILM) policy engine to obtain the system metadata about the files stored on the file system. The system metadata is written to a file, which is then transferred to the IBM Spectrum Discover master node. The file is then ingested within the node and analytics is carried out to provide search, duplicate file detection, archive data detection, and capacity show-back report generation. The following system metadata is collected from the file system scan:
Key name Description
Site The site where the file or object resides.
Platform The source storage platform that contains the file or object.
Size The size of the file.
Owner The owner of the file.
Path The subdirectory where the data resides.
Name The name of the data.
Permissions The permissions for the file (mode).
ctime The change time of the file metadata (inode).
mtime The time when the data was last modified.
atime The time when the data was last accessed.
Filesystem The name of the IBM Spectrum Scale file system that is storing the data.
Cluster The name of the IBM Spectrum Scale cluster.
inode The IBM Spectrum Scale inode that is storing the data.
Group The Linux® group associated with the file.
Fileset The file set that stores the file.
Pool The storage pool where the file resides.
Migstatus If applicable, indicates whether the data is migrated to tape or object.
migloc If applicable, indicates the location of the data if migrated to tape or object.
ScanGen Scan generation - useful to track rescans.

The IBM Spectrum Scale Scanner tool also collects quota information by calling mmrepquota.

The tool comprises the following files:
  • scale_scanner.py: The tool that starts the IBM Spectrum Scale ILM policy.
  • scale_scanner.conf: The configuration file used to customize the behavior of the scale_scanner.py tool.
  • createScanPolicy: The script that is called internally by the tool.

Procedure

Install the IBM Spectrum Scale Scanner tool by unpacking the utility from the IBM Spectrum Discover node to the required location on the IBM Spectrum Scale cluster node.

  1. Log in to the IBM Spectrum Discover node through Secure Shell (SSH) with the moadmin username and password:
    ssh modadmin@spectrum.discover.ibm.com
  2. Change to the directory that contains the IBM Spectrum Scale scanning utility
    /opt/ibm/metaocean/spectrum-scale
  3. Copy the createScanPolicy, _init_.py, scale_scanner.conf, and scale_scanner.py files to a node in the IBM Spectrum Scale cluster:
    scp * root@spectrumscale.ibm.com:/my_scanner_directory
    
    createScanPolicy 100% 3320 3.2KB/s 00:00
    init.py 100% 427 0.4KB/s 00:00
    scale_scanner.conf 100% 1595 1.6KB/s 00:00
    scale_scanner.py 100% 13KB 13.2KB/s 00:00
  4. On the IBM Spectrum Scale node where you install the scanning utility, edit the configuration file (scale_scanner.conf) as follows:
    1. Use the IBM Spectrum Discover UI to create a connection to the SS system on which you start a manual scan for. Set the filesystem and scandir fields, and optionally set the outputdir and site fields in the [spectrumscale] stanza of the file.
      
      [spectrumscale]
      # Spectrum Scale Filesystem which hosts the scan directory
      # example:  /dev/gpfs0
      filesystem=/dev/gpfs0
      # The directory path on Spectrum Scale Filesystem to perform scan on
      # example: /gpfs0
      # specifies a global directory to be used for temporary storage during 
      # mmapplypolicy command processing. The specified directory must be 
      #mounted with read/write access within a shared file system
      mountpoint=mount point of the gpfs filesystem
      # It is unclear what the mount_point should be, but setting the mount point
      # to the mount point of the scale file system on the IBM Spectrum Scale node works.
      scandir=/gpfs0
      # The directory to store output data from the scan in (default is 
      # scandir)
      outputdir=
      # The site tag to specify a physical location or organization identifier. 
      # If you use this field, remove the comment (#)
      #site=
      
    2. Set the scale_connection, master_node_ip, and username fields in the [spectrumdiscover] stanza of the file.
      Note: scale_connection refers to the name of the IBM Spectrum Scale file system that is scanned and ingested into IBM Spectrum Discover. The scale_connection value must match the value that is defined in the Data Source column of the Data Connections page in the IBM Spectrum Discover GUI.

      The username must be a valid name of the IBM Spectrum Discover user who has the dataadmin role. The username field takes the format of <domain_name>/<username>. To determine a domain and username with the dataadmin role, go to the Access Users page in the IBM Spectrum Discover GUI and click the view for the defined users.

      For the local domain, it is not necessary to specify the domain as part of the username field as it is the default domain. For example, to define username for user1 in the local domain that is assigned the dataadmin role, in the configuration file, enter the following value: username=user1

      .
      
      [spectrumdiscover]
      # Name of the Spectrum Scale connection to scan files from
      # Check using the Spectrum Discover connection manager APIs
      scale_connection=fs3
      # Spectrum Discover Master Node IP
      master_node_ip=203.0.113.23
      # Spectrum Discover user name, having 'dataadmin' role
      # Use format <domain_name>/<username>
      # e.g. username=Scale/scaleuser1
      username=user1
      
      Note: The scanner output file generates approximately 1 K of metadata for every file in the system. If there are 12 M files, the size is expected to be approximately 12 GB. By default, the output file is written to the same directory that is being scanned. The log file output location can be customized by setting the outputdir field.
  5. Run the scan by using the following command:
    ./scale_scanner.py
    Note: While you run the ./scale_scanner.py command, you can start another scan. If you start another scan, ensure that you run the scan with another connection that is online and is not being scanned currently. When the scanner is running, the scanner hides the scan now automatically.
    Note: As you run the scale_scanner.py script, you are prompted for the password for the IBM Spectrum Discover user that is configured in the scale_scanner.conf file with the username under the spectrumdiscover section. You must provide the correct password for the configured user. As described in the configuration file, this user needs to be a valid user configured in the IBM Spectrum Discover Authentication service (Access management). Also, this user must be assigned to the dataadmin role.
    For example:
    $ ./scale_scanner.py
    Enter password for SD user 'user1':
    Scale Scan Policy is created at: ./scanScale.policy
    
    Note:
    • After you see a line similar to “0 ‘skipped’ files and/or errors”, press enter to return to the command prompt.
    • The scan takes approximately 2 minutes 30 seconds for every 10 M files on the following configuration:
      x86 –based Spectrum Scale Cluster
      •4 M4 NSD client nodes
      •2 M4 NSD server nodes
      •DCS3700 350 2TB NL SAS drives & 20 200GB SSD
      •QDR InfiniBand cluster network