Hints and tips for using content search

Following are some best practices for using the content search feature.

Testing on a subset of documents

Running a content search policy on a set of documents has several steps that includes retrieving the document, formatting it as text, if necessary, and searching the document. Depending on the number, formatting, and size of the documents, searching the document can be a time-consuming process.

Therefore, it is best to test the policy and corresponding expressions on a small subset of documents to determine whether the policy and the regular expressions you select are correct. One way to run the test is to use a policy filter that selects only a small set of documents. After you confirm that the policy and search criteria is operating as expected, you can run it against the required set of documents.

The test on a subset of documents can also help you estimate how long the policy might take to run on the complete set of documents.

Avoiding retagging

When you rerun a policy against a set of documents that is previously tagged, the documents are retagged. If the values returned are different than the previous search, they are updated. This difference might occur if the policy or the set of expressions is modified, or if the set of documents is modified.

To avoid retagging the documents, add a criteria to the filter to not select documents that are already tagged.

Modifying regular expressions

If you modify a regular expression, it affects all policies that use that expression. Rerunning these policies might cause the documents to be tagged differently. To avoid changing the behavior of existing policies, create a new regular expression and use it in the specific policies where it is required.

Converting files with Apache Tika

Data cataloging uses Apache Tika to convert files to text before it searches the content. This conversion has an impact on the overall content search performance.

Therefore, files that are text format must be configured in the contentsearch agent to prevent them from being processed unnecessarily by Apache Tika. The default configuration treats JavaScript Object Notation (JSON) and Variant Call Format (VCF) file types as text. To add more text file types to the configuration, edit the file:
/opt/ibm/metaocean/data/agents/contentsearch/conf/contentsearch.conf
And add more types to the line:
text_filetypes=vcf,json
Apache Tika runs in a Kubernetes pod within Data cataloging. You can increase Apache Tika pod instances to improve performance. For example, run this command to scale the number of Tika instances to three:
oc -n spectrum-discover scale --replicas=3
        deploy/spectrum-discover-tikaserver

Apache Tika is resource-intensive, so make sure that the number of Apache Tika instances does not exceed the host resources.

Supported connection types

Content search on Data cataloging supports the following connection types:
  • Spectrum Scale
  • COS
  • NFS
  • S3
  • SMB

For more information, see the topic IBM Spectrum® Scale data source connection in Data Cataloging: Concepts, Planning, and Deployment Guide.For more information, see IBM Storage Scale data source connections.