Hints and tips for using content search
Following are some best practices for using the content search feature.
Testing on a subset of documents
Running a content search policy on a set of documents has several steps that includes retrieving the document, formatting it as text, if necessary, and searching the document. Depending on the number, formatting, and size of the documents, searching the document can be a time-consuming process.
Therefore, it is best to test the policy and corresponding expressions on a small subset of documents to determine whether the policy and the regular expressions you select are correct. One way to run the test is to use a policy filter that selects only a small set of documents. After you confirm that the policy and search criteria is operating as expected, you can run it against the required set of documents.
The test on a subset of documents can also help you estimate how long the policy might take to run on the complete set of documents.
Avoiding retagging
When you rerun a policy against a set of documents that is previously tagged, the documents are retagged. If the values returned are different than the previous search, they are updated. This difference might occur if the policy or the set of expressions is modified, or if the set of documents is modified.
To avoid retagging the documents, add a criteria to the filter to not select documents that are already tagged.
Modifying regular expressions
If you modify a regular expression, it affects all policies that use that expression. Rerunning these policies might cause the documents to be tagged differently. To avoid changing the behavior of existing policies, create a new regular expression and use it in the specific policies where it is required.
Converting files with Apache Tika
Data cataloging uses Apache Tika to convert files to text before it searches the content. This conversion has an impact on the overall content search performance.
contentsearch
agent to prevent them from being processed unnecessarily by Apache Tika. The default configuration
treats JavaScript Object Notation (JSON) and Variant
Call Format (VCF) file types as text. To add more text file types to the configuration, edit the
file: /opt/ibm/metaocean/data/agents/contentsearch/conf/contentsearch.conf
And add more
types to the line:text_filetypes=vcf,json
oc -n spectrum-discover scale --replicas=3
deploy/spectrum-discover-tikaserver
Apache Tika is resource-intensive, so make sure that the number of Apache Tika instances does not exceed the host resources.
Supported connection types
- Spectrum Scale
- COS
- NFS
- S3
- SMB
For more information, see the topic IBM Spectrum® Scale data source connection in Data Cataloging: Concepts, Planning, and Deployment Guide.For more information, see IBM Storage Scale data source connections.