Managing IBM Data Cataloging rules when integrated with CAS

IBM Data Cataloging provides the ability to integrate with Content-Aware Storage to configure file ingestion filtering. This feature enables you to filter files from a data source based on specific criteria, such as file type or content, and ingest them into the corresponding domain.

  1. Ensure that you have IBM Fusion 2.11.0 version.
  2. Install the IBM Data Cataloging to 2.3.0 version. For more information, see Installing IBM Data Cataloging.
    Important: If you face any issues during the installation of the IBM Data Cataloging 2.3.0, then see Data Cataloging upgrade issues to resolve the issue.
  3. Connection to Scale Remote Filesystem from IBM Data Cataloging with live events enabled. For more information, see IBM Storage Scale data source connections.
  4. Configure file-ingestion filtering using IBM Data Cataloging on the Content-Aware Storage. For more information, see Integrating IBM Data Cataloging for file filtering.
  5. When IBM Fusion Data Cataloging is enabled to perform file filtering on Content-Aware Storage, a default rule is automatically created. This rule filters files based on the file types such as pdf, doc, docx, txt, html, markdown, md, pptx, jpeg, png, bmp, xslx, asciidoc, adoc, xhtml, csv, webp, wav, ogg, opus, mp3.
    For example:
    apiVersion: datacataloging.ibm.com/v1alpha1
    metadata:
      name: rule-allowed-filetypes
      namespace: ibm-data-cataloging
    spec:
      metadata:
        - field: filetype
          operator: in
          value:
            - pdf
            - doc
            - docx
            - txt
            - html
            - markdown
            - md
            - pptx
            - jpeg
            - png
            - bmp
            - xslx
            - asciidoc
            - adoc
            - xhtml
            - csv
            - webp
            - wav
            - ogg
            - opus
            - mp3
  6. Use the following command to extend the list of supported files, you can update the default rule by patching the current Rule object.
    For example, to include the file types such as pdf, txt, html, json, md, docx, pptx, bmp, jpeg, png, tiff, and sh.
    DCS_NS=ibm-data-cataloging
    oc patch rule allowed-filetypes -n $DCS_NS --type merge -p '{"spec": {"metadata": [{"field": "filetype","operator": "in","value": ["doc", "docx", "pdf", "txt", "gif", "html", "json", "md", "pptx", "bmp", "jpeg", "png", "tiff", "sh" ]}]}}'
  7. In addition to metadata rule filtering, you can perform a complementary filter based on content using regex-based content search capability. For more information, see Identifying the required regex expressions.
    For example, to filter files based on the presence of an email pattern, you can add a content requirement to the default rule:
    DCS_NS=ibm-data-cataloging
    oc -n $DCS_NS patch rule allowed-filetypes --type=merge -p '{"spec":{"content":>[{"field":"email","operator":"regex","value":"\b[\w\.=-]+@[\w\.-]+\.[\w]{2,3}\b"}]}}'