Managing IBM Data Cataloging rules when integrated with CAS
IBM Data Cataloging provides the ability to integrate with Content-Aware Storage to configure file ingestion filtering. This feature enables you to filter files from a data source based on specific criteria, such as file type or content, and ingest them into the corresponding domain.
- Ensure that you have IBM Fusion 2.11.0 version.
- Install the IBM Data Cataloging to 2.3.0 version. For
more information, see Installing IBM Data Cataloging.Important: If you face any issues during the installation of the IBM Data Cataloging 2.3.0, then see Data Cataloging upgrade issues to resolve the issue.
- Connection to Scale Remote Filesystem from IBM Data Cataloging with live events enabled. For more information, see IBM Storage Scale data source connections.
- Configure file-ingestion filtering using IBM Data Cataloging on the Content-Aware Storage. For more information, see Integrating IBM Data Cataloging for file filtering.
- When IBM Fusion Data Cataloging is enabled to perform file filtering on Content-Aware Storage, a
default rule is automatically created. This rule filters files based on the file types such as
pdf,doc,docx,txt,html,markdown,md,pptx,jpeg,png,bmp,xslx,asciidoc,adoc,xhtml,csv,webp,wav,ogg,opus,mp3.For example:apiVersion: datacataloging.ibm.com/v1alpha1 metadata: name: rule-allowed-filetypes namespace: ibm-data-cataloging spec: metadata: - field: filetype operator: in value: - pdf - doc - docx - txt - html - markdown - md - pptx - jpeg - png - bmp - xslx - asciidoc - adoc - xhtml - csv - webp - wav - ogg - opus - mp3 - Use the following command to extend the list of supported files, you can update the default rule
by patching the current Rule object. For example, to include the file types such as
pdf,txt,html,json,md,docx,pptx,bmp,jpeg,png,tiff, andsh.DCS_NS=ibm-data-cataloging oc patch rule allowed-filetypes -n $DCS_NS --type merge -p '{"spec": {"metadata": [{"field": "filetype","operator": "in","value": ["doc", "docx", "pdf", "txt", "gif", "html", "json", "md", "pptx", "bmp", "jpeg", "png", "tiff", "sh" ]}]}}' - In addition to metadata rule filtering, you can perform a complementary filter based on content
using regex-based content search capability. For more information, see Identifying the required regex expressions.For example, to filter files based on the presence of an email pattern, you can add a content requirement to the default rule:
DCS_NS=ibm-data-cataloging oc -n $DCS_NS patch rule allowed-filetypes --type=merge -p '{"spec":{"content":>[{"field":"email","operator":"regex","value":"\b[\w\.=-]+@[\w\.-]+\.[\w]{2,3}\b"}]}}'