Large Archive Handling Configuration Considerations

This section explains some common configuration considerations when using the large archive handling capability of the Documentum connector.

  • Periodic Clean-up - Large Archive Handling cannot delete the directory trees created by expanding archive files. While individual files are normally deleted, it is possible for some files within the expanded directories to be left behind (most likely due to URL filtering at the Search Collection level - see the "Crawl Filters" item in this list). For this reason, we recommend creating a scheduled task to recursively delete the expanded directories between crawls. These directories can be found within the crawl directory of the Search Collection's data directory and start with "arc-". Here is an example directory path: <DE_ROOT>/data/search-collections/ab1/my-collection/crawl1/arc-*.
  • Errors During Conversion - If any errors occur during the conversion of a file's content, the file's URL cannot be re-crawled by itself, because the file is deleted during conversion process. You must re-crawl the top-most document (the Documentum object) from which the file was extracted.
  • Special Files - Some archive files, such as.tar files, may contain files that require special processing, such as the following:
    • Named Pipes - Named Pipes cannot be crawled. Attempting to read from a Named Pipe can cause the JVM or Watson™ Explorer Engine to freeze, waiting for input to the pipe. Therefore, to be safe, we delete these files. Currently, stopping a crawl while deleting a named pipe may cause a process node exception for that URL, which may safely be ignored.
    • Symbolic Links - Symbolic links should not be crawled as doing so could pose a significant security risk because a user could create an archive file that contains a symbolic link to any file that Watson Explorer Engine has access to. More commonly, a symbolic link in an archive will likely point to a non-existent file. In either case, the link should not be crawled or followed. Therefore, we delete symlinks when they are detected.
    Important: When expanding archives that may contain symbolic links on Microsoft Windows platforms, the command specified to unpack the archive should also recursively detect and delete symbolic links in the output directory.
  • Delete Files After Using Test It - If using the "Test It" tool to troubleshoot file conversion within large archives, you should delete the temporary directories created during the process. These are located in the Watson Explorer Engine temporary directory and begin with "arc-". Here is an example: <DE_ROOT>/tmp/arc-*. In order to make troubleshooting conversion steps easier, these files are not deleted as would normally happen during conversion, and so should be deleted manually or on a scheduled basis.
  • Crawl Filters - Because an archive's expanded files are deleted during the conversion process, if a URL for a file is never processed, the temporary file will not be deleted. Conditional Settings, such as the Binary file extensions (filter), and any custom collection filtering, can cause this to occur. Therefore, special care should be taken when doing these types of URL filtering. To help support these scenarios, the Documentum connector provides a hidden set of advanced seed options matching that of the Binary file extensions (filter) conditional setting. One of the fields allows you to define additional wild-card patterns for files (or folders) that will be excluded from the crawl (and thus deleted by the connector rather than leaving files orphaned). Another field allows you to define patterns for files or folders to keep out of the default list shared by the Binary file extensions (filter). These settings should only be used by advanced users (thus they are hidden by default).
Note: For additional information on large archive handling seed settings, see subsection 4.2.1 of Basic Documentum Connector Configuration.