Support for crawling archive files

The crawlers can extract files from an archive file (such as a ZIP or TAR file) so that individual files in the archive can be indexed and searched.

Supported archive file formats

The following archive file formats are supported:
Table 1. Archive file formats supported by crawlers
File extension MIME type Data type Notes
.zip, .ZIP application/zip zip
  • Depends on capabilities of the java.utl.zip package
  • Supports deflated (method 8) compression:
    • No support for encrypted files
    • No support for zip64
.tar application/tar tar Supported tar formats:
  • GNU tar 1.13
  • POSIX 1003.1-1998 (ustar)
  • POSIX 1003.1-2001 (pax)
.tar.gz, .tgz not applicable tgz Depends on capabilities of the java.utl.zip package

Restrictions and guidelines

Automatic code page detection is not available for files that are extracted from an archive file. When extracting the files, the crawler uses the code page setting that it is configured to use with plain text and unknown document types. When you use the administration console to configure language and code page settings for a crawler, you specify the code page that the crawler should use for plain text documents and for documents whose code page cannot be detected automatically.

To determine when files in an archive file need to be recrawled, the crawler uses the modified date in the archive entry header data for each file. When you monitor a crawler, the statistics that are shown for crawled documents, including statistics for inserted, updated, and deleted documents, include information about files that were extracted from archive files.

To enable crawlers to crawl archive files in other archive file formats, such as LZH files, you must write a crawler plug-in and then configure the crawler to use the plug-in.

The names of all files inside the archive file are stored in a field named EntryName, not a Title field or Name field. The EntryName metadata represents one entry in the archive file.