Support for crawling archive files
The crawlers can extract files from an archive file (such as a ZIP or TAR file) so that individual files in the archive can be indexed and searched.
Supported archive file formats
File extension | MIME type | Data type | Notes |
---|---|---|---|
.zip, .ZIP | application/zip | zip |
|
.tar | application/tar | tar | Supported tar
formats:
|
.tar.gz, .tgz | not applicable | tgz | Depends on capabilities of the java.utl.zip package |
Restrictions and guidelines
Automatic code page detection is not available for files that are extracted from an archive file. When extracting the files, the crawler uses the code page setting that it is configured to use with plain text and unknown document types. When you use the administration console to configure language and code page settings for a crawler, you specify the code page that the crawler should use for plain text documents and for documents whose code page cannot be detected automatically.
To determine when files in an archive file need to be recrawled, the crawler uses the modified date in the archive entry header data for each file. When you monitor a crawler, the statistics that are shown for crawled documents, including statistics for inserted, updated, and deleted documents, include information about files that were extracted from archive files.
To enable crawlers to crawl archive files in other archive file formats, such as LZH files, you must write a crawler plug-in and then configure the crawler to use the plug-in.
The names of all files inside the archive file are stored in a field named EntryName, not a Title field or Name field. The EntryName metadata represents one entry in the archive file.