This program is useful for reducing the size of the HTML that will be stored in the index
and, in the process, converts HTML to IBM XML.
usage: [--title-weight t] [--snippet-weight w] [--anchors] [--bad-encoding]
[--pdf] [--strip-br] [--no-title] [--strip-all]
The options are:
- --title-weight set the weight of the title content. Default=3.
- --snippet-weight set the weight of the snippet content. Default=1.
- --anchors output anchor contents which contain the anchor text.
- --bad-encoding the input has been generated by a program that is dubious. All bytes
that have a high bit set will be replaced with spaces.
- --pdf applies heuristics for removing hyphens generated in typed text.
- --strip-br remove all <BR> tags.
- --no-title discard the title.
- --strip-all remove all HTML markup.
In addition to the operations that can be specified using its options, this program removes
all attributes on tags, removes extra spaces, removes comments, JavaScript and CSS
sections.