For effective customization, it is helpful to understand the steps taken by the body text tagger:
- Nodes are analyzed for various textual and structural features, including densities of
<A> (link) tags, title case text, and punctuation. Based on the
weights set in the converter, each node is assigned a fitness number. Weights for
the following categories are available for customization:
- Link density: The percentage of all text under this node that is contained within <A> tags. Lower is desirable.
- Tag density: The percentage of all text under this node that is contained within any other tags, instead of being directly under this node. Lower is desirable.
- Title case density: The percentage of all text under this node that is in title case, where only the first letter of a word is capitalized. Example: "This Is Title Case". Lower is desirable.
- Punctuation density: The percentage of text characters under this node that are punctuation, not whitespace, letters, or numbers. Higher is desirable.
- Word count: The number of words under this node. Higher is desirable.
- Node depth: The depth of this node within the entire document's HTML structure. Deeper is desirable.
To increase the significance of a particular category, set its weight to a higher number. To decrease the significance, set its weight to a lower number. A weight of 0 will ignore that category. The sum of the weights is irrelevant, as the weights are only important relative to each other. The sum need not equal 1.
- The fitness numbers assigned in the previous step are adjusted based on the percentage of the entire document's text contained within them. For example, the <BODY> tag contains 100% of the document's text and an individual <P> tag might contain 20%. The influence of this percentage over the final fitness is the Document-word-percentage postweight. If it is 0.25, then the final fitness will be calculated as 25% of the document word percentage and 75% of the original fitness number. To ignore the document word percentage and select the body text solely from the original fitness numbers, set this to 0.
- The node with the highest fitness after the previous calculation is selected to be the main body text container.
- Any immediate children of the container node with a fitness less than the Minimum child fitness ratio of the highest child's fitness are pruned from the beginning and end of the child list. This eliminates poor-quality headers, footers, etc. For example, if the container has a child with a fitness of 1.0 and the Minimum child fitness ratio is 0.66, any children with a fitness below 0.66 will be removed from the beginning and end of the container node until a child is found that satisfies the minimum.
- The analysis is now complete. If the container node holds less than the Minimum percentage of text to tag as "high quality", the converter ignores its results and tags the entire document as body text.