Extracting plain text from the body of emails containing HTML

The probe extracts plain text from the body of emails formatted in HTML.

During extraction, the probe performs the following steps:

  1. Restores back to the original reserved characters all HTML character entities (for example,   or <).
  2. Removes and replaces all consecutive white spaces by a single space character (unless there is any paragraph preserved by enclosing <pre></pre> tags). The resulting text appears in a tidier form.

Example

If the body of an email contains the following example HTML code:

<p>An <a href='http://example.com/'><b>example</b></a> link.</p><br>Below an empty line<div>Example text 1 distant text 2 and te<text color='red'></text>xt 3</div>

The probe extracts the following plain text:

An example link. 
Below an empty line 
Example text 1 distant text 2 and text 3