Parsing emails

Many of the natural language processing steps require relatively clean texts to produce good results. However, email text can be messy, and may include irrelevant replies, forwards, spam, signatures, disclaimers, and other artifacts. Due to the wide range of email formats and mail protocols in use, email texts can also include a range of characters that are problematic for analysis, including HTML tags, unusual characters, and corrupted or encoded sequences of random characters. As a first step of most NLP pipelines, this service is used to remove redundant characters and text.

The e-communication email parsing process is used to clean up email texts before further processing. There are two components:

  • a python mailparse library that is called directly by other python machine learning code
  • a standalone mailparse REST API that can be used as an on-demand email cleaning service

Approach to solving the business problem

This service uses python email libraries and regular expressions to split emails into their component parts.

Assumptions

Email content is derived from the current email only; replies and forwarded emails (the email history trail) are not used to determine email content or context.

Regular expressions are good enough to identify most of e-mail sections. These are configurable and can be extended as necessary.

Input is a file of email texts in MIME or similar structure.

Capitalization and punctuation are not used later as part of the NLC model. In other words, there is no downstream NLP model that relies on accurate character level input to determine features.

Using the REST service

Starting the REST Service
python3 mailparseRESTAPI.py &
Sample input
Message-ID: <29665600.1075855687895.JavaMail.evans@thyme>\nDate: Tue, 26 Sep 2000 05:11:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo:    
    cindy.cicchetti@enron.com\nSubject: Re: Gas Trading Vision meeting\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding:  7bit\nX-From: Phillip K Allen\nX-To: Cindy Cicchetti\nX-cc: \nX-bcc: \nX-Folder: \    \Phillip_Allen_Dec2000\\Notes Folders\\'sent mail\nX-Origin: Allen-P\nX-FileName: pallen.nsf\n\nNymex expiration is during this time frame.  Please reschedule.
Sample response
Nymex expiration is during this time frame.  Please reschedule.
parse_email Service details
The service allows users to parse a single email into its component parts.
Table 1. parse_email service details
Method URL Input Output
POST /analytics/models/v1/parse_email JSON payload JSON response

The following is an example CURL command to POST:

curl -k -H 'Content-Type: application/json' -X POST --data @msg_29.json https://ip address:port/analytics/models/v1/parse_email/

The following code is an example JSON payload:

{"message": "Message-ID: <29665600.1075855687895.JavaMail.evans@thyme>\nDate: Tue, 26 Sep 2000 05:11:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo:    
    cindy.cicchetti@enron.com\nSubject: Re: Gas Trading Vision meeting\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding:   
    7bit\nX-From: Phillip K Allen\nX-To: Cindy Cicchetti\nX-cc: \nX-bcc: \nX-Folder: \  
    \Phillip_Allen_Dec2000\\Notes Folders\\'sent mail\nX-Origin: Allen-P\nX-FileName: pallen.nsf\n\nNymex expiration is during this time frame.  Please reschedule."} 

The following code is an example response:

{"history": null,"body": "Nymex expiration is during this time frame. Please reschedule.", "header": {"from": "phillip.allen@enron.com", "date": "Tue, 26 Sep 2000 05:11:00 -0700 (PDT)", "to": "cindy.cicchetti@enron.com", "subject": "Re: Gas Trading Vision meeting", "X-from": "Phillip K Allen", "X-to": "Cindy Cicchetti
clean_email Service details
This function cleans the email body text by removing special characters so that only alpha-numeric characters remain.
Table 2. clean_email service details
Method URL Input Output
POST /analytics/models/v1/clean_email JSON payload JSON response

The following is an example CURL command to POST:

curl -k -H 'Content-Type: application/json' -X POST --data @msg_cln_1.json https://ip address:port/analytics/models/v1/clean_email/

The following code is an example JSON payload:

{"message":"Allen.xls Enclosed is the preliminary proforma for the Westgate property is Austin that we told you about. ... number of things she does everyday. Fortunately, it looks as if she will be ok in the long run. George W. Richards Creekside Builders"} 

The following code is an example response:

{"message": "Allen xls Enclosed is the preliminary proforma for the Westgate property is Austin that we told you about .. number of things she does everyday Fortunately it looks as if she will be ok in the long run George W Richards Creekside Builders"} 

Accuracy and limitations

The algorithm uses lists of python regular expression patterns to identify each of the key parts of an email: signatures, forwarded emails, disclaimers, etc. In some cases, these sections may be missed or misidentified. The regular expressions can be amended if new disclaimers or sign-offs are found.

Currently only emails in English are parsed.