Sampling emails

The e-communication sampling process is used to create small sample datasets of emails that represent the main types of themes that are found in the whole corpus of emails. These samples can be used by Subject Matter Experts (SMEs) as templates for identifying whether the emails really are complaints and what entities are contained in the emails. Subsequently, you can use these annotated emails to train the natural language classifier and natural language understanding models

Business problem

The sampling process is required to solve the restrictions around the number of emails that can be accurately labeled manually by SMEs. Since hand labeling is time-consuming and resource-intensive, not all complaints can be chosen. Instead, a sample is taken by using a theme detection algorithm combined with a stratified sampling method. Themes are discovered in the email corpus, and the sampling method replicates the proportion of each discovered theme in the sampled datasets. This should ensure that a wide and representative set of training examples is created.

The following steps are carried out:

All of the emails are cleaned and processed so that only the current email text is used.
The email class counts are adjusted to reach the desired balance of classes, for example, complaints=75%, non-complaints=25%.
The emails are grouped into consistent themes.
Small datasets are produced containing representative amounts of all of the themes that are discovered in the previous step.
The dataset IDs for each sampled email are sent by a REST AP to be included in the email meta-data.

You can also choose random sampling by changing the sampling method parameter. If you choose this option, the algorithm carries out the following steps:

All emails are cleaned and processed so that only the current email text is used.
The email class counts are adjusted to reach the desired balance of classes, for example, complaints=75%, non-complaints=25%.
Emails are chosen at random from the proportions of classes that are used in step 2.

If a random sample is taken, then it is possible that some categories of complaints or entities might not be included due to chance.

Approach to solving the business problem

For this problem, we assume that most complaints can be grouped into a number of categories, and that this number is in the range of 5 to 50. A Latent Dirichlet Algorithm (LDA) is used to determine which breakdown of topics would best divide the complaints into such groups. This is done by determining the set of keywords that best describe each complaint theme, and the number of themes that best describe all of the complaints.

Although the themes and the keywords of each theme are derived, they are not used directly by Surveillance Insight. They are, however, visible in the algorithm logs. Instead, the approach is taken of grouping the complaints by similar keywords, and then ensuring that all of the groups are included in the samples in the relative proportions that they occur.

The following parameters are used for this algorithm:

Upper limit on email length: Occasionally very long emails can be included in the dataset. To prevent this, emails with lengths that are more than a number of standard deviations above the mean are excluded.
Minimum number of words in an email: This parameter is used to exclude emails that are too short to be of use.
Part of speech tags for lemmatization: This parameter controls which parts of speech are considered for analysis of themes. The default setting is 'NOUN', 'ADJ', 'VERB',''ADV'. All other parts of speech are not considered.
Latent Dirichlet allocation grid search parameter for number of topics: The algorithm must determine the best number of topics that represent all of the emails. A range of topics can be entered. However, a large set of numbers can substantially increase the processing time.
Latent Dirichlet allocation grid search parameters for number for learning decay: This parameter is also used to determine the best number of topics that represent all of the emails. You can enter a range of learning decay coefficients. However, a large set of numbers can substantially increase processing time.

Assumptions

There is a large corpus of emails: in the range 100 thousand - 10 million.

Of all of the emails in this corpus, approximately only 2 - 3% are complaints, that is, in total there are between 2000 - 20,000 actual complaints.

Due to time and staffing constraints, SMEs are able to manually label only a few hundred emails at most, for example, 300.

Accuracy and limitations

The algorithm must determine the number of themes that are present in all of the emails by using a grid search algorithm. This is done using a configuration to suggest a range of different numbers, and the algorithm determines which works best. However, if the range of numbers is too great, then the processing time may increase substantially.