Language detection

The language detection service is used to identify the language of business texts, such as emails and chats. The service identifies the language of a text and the parts of that text where the language changes, down to the word level. Using the language detection service, Surveillance Insights can highlight and annotate the languages that are used in a text and help to identify potential suspicious activity.

Business problem

Business texts, such as emails or chats, can be in different languages. A key part of natural language processing pipelines is identifying which is the main language so that each text can be processed by the relevant language specific steps.

In some cases, people may change the language that is used in a chat to avoid monitoring or to conceal illicit activity. Identifying the points at which a chat switches languages can be useful in determining if suspicious activity is occurring.

Approach to solving the business problem

This service uses a machine learning library called FastText (https://fasttext.cc/) that was developed by Facebook. The library is used to train a language classifier model that is based on n-grams embeddings. N-grams are groups of words that occur together. The Surveillance Insight language detection model uses 3-grams or trigrams. FastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.

The language detection model must be trained on a corpus of texts from different languages. Only one language label is required per text.

Using the REST service

The main purpose of the REST services is to determine which languages are present in a text and at which point the language changes within a text. Language detection has 3 services:

List service: returns a list of available (pre-trained) language detection models
Predict service: use a model to predict the language in a text
Train Service: trains a new model

Starting the REST service

su sifsuser
cd /home/sifsuser/ml.langdetect
python3.5 langdetectREST.py &

List service details

This service identifies which models are available for use to make predictions. Pre-trained model are provided, but other models can be added as needed.

Table 1. List service details
Method	URL	Input	Output
POST	/analytics/models/v1/list	JSON payload	JSON response

The following is an example CURL command:

curl -k -H 'Content-Type: application/json' -X POST --data  {"models_dir":"/home/sifsuser/ml.langdetect/models"}'  https://<ip>:<port>/analytics/models/v1/list/

The following code is an example response:

{"message": "Success", "code": 200},
  "response": {"models": [["model_a", "model_b", "langs_12"]}

In this example, there are three trained models that are available: model_a, model_b, and langs_12

Predict service details

This service is used to detect which languages are present in a text and give the percent of each. The text can be an email or a chat in SIFS format. The language detection also gives the word index of each language change that is detected so that these can be identified at the word level.

Table 2. Predict service details
Method	URL	Input	Output
POST	/analytics/models/v1/get_keywords	JSON payload	JSON response

The parameters are:

data-binary: the text to be classified
Model_name: the language detection model to be used
comm_type: what type of e-communication the text is. The possible values include e-mail and chat.

For an email, the following is an example CURL command:

curl -k -H 'Content-Type: application/json' -X POST --data-binary  '{"text":"this is the text for labelling bienvenidos a todos a nuestra casa", "model_name":"langs_12", "comm_type":"e-mail"}'  https://ip:5035/analytics/models/v1/detect/

The following code is an example response:

{"status": {"code": 200, "message": "Success"}, "response": {"language_segments": [{"startindex": 0, "endindex": 8, "language": "English"}, {"startindex": 9, "endindex": 12, "language": "Spanish"}], "detected_language": "English", "language_probabilities": {"English": "69.23", "Spanish": "30.77"}, "languages": ["English", "Spanish"]}}

The response component of the JSON has the following components:

detected_language: the main language of the text
language_segments: the start and end word index of each language change found
language_probabilities: the percent (proportion) of each language that is found
languages: the list of languages found in the text

For a chat, the service detects which languages are used in the chat and identifies the segment and row where the language changes. It also detects the overall percentage of each language that is used in the chat. To keep the response size small, only changes from the most common language are given, all other words are assumed to be in the main 'detected_language'. As input to the service, the chat is given in an array where the following values are specified for each line:

speech: the text for the line
timestamp: the timestamp of the line of chat
speaker: the ID of the speaker
segment: index (that is, the line count) of the line of chat. This value starts at 1.

For a chat, the following is an example CURL command:

curl -k -H 'Content-Type: application/json' -X POST --data-binary  "{'comm_type': 'chat',
 'model_name': 'langs_12', 'text': '[
{"Speech": "Hi, how can you be so irresponsible?", "timestamp": "2017-04-13T20:10:30.726Z", "Speaker": "Gabriella.Myers@digitbrokerage.com", "segment": 1}, 
{"Speech": "Hello! What happened?", "timestamp": "2017-04-13T20:12:30.326Z", "Speaker": "Chris_Brown@digitbrokerage.com", "segment": 2},
 {"Speech": "I am smelling the growth potential.Tomorrow the stock will increase by 550%.Its an opportunity of a lifetime", "timestamp": "2017-04-13T20:13:00.272Z", "Speaker": "Gabriella.Myers@digitbrokerage.com", "segment": 3},
 {"Speech": "me dice ella que trabaja en una tienda de furniture . ", "timestamp": "2018-04-12 09:45:03.000Z", "Speaker": "Chris_Brown@digitbrokerage.com", "segment": 4}]}"  https://<ip>:<port>/analytics/models/v1/detect/

The following code is an example response:

{'response': {'detected_language': 'English',
      'language_probabilities': {'English': '82.50', 'Spanish': '17.50'},
      'language_segments': 
      [{'endindex': 10, 'language': 'English', 'segment': 4, 'startindex': 0}],
      'tags': ['English', 'Spanish']},
     'status': {'code': 200, 'message': 'Success'}}

For each line of chat that has a language change, the segment (that is, the index of that line), and then the index of the words and the changed language are given. The response component of the JSON has the following components:

segment: the index of the line in the chat
start index: the index of the word in this line where the different language started
end_index: the index of the word in this line where the different language ended
language: the language of those words

For example: {'segment': 3, 'startindex': 20, 'endindex': 27, 'language': 'Spanish' } means that in the third row of the chat, the language changes from English to Spanish for words 20 to 27.

Train service details

This service is used to train a new language detection model. The training data set requires a text file of each language, of about 500,000 words. The file should be saved as a tab-separated value file. There should be one row per language with the following tab-separated columns:

the language name
the ISO-639-2 three letter initials for that language
a large text of sentences in that language that are joined together into one long line, with no carriage returns apart from one at the end.

For example,

'Danish' 	 'Dan'	 'Genoptagelse af sessionen Jeg ...
'French'	 'fre'	 'Reprise de la session Je déclare reprise  ...
...

Table 3. Train service details
Method	URL	Input	Output
POST	/analytics/models/v1/train	JSON payload	Trained model JSON response

The parameters are:

Model_name: the language detection model to be used
Datapath: the path on the local server to the training data file

The following is an example CURL command:

curl -k -H 'Content-Type: application/json' -X POST --data-binary  '{"model_name":"langs_12", "data_fpath":"/home/sifsuser/ml.langdetect/data/model_training_data.tsv"}'  https://<ip>:<port>/analytics/models/v1/train/

This returns metadata about the model that was created. The actual model is stored in the models directory, so only the name is required to use the model. The list of available models is available by using the list models service.

The following code is an example response:

{"status": 
{"message": "Success", "code": 200},
 "response": {"model created": 
 {"evaluation": {"support": 1195265, "Precision@1": 0.9893881273190464, "Recall @ 1": 0.9893881273190464}, 
 "languages": {"Portuguese": "por", "Dutch": "dut", "Hindi": "hin", "English": "eng", "Danish": "dan", "Hindi_Eng": "hie", "Swedish": "swe", "French": "fre", "Italian": "ita", "lang": "lang_short", "Spanish": "spa", "German": "ger", "Finnish": "fin"},
 "size": 102596073, 
 "creation_date": "Thu Oct  4 07:58:03 2018", 
 "path": "/home/sifsuser/ml.langdetect/models", "name": "langs_12"}}}

Pre-trained model

The language detection algorithm includes a pre-trained machine learning model that is named langs_12, which supports the following languages:

Hindi_Eng
Danish
French
Swedish
English
German
Finnish
Italian
Spanish
Dutch
Portuguese
Hindi

The model was trained using a mixture of publicly available content. The training set had approximately 500,000 words in each language.

Accuracy and limitations

The model does not give perfect 100% accuracy. The known issues include:

Accuracy depends on being trained on similar texts
The model may not perform well on texts that have lists of proper names or part numbers, that is, specific words that did not appear in the training set
There can be confusion between similar languages, such as Portuguese and Spanish
There is poorer accuracy for short lengths of a language, especially if there is less than 5 consecutive words together
There can be misalignment of start and end indexes of language changes. The models can be wrong by one or two word indexes when a language changes. For example, the model might detect a language change as starting on the second or third word of the new language, rather than the first.

You might be able to increase the accuracy by creating a new trained model and doing the following:

increasing training set size
adding more varied training set data
modifying the following fast-text hyperparameters
- iterations
- learning rate
- sub-word length