Language detection
The language detection service is used to identify the language of business texts, such as emails and chats. The service identifies the language of a text and the parts of that text where the language changes, down to the word level. Using the language detection service, Surveillance Insights can highlight and annotate the languages that are used in a text and help to identify potential suspicious activity.
Business problem
Business texts, such as emails or chats, can be in different languages. A key part of natural language processing pipelines is identifying which is the main language so that each text can be processed by the relevant language specific steps.
In some cases, people may change the language that is used in a chat to avoid monitoring or to conceal illicit activity. Identifying the points at which a chat switches languages can be useful in determining if suspicious activity is occurring.
Approach to solving the business problem
This service uses a machine learning library called FastText (https://fasttext.cc/) that was developed by Facebook. The library is used to train a language classifier model that is based on n-grams embeddings. N-grams are groups of words that occur together. The Surveillance Insight language detection model uses 3-grams or trigrams. FastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.
The language detection model must be trained on a corpus of texts from different languages. Only one language label is required per text.
Using the REST service
The main purpose of the REST services is to determine which languages are present in a text and at which point the language changes within a text. Language detection has 3 services:
- List service: returns a list of available (pre-trained) language detection models
- Predict service: use a model to predict the language in a text
- Train Service: trains a new model
Starting the REST service
su sifsuser
cd /home/sifsuser/ml.langdetect
python3.5 langdetectREST.py &
List service details
This service identifies which models are available for use to make predictions. Pre-trained model are provided, but other models can be added as needed.
| Method | URL | Input | Output |
|---|---|---|---|
| POST | /analytics/models/v1/list | JSON payload | JSON response |
The following is an example CURL command:
curl -k -H 'Content-Type: application/json' -X POST --data {"models_dir":"/home/sifsuser/ml.langdetect/models"}' https://<ip>:<port>/analytics/models/v1/list/
The following code is an example response:
{"message": "Success", "code": 200},
"response": {"models": [["model_a", "model_b", "langs_12"]}
In this example, there are three trained models that are available: model_a, model_b, and langs_12
Predict service details
This service is used to detect which languages are present in a text and give the percent of each. The text can be an email or a chat in SIFS format. The language detection also gives the word index of each language change that is detected so that these can be identified at the word level.
| Method | URL | Input | Output |
|---|---|---|---|
| POST | /analytics/models/v1/get_keywords | JSON payload | JSON response |
The parameters are:
- data-binary: the text to be classified
- Model_name: the language detection model to be used
- comm_type: what type of e-communication the text is. The possible values include e-mail and chat.
For an email, the following is an example CURL command:
curl -k -H 'Content-Type: application/json' -X POST --data-binary '{"text":"this is the text for labelling bienvenidos a todos a nuestra casa", "model_name":"langs_12", "comm_type":"e-mail"}' https://ip:5035/analytics/models/v1/detect/
The following code is an example response:
{"status": {"code": 200, "message": "Success"}, "response": {"language_segments": [{"startindex": 0, "endindex": 8, "language": "English"}, {"startindex": 9, "endindex": 12, "language": "Spanish"}], "detected_language": "English", "language_probabilities": {"English": "69.23", "Spanish": "30.77"}, "languages": ["English", "Spanish"]}}
The response component of the JSON has the following components:
- detected_language: the main language of the text
- language_segments: the start and end word index of each language change found
- language_probabilities: the percent (proportion) of each language that is found
- languages: the list of languages found in the text
For a chat, the service detects which languages are used in the chat and identifies the segment and row where the language changes. It also detects the overall percentage of each language that is used in the chat. To keep the response size small, only changes from the most common language are given, all other words are assumed to be in the main 'detected_language'. As input to the service, the chat is given in an array where the following values are specified for each line:
- speech: the text for the line
- timestamp: the timestamp of the line of chat
- speaker: the ID of the speaker
- segment: index (that is, the line count) of the line of chat. This value starts at 1.
For a chat, the following is an example CURL command:
curl -k -H 'Content-Type: application/json' -X POST --data-binary "{'comm_type': 'chat',
'model_name': 'langs_12', 'text': '[
{"Speech": "Hi, how can you be so irresponsible?", "timestamp": "2017-04-13T20:10:30.726Z", "Speaker": "Gabriella.Myers@digitbrokerage.com", "segment": 1},
{"Speech": "Hello! What happened?", "timestamp": "2017-04-13T20:12:30.326Z", "Speaker": "Chris_Brown@digitbrokerage.com", "segment": 2},
{"Speech": "I am smelling the growth potential.Tomorrow the stock will increase by 550%.Its an opportunity of a lifetime", "timestamp": "2017-04-13T20:13:00.272Z", "Speaker": "Gabriella.Myers@digitbrokerage.com", "segment": 3},
{"Speech": "me dice ella que trabaja en una tienda de furniture . ", "timestamp": "2018-04-12 09:45:03.000Z", "Speaker": "Chris_Brown@digitbrokerage.com", "segment": 4}]}" https://<ip>:<port>/analytics/models/v1/detect/
The following code is an example response:
{'response': {'detected_language': 'English',
'language_probabilities': {'English': '82.50', 'Spanish': '17.50'},
'language_segments':
[{'endindex': 10, 'language': 'English', 'segment': 4, 'startindex': 0}],
'tags': ['English', 'Spanish']},
'status': {'code': 200, 'message': 'Success'}}
For each line of chat that has a language change, the segment (that is, the index of that line), and then the index of the words and the changed language are given. The response component of the JSON has the following components:
- segment: the index of the line in the chat
- start index: the index of the word in this line where the different language started
- end_index: the index of the word in this line where the different language ended
- language: the language of those words
For example: {'segment': 3, 'startindex': 20, 'endindex': 27, 'language': 'Spanish'
} means that in the third row of the chat, the language changes from English to Spanish for
words 20 to 27.
Train service details
This service is used to train a new language detection model. The training data set requires a text file of each language, of about 500,000 words. The file should be saved as a tab-separated value file. There should be one row per language with the following tab-separated columns:
- the language name
- the ISO-639-2 three letter initials for that language
- a large text of sentences in that language that are joined together into one long line, with no carriage returns apart from one at the end.
For example,
'Danish' 'Dan' 'Genoptagelse af sessionen Jeg ...
'French' 'fre' 'Reprise de la session Je déclare reprise ...
...
| Method | URL | Input | Output |
|---|---|---|---|
| POST | /analytics/models/v1/train | JSON payload |
Trained model JSON response |
The parameters are:
- Model_name: the language detection model to be used
- Datapath: the path on the local server to the training data file
The following is an example CURL command:
curl -k -H 'Content-Type: application/json' -X POST --data-binary '{"model_name":"langs_12", "data_fpath":"/home/sifsuser/ml.langdetect/data/model_training_data.tsv"}' https://<ip>:<port>/analytics/models/v1/train/
This returns metadata about the model that was created. The actual model is stored in the models directory, so only the name is required to use the model. The list of available models is available by using the list models service.
The following code is an example response:
{"status":
{"message": "Success", "code": 200},
"response": {"model created":
{"evaluation": {"support": 1195265, "Precision@1": 0.9893881273190464, "Recall @ 1": 0.9893881273190464},
"languages": {"Portuguese": "por", "Dutch": "dut", "Hindi": "hin", "English": "eng", "Danish": "dan", "Hindi_Eng": "hie", "Swedish": "swe", "French": "fre", "Italian": "ita", "lang": "lang_short", "Spanish": "spa", "German": "ger", "Finnish": "fin"},
"size": 102596073,
"creation_date": "Thu Oct 4 07:58:03 2018",
"path": "/home/sifsuser/ml.langdetect/models", "name": "langs_12"}}}
Pre-trained model
The language detection algorithm includes a pre-trained machine learning model that is named langs_12, which supports the following languages:
- Hindi_Eng
- Danish
- French
- Swedish
- English
- German
- Finnish
- Italian
- Spanish
- Dutch
- Portuguese
- Hindi
The model was trained using a mixture of publicly available content. The training set had approximately 500,000 words in each language.
Accuracy and limitations
The model does not give perfect 100% accuracy. The known issues include:
- Accuracy depends on being trained on similar texts
- The model may not perform well on texts that have lists of proper names or part numbers, that is, specific words that did not appear in the training set
- There can be confusion between similar languages, such as Portuguese and Spanish
- There is poorer accuracy for short lengths of a language, especially if there is less than 5 consecutive words together
- There can be misalignment of start and end indexes of language changes. The models can be wrong by one or two word indexes when a language changes. For example, the model might detect a language change as starting on the second or third word of the new language, rather than the first.
You might be able to increase the accuracy by creating a new trained model and doing the following:
- increasing training set size
- adding more varied training set data
- modifying the following fast-text hyperparameters
- iterations
- learning rate
- sub-word length