NLP in the real world: How Global-Regulation organized its comparative law search engine with Watson
As I’ve previously discussed, Natural Language Processing (NLP) is making its way into every industry. Each day, organizations find a way to implement NLP into their organization to speed up a process, analyze data more efficiently, or even provide smarter recommendations. As the amount of unstructured data grows, Natural Language Processing APIs and platforms are essential to break down enterprise data and harvest new insights from it.
In the legal services industry, Global-Regulation is using NLP and machine translation to build the most comprehensive world law search engine. I recently spoke with CTO Sebastian Dusterwald to discuss how Global-Regulation uses Watson NLP technology to translate laws into English.
Sean: Sebastian, tell us about Global-Regulation and what your team does.
Sebastian: At Global-Regulation, it is our mission to democratize access to laws from across the globe. We handle large amounts of text data. We index, process, and translate nearly 2 million laws from nearly 100 countries, from Brazil to China to France to Italy and more, using machine translation. We help make laws searchable and accessible in English. We do all of this with a very small team, and none of it would be possible without the amazing AI-powered cloud services provided by the Watson platform.
Sean: Very cool. Do you have any recent examples to share about how the team is using Watson?
Sebastian: Recently one of our clients approached us about adding categories to our law metadata, in order to make it easier for them to find the laws that are relevant to their business use case, of monitoring specific types of laws (such as those in healthcare and cybersecurity) to maintain regulatory compliance. With so many laws in our database, discoverability is always an issue, so we thought this could be a great feature to add to our site. The problem is that very few of our sources provide any sort of categorization metadata, and those that do all use slightly different categories, so simply grabbing this data during indexing was out.
We needed a system that could analyze and process our text data, and then categorize it in preset bins. IBM suggested that we try out the IBM Watson Natural Language Understanding (NLU) API. This does exactly what we want out of the box: it allows us to upload training data and then to classify natural language text based on that.
Sean: Interesting, so what did you do next?
Sebastian: Well, we went through our database to find several laws that we thought were representative of each category, from finance to cybersecurity to environment-based laws. We then went through each of those laws and picked out chunks of text that we thought were relevant to the category. This was the most complicated and labor-intensive part of the implementation process. Care had to be taken to take chunks of text that were specific enough to train the NLP algorithm about the domain the law refers to, while being generic enough to not over-train the algorithm. This meant avoiding words such as specific names of countries or people, or dates. Including them would have risked training the algorithm on keywords that would look very specific but have nothing to do with the category on hand.
The training set was simply entered into a spreadsheet, and then uploaded to the IBM Watson NLU API. After a short wait for it to process the data, the API was now ready to accept queries. Our approach was to use the first 1024 characters of a law to classify it. This generated quite good results, in part because the first 1024 characters of a law typically include its title, which tends to include a number of keywords that the algorithm can use. At this stage we were now pretty sure that the IBM Watson NLP technology would be suitable for our use case, albeit with a little bit of fine tuning.
Sean: That’s great! Can you tell me a bit more about how you then fined tuned Watson to meet your client’s use case?
Sebastian: The first thing we did was to take samples across each of the laws in our database, such as healthcare, welfare, and privacy-based laws. Instead of taking just the first 1024 characters of each document, we took 5 samples of 1024 character chunks evenly spread across the document. We then averaged out the confidence scores returned by the IBM Watson NLU API and chose the highest value returned as the category for that law. This significantly increased the accuracy of the classifier for our dataset.
Next we looked at laws that we found to be classified incorrectly. We compiled a list of such laws and went through them, using more text fragments from each of these to add to the training set. Once this was completed, we uploaded it to the IBM Watson NLU API and waited for it to train a new and improved classifier. This further improved the accuracy and at this point we were happy with it. So as a final step we started to run the classifier across our entire database of laws.
Sean: Glad to hear it all worked out. Do you have any final thoughts or takeaways you would like to share about working with IBM Watson NLP technology?
Sebastian: Yes, absolutely! As you can tell, automatically translating and classifying nearly 2 million documents into a number of categories was a daunting task for a small team working with limited resources. With the volume of new laws coming in globally, our company needs to keep up with the demand and constant changes to existing laws at a global scale. We can say confidently that without the help of the Watson platform we would not be able to translate and categorize the millions of documents coming into our database in such a short time. We managed to have the basic implementation running in about a week, which is phenomenal! Thanks to the Watson platform, our small company can punch well above our weight.
Sean: Thanks so much Sebastian, can’t wait to hear what Global-Regulation accomplishes next with the help of IBM Watson NLP technology!