After studying Watson materials and actually following a course (September seems a long time ago), I have found some time to get started with IBM Watson Explorer Content Analytics (Version 11)
Installing the server on my laptop was easy enough (use the one server installation).
One reason to start working with this product is the supported for analysis unstructured text in the Dutch language. Support for Dutch in the cognitive Watson solution has been addressed here: https://www.ibm.com/developerworks/community/blogs/ibmandgoogle/entry/watson_q_and_a_in_dutch?lang=en
A warning for the screenshots ahead, there are in Dutch !
For this first step, I used the next question to put Watson to work:
What is being posted online on the topics and products that we (e-office) are specialized in?
1. First you need to crawl a number of websites (since this is a first step, I did not include other sources, like SharePoint , Connections etc... , I have crawled the following sites:
2. While the crawler was running, I created a few "dictionaries" that should be used to analyse the content. I used Content Analytics Sudio for this purpose.
The dictionaries :
- Companies (I imported 6k+ accounts from our CRM system)
- Products (I manually entered those). I used the lemma (synonym) function to group products like Office365 amd Office 365
- Digital Workplace vocabulary (buzzwords that match our corporate strategy, also entered manually)
Explorer itself has a couple of standard dictionaries that understand the language (in this case Dutch and English).
3. Parsing Rules: In Studio you can create rules that are used to match documents in the index to the specifications you formalized in such a rule (also called annotation)
Did I already mention, that these are first steps ? ....
I created a rule that should find content where abbreviated (company)names ( IBM, NASA, HP, ....) appear in a sentence that also has a verb and an entry from the products dictionary see screenshot below:.
The result of applying this rule to the content of the crawled websites is shown in a screenshot of the " Miner" application of Watson Explorer.
What is obvious is the incorrect match for "ICT" , this is an abbreviation but NOT for a company !
The next screenshot shows company types and how often each type occurs in the documents in the index. The type of a company is imported from our CRM system (next to its name)
To display documents that contain companies of the type "Influencer" ("Beinvloeder" in Dutch), use this view. The company names are highlighted.
Again there are mismatches; Ilse in the document is a reference to " Ilse de Lange" and not to the company Ilse.
The same analysis can be performed for the Digital Workspace vocabulary, here I have selected the term " EMM" and this is the result. Notice the synonyms that also show up.
I don't have the rule working yet that combines the CRM data dictionary with the Products dictionary (should be similar to the abbreviation rule). I think one of the reasons is that the formal company name is not often used in the websites that we have crawled.
Machine learning could help us out here, but since we like to communicate in Dutch, we cannot go there today. For now it will be Human Learning, meaning that I have to incrementally improve the rules and dictionaries :-)
To get started I have asked several questions in this forum
@sgnk2700004K01 has been very helpful !
I just wanted to share how I got started, I thought I would share this now that I can recall these baby steps...
If everything goes well I should return with next steps on this topic.