At e-office we have our CRM data in Microsoft CRM Dynamics Online.
We want to add the information from this system in our Watson Explorer Content Analytics Collection.
I first started asking the question in dWAnswers: https://developer.ibm.com/answers/questions/275949/creating-and-registering-a-cusomt-crawler-for-wats.html
I created the crawler based on the sample you can find in the
As Deepika Devarajan points out, there is no publicly available documentation, so if you need more info please contact your IBM rep to ask for this documentation.
Custom Crawler configuration
Before you start coding, you first need to get the option "Custom Crawler" in the list of Crawler types:
To get this option in the list, modify the config.properties file in .../webapps/ESAdmin/WEB-INF/
Restart the admin session, to see the effect.
Another effect of this setting is the Custom Crawler tab in the System Settings:
Implementing the crawler
In the customcrawler sample code, you will find several classes.
The CustomManager 's role is to instantiate a TopSpace. It is your CustomManager class that you specify in the Custom Crawler Type settings.
- is used to collect information about the system you are connecting to (like hostname and credentials)
- is used to generated SubSpace (s)
The information collected by the TopSpace is accessible via the CustomInfo class
- can have their own configuration, this can also be accessed through the CustomInfo class
- are responsible for getting the list of content
- provide the fields that you want to add to index, additionally to the body and standard fields
CustomContent is used to get the actual content.
Implementation for CRM Dynamics
In our case the TopSpace asks for this information:
To access CRM Dynamics Online you need to register you application with Azure AD. The configuration of this part is outside the scope of this article.
The SubSpaces for our case are the CRM Entities , the screenshot below shows a list of all the known entities in our CRM system (notice the scrollbar)
In this case I want to crawl Accounts and Activities. Unfortunately not all entities use the same field names for their title- and/or memo-fields.
In our case we need to specify the title-field to name for the entity account. The first three options are pre-populated by the TopSpace, description is the most common field name for the memo field, so this default value can stay.
This pre-population can be done, because CRM Dynamics exposes a meta-data API.
Now we have two configured search spaces that we can start to crawl:
When the crawling has finished, we can go to the miner to inspect the results.
In the screenshot you can see that we added the entity as an extra field, this way we can see that we have 7979 accounts and that of all the activities, email has the most, but that we also have 1 fax record :-)
The timestamps on the documents also transfer nicely into (in this case) the deviations view.
- Security: we did not implement document level security
- Social features : it would be great if we could see who created these records (other than in text)
- Leverage scheduling: although the second time the process takes a lot less time, we still need to fetch to all documents to get modifiedon timestamps. If we, somehow, could know from the scheduler what the last crawl time was, we can use this information in our data retrieval.