(i.e. using batch, script, or Collaboration & Deployment Services job...)
There IS a way to automate generation of a new Concept nugget from a text node, as there is an option to generate concept nugget when executed. However a category text nugget can only be generated in interactive mode... even if all one wants to do is feed the text node new data, open workbench, save as model, exit (and not edit any categories).
The problem is that text nuggets only store the text entities seen in the data that created them. So while a nonlinguistic pattern may recognize Phone Numbers or Days of Month... the text nugget generated will only extract Phone Numbers or Days of Month that "were present in the the original data set" creating the nugget. So for example, unless you include 999-999-9999 phone numbers in your training set, the text nugget will never be able to identify ALL phone numbers that occur. The nonlinguistic entity regular expression tagging does no exist in the nugget, only actual text.
So in a production environment, I still haven't found a way to refresh a text nugget with new data, without having a physical person edit the stream and manually create the new text nuggets using newer data.
Simple Example of issue:
- Nonlinguisitic entity exists which extracts <Car> = ford, chevy, gm, porsche, mercedes, vw, etc...
- data (people talking about cars) is fed into Text Node category model, which creates a text nugget
- month by month we feed new data into system, through the text nugget, in which people mention Cars.
- For the category text node, the original set of data only had people talking about ford, chevy and gm
- the resulting text nugget will only recognize the concepts ford, chevy and gm (even though the nonlinguistic entity can also recognize porsche, mercedes and vw)
- If one feeds newer data through the nugget, and when people talk about porsche or mercedes or vw, the system can never extracts it (because it isn't stored in the text nugget)
- so one needs to "refresh" the nugget with new data, so it can recognize the concepts in the new nugget.
But no way to automate a refresh.
If this was a Concept Nugget, I could feed new data into the text node, in Generate Concept Nugget mode... and hence automatically create an updated text nugget that would recognize all Cars in the current data set.
However, there is no way to automate creation of a Category Nugget.
So there appears to be no way to automate a stream, which uses Category Nuggets, to refresh itself with new data (without a person manually creating the refreshed nugget).
This is not useful in a production system that uses category nuggets to score text data.
Has anyone figured out a solution to this issue?