Topic
  • 2 replies
  • Latest Post - ‏2012-11-14T16:02:29Z by SystemAdmin
SystemAdmin
SystemAdmin
435 Posts

Pinned topic Automating refresh of Category text nuggets with new data

‏2012-11-14T08:17:15Z |
For a production system... Is there yet a way to automatically refresh a 'category' text nugget with new data without a person having to manually edit the stream?
(i.e. using batch, script, or Collaboration & Deployment Services job...)

There IS a way to automate generation of a new Concept nugget from a text node, as there is an option to generate concept nugget when executed. However a category text nugget can only be generated in interactive mode... even if all one wants to do is feed the text node new data, open workbench, save as model, exit (and not edit any categories).

The problem is that text nuggets only store the text entities seen in the data that created them. So while a nonlinguistic pattern may recognize Phone Numbers or Days of Month... the text nugget generated will only extract Phone Numbers or Days of Month that "were present in the the original data set" creating the nugget. So for example, unless you include 999-999-9999 phone numbers in your training set, the text nugget will never be able to identify ALL phone numbers that occur. The nonlinguistic entity regular expression tagging does no exist in the nugget, only actual text.

So in a production environment, I still haven't found a way to refresh a text nugget with new data, without having a physical person edit the stream and manually create the new text nuggets using newer data.

Simple Example of issue:
  • Nonlinguisitic entity exists which extracts <Car> = ford, chevy, gm, porsche, mercedes, vw, etc...
  • data (people talking about cars) is fed into Text Node category model, which creates a text nugget
  • month by month we feed new data into system, through the text nugget, in which people mention Cars.
  • For the category text node, the original set of data only had people talking about ford, chevy and gm
  • the resulting text nugget will only recognize the concepts ford, chevy and gm (even though the nonlinguistic entity can also recognize porsche, mercedes and vw)
  • If one feeds newer data through the nugget, and when people talk about porsche or mercedes or vw, the system can never extracts it (because it isn't stored in the text nugget)
  • so one needs to "refresh" the nugget with new data, so it can recognize the concepts in the new nugget.

But no way to automate a refresh.

If this was a Concept Nugget, I could feed new data into the text node, in Generate Concept Nugget mode... and hence automatically create an updated text nugget that would recognize all Cars in the current data set.
However, there is no way to automate creation of a Category Nugget.
So there appears to be no way to automate a stream, which uses Category Nuggets, to refresh itself with new data (without a person manually creating the refreshed nugget).
This is not useful in a production system that uses category nuggets to score text data.

Has anyone figured out a solution to this issue?
Updated on 2012-11-14T16:02:29Z at 2012-11-14T16:02:29Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    435 Posts

    Re: Automating refresh of Category text nuggets with new data

    ‏2012-11-14T08:43:29Z  
    hello
    To my knowledge, there is no way to automatically refresh a category text nugget.
    However, it is not completely true to say that a text nugget only contains the concepts that were extracted. It also contains all the terms from the template that was used to create the model, provided their match is at least "Entire".

    So let suppose you create a type Cars and add terms such as mercedes, bmw, porsche, renault... with Match "Entire and Any".
    Your training corpus only contains "Renault" and "Mercedes x400".
    With the new type, both terms will be typed as Cars and instead of adding the concepts as descriptors, you can use the type to make a descriptor.
    You can generate the nugget and if now your new corpus contains "mercedes x400" "renault" "bmw" "porsche", all 4 records will be scored because bmw and porsche, although not extracted, are part of the model as <Cars>.
  • SystemAdmin
    SystemAdmin
    435 Posts

    Re: Automating refresh of Category text nuggets with new data

    ‏2012-11-14T16:02:29Z  
    hello
    To my knowledge, there is no way to automatically refresh a category text nugget.
    However, it is not completely true to say that a text nugget only contains the concepts that were extracted. It also contains all the terms from the template that was used to create the model, provided their match is at least "Entire".

    So let suppose you create a type Cars and add terms such as mercedes, bmw, porsche, renault... with Match "Entire and Any".
    Your training corpus only contains "Renault" and "Mercedes x400".
    With the new type, both terms will be typed as Cars and instead of adding the concepts as descriptors, you can use the type to make a descriptor.
    You can generate the nugget and if now your new corpus contains "mercedes x400" "renault" "bmw" "porsche", all 4 records will be scored because bmw and porsche, although not extracted, are part of the model as <Cars>.
    Marie-Claude,

    Thanks for the clarification on "entire" and what it causes a nugget to store.

    My related issue is whether linguistic entities (using regular expressions) can ever be entirely captured in a single nugget without refresh (or automatic way to refresh).

    For example... one wants to capture when people are talking about Social Security numbers. We have the nonlinguistic entity to recognize social security numbers via regular expression. The nugget only recognizes social soc numbers in the Concept or Category nugget that were present when created. So if on on ever mentioned "111-22-3333" when the nugget was generated, it will never be captured in future texts by that nugget.

    So we can't create a Nugget that will flag all social security numbers, without manually refreshing (regenerating) the nugget. And there's no apparent method to automate the process.

    For production streams, where we have new text data being pushed through the stream, an enhancement request would be an option in the Text Mining node model options for "Generate directly (CATEGORY model nugget)"... much like there is currently a "Generate directly (concept model nugget)" option. For category models, only a Build Interactively option is available. Then we could fully refresh text nuggets with category extraction. This option would just run extraction, generate TLA, score, generate a new nugget, and exit.