Topic
5 replies Latest Post - ‏2012-11-15T04:17:41Z by SystemAdmin
SystemAdmin
SystemAdmin
435 Posts
ACCEPTED ANSWER

Pinned topic Help with IBM SPSS Text Analytics

‏2012-10-15T10:07:59Z |
For my thesis I have to get in touch with IBM SPSS Text Analytics.

So I have to analyse anual reports in pdf.

Therefor I created a wordlist. And the program should find these words within the pdfs.

Now I do not know how to handle this problem best.

And I hope someone can help with several problems.

I am a german student, so perhaps it is better to handle the problems in german language.

If is familiar with this program, please write me: klein.melissa@web.de

Thanks for your help
Updated on 2012-11-15T04:17:41Z at 2012-11-15T04:17:41Z by SystemAdmin
  • TedFischer
    TedFischer
    251 Posts
    ACCEPTED ANSWER

    Re: Help with IBM SPSS Text Analytics

    ‏2012-10-18T13:18:35Z  in response to SystemAdmin
    Text analytics is not a search program. Text analytics is designed to find certain concepts of types of words but is not to find a specific word on term. If you want to search for specific words in a file, a combination of the file list node from text analytics and a derive node using the has_substring function.

    Ted
    • SystemAdmin
      SystemAdmin
      435 Posts
      ACCEPTED ANSWER

      Re: Help with IBM SPSS Text Analytics

      ‏2012-10-18T14:55:08Z  in response to TedFischer
      I used the file list node! But what is "derive node using the has_substring function"?
      I also want to linkt two words, that means I want to find two words going together, e.g. "probability" and "long-term production". I think I can do this by TLA, but I do not know how exactly I can do this!?
      • TedFischer
        TedFischer
        251 Posts
        ACCEPTED ANSWER

        Re: Help with IBM SPSS Text Analytics

        ‏2012-10-18T15:29:35Z  in response to SystemAdmin
        Create a derive node. If you are looking for the word IBM, select the option for derive as flag and put has_substring(string_field,"IBM") > 0. If the value returned is greater than zero, IBM is in the string. If you are looking for both "probability" and "long-term production". then put has_substring(string_field,"productivity") > 0 and has_substring(string_field,"long-term production") > 0

        Text link analysis is the way to find co-occurrences of words or concepts. For information on how to use it, look at chapter of the Text Analytics User's Guide found at

        http://pic.dhe.ibm.com/infocenter/spssmodl/v15r0m0/index.jsp?topic=%2Fcom.ibm.spss.modeler.help%2Fabout_clementine_documentation.htm

        Go to IBM SPSS Modeler Text Analytics Help, Text Mining Nodes, Minding for Text Links (you can also do text link analysis in the interactive workbench).

        Ted
      • SystemAdmin
        SystemAdmin
        435 Posts
        ACCEPTED ANSWER

        Re: Help with IBM SPSS Text Analytics

        ‏2012-11-12T16:01:23Z  in response to SystemAdmin
        Hello Melone,
        When you write "I also want to linkt two words, that means I want to find two words going together, e.g. "probability" and "long-term production", is it in the same sentence or in the same document?

        TLA will only work at the level of the sentence, not if words are in different sentences.
        If you want only sentences that contain both words, then you can use TLA, but you'll need to to several things:
        1. first define a type for probability, for instance Probability and add the term under it.
        Do the same for long-term production, type Production.
        2. Then write a TLA rule, such as
        $Probability @{0,8} $Production
        and define an output.
        It will match sentences such as "probability of long-term production", "probability for xx, yy and long-term production".
        But you need to go thru step 1 first, because when words are extracted and become concepts, such as "probability ", you cannot manipulate them as "strings". You can access them only thru their type.

        If you want documents that contain both words, another way is to create a category model with a business rule, such as
        probability & long-term production. It will score any document containing both words, even if probability is in last sentence of document and long-term production in the first sentence.
  • SystemAdmin
    SystemAdmin
    435 Posts
    ACCEPTED ANSWER

    Re: Help with IBM SPSS Text Analytics

    ‏2012-11-15T04:17:41Z  in response to SystemAdmin
    One other helpful hint. Text Analytics uses some resources in Adobe Reader (or Adobe PDF iFilter) to process PDFs. So you need that Adobe Reader/Acrobat installed to probably translate PDF files. Also, currently you need to run version 9.X ... as version 10 and later are not fully compatible.
    From Users Guide For Text Analytics:
    Adobe PDF Processing. In order to extract text from Adobe PDFs, Adobe Reader version 9 must
    be installed on the machine where SPSS Modeler Text Analytics and IBM® SPSS® Modeler
    Text Analytics Server reside.
    Note: Do not upgrade to Adobe Reader version 10 or later because it does not contain the
    required filter.
    Upgrading to Adobe Reader version 9 helps you avoid a rather substantial memory leak in the
    filter that caused processing errors when working with the volumes of Adobe PDF documents
    (near or over 1,000) . If you plan to process Adobe PDF documents on either 32-bit or 64-bit
    Microsoft Windows OS, upgrade to either Adobe Reader version 9.x for 32-bit systems or
    Adobe PDF iFilter 9 for 64-bit systems, both of which are available on the Adobe website.
    Adobe has changed the filtering software they use in starting in Adobe Reader 8.x. Older
    Adobe PDFs files may not be readable or may contain foreign characters. This is an Adobe
    issue and is outside of SPSS Modeler Text Analytics ’s control.
    If a Adobe PDF’s security restriction for “Content Copying or Extraction” is set to “Not
    Allowed” in the Security tab of the Adobe PDF’s Document Properties dialog, then the
    document cannot be filtered and read into the product.
    Adobe PDF files cannot be processed under non-Microsoft Windows platforms.
    Due to limitations in Adobe, it is not possible to extract text from image-based Adobe PDF
    files.