IBM Support

How to successfully index your documents

Technical Blog Post


Abstract

How to successfully index your documents

Body

image

 

Paula Muir is a Software Developer with IBM Content Manager OnDemand for Multiplatforms in Boulder, Colorado. She has 20 years of experience with Content Manager OnDemand and 15 years of experience in the data indexing field. Her areas of expertise include indexing and loading data, and AFP and PDF architecture.

 

 

I just finished writing the indexing chapter for the new IBM Redbooks publication, IBM Content Manager OnDemand Guide.  It is 14 pages long, and I can write a whole book about indexing.

I've worked on several indexers of Content Manager OnDemand (CMOD) for over 10 years, and have seen many indexing problems.  Often they are user errors, and they come down to users not understanding the format of the data, or users not understanding the correct format for loading into CMOD. So here, I'll briefly discuss these two topics.

1. Understand the format of your data

It is critical to have a good understanding of the format of your data before indexing them. Let’s use line data as an example.  It sounds simple but line data is really complex. You must understand the following characteristic of line data:

  • Carriage controls?
  • Type of carriage control
  • Line delimiter?
  • Type of line delimiter
  • TRC ( Table Reference Character ) ?
  • Code page
  • Record length
  • Variable or fixed record length

And I don't want to begin to discuss PRMODE!

Many customers have no idea if their line data has any of these qualities.  The above checklist give you a good start as to what you should look for.

Think of the different kinds of line data like the different types of wine...but although I think we're living in a golden age of wine, we're past the golden age of line data. But I still see many problems with it.

Speaking of the golden age of wine, what I mean is, there are more countries making more good wine than there have ever been.  If you only drink, say, California Chardonnay, it's time to expand your horizons!  Try a Spanish Albarino, an Italian Verdicchio, or a New Zealand Sauvignon Blanc.  You will be surprised!

OK, back to line data and indexing.

2. Understand the correct format for loading into CMOD

Another critical factor in successfully indexing data is to have a good understanding of the correct format for loading the data into CMOD.

What is the correct format for loading line data into CMOD?  Aside from the other characteristics, the ideal line data would contain either ANSI or machine carriage controls. This is so important that CMOD provides an indexing exit just to insert carriage controls into the data. When there are no carriage controls, bad things happen. I've seen CMOD attempt to load 100 MB documents as one page.  Surprise - the system runs out of memory.

The line data characteristics are specified on the View Information tab in the Application:

image

 

Next, what is the correct format for loading AFP into CMOD?  The following is a text dump of the correct format.

1 BDT Begin Document
2   BNG Begin Named Page Group 00000001
3     TLE Tag Logical Element
4     TLE Tag Logical Element
5     TLE Tag Logical Element
6     BPG Begin Page 00000001
7       BAG Begin Active Environment Group
8         MCF2 Map Coded Font2
9         NOP No Operation
0         PGD Page Descriptor
1         PTD2 Presentation Text Desc2
2       EAG End Active Environment Group
3       BCT Begin Composed-Text Block
4         PTX Presentation Text Data
5       ECT End Composed-Text Block
6     EPG End Page
      < next page and so on... >
72   ENG End Named Group
     < next BNG - ENG section >
385 EDT End Document

 

The essential characteristics are the following:

  • The file must contain BNG - ENG pairs.
  • The TLEs must be between the BNG and BPG for each document.

 

If the file does not have these, then you must work with your AFP producer to get it in the correct format. I have encountered many problems where our Services department have to fix badly formed AFP files.

Next time: red wine, or why I love Italy.

Additional references

For Content Manager OnDemand related blog posts, see:

For more information on Content Manager OnDemand, see IBM Redbooks publications:

 

image

[{"Business Unit":{"code":"BU025","label":"IBM Cloud and Cognitive Software"}, "Product":{"code":"SSCTJ4","label":"Case Manager"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":""}]

UID

ibm11281550