Creating document extractors
Use a document extractor to define and extract specific fields from rich document formats such as PDFs, Microsoft Word documents, and images.
You can configure a document extractor to identify and extract a set of fields across sample documents, for example the fields “Agreement Number” and “Agreement Parties” from contract documents. To refine these fields, you can provide a natural language description of what each field represents and include examples of how the fields appear in actual documents.
Powered by a large language model (LLM), the configured document extractor identifies and extracts instances of the fields and highlights these fields directly within the document for easy review. The fields are not specific to one document but are found across all of your example documents.
Document extractors support documents in English, French, and English with handwriting.
Creating a document extractor skill
-
From the menu
, select Skill studio.
-
Click Create and select Project.
-
From the New project screen, name your project, describe your project, and click Create.
-
Select the Document processor skill type.
-
Select Document extractor.
-
Name and describe your skill, and click Create.
Uploading documents and configuring fields
-
Add up to 5 initial documents in the upload area. Supported formats are PDF, DOCX, PPTX, JPG, PNG, and TIFF. Each document must not exceed a size of 10 MB.
It might take some time to upload the documents. The more pages a document has, the longer it takes to process.
-
In the skill toolbar, use the dropdown list to select the language of your documents if necessary. You can extract fields from documents in English, French, and English with handwriting.
-
To upload more documents, delete existing documents, or monitor the upload status, click the dropdown menu at the top of the displayed document and select Manage documents.
Figure 1: Manage documents dropdown Figure 2: Manage documents window with list of uploaded documents You can see uploaded documents in the left panel, where you can navigate between pages, zoom in and out, and switch between documents with the dropdown list.
-
In the Fields panel, click Add field and enter a field name, for example "Agreement number".
The document extractor searches for values that are related to this specific field.
-
If no values are found or if you need to refine the field, click View field details to edit the field.
-
Provide a description of the field, and update the data type if needed. Add examples to help the LLM provide the expected output. Click Show on document to get the results.
Figure 3: Edit fields details for sample documents -
When you're happy with the results, go back to the Fields page. The changes are automatically updated and you can see the extracted values.
-
Repeat the previous steps to define additional fields. You can switch between documents to check how well the tool identifies the desired values across different files.
Figure 4: Extracted fields in a document extractor
Edit a document extractor details
From the menu , click Edit details.
You can update the following details:
- Name: The display name of your skill.
- Operation ID: A unique identifier for a skill in the system.
- Description: The description of your skill.
Delete a document extractor skill
From the menu , click Delete this skill.
- When you delete a document extractor, the activities, skills, and assistants that use the document extractor might no longer work.
- After deletion, make sure to complete the following actions:
- Manually replace or remove the document extractor from activities, skills, and assistants that use it.
- If the document extractor was previously shared, share your changes after the deletion so that the document extractor deletion is shared.
- If the document extractor was published to the skill catalog, delete it from the skill catalog. To delete it from the skill catalog, go to Skill studio and delete the skill.
What to do next
When you're finished building your document extractor, you can:
- Click the icon
to make your skill public. Only public skills can be published to the skill catalog.
- Share your changes to make them available to your collaborators.
- You can use the skill in workflows, for more information see Creating workflows and Configuring document processors.