The General Document Processor (GDP) is a tool that allows users to process documents and extract information from them. This feature adds the ability to link a database to specific fields in GDP, so that the extracted data can be matched with values in the database.
- When creating a new field in GDP, select the data type "Calculated".
- Enter the data set URI. This is the location of the database you want to link to GDP. This can be public URL or data:// linking to your super.AI storage.
- Define which fields to match against the data in the dataset. Select the fields in GDP that you want to compare with the values in the database.
- Define the number of candidates to present for human review. This is the number of matches that will be returned for interactive selection. If set to 0, matching will be disabled.
- Define the matching threshold. This is the percentage of fields that should match for automatic assignment. For example, a threshold of 0.8 means that 80% of the fields must match for the value to be automatically assigned.
- Define the Output Column Name. This is the name of the column in the output.
- Click "Setup" to link the database to GDP and begin matching values.
Once the fields are set up, GDP will look up values in the database and match them to the extracted data from your documents.
If the matching threshold is met, the corresponding value will be automatically assigned to the output field.
If there are multiple matches found, the number of candidates as defined, will be shown for human review, and users can select the correct one.
By setting up the matching threshold, you can control the level of confidence in automatic matching. Lowering threshold would increase the chance of getting matched but increase the chance of mismatches, while raising the threshold decrease the chance of mismatches but also reduce the chance of getting matched.
This feature is useful if you have a specific set of values that should be extracted from your documents, and you want to ensure that the extracted data matches one of those values. It allows to improve extraction quality and automate process to some extent.
Updated 27 days ago