General Document Processor (GDP)

The General Document Processor (GDP) combines state of the art Optical Character Recognition (OCR) techniques with latest super.AI deep learning models to understand and extract data from variety of document types - such as invoices, bill of ladings, purchase orders, passports, business cards or custom documents. The document can be in various formats and quality including captured images, scans and machine readable PDFs. Human input improves the accuracy per document type over time. The human in the loop (HITL) can be on the customer's side or customers can leverage certified super.AI personnel.

This article applies to the latest version of the General Document Processor.

General Document Processor overview

  • The GDP offers pre-trained models for a variety of document types; it doesn't require labeling or training to get started
  • Users can extract via API or UI nested-key-value pairs, text, complex tables, and selection marks
  • Human input improves the accuracy over time per document type
  • You can leverage your own humans or certified super.AI humans

General Document Processor user interface showing a sample document

Key Value Pairs

Key-value pairs are a group of entities within a document that identify a key and its associated value (e.g. day of birth date as the key and its value 2015-04-30). The super.AI model is trained to extract keys and values based on a wide variety of document types, formats, and structures.

Keys can also exist without a corresponding value, e.g. a middle name field may be left blank on a form in some instances. For documents where the same value is described in various ways, e.g. Phone Number and Telephone Number, it will be harmonised on one key.

Nested key-value pairs can be extracted as well. Thus, a parent key with nested key value pairs, e.g. Gender and options to be checked for male, female, diverse, or not applicable.

Wherever feasible the values are standardized on ISO format in the JSON output, e.g. a date is formatted YYYY-MM-DD.

Values in (Complex) Tables

The super.AI model has been trained extensively to identify and extract values from various (complex) tables one finds in purchase orders, invoices etc.

Let's consider a few examples to illustrate the concept. In a purchase order, you might find the following key-value pairs:

Key: "Order Number"
Value: "PO123456789"

Key: "Supplier"
Value: "ABC Company"

Key: "Item Description"
Value: "Widget A"

Key: "Quantity"
Value: "100"

Key: "Unit Price"
Value: "$10.99"

Key: "Total Amount"
Value: "$1,099.00"

In this scenario, the model would accurately extract the keys such as "Order Number," "Supplier," "Item Description," "Quantity," "Unit Price," and "Total Amount," along with their respective values.

The model also handles cases where keys exist without corresponding values. For example, if there is a field for "Discount" in the purchase order that is left blank, the model would still recognize and extract the key "Discount" even though it doesn't have an associated value.

Furthermore, the model can handle nested key-value pairs in tables. For instance:

Key: "Shipping Address"
Nested Key: "Street"
Nested Value: "123 Main Street"
Nested Key: "City"
Nested Value: "New York"
Nested Key: "Postal Code"
Nested Value: "10001"
In this example, the model would identify the parent key "Shipping Address" and extract its nested key-value pairs, including "Street," "City," and "Postal Code," along with their respective values.

To ensure standardized data, the model can format values according to relevant standards. For instance, it can format dates using the ISO standard (YYYY-MM-DD) or standardize currency values to a specific currency format (e.g., "$1,099.00").

By leveraging the capabilities of the super.AI model, businesses can efficiently extract valuable information from complex tables, automate data processing, and streamline processes.

Input requirements

  • Image quality: Garbage in, garbage out... The best results are achieved by providing a sharp and non-distorted scanned image or a machine readable document.
  • Max. number of pages: for PDF up to 100 pages can be processed (if you have larger document please contact [email protected])
  • Max. file size: the file size must be less than 50 MB (if you have larger document please contact [email protected])
  • Image dimensions: between 50 x 50 pixels and 10,000 px x 10,000 pixels.
  • PDF dimensions: up to 17 x 17 inches, corresponding to Legal or A3 paper size, or smaller.
  • Font size: the minimum height of the text to be extracted is 12 pixels for a 1024 x 768 pixel image. This dimension corresponds to about 8-point text at 150 dots per inch (DPI).
  • PDFs with password locks can't be processed, you must remove the lock before uploading

Supported file formats

ApplicationPDF (scanned)PDF (machine readable)Image (JPEG, PNG, BMP, and TIFF)
General Document Processor:heavy-check-mark::heavy-check-mark::heavy-check-mark:

Data Extraction

ApplicationTextTablesSelection MarksNested Key-Value PairsStampsSignatures
General Document Processor:heavy-check-mark::heavy-check-mark::heavy-check-mark::heavy-check-mark::heavy-check-mark::heavy-check-mark:

Released Q4'2022