Extracting data from complex tables and/or from tables from different sources is challenging because they don't follow a consistent structure, format and/or (header) naming conventions. This means that document tables might have diverse formatting styles, such as merged cells, nested tables, and irregular borders, varying numbers of rows and columns as well as heterogenous header names.

Extracting data from tables requires an intelligent document processing platform that uses advanced algorithms, like super.AI's General Document Processor, to recognize and extract data accurately. In the following guide, we show you how you can streamline the naming convention of varying tables in documents and how to extract only specific header columns or key-value pairs from them.

Example

Imagine you have the following table and want to extract the most popular female names in 1922-2021:

1082 — Table of popular names for births from 1922 to 2021

Create custom header names

First, you need to assign custom header names. Custom header names are user-defined names given to the columns (horizontal or vertical) in a table. These names can be different from the original column names and have the following benefits:

Improved clarity: Custom header names can provide more descriptive and meaningful names for the columns in a table, making it easier to understand the data.
Consistency: Custom header names can ensure that the same naming convention is used across multiple tables or datasets, making it easier to compare and analyze data across different sources.
Standardization: Custom header names can be used to standardize the naming conventions used in the organization, reducing confusion and errors.
Simplification: Custom header names can simplify complex tables by making them easier to navigate and read.
Increased usability: Custom header names can make tables more user-friendly by providing clear and concise names for the columns, improving the overall usability of the data.

You define custom header names in the project settings in the "Table Headers" section:

2848 — Define custom header names in the project settings.

Apply custom header names

The custom header names can be assigned automatically - but for complex tables, e.g. those that have two headers like the one in the example, you need to manually apply the custom headers.

1592 — Delete the first header row and apply custom header for female & male names.

Now you can download the table in .json format (via UI and API) and extract specific data from it:

Extract specific columns

In case you want to extract whole columns, e.g. all but only popular female names, you chose the corresponding custom header name. Below is a sample script on how to you could do that using Python:

import json

field_name = 'custom-1Table0'
header_id = 'female_names'

with open('./output.json') as output:
    data = json.load(output)

    # take the first annotation
    table_annotation = data['annotations'][field_name][0]['content']

    # find index of column, that was matched with custom header "female_names"
    column_index = [
        i for i, column in enumerate(table_annotation['columns']) \
            if 'tableHeaderId' in column and column['tableHeaderId'] == header_id
    ][0]

    # get content of all the cells in this column
    column_content = [cell['content'] for cell in table_annotation['cells'] \
        if cell['columnIndex'] == column_index + 1 and cell['cellType'] == 'Data'
    ]

    print('Female names', column_content)

The output then looks like this:

Female names ['Mary', 'Patricia', 'Jennifer', 'Linda', ...]

Extract specific key-value pairs

In case you want to extract whole columns, e.g. the 10th most popular male name, you chose the corresponding custom header name and rank. Below is a sample script on how to you could do that using Python:

import json

field_name = 'custom-1Table0'
header_name_id = 'female_name'
header_rank_id = 'rank'

with open('./output.json') as output:
    data = json.load(output)

    # take the first annotation
    table_annotation = data['annotations'][field_name][0]['content']

    # array to store data of custom headers
    rows = [dict() for x in range(len(table_annotation['rows']))]

    # iterate over all cells and put data into rows
    for cell in table_annotation['cells']:
        column_key = table_annotation['columns'][cell['columnIndex'] - 1].get('tableHeaderId')
        row_index = cell['rowIndex'] - 1

        # filter out non-data cells and cells without matched custom header
        if column_key != None and cell['cellType'] == 'Data':
            rows[row_index][column_key] = cell['content']

    # find a row with value "10" in column "rank"
    row = next((row for row in rows if row[header_rank_id] == '10'), None)

    print('10 rank:', row[header_name_id])

The output then looks like this:

10 rank: Karen

Current Limitations

Tables with headers in a column (”vertical tables”) are not automatically detected