How to extract data from tables using custom headers
Extracting data from complex tables and/or from tables from different sources is challenging because they don't follow a consistent structure, format and/or (header) naming conventions. This means that document tables might have diverse formatting styles, such as merged cells, nested tables, and irregular borders, varying numbers of rows and columns as well as heterogenous header names.
Extracting data from tables requires an intelligent document processing platform that uses advanced algorithms, like super.AI's General Document Processor, to recognize and extract data accurately. In the following guide, we show you how you can streamline the naming convention of varying tables in documents and how to extract only specific header columns or key-value pairs from them.
Imagine you have the following table and want to extract the most popular female names in 1922-2021:
Create custom header names
First, you need to assign custom header names. Custom header names are user-defined names given to the columns (horizontal or vertical) in a table. These names can be different from the original column names and have the following benefits:
- Improved clarity: Custom header names can provide more descriptive and meaningful names for the columns in a table, making it easier to understand the data.
- Consistency: Custom header names can ensure that the same naming convention is used across multiple tables or datasets, making it easier to compare and analyze data across different sources.
- Standardization: Custom header names can be used to standardize the naming conventions used in the organization, reducing confusion and errors.
- Simplification: Custom header names can simplify complex tables by making them easier to navigate and read.
- Increased usability: Custom header names can make tables more user-friendly by providing clear and concise names for the columns, improving the overall usability of the data.
You define custom header names in the project settings in the "Table Headers" section:
Apply custom header names
The custom header names can be assigned automatically - but for complex tables, e.g. those that have two headers like the one in the example, you need to manually apply the custom headers.
Now you can download the table in .json format (via UI and API) and extract specific data from it:
Extract specific columns
In case you want to extract whole columns, e.g. all but only popular female names, you chose the corresponding custom header name. Below is a sample script on how to you could do that using Python:
import json field_name = 'custom-1Table0' header_id = 'female_names' with open('./output.json') as output: data = json.load(output) # take the first annotation table_annotation = data['annotations'][field_name]['content'] # find index of column, that was matched with custom header "female_names" column_index = [ i for i, column in enumerate(table_annotation['columns']) \ if 'tableHeaderId' in column and column['tableHeaderId'] == header_id ] # get content of all the cells in this column column_content = [cell['content'] for cell in table_annotation['cells'] \ if cell['columnIndex'] == column_index + 1 and cell['cellType'] == 'Data' ] print('Female names', column_content)
The output then looks like this:
Female names ['Mary', 'Patricia', 'Jennifer', 'Linda', ...]
Extract specific key-value pairs
In case you want to extract whole columns, e.g. the 10th most popular male name, you chose the corresponding custom header name and rank. Below is a sample script on how to you could do that using Python:
import json field_name = 'custom-1Table0' header_name_id = 'female_name' header_rank_id = 'rank' with open('./output.json') as output: data = json.load(output) # take the first annotation table_annotation = data['annotations'][field_name]['content'] # array to store data of custom headers rows = [dict() for x in range(len(table_annotation['rows']))] # iterate over all cells and put data into rows for cell in table_annotation['cells']: column_key = table_annotation['columns'][cell['columnIndex'] - 1].get('tableHeaderId') row_index = cell['rowIndex'] - 1 # filter out non-data cells and cells without matched custom header if column_key != None and cell['cellType'] == 'Data': rows[row_index][column_key] = cell['content'] # find a row with value "10" in column "rank" row = next((row for row in rows if row[header_rank_id] == '10'), None) print('10 rank:', row[header_name_id])
The output then looks like this:
10 rank: Karen
Updated 17 days ago