How to extract field names with python via API
- This notebook is a quick introduction to how to use the super.AI API to extract annotations from a job, and transform them into a custom format
Getting Started with Super.AI API
- We will be interacting with the super.AI API
- We also need 2 other opensource tools
requestsandpandasfor the API requests and data transformations respectively.
Setting up the API client
Before interacting with SuperAI to parse job responses, you'll need to set up a client. This client will facilitate communication with the SuperAI system, enabling you to send requests and receive responses. Below, you'll find a step-by-step guide on how to create and configure the client for your specific needs.
- Create
api.pyfile and paste the following code
import requests
class APIClient:
def __init__(self, base_url: str, api_key: str):
self.base_url = base_url
self.session = requests.Session()
self.session.headers.update({
"API-KEY": f"{api_key}",
"Content-Type": "application/json"
})
def _make_api_call(self, endpoint: str, method="GET", params=None, data=None):
"""
Generalizes API calls to handle various methods and endpoints.
:param endpoint: The API endpoint (relative to base_url)
:param method: HTTP method (GET, POST, etc.)
:param params: URL parameters (optional)
:param data: Request payload (optional)
:return: JSON response if successful, None if unsuccessful
"""
url = f"{self.base_url}{endpoint}"
try:
if method.upper() == "GET":
response = self.session.get(url, params=params)
elif method.upper() == "POST":
response = self.session.post(url, json=data)
elif method.upper() == "PUT":
response = self.session.put(url, json=data)
elif method.upper() == "DELETE":
response = self.session.delete(url, json=data)
else:
raise ValueError(f"Unsupported HTTP method: {method}")
response.raise_for_status() # Raise an HTTPError for bad responses
return response.json() # Return the JSON response
except requests.exceptions.RequestException as e:
print(f"Error making API call: {e}")
return None
def get_job_response(self, job_id: str):
"""
Retrieves job response from the API.
:param job_id: Job ID
:return: JSON response or None if failed
"""
endpoint = f"/v1/jobs/{job_id}/response"
return self._make_api_call(endpoint)
def get_app_schema(self, app_id: str):
"""
Retrieves app schema from the API.
:param app_id: Application ID
:return: JSON response or None if failed
"""
endpoint = f"/v1/apps/{app_id}"
return self._make_api_call(endpoint)
This is a general API client which we will use to communicate with the super.AI platform
- Import the APIClient from the
api.pyfile in your main python program (ex:main.py)
from api import APIClient- Configure the client (
main.py), make sure to enter your API Key
api_client = APIClient(base_url="https://api.super.ai", api_key="YOUR_API_KEY")Fetching the relevant data from Super.AI
We will need to call the super.AI API to get the result of a processed job and the output schema, which are necessary for the final transformation.
- Use the api client you just created to fetch an individual job response (
main.py), make sure to replace JOB_ID with a relevant job_id
job_response = api_client.get_job_response(job_id="JOB_ID")- Use the client to fetch the job output schema needed for the transformation (
main.py), make sure to replace APP_ID with a relevant app_id
app_schema = api_client.get_app_schema(app_id="APP_ID")- Now we will need the annotations and annotation schema from the data we have. Extract the annotations from the job response (
main.py)
annotations = job_response.get("response", {}).get("annotations", {})- Extract the application schema (
main.py)
annotations_schema = app_schema.get("outputSchema", {}).get("definitions", {}).get("AnnotationModel", {}).get(
"properties", {})Transforming Data
To transform the data we will create some handy utility function in a new file called util.py
- Create the
util.pyfile and paste the following:
import pandas as pd
from itertools import chain
from collections.abc import Iterable
class DataTransformationUtils:
@staticmethod
def _merge_arrays_by_id(array1, array2):
"""
Merges two arrays of objects by their `id` property.
:param array1: First array of objects.
:param array2: Second array of objects.
:return: Merged list of objects based on `id`.
"""
df1 = pd.DataFrame(array1)
df2 = pd.DataFrame(array2)
# Merge on `id` and fill missing values
merged = pd.merge(df1, df2, on="id", how="left")
return merged.to_dict(orient="records")
@staticmethod
def _transform_to_table(cells_array):
"""
Transforms a cell-based structure into a table-like array of objects.
:param cells_array: The array containing cell-based data with row and column info.
:return: Transformed data as a list of dictionaries representing the table.
"""
is_horizontal = not cells_array.get("orientation") or cells_array["orientation"] == "horizontal"
cells = cells_array["cells"]
# Create a DataFrame from the cells
df = pd.DataFrame(cells)
# Create headers
if is_horizontal:
headers = df[df["rowIndex"] == 1].set_index("columnIndex")["content"].to_dict()
else:
headers = df[df["columnIndex"] == 1].set_index("rowIndex")["content"].to_dict()
# Filter non-header rows and columns
data_cells = df[
(df["rowIndex"] != 1 if is_horizontal else df["columnIndex"] != 1)
]
# Map the content into rows and headers
data_cells["header"] = data_cells.apply(
lambda x: headers.get(x["columnIndex"] if is_horizontal else x["rowIndex"], None),
axis=1,
)
# Pivot data into a table format
table = data_cells.pivot_table(
index=(data_cells["rowIndex"] - 2) if is_horizontal else (data_cells["columnIndex"] - 2),
columns="header",
values="content",
aggfunc="first",
).reset_index(drop=True)
return table.to_dict(orient="records")
@staticmethod
def _transform_job_annotations(obj):
"""
Transforms annotations into a standardized format.
:param obj: Dictionary of annotations with `id` as the key.
:return: List of standardized annotation dictionaries.
"""
annotations = [
{
"id": key,
"value": list(
chain.from_iterable(
DataTransformationUtils._transform_to_table(item["content"])
if "cells" in item.get("content", {})
else [item["content"]]
for item in annotation
if item.get("content") and isinstance(item["content"], Iterable) and not isinstance(
item["content"], int)
)
),
}
for key, annotation in obj.items()
if isinstance(annotation, list) and annotation
]
return [ann for ann in annotations if ann["value"]]
@staticmethod
def _transform_job_schema(obj):
"""
Transforms schema fields into a simplified format.
:param obj: Dictionary containing schema fields.
:return: List of dictionaries with `id` and `title` properties.
"""
return [
{"id": key, "title": field["title"]}
for key, field in obj.items()
if field.get("title")
]
@classmethod
def process_annotations(cls, annotations_schema, annotations):
"""
Processes annotations using internal transformation methods and merges them with schema.
:param annotations_schema: Dictionary containing schema fields.
:param annotations: Dictionary of annotations with `id` as the key.
:return: Dictionary mapping schema titles to corresponding annotation values.
"""
schema = cls._transform_job_schema(annotations_schema) if annotations_schema else []
data = cls._transform_job_annotations(annotations) if annotations else []
result = {
item["title"]: item["value"]
for item in cls._merge_arrays_by_id(data, schema)
}
return result
- Import the DataTransformationUtils class in your
main.pyfile and apply the transformations
result = DataTransformationUtils.process_annotations(annotations_schema, annotations)- Print the result
print(result)The result will be an object with a key/value pair, the key being the field name, and value being the extracted value.
Example output:
{
"Invoice Date": ["30.9.2022"],
"Invoice Number": ["14163-5"],
"PO number": ["PO-57392018"],
"Consignor Name": ["Tyrell Inc."],
"Consignor Address": ["Mill Rd, Worthing BN11 4GU, UK"],
"Supplier Tax ID": ["54-1234567"],
"Consignee Name": ["Monster GmbH"],
"Consignee Address": ["Teststr. 13, 20100 Hamburg, Germany"],
"Receiver Tax ID": ["98-7654321"],
"Delivery Date": ["3.10.2023"],
"Due Date": ["3.11.2023"],
"Subtotal": ["USD 9050.00"],
"Total Tax": ["USD 100.00"],
"Total Amount": ["USD 10050.00"],
"Shipped items list": [
{
"Description": "Lorem ipsum dolor sit\namet, consectetur\nipsum",
"Id": "ZU781298",
"Quantity": "3",
"Total (USD)": "300",
"Unit Price (USD)": "100"
},
{
"Description": "Sed ut perspiciatis unde\nomnis iste natus",
"Id": "ZU781432",
"Quantity": "5",
"Total (USD)": "1000",
"Unit Price (USD)": "200"
},
{
"Description": "Ut enim ad minima\nveniam, quis nostrum\nexercitationem",
"Id": "ZU781753",
"Quantity": "10",
"Total (USD)": "2500",
"Unit Price (USD)": "250"
}
]
}Updated about 2 months ago
