Skip to content
Last updated

How to Build a JSON Extraction Schema

Document processors are at the heart of the Anesya system. They analyze, extract, and structure the information contained in your documents (invoices, contracts, forms…). This guide explains how to create a JSON schema compatible with OpenAI JSON, in order to define the exact structure of the output data you expect after extraction.


Overall Schema Structure

A valid schema always starts with a root object like this:

{
  "type": "object",
  "required": [...], // Array of your fields's name
  "properties": { ... } // Place your properties inside
}
  • type: "object": this is mandatory at the root.
  • required: a list of fields that must appear.
  • properties: a dictionary of fields to extract, with their name, type, and description.

Supported Field Types

string

Free text string.

"customer_name": {
  "type": "string",
  "description": "Customer name"
}

number

Decimal number (can include cents, etc.).

"amount": {
  "type": "number",
  "description": "Invoice amount"
}

integer

Whole number without decimals.

"quantity": {
  "type": "integer",
  "description": "Quantity ordered"
}

boolean

Boolean value (true or false).

"is_signed": {
  "type": "boolean",
  "description": "Is the document signed?"
}

enum

Predefined list of values (closed set of choices).

"status": {
  "type": "enum",
  "enum": ["pending", "approved", "rejected"],
  "description": "Invoice status",
  "context:descriptions": [
    "Pending approval",
    "Invoice approved",
    "Invoice rejected"
  ]
}

Tip: context:descriptions is optional but useful to give more context to each value.


object

Sub-object containing its own fields.

"invoice_amount": {
  "type": "object",
  "required": ["amount", "iso_4217_currency_code"],
  "properties": {
    "amount": {
      "type": "number",
      "description": "Amount"
    },
    "iso_4217_currency_code": {
      "type": "string",
      "description": "ISO currency code (e.g. EUR, USD)"
    }
  },
  "description": "Invoice amount"
}

array

List of items, typically objects or strings.

Example with an array of objects:

"orders": {
  "type": "array",
  "description": "List of orders",
  "items": {
    "type": "object",
    "required": ["order_id", "customer_name"],
    "properties": {
      "order_id": {
        "type": "string",
        "description": "Order identifier"
      },
      "customer_name": {
        "type": "string",
        "description": "Customer name"
      }
    }
  }
}

Common Mistakes to Avoid

❌ Mistake✅ Solution
Missing or invalid typeMake sure each field has a valid type (among those listed above)
Missing descriptionEach field must have a clear description
Mismatch in requiredAll fields listed in required must exist in properties
Malformed object or arrayObjects must have properties and required, arrays must have items
Duplicates or typosExample: invoice_amoun2t instead of invoice_amount

Testing Your Schema

Before submitting a schema, you can test it in our interface or with our API.

If an error is detected, you’ll see a message like:

❌ Invalid schema: root.status.type must be string or list

🧪 Schema Examples

E-commerce Invoice

Context: You want to extract the following information from an e-commerce invoice:

  • The order number
  • The buyer’s first and last name
  • The items included in the invoice, with their price and quantity

Here’s how you should design your schema:

  1. Order Number
    → Add a field of type string (If your order numbers never contain letters, you may use number instead.)

  2. Buyer Information
    → Add a field of type object named customer_information
    → Inside this object, define two string fields: first_name and last_name

  3. Invoice Items
    → Add a field of type array named items
    → The items array will contain multiple object entries
    → Each object will include two number fields: price and quantity


{
    "type": "object",
    "required": [
        "order_number",
        "customer_information",
        "items"
    ],
    "properties": {
        "order_number": {
            "type": "string",
            "description": "The unique number assigned to the order"
        },
        "customer_information": {
            "type": "object",
            "required": [
                "first_name",
                "last_name"
            ],
            "properties": {
                "first_name": {
                    "type": "string",
                    "description": "The buyer's first name"
                },
                "last_name": {
                    "type": "string",
                    "description": "The buyer's last name"
                }
            },
            "description": "Information about the buyer"
        },
        "items": {
            "type": "array",
            "description": "List of items included in the invoice",
            "items": {
                "type": "object",
                "required": [
                    "price",
                    "quantity"
                ],
                "properties": {
                    "price": {
                        "type": "number",
                        "description": "Price of the item"
                    },
                    "quantity": {
                        "type": "number",
                        "description": "Quantity of the item ordered"
                    }
                }
            }
        }
    }
}

Payslip

Context: You want to extract the following information from a standard French payslip:

  • The employee’s first and last name
  • The employer’s name
  • The net and gross salary amounts
  • The pay period covered by the payslip

Here’s how you should design your schema:

  1. Employee Information
    → Add an object field named employee
    → Inside this object, define two string fields: first_name and last_name

  2. Employer Name
    → Add a string field named employer

  3. Salary Amounts
    → Add two number fields: net_salary and gross_salary
    → These correspond to the salary before and after deductions

  4. Pay Period
    → Add an object field named pay_period
    → Inside, define two string fields: start and end
    → Use context:type: "date" to specify date format


{
  "type": "object",
  "required": ["employee", "employer", "net_salary", "pay_period"],
  "properties": {
    "employee": {
      "type": "object",
      "required": ["first_name", "last_name"],
      "properties": {
        "first_name": {
          "type": "string",
          "description": "Employee's first name"
        },
        "last_name": {
          "type": "string",
          "description": "Employee's last name"
        }
      },
      "description": "Information about the employee"
    },
    "employer": {
      "type": "string",
      "description": "Name of the employer"
    },
    "net_salary": {
      "type": "number",
      "description": "Net amount paid to the employee"
    },
    "gross_salary": {
      "type": "number",
      "description": "Gross salary before deductions"
    },
    "pay_period": {
      "type": "object",
      "required": ["start", "end"],
      "properties": {
        "start": {
          "type": "string",
          "description": "Start date of the pay period",
          "context:type": "date"
        },
        "end": {
          "type": "string",
          "description": "End date of the pay period",
          "context:type": "date"
        }
      },
      "description": "Period covered by the payslip"
    }
  }
}

Rental Agreement

Context: You want to extract the following information from a residential rental agreement:

  • The landlord’s name
  • The tenant’s first and last name
  • The address of the rented property
  • The monthly rent and security deposit
  • The start and end dates of the lease

Here’s how you should design your schema:

  1. Landlord Information
    → Add a string field named landlord

  2. Tenant Information
    → Add an object field named tenant
    → Inside this object, define two string fields: first_name and last_name

  3. Property Address
    → Add a string field named property_address

  4. Rent & Deposit
    → Add two number fields: rental_amount and deposit_amount

  5. Lease Period
    → Add an object field named lease_period
    → Inside, define two string fields: start_date and end_date
    → Use context:type: "date" to specify date format


{
  "type": "object",
  "required": ["landlord", "tenant", "property_address", "rental_amount", "lease_period"],
  "properties": {
    "landlord": {
      "type": "string",
      "description": "Name of the landlord"
    },
    "tenant": {
      "type": "object",
      "required": ["first_name", "last_name"],
      "properties": {
        "first_name": {
          "type": "string",
          "description": "Tenant's first name"
        },
        "last_name": {
          "type": "string",
          "description": "Tenant's last name"
        }
      },
      "description": "Information about the tenant"
    },
    "property_address": {
      "type": "string",
      "description": "Full address of the rented property"
    },
    "rental_amount": {
      "type": "number",
      "description": "Monthly rent amount"
    },
    "deposit_amount": {
      "type": "number",
      "description": "Amount of the security deposit"
    },
    "lease_period": {
      "type": "object",
      "required": ["start_date", "end_date"],
      "properties": {
        "start_date": {
          "type": "string",
          "description": "Start date of the lease",
          "context:type": "date"
        },
        "end_date": {
          "type": "string",
          "description": "End date of the lease",
          "context:type": "date"
        }
      },
      "description": "Duration of the lease"
    }
  }
}

What’s Next?

Once your schema is created, you can use it to: