How to Build a JSON Extraction Schema
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude

Document processors are at the heart of the Anesya system. They analyze, extract, and structure the information contained in your documents (invoices, contracts, forms…). This guide explains how to create a JSON schema compatible with OpenAI JSON, in order to define the exact structure of the output data you expect after extraction.

Overall Schema Structure

A valid schema always starts with a root object like this:

{
  "type": "object",
  "required": [...], // Array of your fields's name
  "properties": { ... } // Place your properties inside
}

type: "object": this is mandatory at the root.
required: a list of fields that must appear.
properties: a dictionary of fields to extract, with their name, type, and description.

Supported Field Types

`string`

Free text string.

"customer_name": {
  "type": "string",
  "description": "Customer name"
}

`number`

Decimal number (can include cents, etc.).

"amount": {
  "type": "number",
  "description": "Invoice amount"
}

`integer`

Whole number without decimals.

"quantity": {
  "type": "integer",
  "description": "Quantity ordered"
}

`boolean`

Boolean value (true or false).

"is_signed": {
  "type": "boolean",
  "description": "Is the document signed?"
}

`enum`

Predefined list of values (closed set of choices).

"status": {
  "type": "enum",
  "enum": ["pending", "approved", "rejected"],
  "description": "Invoice status",
  "context:descriptions": [
    "Pending approval",
    "Invoice approved",
    "Invoice rejected"
  ]
}

Tip: context:descriptions is optional but useful to give more context to each value.

`object`

Sub-object containing its own fields.

"invoice_amount": {
  "type": "object",
  "required": ["amount", "iso_4217_currency_code"],
  "properties": {
    "amount": {
      "type": "number",
      "description": "Amount"
    },
    "iso_4217_currency_code": {
      "type": "string",
      "description": "ISO currency code (e.g. EUR, USD)"
    }
  },
  "description": "Invoice amount"
}

`array`

List of items, typically objects or strings.

Example with an array of objects:

"orders": {
  "type": "array",
  "description": "List of orders",
  "items": {
    "type": "object",
    "required": ["order_id", "customer_name"],
    "properties": {
      "order_id": {
        "type": "string",
        "description": "Order identifier"
      },
      "customer_name": {
        "type": "string",
        "description": "Customer name"
      }
    }
  }
}

Common Mistakes to Avoid

❌ Mistake	✅ Solution
Missing or invalid `type`	Make sure each field has a valid `type` (among those listed above)
Missing `description`	Each field must have a clear `description`
Mismatch in `required`	All fields listed in `required` must exist in `properties`
Malformed `object` or `array`	Objects must have `properties` and `required`, arrays must have `items`
Duplicates or typos	Example: `invoice_amoun2t` instead of `invoice_amount`

Testing Your Schema

Before submitting a schema, you can test it in our interface or with our API.

If an error is detected, you’ll see a message like:

❌ Invalid schema: root.status.type must be string or list

🧪 Schema Examples

E-commerce Invoice

Context: You want to extract the following information from an e-commerce invoice:

The order number
The buyer’s first and last name
The items included in the invoice, with their price and quantity

Here’s how you should design your schema:

Order Number
→ Add a field of type string (If your order numbers never contain letters, you may use number instead.)
Buyer Information
→ Add a field of type object named customer_information
→ Inside this object, define two string fields: first_name and last_name
Invoice Items
→ Add a field of type array named items
→ The items array will contain multiple object entries
→ Each object will include two number fields: price and quantity

{
    "type": "object",
    "required": [
        "order_number",
        "customer_information",
        "items"
    ],
    "properties": {
        "order_number": {
            "type": "string",
            "description": "The unique number assigned to the order"
        },
        "customer_information": {
            "type": "object",
            "required": [
                "first_name",
                "last_name"
            ],
            "properties": {
                "first_name": {
                    "type": "string",
                    "description": "The buyer's first name"
                },
                "last_name": {
                    "type": "string",
                    "description": "The buyer's last name"
                }
            },
            "description": "Information about the buyer"
        },
        "items": {
            "type": "array",
            "description": "List of items included in the invoice",
            "items": {
                "type": "object",
                "required": [
                    "price",
                    "quantity"
                ],
                "properties": {
                    "price": {
                        "type": "number",
                        "description": "Price of the item"
                    },
                    "quantity": {
                        "type": "number",
                        "description": "Quantity of the item ordered"
                    }
                }
            }
        }
    }
}

Payslip

Context: You want to extract the following information from a standard French payslip:

The employee’s first and last name
The employer’s name
The net and gross salary amounts
The pay period covered by the payslip

Here’s how you should design your schema:

Employee Information
→ Add an object field named employee
→ Inside this object, define two string fields: first_name and last_name
Employer Name
→ Add a string field named employer
Salary Amounts
→ Add two number fields: net_salary and gross_salary
→ These correspond to the salary before and after deductions
Pay Period
→ Add an object field named pay_period
→ Inside, define two string fields: start and end
→ Use context:type: "date" to specify date format

{
  "type": "object",
  "required": ["employee", "employer", "net_salary", "pay_period"],
  "properties": {
    "employee": {
      "type": "object",
      "required": ["first_name", "last_name"],
      "properties": {
        "first_name": {
          "type": "string",
          "description": "Employee's first name"
        },
        "last_name": {
          "type": "string",
          "description": "Employee's last name"
        }
      },
      "description": "Information about the employee"
    },
    "employer": {
      "type": "string",
      "description": "Name of the employer"
    },
    "net_salary": {
      "type": "number",
      "description": "Net amount paid to the employee"
    },
    "gross_salary": {
      "type": "number",
      "description": "Gross salary before deductions"
    },
    "pay_period": {
      "type": "object",
      "required": ["start", "end"],
      "properties": {
        "start": {
          "type": "string",
          "description": "Start date of the pay period",
          "context:type": "date"
        },
        "end": {
          "type": "string",
          "description": "End date of the pay period",
          "context:type": "date"
        }
      },
      "description": "Period covered by the payslip"
    }
  }
}

Rental Agreement

Context: You want to extract the following information from a residential rental agreement:

The landlord’s name
The tenant’s first and last name
The address of the rented property
The monthly rent and security deposit
The start and end dates of the lease

Here’s how you should design your schema:

Landlord Information
→ Add a string field named landlord
Tenant Information
→ Add an object field named tenant
→ Inside this object, define two string fields: first_name and last_name
Property Address
→ Add a string field named property_address
Rent & Deposit
→ Add two number fields: rental_amount and deposit_amount
Lease Period
→ Add an object field named lease_period
→ Inside, define two string fields: start_date and end_date
→ Use context:type: "date" to specify date format

{
  "type": "object",
  "required": ["landlord", "tenant", "property_address", "rental_amount", "lease_period"],
  "properties": {
    "landlord": {
      "type": "string",
      "description": "Name of the landlord"
    },
    "tenant": {
      "type": "object",
      "required": ["first_name", "last_name"],
      "properties": {
        "first_name": {
          "type": "string",
          "description": "Tenant's first name"
        },
        "last_name": {
          "type": "string",
          "description": "Tenant's last name"
        }
      },
      "description": "Information about the tenant"
    },
    "property_address": {
      "type": "string",
      "description": "Full address of the rented property"
    },
    "rental_amount": {
      "type": "number",
      "description": "Monthly rent amount"
    },
    "deposit_amount": {
      "type": "number",
      "description": "Amount of the security deposit"
    },
    "lease_period": {
      "type": "object",
      "required": ["start_date", "end_date"],
      "properties": {
        "start_date": {
          "type": "string",
          "description": "Start date of the lease",
          "context:type": "date"
        },
        "end_date": {
          "type": "string",
          "description": "End date of the lease",
          "context:type": "date"
        }
      },
      "description": "Duration of the lease"
    }
  }
}

What’s Next?

Once your schema is created, you can use it to:

Trigger an extraction via API
Run an extraction directly on the Anesya platform
Create a workflow using this schema

How to Build a JSON Extraction SchemaCopyCopy for LLMCopy page as Markdown for LLMsView as MarkdownOpen this page as MarkdownOpen in ChatGPTGet insights from ChatGPTOpen in ClaudeGet insights from Claude

Overall Schema Structure

Supported Field Types

string

number

integer

boolean

enum

object

array

Example with an array of objects:

Common Mistakes to Avoid

Testing Your Schema

🧪 Schema Examples

E-commerce Invoice

Payslip

Rental Agreement

What’s Next?

Was this helpful?

How to Build a JSON Extraction Schema
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude

`string`

`number`

`integer`

`boolean`

`enum`

`object`

`array`