# How to Build a JSON Extraction Schema Document processors are at the heart of the Anesya system. They analyze, extract, and structure the information contained in your documents (invoices, contracts, forms…). This guide explains how to create a JSON schema compatible with OpenAI JSON, in order to define the exact structure of the output data you expect after extraction. ## Overall Schema Structure A valid schema always starts with a root object like this: ```json { "type": "object", "required": [...], // Array of your fields's name "properties": { ... } // Place your properties inside } ``` * `type: "object"`: this is mandatory at the root. * `required`: a **list of fields** that must appear. * `properties`: a **dictionary of fields** to extract, with their name, type, and description. ## Supported Field Types ### `string` Free text string. ```json "customer_name": { "type": "string", "description": "Customer name" } ``` ### `number` Decimal number (can include cents, etc.). ```json "amount": { "type": "number", "description": "Invoice amount" } ``` ### `integer` Whole number without decimals. ```json "quantity": { "type": "integer", "description": "Quantity ordered" } ``` ### `boolean` Boolean value (`true` or `false`). ```json "is_signed": { "type": "boolean", "description": "Is the document signed?" } ``` ### `enum` Predefined list of values (closed set of choices). ```json "status": { "type": "enum", "enum": ["pending", "approved", "rejected"], "description": "Invoice status", "context:descriptions": [ "Pending approval", "Invoice approved", "Invoice rejected" ] } ``` > **Tip**: `context:descriptions` is optional but useful to give more context to each value. ### `object` Sub-object containing its own fields. ```json "invoice_amount": { "type": "object", "required": ["amount", "iso_4217_currency_code"], "properties": { "amount": { "type": "number", "description": "Amount" }, "iso_4217_currency_code": { "type": "string", "description": "ISO currency code (e.g. EUR, USD)" } }, "description": "Invoice amount" } ``` ### `array` List of items, typically objects or strings. #### Example with an array of objects: ```json "orders": { "type": "array", "description": "List of orders", "items": { "type": "object", "required": ["order_id", "customer_name"], "properties": { "order_id": { "type": "string", "description": "Order identifier" }, "customer_name": { "type": "string", "description": "Customer name" } } } } ``` ## Common Mistakes to Avoid | ❌ Mistake | ✅ Solution | | --- | --- | | Missing or invalid `type` | Make sure each field has a valid `type` (among those listed above) | | Missing `description` | Each field must have a clear `description` | | Mismatch in `required` | All fields listed in `required` must exist in `properties` | | Malformed `object` or `array` | Objects must have `properties` and `required`, arrays must have `items` | | Duplicates or typos | Example: `invoice_amoun2t` instead of `invoice_amount` | ## Testing Your Schema Before submitting a schema, you can test it in our interface or with our API. If an error is detected, you’ll see a message like: ``` ❌ Invalid schema: root.status.type must be string or list ``` ## 🧪 Schema Examples ### E-commerce Invoice **Context**: You want to extract the following information from an e-commerce invoice: * The **order number** * The **buyer’s first and last name** * The **items** included in the invoice, with their **price** and **quantity** Here’s how you should design your schema: 1. **Order Number** → Add a field of type `string` (If your order numbers never contain letters, you may use `number` instead.) 2. **Buyer Information** → Add a field of type `object` named `customer_information` → Inside this object, define two `string` fields: `first_name` and `last_name` 3. **Invoice Items** → Add a field of type `array` named `items` → The `items` array will contain multiple `object` entries → Each object will include two `number` fields: `price` and `quantity` br ```json { "type": "object", "required": [ "order_number", "customer_information", "items" ], "properties": { "order_number": { "type": "string", "description": "The unique number assigned to the order" }, "customer_information": { "type": "object", "required": [ "first_name", "last_name" ], "properties": { "first_name": { "type": "string", "description": "The buyer's first name" }, "last_name": { "type": "string", "description": "The buyer's last name" } }, "description": "Information about the buyer" }, "items": { "type": "array", "description": "List of items included in the invoice", "items": { "type": "object", "required": [ "price", "quantity" ], "properties": { "price": { "type": "number", "description": "Price of the item" }, "quantity": { "type": "number", "description": "Quantity of the item ordered" } } } } } } ``` br ### Payslip **Context**: You want to extract the following information from a standard French payslip: * The **employee’s first and last name** * The **employer’s name** * The **net and gross salary amounts** * The **pay period** covered by the payslip Here’s how you should design your schema: 1. **Employee Information** → Add an `object` field named `employee` → Inside this object, define two `string` fields: `first_name` and `last_name` 2. **Employer Name** → Add a `string` field named `employer` 3. **Salary Amounts** → Add two `number` fields: `net_salary` and `gross_salary` → These correspond to the salary before and after deductions 4. **Pay Period** → Add an `object` field named `pay_period` → Inside, define two `string` fields: `start` and `end` → Use `context:type: "date"` to specify date format br ```json { "type": "object", "required": ["employee", "employer", "net_salary", "pay_period"], "properties": { "employee": { "type": "object", "required": ["first_name", "last_name"], "properties": { "first_name": { "type": "string", "description": "Employee's first name" }, "last_name": { "type": "string", "description": "Employee's last name" } }, "description": "Information about the employee" }, "employer": { "type": "string", "description": "Name of the employer" }, "net_salary": { "type": "number", "description": "Net amount paid to the employee" }, "gross_salary": { "type": "number", "description": "Gross salary before deductions" }, "pay_period": { "type": "object", "required": ["start", "end"], "properties": { "start": { "type": "string", "description": "Start date of the pay period", "context:type": "date" }, "end": { "type": "string", "description": "End date of the pay period", "context:type": "date" } }, "description": "Period covered by the payslip" } } } ``` br ### Rental Agreement **Context**: You want to extract the following information from a residential rental agreement: * The **landlord’s name** * The **tenant’s first and last name** * The **address** of the rented property * The **monthly rent** and **security deposit** * The **start and end dates** of the lease Here’s how you should design your schema: 1. **Landlord Information** → Add a `string` field named `landlord` 2. **Tenant Information** → Add an `object` field named `tenant` → Inside this object, define two `string` fields: `first_name` and `last_name` 3. **Property Address** → Add a `string` field named `property_address` 4. **Rent & Deposit** → Add two `number` fields: `rental_amount` and `deposit_amount` 5. **Lease Period** → Add an `object` field named `lease_period` → Inside, define two `string` fields: `start_date` and `end_date` → Use `context:type: "date"` to specify date format br ```json { "type": "object", "required": ["landlord", "tenant", "property_address", "rental_amount", "lease_period"], "properties": { "landlord": { "type": "string", "description": "Name of the landlord" }, "tenant": { "type": "object", "required": ["first_name", "last_name"], "properties": { "first_name": { "type": "string", "description": "Tenant's first name" }, "last_name": { "type": "string", "description": "Tenant's last name" } }, "description": "Information about the tenant" }, "property_address": { "type": "string", "description": "Full address of the rented property" }, "rental_amount": { "type": "number", "description": "Monthly rent amount" }, "deposit_amount": { "type": "number", "description": "Amount of the security deposit" }, "lease_period": { "type": "object", "required": ["start_date", "end_date"], "properties": { "start_date": { "type": "string", "description": "Start date of the lease", "context:type": "date" }, "end_date": { "type": "string", "description": "End date of the lease", "context:type": "date" } }, "description": "Duration of the lease" } } } ``` br ## What’s Next? Once your schema is created, you can use it to: * Trigger an extraction via API * Run an extraction directly on the Anesya platform * Create a workflow using this schema