Get Started

# Extract

> Pull structured JSON from documents using one existing schema.


Extract is the Anesya endpoint for turning a document or a completed parsing into structured data.

You provide:

* one schema ID
* and either one document input or one parsing ID


Anesya then returns a structured JSON result in the `result` field.

Under the hood, extract relies on parsing. If you pass a raw document, a parsing is created first. If you pass a parsing ID, that parsing must already be complete enough to be reused.

## Parsing vs extract

Both endpoints process documents, but they solve different problems.

### Parsing answers

**"What is in this document?"**

Use parsing when you need:

* OCR output
* markdown content
* pictures
* page-level success and failure details
* direct inspection of the source content


### Extract answers

**"What structured result should I return from this document?"**

Use extract when you need:

* structured JSON
* one schema-driven result
* a clean downstream payload for your app, workflow, or automation


### Important rule

Extract can only work from what parsing can process.

If the parsing is poor, incomplete, or fails, the extract result will be affected too.

When debugging extraction quality, inspect the parsing first.

## What extract is for

Extract is the right entry point when you want to:

* map a document to one business schema
* return a compact JSON payload instead of full document content
* process one already parsed document
* process one stored document, one URL, or one uploaded file in a single call


The common flow is:


```mermaid
flowchart LR
    A[Document or parsing] --> B[Create extract]
    B --> C[Poll extract]
    C --> D[Use structured JSON]
```

## Quick start

The fastest extract workflow has two variants.

### Option A: Extract from an existing parsing

This is the safest option when you already created a parsing or want maximum visibility.


```bash
curl -X POST "https://api.anesya.app/v0/extract" \
  -H "X-API-Key: $ANESYA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "parsing": "YOUR_PARSING_ID",
    "schema": "YOUR_SCHEMA_ID"
  }'
```

Use this path when:

* the parsing already exists
* you want to inspect parsing output before extraction
* you want to reuse one parsing in downstream logic


### Option B: Extract directly from one document

This is the simplest one-call path when you only need one final structured result.


```bash
curl -X POST "https://api.anesya.app/v0/extract" \
  -H "X-API-Key: $ANESYA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document": "YOUR_DOCUMENT_ID",
    "schema": "YOUR_SCHEMA_ID",
    "model": "PIGALLE",
    "picture_description_enabled": false,
    "table_verification_enabled": false
  }'
```

When `document` is used, Anesya creates a new parsing first and then runs the extract.

### Poll the extract


```bash
curl -X GET "https://api.anesya.app/v0/extract/YOUR_EXTRACT_ID" \
  -H "X-API-Key: $ANESYA_API_KEY"
```

Stop polling when `status` is one of:

* `FINISHED`
* `ERROR`


For complete polling guidance, see [API quickstart](/tutorials/quickstart).

## What you get back

The extract retrieve endpoint returns a resource shaped like this:


```json
{
  "id": "e90f337f-4d7c-46d1-a2a7-13cf5d9d7cfe",
  "schema": {
    "id": "YOUR_SCHEMA_ID",
    "name": "Invoice schema",
    "description": "Schema used for extracting data from invoices.",
    "completion_mode": "CLASSIC"
  },
  "parsing": {
    "id": "d1b96998-f20e-4b6f-8fa5-78a70b1db9b2"
  },
  "result": {
    "invoice_number": "INV-2025-0042",
    "invoice_date": "2025-06-01",
    "total_amount": 1280.5
  },
  "status": "FINISHED",
  "error": null,
  "created_at": "2025-06-12T14:58:00.000000Z",
  "updated_at": "2025-06-12T14:58:20.000000Z"
}
```

### Key fields

| Field | What it contains |
|  --- | --- |
| `schema` | Metadata about the schema used for extraction |
| `parsing` | Related parsing resource |
| `result` | Final structured JSON payload |
| `status` | Current extract state |
| `error` | Error message when extraction fails |


## Input options

Extract accepts exactly one of:

* `document`
* `parsing`


It always requires:

* `schema`


### 1. Existing parsing ID

Best when parsing already exists and should be reused.


```json
{
  "parsing": "YOUR_PARSING_ID",
  "schema": "YOUR_SCHEMA_ID"
}
```

### 2. Existing document ID

Best when the file is stored in Anesya and you want a one-call extract flow.


```json
{
  "document": "YOUR_DOCUMENT_ID",
  "schema": "YOUR_SCHEMA_ID"
}
```

### 3. Public or pre-signed URL

Best when the source file is hosted elsewhere.


```json
{
  "document": "https://example.com/invoice.pdf",
  "schema": "YOUR_SCHEMA_ID"
}
```

### 4. Multipart uploaded file

Best when your app already has the local file content.


```text
multipart/form-data
document=@invoice.pdf
schema=YOUR_SCHEMA_ID
```

Important behavior:

* extract documents one document at a time
* the public schema does **not** document a URL-list input for extract
* if you need fan-out over multiple files, create multiple parsings first


## Request rules

These rules are critical.

### Rule 1: `schema` is always required

Every extract request must include one schema ID.

### Rule 2: exactly one of `document` or `parsing`

Do not send both in the same payload.

### Rule 3: parsing must already be usable

If you send a parsing ID, that parsing must already be in:

* `FINISHED`
* or `PARTIAL_FINISHED`


If the parsing is still running, keep polling it first.

### Rule 4: document-specific parsing options only apply with `document`

These fields affect the internal parsing step only when you use `document`:

* `model`
* `picture_description_enabled`
* `table_verification_enabled`
* `metadata`


If you send `parsing`, those fields are not the main lever anymore because parsing already happened.

## Response format details

Extract has two main response moments.

### `POST /v0/extract`

Create extract returns one extract object in a non-final state such as:

* `IN_QUEUE`
* `IN_PROGRESS`


At that moment:

* `result` may be `null`
* the resource exists
* processing is still ongoing


### `GET /v0/extract/{id}`

Retrieve extract returns the full extract resource with:

* `schema`
* `parsing`
* `result`
* `status`
* `error`


### Important result-shape rule

The public schema models `result` as a generic JSON payload.

That means `result` can be:

* `null`
* an object
* an array of objects


Do not hardcode extract handling as “always one flat object”.

The exact shape depends on the schema and the extraction outcome.

## Extract statuses

These are the possible extract states:

| Status | Meaning | What to do |
|  --- | --- | --- |
| `IN_QUEUE` | Accepted and waiting to start | keep polling |
| `IN_PROGRESS` | Extraction is running | keep polling |
| `FINISHED` | Extraction completed successfully | use `result` |
| `ERROR` | Extraction failed | inspect `error` |


Unlike parsing, extract does not expose `PARTIAL_FINISHED` in the public schema.

## Best practices

### 1. Prefer extract-from-parsing when debugging matters

If quality matters and you want visibility, do not jump directly from raw document to extract.

Use:

1. parsing
2. parsing review
3. extract from parsing


This makes debugging much easier.

### 2. Prefer direct extract when you only need the final result

If you do not need OCR or markdown separately, direct extract from `document` is cleaner and shorter.

### 3. Keep one schema per business outcome

Extract behaves best when one schema corresponds to one clear business result.

Examples:

* invoice extraction schema
* payslip extraction schema
* contract summary schema


Avoid overloading one schema with too many unrelated goals.

### 4. Make sure the parsing is ready before extract-from-parsing

If you send a parsing ID too early, extract will fail or behave unexpectedly.

Always wait until parsing is in a usable final state.

### 5. Use parsing-first for multiple URLs

If your input source is a list of URLs:

1. send the list to parsing
2. receive multiple parsing IDs
3. create one extract per parsing


### 6. Treat `result` as arbitrary JSON

Do not assume:

* one flat object
* fixed field ordering
* a single fixed value type


Consume `result` based on the schema you asked for.

## Common pitfalls

### Sending both `document` and `parsing`

This is invalid. Extract expects exactly one source strategy.

### Forgetting `schema`

Without a schema ID, the request is incomplete.

### Sending a parsing that is still running

If parsing has not reached `FINISHED` or `PARTIAL_FINISHED`, do not create the extract yet.

### Assuming extract supports a list of URLs

That behavior is documented on parsing, not on extract.

### Assuming `result` is ready immediately after `POST`

Extract is asynchronous. Poll the resource before using `result`.

### Treating every extract result as the same JSON shape

The schema drives the result shape, so different schemas can produce very different payloads.

## Troubleshooting

### 401 Unauthorized

Your API key is missing or invalid. Check the `X-API-Key` header.

### Extract stays in `IN_QUEUE` or `IN_PROGRESS`

Keep polling. Extract is asynchronous by design.

### Extract finishes with `ERROR`

Inspect the `error` field, then verify:

* the schema ID is valid
* the parsing is already usable if you passed `parsing`
* the document input is valid if you passed `document`


### Direct extract from URL fails

If the input uses a short-lived pre-signed URL, it may expire before the internal parsing step starts or finishes.

Use a longer validity window or upload the file first.

### Output quality is poor

If the extract result is weak or incomplete:

1. inspect the parsing first
2. verify the document content appears correctly in `markdown_content` or `ocr_content`
3. then review the schema you are using


## Related guides

* [API quickstart](/tutorials/quickstart)
* [Parsing](/tutorials/parsing)
* [Extract guide](/tutorials/extract)
* [Quickstart polling flow](/tutorials/quickstart)
* [How to Build a JSON Extraction Schema](/tutorials/schemas/create_schema)
* [Anesya API Reference for Coding Agents](/tutorials/agent-guide)
* [API reference](/api/schema/extracts/extract_create)