Get Started

# Parsing

> Convert documents into OCR, markdown, and page-level parsing data.


Parsing is the Anesya endpoint for turning a document into machine-usable content before structured extraction.

It can:

* run OCR on the document
* generate `markdown_content`
* return page-level parsing data in `ocr_content`
* describe images when picture descriptions are enabled
* report processing status and partial page failures


Use parsing when you need visibility into the document content itself, not only the final structured JSON.

If you only need the final structured result from a schema, see [Extract](/tutorials/extract).

## What parsing is for

Parsing is the right entry point when you want to:

* inspect OCR output before extraction
* get readable markdown for LLM workflows
* process one document or fan out over a list of public URLs
* detect partial page failures
* reuse a parsing in later downstream steps


The typical flow is:


```mermaid
flowchart LR
    A[Document ID, URL, file, or URL list] --> B[Create parsing]
    B --> C[Poll parsing]
    C --> D[Use markdown or OCR]
    C --> E[Create extract]
```

## Quick start

The safest parsing workflow has three steps:

1. prepare one input document
2. create the parsing
3. poll it until the status is final


### Option A: Start from a stored document

Upload the file first if you want a reusable document ID.


```bash
curl -X POST "https://api.anesya.app/v0/documents" \
  -H "X-API-Key: $ANESYA_API_KEY" \
  -F "file=@invoice.pdf;type=application/pdf" \
  -F "filename=invoice.pdf"
```

Then create the parsing:


```bash
curl -X POST "https://api.anesya.app/v0/parsing" \
  -H "X-API-Key: $ANESYA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document": "YOUR_DOCUMENT_ID",
    "model": "PIGALLE",
    "picture_description_enabled": false,
    "table_verification_enabled": false
  }'
```

### Option B: Start directly from a public URL


```bash
curl -X POST "https://api.anesya.app/v0/parsing" \
  -H "X-API-Key: $ANESYA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document": "https://example.com/invoice.pdf",
    "model": "PIGALLE"
  }'
```

### Option C: Start directly from a local file


```bash
curl -X POST "https://api.anesya.app/v0/parsing" \
  -H "X-API-Key: $ANESYA_API_KEY" \
  -F "document=@invoice.pdf;type=application/pdf" \
  -F "model=PIGALLE" \
  -F "picture_description_enabled=false" \
  -F "table_verification_enabled=false"
```

### Poll the parsing


```bash
curl -X GET "https://api.anesya.app/v0/parsing/YOUR_PARSING_ID" \
  -H "X-API-Key: $ANESYA_API_KEY"
```

Stop polling when `status` is one of:

* `FINISHED`
* `PARTIAL_FINISHED`
* `ERROR`


For a complete polling strategy, see [API quickstart](/tutorials/quickstart).

## What you get back

The parsing retrieve endpoint returns a resource shaped like this:


```json
{
  "id": "d1b96998-f20e-4b6f-8fa5-78a70b1db9b2",
  "document": {
    "id": "300f339f-da71-4f9f-80f6-c25a63baae75",
    "filename": "invoice.pdf"
  },
  "picture_description_enabled": false,
  "table_verification_enabled": false,
  "model": "PIGALLE",
  "pictures": [],
  "ocr_content": {
    "pages": []
  },
  "markdown_content": "# Invoice\n\nInvoice number: INV-2025-0042\n\nTotal: 1280.50 EUR",
  "pages_total": 3,
  "pages_success": 3,
  "pages_failed": 0,
  "status": "FINISHED",
  "metadata": {
    "source": "api"
  },
  "error": null,
  "created_at": "2025-06-12T14:57:00.000000Z",
  "updated_at": "2025-06-12T14:58:12.000000Z"
}
```

### Key fields

| Field | What it contains |
|  --- | --- |
| `document` | The source document metadata |
| `markdown_content` | Readable markdown version of the parsed document |
| `ocr_content` | Structured OCR payload |
| `pictures` | Parsed image entries with optional descriptions |
| `pages_total` | Total page count |
| `pages_success` | Number of pages successfully processed |
| `pages_failed` | Number of failed pages |
| `status` | Current parsing state |
| `error` | Error message when parsing fails |


## Input options

The `document` field in `POST /v0/parsing` accepts four shapes.

### 1. Existing document ID

Best when the file is already stored in Anesya and may be reused later.


```json
{
  "document": "300f339f-da71-4f9f-80f6-c25a63baae75"
}
```

### 2. Public or pre-signed URL

Best when the file is already hosted elsewhere.


```json
{
  "document": "https://example.com/invoice.pdf"
}
```

### 3. Multipart uploaded file

Best when your app already has the local file content and does not need a separate document-upload step.


```text
multipart/form-data
document=@invoice.pdf
```

### 4. List of public or pre-signed URLs

Best when you want to fan out over several remote documents at once.


```json
{
  "document": [
    "https://example.com/doc-1.pdf",
    "https://example.com/doc-2.pdf"
  ]
}
```

Important behavior:

* this URL-list shape is supported on parsing
* it returns an **array of parsing objects**
* the extract endpoint does not document the same URL-list support


If you need one extract per URL, create the parsings first, then iterate over the returned parsing IDs.

## Response format details

Parsing has two main response shapes depending on the endpoint you call.

### `POST /v0/parsing`

Create parsing returns:

* one parsing object for one document ID, one URL, or one uploaded file
* an array of parsing objects for a list of URLs


Example single-object response:


```json
{
  "id": "PARSING_ID",
  "status": "IN_QUEUE"
}
```

Example array response:


```json
[
  {
    "id": "PARSING_ID_1",
    "status": "IN_QUEUE"
  },
  {
    "id": "PARSING_ID_2",
    "status": "IN_QUEUE"
  }
]
```

### `GET /v0/parsing`

List parsing returns a paginated resource:


```json
{
  "count": 2,
  "next": null,
  "previous": null,
  "results": [
    {
      "id": "PARSING_ID",
      "status": "FINISHED"
    }
  ]
}
```

### `GET /v0/parsing/{id}`

Retrieve parsing returns the full parsing resource, including:

* `markdown_content`
* `ocr_content`
* `pictures`
* page counters
* status
* errors


## Parsing statuses

These are the possible parsing states:

| Status | Meaning | What to do |
|  --- | --- | --- |
| `IN_QUEUE` | Accepted and waiting to start | keep polling |
| `IN_PROGRESS` | Processing is running | keep polling |
| `FINISHED` | Completed successfully | use the parsing output |
| `PARTIAL_FINISHED` | Completed with one or more failed pages | use it if partial content is acceptable |
| `ERROR` | Processing failed | inspect `error` |


`PARTIAL_FINISHED` is an important state.

It is not a full success, but it is often still usable, especially if:

* most pages succeeded
* the failed pages are not critical
* your next step is tolerant to partial source coverage


## Configuration options

The parsing endpoint exposes a deliberately small set of knobs.

### Model


```json
{
  "model": "PIGALLE"
}
```

Supported values:

* `PIGALLE`
* `PIGALLE_LITE`


Use `PIGALLE` by default unless you already know `PIGALLE_LITE` is sufficient for your workload.

### Picture descriptions


```json
{
  "picture_description_enabled": true
}
```

Enable this when image descriptions are useful for your workflow.

### Table verification


```json
{
  "table_verification_enabled": true
}
```

Enable this when table accuracy matters more than raw speed.

### Metadata


```json
{
  "metadata": {
    "source": "n8n",
    "customer_id": "cust_123"
  }
}
```

Use metadata to keep track of source systems, customer context, or workflow identifiers.

## Best practices

### 1. Prefer a stored document when reuse matters

If the same file may be parsed again, extracted later, or audited, upload it first with `/v0/documents`.

This gives you a reusable document ID and a cleaner workflow than repeatedly uploading the same file.

### 2. Use parsing before extract when visibility matters

If you need OCR, markdown, pictures, or page-level status, do not jump directly to extract.

Create a parsing first, inspect it, then create the extract from the parsing ID.

### 3. Use URL-list fan-out only on parsing

If your source input is a list of URLs:

1. send the list to `/v0/parsing`
2. receive an array of parsing IDs
3. iterate over those IDs to create extracts if needed


### 4. Treat `PARTIAL_FINISHED` as a distinct state

Do not collapse it into generic failure logic.

Instead, inspect:

* `pages_success`
* `pages_failed`
* `error`


Then decide whether the workflow can continue.

### 5. Enable picture and table options only when they help

Both `picture_description_enabled` and `table_verification_enabled` should be chosen intentionally.

If your use case does not depend on those features, keep them disabled for simpler requests.

### 6. Trim PDFs early when only a subset matters

If you only need part of a PDF, trim it during document upload with the `pdf_page_start` and `pdf_page_end` query parameters on [Create document](/api/schema/documents/document_create).

That reduces unnecessary downstream processing.

## Common pitfalls

### Assuming parsing is synchronous

`POST /v0/parsing` creates a resource and starts processing. It does not mean the final content is ready immediately.

Always poll the parsing resource until the status is final.

### Assuming `POST /v0/parsing` always returns one object

If `document` is a list of URLs, the response is an array.

This is one of the most common integration mistakes in no-code tools and coding agents.

### Passing a URL list to extract

That list behavior is documented on parsing, not on extract.

If you need multiple downstream extracts, create multiple parsings first.

### Ignoring `error`

When a parsing fails, do not branch only on `status`.

Always inspect the `error` field for debugging and retry decisions.

## Troubleshooting

### 401 Unauthorized

Your API key is missing or invalid. Check the `X-API-Key` header.

### Parsing remains in `IN_QUEUE` or `IN_PROGRESS`

Keep polling. Parsing is asynchronous by design.

### Parsing finishes with `PARTIAL_FINISHED`

Review `pages_failed` and decide whether partial output is acceptable for the workflow.

### Parsing fails with `ERROR`

Inspect the `error` field, then verify:

* the document input is valid
* the public URL is accessible
* the pre-signed URL is still valid
* the uploaded file is readable


### Public URL parsing fails unexpectedly

If the input uses a short-lived pre-signed URL, it may expire before or during processing.

Use a longer validity window or upload the file first.

## Related guides

* [API quickstart](/tutorials/quickstart)
* [Parsing guide](/tutorials/parsing)
* [Extract guide](/tutorials/extract)
* [Quickstart polling flow](/tutorials/quickstart)
* [n8n parsing workflow](/tutorials/n8n/n8n-parsing)
* [Anesya API Reference for Coding Agents](/tutorials/agent-guide)
* [API reference](/api/schema/parsings/parsing_create)