# Towards a Reliance Layer in Document Agents

Six field signals and one missing artefact as of Q2 2026

Published 2026-05-12 · Placet Experiri · canonical: https://placetexperiri.com/posts/towards-a-reliance-layer-in-document-agents/

---
Agentic document work implements agent systems that handle documents as inputs, working materials and outputs inside administrative or knowledge workflows.

Innovation in the domain has proceeded parallel to broader technological change around model and agent tooling. A community of builders, researchers, product teams and open-source maintainers is experimenting with newly open technical possibilities at a restless and at times frantic pace. The results of their tinkering are routinely shared in engineering writeups or social media posts. Such signals are fragmentary but analytically relevant as empirical records, before their operational findings harden into vendor categories or institutional procedures.

In this article we provide a snapshot of the state of the field over Q1 and Q2 2026 with three evidence classes. We read social media posts for frontier builder behaviour (emerging practice, tacit vocabulary, recurring operational problems) and consider arXiv papers for ongoing innovation in the literature (named methods, benchmarks, formal mechanisms). Finally, we use official docs, product posts and engineering writeups to track the pace of industry adoption.

Field signals in 2026 tend to show that model capability is not yet enough to make document pipelines institutionally reliable, with a surrounding layer of work accumulating around it as a remedy. Several frontier models shipped during the same period,[^1] but reliability questions persisted across the release cycle.

Part I starts from six recurring problems in agentic document work and follows the field response to each one.

1. [Procedure skills](#1-procedure-skills). Procedures are difficult to repeat if they exist only in prompts or tacit routines.
2. [Attribution substrate](#2-attribution-substrate). Text extraction does not by itself make document output checkable. Claims, fields and source locations have to remain attached to the extracted content.
3. [Source interfaces](#3-source-interfaces). The document format has to trade off between stripping evidence and spending context on page structure.
4. [File-backed state](#4-file-backed-state). Long runs need durable state because decisions, open questions and accepted sources otherwise disappear with the context.
5. [Harness engineering](#5-harness-engineering). A model can propose dangerous actions if adequate checks, limits, or halting conditions are not enforced.
6. [Reliance layer](#6-reliance-layer). A document output is not institutionally usable without a properly informed review.

While the discipline is still rapidly evolving, the first five developments already produce maintained and actionable workflow objects. Part II tackles the unresolved sixth object, and attempts to describe an artefact for making agentic document work institutionally reliable.

## Part I: Field signals

Today's agents are capable of autonomously taking a topic, gathering its sources, reading them, and writing a long referenced report while checking in with the operator only occasionally.[^2] However, long reports tend to lose their thread, and later sections can forget what earlier sections established, thus collapsing the entire pipeline. A long document run is only as reliable as the conditions that hold around it.

In agentic document work, usable output depends on maintained run conditions (procedure, source access, durable state, external checks and institutional criteria). Across those conditions, the field is still working through problems and partial repairs.

### 1. Procedure skills

Builders increasingly move recurring document procedures out of throwaway prompts and into reusable procedures. A **skill** is a procedure set down once, versioned and shared like code, then loaded by an agent at the start of a task. Domain experts can package multi-year practice into skills that agents reuse across tasks, and vendors can release capabilities as installable skills as they do with built-in features.[^3] An ecosystem has since formed around skills, with an open skill format shared across multiple agent products and public registries, where a single widely adopted skill's installs run to the hundreds of thousands.[^4]

How much skill-writing transfers to a document-heavy organisation depends on how much of the organisation's procedure is already explicit. A written procedure translates into a skill almost directly, and can be handed to an agent in already executable form, while a procedure that relies on tacit knowledge has to be extracted first. Organisations that already have well-documented procedures start their automation processes ahead.

Once skills are installed, they become part of the dependency surface. Public skill ecosystems already contain malicious payloads and insecure skills that expose secrets, embed hidden instructions, or fetch executable content at runtime.[^5] OWASP collects those and adjacent risks in its Agentic Skills Top 10.[^6] An organisation that installs a skill is installing executable behaviour, and owes the skill the scrutiny it gives any other software.

Alongside skills, **connector protocols** define how an agent reaches external systems, including file stores, databases, and services. Skills tell the agent how to perform a task, and connector protocols expose the systems the task can use. The practical example is Model Context Protocol (MCP), an open standard for connecting AI applications to external data, tools, and workflows.[^7] Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation, and later roadmap work describes protocol development through maintainers, working groups, and Specification Enhancement Proposals.[^8] This marks a clear division between shared infrastructure and institutional procedure. Connectors and parsers can often be bought off the shelf. Procedure is inherently local because it contains the organisation's rules, exceptions, vocabulary, and acceptance thresholds.

### 2. Attribution substrate

Extracted text is usable only when the output still shows where each claim or field came from. For years document parsing was treated as preprocessing, the boring step before the interesting retrieval, but that order has now reversed. Infrastructure vendors are repurposing around document understanding, on the stated reasoning that the further agents push into knowledge work, the more decisions need audit trails back to their source documents.[^9]

Parsing technology changed over the same period. Our field inquiry includes practitioner measurements where vision-language models beat conventional OCR without document-specific training on varied pages.[^10] A separate production guide estimates self-hosted VLM OCR at 700-1,000 $ for 10 million pages, enough to change the cost profile of archive-scale preprocessing.[^11] The leading edge has shifted to small, specialised document models, with open-weight options that outperform far larger generalists on parsing benchmarks,[^12] and production architectures standardise on two-tier routing, in which a fast local parser handles the bulk and only the genuinely hard pages escalate to a heavier model.[^11]

Extraction products changed accordingly. The frontier offering competes on attribution, with page references, bounding boxes and confidence scores aimed at keeping humans in the loop.[^13] The selling point is moving from parsing accuracy to review ergonomics. Production reports also tend to agree that the text layer produced by OCR is the input for everything downstream, and when transcription fails, the whole pipeline fails with it.[^14]

### 3. Source interfaces

The first source decision is representational. The workflow chooses which version of the document enters context.

The field record shows three maintained source forms, web representations served from a URL, local source systems and derived views built over source material.

The first source is the web-served representation. Practitioners are increasingly looking for smaller, cleaner representations,[^15] and infrastructure can set that representation at two points in source access. Network operators such as Cloudflare act upstream when the same URL can answer different clients with different formats.[^16] Open-source crawlers and paid APIs act downstream, after the URL has become the starting point for a crawl or capture job.[^17][^18]

Both moves place the format decision inside source access. A server can negotiate Markdown before sending the page, and crawl endpoints can return Markdown, HTML, raw HTML, links, metadata, status fields, headers or schema-shaped JSON without naming one canonical target.[^16][^17][^18] Each representation preserves different evidence. Markdown reduces boilerplate, HTML preserves page structure, JSON serves extraction, and link or status fields tell the agent what it can reopen, follow, or treat as a failed fetch.

The practitioner debate is about information loss. Compact text or diagrams save context, source-produced Markdown can be more faithful than an automatic HTML-to-Markdown pass, and stripped outputs can remove links the agent needs for its next crawl step.[^15][^19] If representation is chosen at capture time, a richer output can be used when the agent needs page state to follow references, while a simpler one can be used when source text is enough.

The second source is the local source system. Inside knowledge systems, an application mediates source access even when no web endpoint negotiates the representation. Raw files give the agent content, while the application can expose relations the file tree does not carry, including links, tags, tasks, metadata, permissions and search structures. The evidence is narrow. In the corpus, an Obsidian CLI test reports faster searches through the application's maintained index, and Obsidian's own CLI exposes search, tasks and vault reads as commands.[^20][^21] The same product direction appears in document stores such as Google Drive and Box, whose MCP surfaces expose search, metadata, file content, permissions, extraction and governed access to agents.[^22]

A portable version of the same move appears in Google's Open Knowledge Format. OKF represents curated knowledge as a directory of Markdown files with YAML frontmatter, so agents and tools can consume the same knowledge bundle without a new runtime, service, or SDK. A field signal around the release described the same structure as a living wiki that agents can query or edit.[^23]

The third source is the derived view. In this form, another system reads the source material first and exposes a prepared result the agent can query or call. DeepWiki indexes repositories into wikis with diagrams, source links and grounded Q&amp;A, and its MCP server exposes those wiki operations as tools.[^24][^25] In the literature, Doc2Agent turns API documentation into validated tool definitions that agents can call.[^25]

The shared pattern is that source access is maintained in the URL response, in the local application, or in a prepared layer over the material. For document agents, the source is the file plus the maintained access layer around it, including representations, relations, commands and derived views that help the agent find the right material and check what it found.[^22][^25]

### 4. File-backed state

The state problem appears when a long run has to preserve its decisions, open questions, and accepted sources after the conversation has moved on. Agent tooling and practitioner guidance have by now settled on solving this problem by moving state into files. The durable state lives in named Markdown files and project instructions that can be inspected and versioned with the rest of the workspace.[^26] Practitioners support this operational framework by recording decisions and open questions as they work, so that the next session or agent may start with the appropriate context.

Subagents address the same drift problem while also making parallel research easier. Claude Code documents subagents as delegated workers with separate context windows, so high-volume work can stay outside the main conversation and return only a summary.[^27] Published experiments go a step further and remove context management from the model altogether. A deterministic engine sits outside the model, compresses the history, and keeps a pointer to every original passage. That engine outperformed agents managing their own context at every length tested.[^28]

Tool documentation and empirical surveys now describe a recurring decomposition into memory files, working files and subagents, but the pattern is still provisional. Context files are widely used, and repository-level evaluations report mixed effects. Unnecessary instructions can lower task success while raising cost, so the useful state file is a maintained interface with selected decisions, open questions, accepted sources and update rules, not a larger context dump.[^29][^30]

The stable pattern is that context strategies survive when they are inspectable and rerunnable from disk, because inspected state can also be debugged, audited, and handed to the next session, or to the next model.[^31]

### 5. Harness engineering

The control problem deals with model actions that another layer has to execute, check, limit, or halt. The scaffolding around the model has become known as **harness engineering**. In this arrangement, the model drafts each action, while the harness executes it, checks the result and applies the run's limits on tool use, write access and approval.

Harness improvements can move capability at the system level without changing the model: a system's benchmark results may improve when its harness changes while holding the model constant.[^32] OpenAI describes the dependence between agent performance and the working environment at repository scale. Early Codex progress slowed because the environment lacked the tools, abstractions and internal structure the agent needed, so the team moved engineering effort into repository structure, feedback loops, validation and guardrails around Codex.[^33]

How long an agent can run unattended depends in part on whether the agent can check its own work.[^34] Some practitioners now package that review loop as a reusable skill, which runs a reviewer over the work, applies the findings, and repeats until the reviewer returns clean.[^35]

The harness argument has also moved upward into model-adjacent systems. Claude Code's dynamic workflows let Claude write a task-specific harness around the run,[^36] and engineering accounts treat tool loops, compaction and iterative verification as part of the operating environment rather than wrapper code alone.[^37] The scaffolding for short contexts became less central, and in its place came memory policies for runs that span days, coordination between agents, and enforcement hooks that block destructive operations and require an approval before anything ships.[^38] The centre of gravity now sits in control, in the questions of who decides, what gets checked and what may happen when the agent is unattended.

Harness planning is also contested, as some practitioners read heavy upfront plans as waterfall development reborn, where the full design is fixed before any work begins.[^39] A less disputed claim is that harness design should follow failures observed in use, while verification should precede any agent action that changes a file, sends an answer, or enters production.[^33][^40]

### 6. Reliance layer

Reliance begins after the model has produced an answer, when the institution still has to decide whether the output can enter its work.

No settled artefact in the field yet combines evidence, validation, review responsibility, approval scope and permitted use. Adjacent solutions exist, but they still scatter the work across a multitude of objects (eval suites, production traces, review queues, approval gates, provenance records, policy cards, audit trails, evidence packages etc[^41]). None plays the same role that skills play for procedure, parsers for attribution, context files for state and harnesses for control.

The gap is already visible in production, where generated answers can reach users before validation, review, and regression tests catch the relevant failures.[^14][^40]

A generated explanation can make an answer look finalised even when no external check has occurred. Self-explanation research treats faithfulness as a property measured outside the explanation, because the model's account is another generated output with no access to the process it claims to describe.[^42] Code shows the distinction because the generated artefact can be tested outside the model. Scientific-code benchmarks compile, execute, time, and review generated code against expected behaviour and domain conventions.[^43] Document output has no equivalent compiler, so the outside check has to be a record attached to the result.[^41][^44]

Evaluation suites add a quieter risk. They describe the world at the moment their cases were written. A prompt can pass that suite, ship, and keep passing after policies, sources, or user behaviour change, even while production answers degrade.[^45] The field response is to keep evals alive and prompts versioned, with production traces feeding new cases and products proposing fixes.[^40]

Production reports put substantial work into exception routing, validation, review queues, and audit documentation.[^14] The expense does not stop at inference. The apparatus that makes generation usable needs its own budget.

The candidate artefacts point to the same requirement. A usable output needs a point-of-use record of its sources, checks, review, and approved scope.

## Part II: The case for a reliance artefact

Part I shows five conversions from run behaviour into maintained workflow objects. A procedure becomes a skill file, source access becomes an interface, attribution becomes a parsing substrate, state becomes files, and control becomes a harness. Each conversion gives later cases something explicit to inherit.

### The missing maintained object

The remaining condition deals with institutional permission around the output. The field already contains records that cover part of the job, including evidence packages, policy cards, gates, production traces, and audit trails.[^41] These scattered artefacts, however, have not yet hardened into one maintained artefact that travels with the output and states its permitted use.


<figure class="article-exhibit article-exhibit-table">
  <table>
    <thead>
      <tr>
        <th>Field signal</th>
        <th>Workflow condition</th>
        <th>Maintained artefact</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Procedure skills</td>
        <td>procedure</td>
        <td>authored task files and local operating rules</td>
      </tr>
      <tr>
        <td>Attribution substrate</td>
        <td>source attachment</td>
        <td>extraction pipelines with source locations</td>
      </tr>
      <tr>
        <td>Source interfaces</td>
        <td>source access</td>
        <td>approved source sets and served representations</td>
      </tr>
      <tr>
        <td>File-backed state</td>
        <td>continuity</td>
        <td>context files, decisions, open questions, and history</td>
      </tr>
      <tr>
        <td>Harness engineering</td>
        <td>control</td>
        <td>execution, validation, permissions, and stopping rules</td>
      </tr>
      <tr>
        <td>Reliance layer</td>
        <td>institutional use</td>
        <td>TBD</td>
      </tr>
    </tbody>
  </table>
  <figcaption>Five conditions already have inherited artefacts, while institutional use remains the open slot.</figcaption>
</figure>


The table sets a boundary for the proposed artefact. A reliance-specific object doesn't replace the artefacts named in the first five rows, rather it points to them and records the permitted use that follows from their evidence.

### The distributed validation target

Code is the easier reliance case because generated work has an executable target. A generated patch can compile, run, fail tests, or violate conventions, and scientific-code benchmarks use those external checks instead of the model's own explanation.[^43] Document work has to build the check out of workflow records. Production document systems route low-confidence extractions or missing keys to human review, and confidence scores can decide whether a field is accepted automatically or flagged for review.[^44] A report, extraction, or decision has to show source evidence, satisfy policy criteria, and enter a system of record through a review decision.

Business process research already treats workflows as designed and analysable objects rather than loose sequences of tasks.[^46] Agentic document work inherits that premise, but the workflow now includes model-facing machinery that determines how sources are read, how state persists, and how validation records are written.[^47] That makes the workflow itself a review target. Reusable rules create their own drift risk. A case can pass today under conditions that no longer hold for the next source set, policy update, or approval path.[^46][^48]

### Trace density and missing judgment

Code changes usually carry diffs, review comments, test results and issue history. Those records give agents a dense trace environment because they preserve both the outcome and the reasons a change was accepted. Document-heavy institutions often lose such a substrate. The final form, answer, or decision may be archived, while the source passage, validation result, or reviewer judgment may sit in another system or disappear altogether.

*Trace density* is the ratio of recorded reasoning to recorded outcomes in a domain.[^49] According to this metric, code has high trace density because the accepted change is usually stored with the evidence that made it acceptable. Administrative archives can show the opposite pattern. They keep forms, approvals, and filed answers, but often separate those outcomes from the reasons and checks that made them acceptable.

We argue that the reliance layer should be hardened by increasing trace density around the output. A derivative can record its source, version, transformation, and actor.[^50] An attributed claim can point to the source passage that supports it.[^51] A reviewed field can record who checked it, which validations ran, and what was approved.[^44] These records do not recover all judgment, but they give later review a concrete stack trace to inspect.

### The permitted-use record

Trace can support review, but reliance starts only when the institution approves a bounded use. The institution still has to decide whether a reviewed output remains a draft, enters a file, or supports an external action such as payment or publication.[^52] An evidence account is the broad record behind that decision. It ties the output to source evidence, validation state, review state, and unresolved exceptions. The output-attached part of that account is the **reliance artefact**, or reliance record. It travels with one output version. Its central field is the permitted use, which states what the institution may do with the output and points back to the evidence that justifies that use.

<figure class="article-exhibit">
  <img src="/reliance-artifact-json-map.png" alt="Reliance record schema sketch with output, evidence, validation, reviewer, approval authority, versioned policy reference, permitted use, and reopening fields mapped to review questions" />
  <figcaption>Illustrative reliance-record fields map the output to evidence, authority, versioned policy and reopening checks.</figcaption>
</figure>

Existing work already covers parts of that record. EviBound treats evidence as a gate on generation, PROV-AGENT records agent provenance, and Policy Cards specify policy and evidence requirements for use.[^41] These components make a reliance artefact plausible, but they do not yet do its job. None sits with one output version and states the approved use together with the source, policy, or workflow changes that reopen review.

### Revision loop for drift

Sources, policies and approval criteria are moving, ever-changing targets, which calls for continuous revisions of the reliance artefact. In an attempt to capture the temporal logic of such adjustments, we may split the agentic document workflow into three repeatable operations.

The **run** produces one output and its trace. **Verification** checks that output against source evidence, permissions and review criteria. **Revision** changes the workflow version that new runs will inherit. Adjacent software practice and self-adaptive-systems research already use linked feedback loops for runtime control, and the Q2 2026 field record makes the same shape visible in agentic document work.[^48]

The field examples supply different parts of that loop. Production document pipelines make run and verification visible through exception routing, validation, review queues and audit documentation.[^14] Agent-evaluation guidance supplies revision when production traces become new cases for later runs.[^40] Harness accounts show the same revision pattern at repository scale, where failures trigger changes in tools, abstractions, feedback loops and future controls.[^33]

Such a cyclical approach may be used to repair reliance drift. A review finding can become a new eval case, a validator, a source rule, a procedure change, or an update to the reliance artefact. The later case is then checked against a failure the earlier case exposed.[^40] Human judgment remains inside the cycle in two places: a reviewer can approve one output during verification, and the institution can revise the rules that govern future cases.

<figure class="article-exhibit">
  <img src="/reimbursement-reliance-sequence.png" alt="Sequence diagram of a reimbursement reliance process where a policy or interface change opens workflow revision before a later request is checked against workflow v2" />
  <figcaption>Review findings become workflow revisions, and later cases run against the revised conditions.</figcaption>
</figure>

A reliance process is governed over time. The institution accepts one output, records why, and updates the workflow when the record exposes a gap. Governance enters the loop when a review finding changes the conditions future runs must satisfy.

## Conclusion

Part I followed a migration from run behaviour to workflow objects. Workflows around procedures, source access, attribution, state and control become usable when teams can inspect and revise them. Part II deals with the remaining gap at the point where a reviewed output becomes something an institution may use. The next field signal is systems moving reliance decisions from review practice into fully formed workflow infrastructure.

---

[^1]: [Frontier model releases, spring 2026](https://aiflashreport.com/model-releases.html).
[^2]: [Naskręcki, Feb 2026](https://x.com/nasqret/status/2023168173722222757); [Zhao, Feb 2026](https://x.com/GenAI_is_real/status/2023313199765070095).
[^3]: [Agent Skills, 2026](https://agentskills.io/specification).
[^4]: [Vercel, Jan 2026](https://vercel.com/changelog/introducing-skills-the-open-agent-skills-ecosystem); [skills.sh, 2026](https://www.skills.sh/).
[^5]: [Snyk ToxicSkills, Feb 2026](https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/).
[^6]: [OWASP Agentic Skills Top 10, 2026](https://owasp.org/www-project-agentic-skills-top-10/).
[^7]: [MCP docs, 2026](https://modelcontextprotocol.io/docs/getting-started/intro).
[^8]: [Soria Parra, Dec 2025](https://blog.modelcontextprotocol.io/posts/2025-12-09-mcp-joins-agentic-ai-foundation/); [MCP roadmap, 2026](https://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/).
[^9]: [Liu, Jun 2026](https://x.com/jerryjliu0/status/2064479193988206933).
[^10]: [Dasanaike, Mar 2026](https://x.com/dasanaike/status/2030039366068772952).
[^11]: [Dubrov, Mar 2026](https://slavadubrov.github.io/blog/2026/03/04/the-definitive-guide-to-ocr-in-2026-from-pipelines-to-vlms/).
[^12]: [OmniDocBench, 2026](https://ofox.ai/blog/best-ai-model-for-ocr-2026/).
[^13]: [Liu, Feb 2026](https://x.com/jerryjliu0/status/2023813440712917488).
[^14]: [Alan engineering, Mar 2026](https://medium.com/alan/lessons-from-running-an-llm-document-processing-pipeline-in-production-33d87f99cdb1).
[^15]: [Man, May 2026](https://x.com/int32max/status/2054890146948882909).
[^16]: [Cloudflare docs, 2026](https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/).
[^17]: [Crawl4AI CrawlResult docs, 2026](https://docs.crawl4ai.com/api/crawl-result/).
[^18]: [Firecrawl crawl docs, 2026](https://docs.firecrawl.dev/api-reference/endpoint/crawl-post).
[^19]: [Ubl, Feb 2026](https://x.com/cramforce/status/2022781406355878121).
[^20]: [Cincotta, Feb 2026](https://x.com/drrobcincotta/status/2022210753575760293).
[^21]: [Obsidian CLI docs, 2026](https://obsidian.md/cli); [Newton, Feb 2026](https://benenewton.com/blog/your-ai-agent-already-had-file-access-heres-why-obsidian-cli-changes-everything-anyway).
[^22]: [Google Drive MCP docs, 2026](https://developers.google.com/workspace/drive/api/guides/configure-mcp-server); [Box MCP tools docs, 2026](https://developer.box.com/guides/box-mcp/tools).
[^23]: [Google Cloud OKF, 2026](https://cloud.google.com/blog/products/data-analytics/how-the-open-knowledge-format-can-improve-data-sharing); [Haynes, Jun 2026](https://x.com/Marie_Haynes/status/2065531158356717721).
[^24]: [Cognition, May 2025](https://cognition.ai/blog/deepwiki); [DeepWiki docs, 2026](https://docs.devin.ai/work-with-devin/deepwiki).
[^25]: [DeepWiki MCP docs, 2026](https://docs.devin.ai/work-with-devin/deepwiki-mcp); [Doc2Agent, 2025](https://arxiv.org/abs/2506.19998).
[^26]: [Claude Code memory docs, 2026](https://code.claude.com/docs/en/memory).
[^27]: [Claude Code subagents docs, 2026](https://code.claude.com/docs/en/sub-agents).
[^28]: [Ehrlich and Blackman, 2026](https://arxiv.org/abs/2605.04050).
[^29]: [Galster et al., 2026](https://arxiv.org/abs/2602.14690); [Gloaguen et al., 2026](https://arxiv.org/abs/2602.11988).
[^30]: [Yang et al., 2026](https://arxiv.org/abs/2602.05665).
[^31]: [Daugherty, May 2026](https://x.com/masondrxy/status/2053717333433340034).
[^32]: [LangChain, Feb 2026](https://x.com/LangChain/status/2025368775780925654); [Agentic Harness Engineering, 2026](https://arxiv.org/abs/2604.25850).
[^33]: [OpenAI, Feb 2026](https://openai.com/index/harness-engineering/); [Böckeler, Apr 2026](https://martinfowler.com/articles/harness-engineering.html).
[^34]: [kepano, Feb 2026](https://x.com/kepano/status/2021999824472879510).
[^35]: [Steinberger, May 2026](https://x.com/steipete/status/2054850632067019173).
[^36]: [Anthropic, Jun 2026](https://claude.com/blog/a-harness-for-every-task-dynamic-workflows-in-claude-code).
[^37]: [Willison, 2026](https://simonw.substack.com/p/agentic-engineering-patterns).
[^38]: [Osmani, May 2026](https://www.oreilly.com/radar/agent-harness-engineering/).
[^39]: [Zechner, May 2026](https://x.com/badlogicgames/status/2052462922350071943).
[^40]: [Anthropic engineering, 2026](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents); [LangSmith Engine, May 2026](https://x.com/hwchase17/status/2054657397902455060).
[^41]: [EviBound, 2025](https://arxiv.org/abs/2511.05524); [PROV-AGENT, 2025](https://arxiv.org/abs/2508.02866); [Policy Cards, 2025](https://arxiv.org/abs/2510.24383).
[^42]: [Madsen et al., 2024](https://arxiv.org/abs/2401.07927).
[^43]: [Tian et al., 2024](https://arxiv.org/abs/2407.13168); [Zhang et al., 2026](https://arxiv.org/abs/2603.15976).
[^44]: [Amazon A2I Textract docs, 2026](https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-textract-task-type.html); [Microsoft Document Intelligence confidence docs, 2026](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept/accuracy-confidence).
[^45]: [elvis, Mar 2026](https://x.com/omarsar0/status/2029225624825659668).
[^46]: van der Aalst and van Hee, *Workflow Management: Models, Methods, and Systems*, MIT Press, 2002.
[^47]: [Popova et al., 2013](https://arxiv.org/abs/1303.2554).
[^48]: [Arcaini et al., 2015](https://cs.unibg.it/scandurra/papers/seams2015_cameraReady.pdf); [Böckeler, 2026](https://martinfowler.com/articles/exploring-gen-ai/humans-and-agents.html).
[^49]: [Koratana, Mar 2026](https://x.com/akoratana/status/2032119242276188424).
[^50]: [W3C PROV Overview, 2013](https://www.w3.org/TR/prov-overview/).
[^51]: [Schreieder et al., 2025](https://arxiv.org/html/2508.15396v1).
[^52]: [NIST AI RMF Core, 2023](https://airc.nist.gov/airmf-resources/airmf/5-sec-core/); [Cobbe et al., 2021](https://arxiv.org/abs/2102.04201).