A metrological framework · Paper 1

TRACE: an engineering
framework for trustworthy
agentic AI.

A cross-domain reference architecture for AI in operationally critical contexts — four layers (with L2 split into classical ML and LLM validators), 17 trust metrics including the first first-class formalisation of Computational Parsimony Ratio, and three reference instances (clinical, industrial multi-domain, judicial).

Read the framework →PublicationsPaper 1 → arXiv preprint

Layers

L1 · L2a/L2b · L3 · L4

5+1

Principles

5 acronymic + 1 design constraint

Metrics

12 layer · 4 cross · 1 economy

Instances

A clinical · B industrial · C judicial

Traceability chain

Primary standards

GUM · VIM · ISO/IEC 17025

Deterministic core

rules · invariants

L2a

Classical ML

GUM-verifiable calibration

L2b

LLM validators

semantic / coherence checks

Tiered orchestration

stateful escalation policy

Human supervision

calibrated oversight

Domain instance

A clinical · B industrial · C judicial

§0 · Abstract

TRACE organises agentic AI systems into a four-layer reference architecture with an explicit split of the learned tier into classical ML (L2a) and LLM validators (L2b) — a stateful orchestration policy (L3) sits over the L2 inventory, and human supervision (L4) carries measurable load.

The framework is grounded in established measurement science — GUM, VIM, ISO/IEC 17025 — and treats trust as engineered and measured, not declared. Five acronymic principles (Trustworthy · Reasoned · Accountable · Context-bound · Escalated) are disciplined by an internal design constraint — Model Parsimony — quantified through the Computational Parsimony Ratio (CPR): the first first-class metric of complexity-performance trade-off in trustworthy AI.

Three instantiations — clinical decision support (Instance A), an industrial multi-domain platform (Instance B), and a judicial decision-support extension (Instance C) — demonstrate domain neutrality. The architecture provides the structural base for layer-wise GUM-style uncertainty propagation toward formal certification.

§1

Reference architecture — five layers × three instances

Rows are architectural layers (with L2 split into classical ML and LLM validators); columns are reference instances. Each cell names the concrete artefact that fills the layer in that instance.

layer \ instance

Instance A

Clinical decision support

Instance B

Industrial multi-domain

Instance C

Judicial decision support

Human supervision

Accredited operator oversight — ISO/IEC 17025

Physician as final reviewer; aligned with the TRIAD human–AI collaboration framework

Driller, supervisor, HSE officer (technology / operations); lawyer, procurement officer (administrative); DNV-RP-0671 governance

Judge or judicial assistant as final reviewer; empirical precedent: Kolkman et al. (Justitia, 2024)

Tiered orchestration

Hierarchical calibration chains

Routine cases via calibrated L2a; borderline confidence → L2b validator; joint high-risk ∧ high-confidence or L2a/L2b inconsistency → mandatory clinician handoff

Cost-tiered routing per sub-domain; sustained anomaly + high-risk → driller alert; SLA breach → operator escalation; high-stakes clauses → lawyer / compliance officer

L2a/NER selects relevant precedents; L2b validator mandatory hallucination check; high-stakes claim with confident recommendation → juridical review

L2b

LLM validators

Subjective inspection step in metrological audit

LLM validator over free-text clinical notes; coherence checks against the patient's anamnesis

Semantic correlation of drilling logs with incident narratives; clause extraction and document diff in the administrative sub-domain

LLM for case-material analysis, semantic precedent correlation, draft document preparation

L2a

Classical ML

Instrument characterization — GUM

Risk scores, vital-sign time series, lab-data classifiers; structured-text NER over clinical records

Anomaly detection, predictive-maintenance classifiers, residual-life models; CNN/LSTM over MWD/LWD signals; vendor risk scoring

Document classification, precedent-relevance ranking, NER over case materials

Deterministic / physics core

Reference materials, traceable standards

Rule-oriented clinical logic; FUTURE-AI-aligned traceability

Physics-informed control: priority controller, drilling automation, parameter envelopes; safety interlocks; compliance scaffolding for procurement and contracts

Procedural and material legal norms; CEPEJ Ethical Charter as governance reference

§2

Five principles + one design constraint

The TRACE acronym reflects five user-visible properties. Model Parsimony is a quantified internal design constraint that disciplines L2a / L2b selection.

T · Trustworthy

Evidence traceability

Every prescriptive action carries a machine-readable evidence chain — data → inference → decision.

→ ETC · Evidence Trail Completeness

R · Reasoned

Bounded human supervision

Human oversight is an architectural layer with measurable load and override rights — not a cosmetic safety net.

→ OvR · Override Rate

A · Accountable

Staged autonomy

Authority is earned through accumulated stability data and explicit qualification — not granted by default at release.

→ ABC · Autonomy Boundary Compliance

C · Context-bound

Bounded context

Input context is explicitly specified, dated, and refreshed as part of the safety envelope.

→ CFI · Context Freshness Index

E · Escalated

Metrological accountability

Each quality property is specified, measured, calibrated, and monitored over time.

→ CE · Calibration Error

Internal design constraint — not in the TRACE acronym

Model parsimony

The type of learned component (classical ML, specialised neural network, LLM, hybrid) is chosen by task fit — not by LLM presumption.

Internal design constraint — quantified via CPR, not visible in the TRACE acronym.

Quantified by

CPR · Computational Parsimony Ratio

CPR = 1 — optimal
CPR ≪ 1 — architectural overhead

§3

Trust metrics

Seventeen measurable indicators: twelve per-layer, four cross-cutting, and one economy metric (CPR). Filter by layer or type.

Layer

Type

17 of 17

ABC

Autonomy Boundary Compliance

X-layer

Compliance

Share of actions taken within the system's defined autonomy boundaries.

Calibration Error

X-layer

Calibration

Deviation of stated confidence from empirical accuracy.

CFI

Context Freshness Index

Freshness

Weighted freshness of data in the active context.

CPR

Computational Parsimony Ratio

Economy

Parsimony

Ratio of resource cost of the most economical model that meets task requirements (precision, calibration, operational reliability) to the actual cost of the deployed model. CPR = 1 — optimal; CPR ≪ 1 — architectural overhead. First first-class formalisation of complexity-performance trade-off as a metric in trustworthy AI.

CRP

Context Relevance Precision

Precision

Share of context items actually relevant to the task.

Escalation Precision

Precision

Share of escalations that actually required a higher tier.

ETC

Evidence Trail Completeness

X-layer

Traceability

Share of outputs with a complete evidence chain (data → decision).

FPA

False Positive Attenuation

Filtering

Suppression of false-positive escalations through policy-driven re-invocation of L2 components.

IPSR

Input Perturbation Stability Rate

Stability

Share of responses stable to input paraphrasing and perturbation.

OSI

Operational Stability Index

X-layer

Drift

Variation of key trust metrics over time.

OvR

Override Rate

Behavior

Share of AI outputs modified by human reviewers.

RBI

Review Burden Index

Load

Average reviewer time per case at the human tier.

RCI

Rule Consistency Index

Stability

Output stability of rules under system updates.

RCR

Rule Coverage Rate

Coverage

Share of scenarios covered by explicit rules of the deterministic core.

SNR

Signal-to-Noise Ratio

Filtering

Ratio of critical cases to total flow reaching the human tier.

TCC

Tier Cost Coefficient

Cost

Aggregate compute cost of the chosen escalation path.

UTC

Update Traceability Coefficient

Traceability

Share of rule changes with documented rationale and traceable provenance.

§4

Reference instances

Two foundational implementations (A clinical, B industrial multi-domain) motivated the formalisation. A third (C judicial) demonstrates portability into a domain with a fundamentally different governance context.

Instance A

Clinical decision support

Foundational instance

Lead

Zabolotnii, Holinko, Antonenko

Status

Paper 0 · IMM journal — under review

Instance B

Industrial multi-domain

Foundational instance · oil & gas

Lead

A. Shcherban

Status

Patent pending · UA u 2025 04038 · U.S. Copyright Office deposit (Mar 2026)

Instance C

Judicial decision support

Partial extension

Lead

Zabolotnii (with the Supreme Court of Ukraine, funded by Expertise France)

Status

Modernisation of the "Legal Positions Database" portal

§4·B

Instance B in detail — sub-domain × layer

The industrial platform spans three operational sub-domains. The four-layer architecture instantiates differently in each: the dominant layer shifts with the type of evidence — illustrating the Model Parsimony principle.

layer \ sub-domain

Technology

Upstream: drilling, production, well operations

Operations

Maintenance decisions, equipment monitoring, KPI tracking

Administrative

Document flow, procurement, contract lifecycle, compliance

Deterministic core

ACTIVE

Physics-informed control core: priority logic, parameter envelopes, tolerance limits

ACTIVE

Maintenance procedures, regulatory intervals, safety interlocks

DOMINANT

Compliance scaffold: procurement rules, contract-lifecycle milestones, audit-trail requirements (formally specified, not learned)

L2a

Classical ML

DOMINANT

Classical ML over structured sensor and equipment-state signals — calibration- and reproducibility-verifiable

DOMINANT

Classical ML over structured equipment and process signals — predictive maintenance, anomaly detection, residual-life estimation

PRESENT

Classical ML for structured signals over procurement and vendor data

L2b

LLM validators

PRESENT

LLM validators engaged asynchronously for narrative-style retrospective review

PRESENT

LLM validators over free-form incident reports and historical journals

DOMINANT

LLM validators for free-form contract and document review (semantic diff, non-standard-term detection)

Tiered orchestration

ACTIVE

Tolerance-band routing; sustained anomaly ∧ high-risk triggers human handoff; component-level orchestration is independent of L2 model class

ACTIVE

Cost-tiered routing; SLA breach, joint risk ∧ confidence, or L2a/L2b inconsistency triggers operator escalation; full audit trail

ACTIVE

Risk-tiered routing; standard contract via L1; non-standard clauses route to L2 components; high-stakes flags trigger mandatory legal handoff

Human supervision

ACTIVE

Driller / supervisor with override rights

ACTIVE

Operations supervisor as final reviewer

ACTIVE

Lawyer / procurement officer / compliance officer as final reviewer

Technology

Upstream: drilling, production, well operations

L1 · Deterministic coreACTIVE

Physics-informed control core: priority logic, parameter envelopes, tolerance limits

L2a · Classical MLDOMINANT

Classical ML over structured sensor and equipment-state signals — calibration- and reproducibility-verifiable

L2b · LLM validatorsPRESENT

LLM validators engaged asynchronously for narrative-style retrospective review

L3 · Tiered orchestrationACTIVE

Tolerance-band routing; sustained anomaly ∧ high-risk triggers human handoff; component-level orchestration is independent of L2 model class

L4 · Human supervisionACTIVE

Driller / supervisor with override rights

Operations

Maintenance decisions, equipment monitoring, KPI tracking

L1 · Deterministic coreACTIVE

Maintenance procedures, regulatory intervals, safety interlocks

L2a · Classical MLDOMINANT

Classical ML over structured equipment and process signals — predictive maintenance, anomaly detection, residual-life estimation

L2b · LLM validatorsPRESENT

LLM validators over free-form incident reports and historical journals

L3 · Tiered orchestrationACTIVE

Cost-tiered routing; SLA breach, joint risk ∧ confidence, or L2a/L2b inconsistency triggers operator escalation; full audit trail

L4 · Human supervisionACTIVE

Operations supervisor as final reviewer

Administrative

Document flow, procurement, contract lifecycle, compliance

L1 · Deterministic coreDOMINANT

Compliance scaffold: procurement rules, contract-lifecycle milestones, audit-trail requirements (formally specified, not learned)

L2a · Classical MLPRESENT

Classical ML for structured signals over procurement and vendor data

L2b · LLM validatorsDOMINANT

LLM validators for free-form contract and document review (semantic diff, non-standard-term detection)

L3 · Tiered orchestrationACTIVE

Risk-tiered routing; standard contract via L1; non-standard clauses route to L2 components; high-stakes flags trigger mandatory legal handoff

L4 · Human supervisionACTIVE

Lawyer / procurement officer / compliance officer as final reviewer

IntensityDOMINANTprimary decision pathACTIVEregular contributionPRESENTnarrow / supporting role

The same four-layer architecture instantiates differently in each sub-domain: the dominant layer shifts with the type of evidence — classical ML for structured signals, L1+LLM validators for rule- and language-bound work. Component-level methodology and operational statistics are reported separately by the system's developer (Paper 2).

§5

Four-paper roadmap

Paper 1 (this site's companion) is the cross-domain framework synthesis. Paper 0 grounds it in the clinical foundational instance; Papers 2 and 3 are domain and metrological deep-dives.

[paper 0]

Clinical foundational

From Black-Box Confidence to Measurable Trust in Clinical AI: A Framework for Evidence, Supervision, and Staged Autonomy

Zabolotnii, Holinko, Antonenko · IEEE Instrumentation & Measurement Magazine — Special Issue "A Measure of Trust in Healthcare"

Under review · Sep 2026

[paper 1]

Framework synthesis

TRACE: A Metrologically Grounded Engineering Framework for Trustworthy Agentic AI in Operationally Critical Domains

Zabolotnii (sole author) · arXiv preprint

Forthcoming on arXiv — this site is the companion

[paper 2]

Industrial multi-domain deep-dive

Industrial Multi-Domain Agentic Platform for Upstream Oil & Gas: A TRACE Instance

Shcherban (lead), Zabolotnii · Scopus-indexed industrial-AI journal

Planned · Q3 2026

[paper 3]

Metrological deep-dive

Layer-wise GUM Propagation in TRACE: A Formal Uncertainty Budget for Agentic AI Systems

Zabolotnii (lead), Shcherban · IEEE Transactions on Instrumentation and Measurement

Planned (optional)

Authors and contributions →

TRACE: an engineeringframework for trustworthyagentic AI.

Reference architecture — five layers × three instances

Five principles + one design constraint

Trust metrics

Reference instances

Instance B in detail — sub-domain × layer

Four-paper roadmap

TRACE: an engineering
framework for trustworthy
agentic AI.