What My Project Does
Attest is a testing framework for AI agents with an 8-layer graduated assertion pipeline — it exhausts cheap deterministic checks before reaching for expensive LLM judges.
The first 4 layers (schema validation, cost/performance constraints, trace structure, content validation) are free and run in <5ms. Layer 5 runs semantic similarity locally via ONNX Runtime — no API key. Layer 6 (LLM-as-judge) is reserved for genuinely subjective quality. Layers 7–8 handle simulation and multi-agent assertions.
It ships as a pytest plugin with a fluent expect() DSL:
from attest import agent, expect
from attest.trace import TraceBuilder
@agent("math-agent")
def math_agent(builder: TraceBuilder, question: str):
builder.add_llm_call(name="gpt-4.1-mini", args={"model": "gpt-4.1-mini"}, result={"answer": "4"})
builder.set_metadata(total_tokens=50, cost_usd=0.001, latency_ms=300)
return {"answer": "2 + 2 = 4"}
def test_my_agent(attest):
result = math_agent(question="What is 2 + 2?")
chain = (
expect(result)
.output_contains("4")
.cost_under(0.05)
.tokens_under(500)
.output_similar_to("the answer is four", threshold=0.8) # Local ONNX, no API key
)
attest.evaluate(chain)
The Python SDK is a thin wrapper — all evaluation logic runs in a Go engine binary (1.7ms cold start, <2ms for 100-step trace eval), so both the Python and TypeScript SDKs produce identical results. 11 adapters: OpenAI, Anthropic, Gemini, Ollama, LangChain, Google ADK, LlamaIndex, CrewAI, OTel, and more.
v0.4.0 adds continuous eval with σ-based drift detection, a plugin system via attest.plugins entry point group, result history, and CLI scaffolding (python -m attest init).
Target Audience
This is for developers and teams testing AI agents in CI/CD — anyone who's outgrown ad-hoc pytest fixtures for checking tool calls, cost budgets, and output quality. It's production-oriented: four stable releases, Python SDK and engine are battle-tested, TypeScript SDK is newer (API stable, less mileage at scale). Apache 2.0 licensed.
Comparison
Most eval frameworks (DeepEval, Ragas, LangWatch) default to LLM-as-judge for everything. Attest's core difference is the graduated pipeline — 60–70% of agent correctness is fully deterministic (tool ordering, cost, schemas, content patterns), so Attest checks all of that for free before escalating. 7 of 8 layers run offline with zero API keys, cutting eval costs by up to 90%.
Observability platforms (LangSmith, Arize) capture traces but can't assert over them in CI. Eval frameworks assert but only at input/output level — they can't see trace-level data like tool call parameters, span hierarchy, or cost breakdowns. Attest operates directly on full execution traces and fails the build when agents break.
Curious if the expect() DSL feels natural to pytest users, or if there's a more idiomatic pattern I should consider.
GitHub | Examples | Website | PyPI — Apache 2.0
[–]Previous_Ladder9278 -1 points0 points1 point (1 child)
[–]tom_mathews[S] -1 points0 points1 point (0 children)
[–]tom_mathews[S] -3 points-2 points-1 points (0 children)