Eval & Trace — Lutum Docs

lutum-trace captures in-memory spans and events from LLM turns. lutum-eval builds evaluation helpers on top of captured traces: synchronous PureEval, async Eval, live Probe hooks, JudgeEval for structured LLM-as-judge evaluation, and composable combinators.

On This Page

Trace Setup

Add the crates and set up the tracing subscriber:

[dependencies]
lutum-trace = "0.1.0"
tracing-subscriber = { version = "0.3", features = ["registry"] }

use tracing_subscriber::layer::SubscriberExt as _;

// Build a subscriber with the capture layer
let subscriber = tracing_subscriber::registry()
    .with(lutum_trace::layer());

// You can also compose with other layers (stdout, json, etc.)
let subscriber = tracing_subscriber::registry()
    .with(lutum_trace::layer())
    .with(tracing_subscriber::fmt::layer());

Capturing a Turn

Wrap any async block with lutum_trace::capture(...) to get back the result plus a TraceSnapshot:

use lutum::Session;
use tracing::instrument::WithSubscriber as _;

let collected = lutum_trace::capture(
    async {
        let mut session = Session::new();
        session.push_user("Say hello in one sentence.");
        let result = session.text_turn(&llm).collect().await?;
        Ok::<_, Box<dyn std::error::Error>>(result.assistant_text().to_string())
    }
    .with_subscriber(subscriber),
)
.await;

// Access the result
let text = collected.output.unwrap();

// Inspect the trace
if let Some(turn) = collected.trace.span("llm_turn") {
    println!("request_id = {:?}", turn.field("request_id"));
    println!("finish_reason = {:?}", turn.field("finish_reason"));
}

Collected<T> has two fields:

output: T — the return value of the wrapped async block
trace: TraceSnapshot — structured snapshot of all spans and events

TraceSnapshot::span(name) returns the first span with the given name. Each span exposes field(key) for accessing recorded field values and events() for child events.

For live streaming of trace events alongside capture, attach a listener to the capture future:

let collected = lutum_trace::capture(my_async_block.with_subscriber(subscriber))
    .listen_events(|event: TraceEvent| {
        // called for each span/event as it is recorded
        println!("trace event: {event:?}");
    })
    .await;

capture_raw(...).listen_events(...) combines the live callback with raw entry capture. Use .listen_spans(...) when you want completed SpanNodes instead of low-level TraceEvents.

Hierarchical Workflows

For multi-agent or sub-agent workflows, capture the whole workflow once and use captured tracing spans for workflow and agent boundaries. The span names are not special; any span with lutum.capture = true is captured and can become the parent of llm_turn spans. Fields such as agent = "planner" are ordinary metadata for later lookup, not reserved keys. Session remains just transcript state; the trace hierarchy comes from the spans you place around your event loop.

use tracing::Instrument as _;

let collected = lutum_trace::capture(async {
    let workflow = tracing::info_span!(
        "workflow",
        lutum.capture = true,
        workflow = "research"
    );

    async {
        let planner = async {
            let mut session = Session::new();
            session.push_user("Draft the plan.");
            session.text_turn(&llm).collect().await
        }
        .instrument(tracing::info_span!(
            "planner_agent",
            lutum.capture = true,
            agent = "planner"
        ));

        let reviewer = async {
            let mut session = Session::new();
            session.push_user("Review the plan.");
            session.text_turn(&llm).collect().await
        }
        .instrument(tracing::info_span!(
            "reviewer_agent",
            lutum.capture = true,
            agent = "reviewer"
        ));

        tokio::join!(planner, reviewer)
    }
    .instrument(workflow)
    .await
})
.await;

Nested sub-agents use the same rule: create another captured span inside the parent span. Do not use nested capture() calls to represent hierarchy; nested agents should be part of the one workflow-level capture.

Raw Capture

lutum_trace::capture_raw(...) captures both a structured snapshot and raw trace entries as a RawTraceSnapshot. Useful for debugging or custom analysis.

capture_raw enables raw collection in the trace layer. To make Lutum adapters emit provider request bodies, SSE payloads, parse errors, and collect errors into that raw stream, attach RawTelemetryConfig to the Lutum context:

use std::sync::Arc;
use lutum::{Lutum, RawTelemetryConfig, Session};
use tracing::instrument::WithSubscriber as _;

let llm = Lutum::new(Arc::new(adapter), budget)
    .with_extension(RawTelemetryConfig::all());

let collected = lutum_trace::capture_raw(
    async {
        let mut session = Session::new();
        session.push_user("Say hello in one sentence.");
        session.text_turn(&llm).collect().await
    }
    .with_subscriber(subscriber),
)
.await;

let trace: &TraceSnapshot = &collected.trace;
let raw: &RawTraceSnapshot = &collected.raw;

PureEval — Synchronous Evaluation

PureEval is a synchronous evaluation trait over a captured trace plus an application-specific artifact. It returns Result<Self::Report, Self::Error> — use a ! (never) error type if the eval is infallible:

use std::convert::Infallible;
use lutum_eval::{PureEval, Score};
use lutum_trace::TraceSnapshot;

struct ResponseLengthEval;

impl PureEval for ResponseLengthEval {
    type Artifact = String;   // e.g. assistant text
    type Report = usize;      // e.g. character count
    type Error = Infallible;

    fn evaluate(
        &self,
        _trace: &TraceSnapshot,
        artifact: &Self::Artifact,
    ) -> Result<Self::Report, Self::Error> {
        Ok(artifact.len())
    }
}

Use PureEvalExt::scored_by(objective) to pair an eval with an Objective and get a Scored<Report>. Score values are f32 in [0.0, 1.0] — use Score::try_new(f32) or Score::new_clamped(f32).

Eval — Async Evaluation

Eval is the async variant. It receives a Lutum context in addition to the trace and artifact, which lets the evaluator run its own LLM calls. Like PureEval, it returns Result<Self::Report, Self::Error>:

use lutum_eval::Eval;
use lutum::Lutum;
use async_trait::async_trait;

struct FreeformJudge;

#[async_trait]
impl Eval for FreeformJudge {
    type Artifact = String;
    type Report = String;
    type Error = Box<dyn std::error::Error + Send + Sync>;

    async fn evaluate(
        &self,
        llm: &Lutum,
        _trace: &TraceSnapshot,
        artifact: &Self::Artifact,
    ) -> Result<Self::Report, Self::Error> {
        let prompt = format!("Rate this response 1-10 and explain: {artifact}");
        let result = llm.completion(prompt).collect().await?;
        Ok(result.text.clone())
    }
}

For structured LLM-as-judge evaluation, see JudgeEval below — it uses structured output so the judge's score is decoded as a typed Rust struct rather than parsed from free text.

Probe — Live Evaluation

Probe hooks into a live turn via collect_with(...). It observes streaming events in real time, making it useful for latency monitoring, streaming quality checks, or live dashboards:

use lutum_eval::Probe;

struct FirstTokenTimer {
    start: std::time::Instant,
}

impl Probe for FirstTokenTimer {
    fn on_text_delta(&mut self, delta: &str) {
        let latency = self.start.elapsed();
        println!("First token at {latency:?}: {delta:?}");
    }
}

let result = session
    .text_turn(&llm)
    .collect_with(FirstTokenTimer { start: std::time::Instant::now() })
    .await?;

JudgeEval — Structured LLM-as-Judge

JudgeEval is a ready-made Eval implementation that runs an LLM call with structured output as the judge. Supply a render_input closure that builds a ModelInput from the trace and artifact; the LLM's structured response becomes the Report:

[dependencies]
lutum-eval = "0.1.0"
schemars = "0.8"
serde = { version = "1", features = ["derive"] }

use lutum_eval::{JudgeEval, EvalExt as _};
use lutum::ModelInput;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};

#[derive(Clone, Debug, Serialize, Deserialize, JsonSchema)]
struct JudgeReport {
    score: f32,          // 0.0–1.0
    explanation: String,
}

let judge = JudgeEval::<String, JudgeReport, _>::new(|_trace, artifact| {
    ModelInput::from_user(format!(
        "Rate this response 0.0–1.0 and explain:\n\n{artifact}"
    ))
})
.max_output_tokens(256);

// Run the judge against an artifact
let report: JudgeReport = judge
    .evaluate(&llm, &trace, &artifact_text)
    .await?;

println!("score={}, reason={}", report.score, report.explanation);

JudgeEval accepts the same builder methods as turn builders (.temperature(...), .max_output_tokens(...), .budget(...), .ext(...)) so you can configure the judge call independently of the main session.

EvalRecord

EvalRecord<R, E> bundles a scored result with the trace that produced it. It is the standard return type for eval runners and eval pipelines:

use lutum_eval::EvalRecord;

// EvalRecord wraps Result<Scored<R>, E> + TraceSnapshot
let record: EvalRecord<JudgeReport, _> = /* ... from eval runner */;

if record.passed() {
    println!("score: {:?}", record.score());
} else {
    println!("error: {:?}", record.error());
}

// Access the trace regardless of outcome
let span = record.trace.span("llm_turn");

RawEvalRecord<R, E> adds a raw: RawTraceSnapshot for full protocol-level debugging. Convert between them with .into_raw(raw) and .without_raw().

Eval Combinators

PureEvalExt and EvalExt (auto-implemented for all PureEval / Eval implementors) expose combinator methods for transforming and composing evaluators:

Method	Description
`.map_report(f)`	Transform the report type
`.map_error(f)`	Transform the error type
`.contramap_artifact(f)`	Accept a different artifact type
`.combine(other, f)`	Run two evals and merge their reports
`.scored_by(objective)`	Pair with an `Objective` to produce `Scored<R>`
`.lift()` (PureEval only)	Lift a `PureEval` into the `Eval` trait

Objective controls how a report becomes a Score. Built-in objectives:

maximize(|r| r.value) — higher is better
minimize(|r| r.latency_ms) — lower is better
pass_fail(|r| r.passed) — boolean pass/fail

use lutum_eval::{PureEvalExt as _, maximize};

// Run the eval and score by response length (higher = better, capped at 500 chars)
let scored = ResponseLengthEval
    .scored_by(&maximize(|len: &usize| (*len).min(500) as f32 / 500.0))
    .run_collected(&collected)?;

println!("score={:.2}, length={}", scored.score.value(), scored.report);

lutum-eval-runner — k-Metrics

lutum-eval-runner provides pass_at_k and pass_hat_k for measuring model reliability across multiple trials:

[dependencies]
lutum-eval-runner = "0.1.0"

use lutum_eval_runner::{pass_at_k, KMetric};

// Run 10 trials of the same task
let metric: KMetric = pass_at_k(
    10,    // k — number of trials
    1,     // n — number of successes required
    || async {
        let result = session.text_turn(&llm).collect().await?;
        let passes = result.assistant_text().contains("correct answer");
        Ok(passes)
    },
)
.await?;

println!("pass@10 = {:.2}", metric.rate);