lutum-trace captures in-memory spans and events from LLM turns. lutum-eval builds evaluation
helpers on top of captured traces: synchronous PureEval, async Eval, live Probe hooks,
JudgeEval for structured LLM-as-judge evaluation, and composable combinators.
On This Page
Trace Setup
Add the crates and set up the tracing subscriber:
[dependencies]
lutum-trace = "0.1.0"
tracing-subscriber = { version = "0.3", features = ["registry"] }use tracing_subscriber::layer::SubscriberExt as _;
// Build a subscriber with the capture layer
let subscriber = tracing_subscriber::registry()
.with(lutum_trace::layer());
// You can also compose with other layers (stdout, json, etc.)
let subscriber = tracing_subscriber::registry()
.with(lutum_trace::layer())
.with(tracing_subscriber::fmt::layer());Capturing a Turn
Wrap any async block with lutum_trace::capture(...) to get back the result plus a
TraceSnapshot:
use lutum::Session;
use tracing::instrument::WithSubscriber as _;
let collected = lutum_trace::capture(
async {
let mut session = Session::new();
session.push_user("Say hello in one sentence.");
let result = session.text_turn(&llm).collect().await?;
Ok::<_, Box<dyn std::error::Error>>(result.assistant_text().to_string())
}
.with_subscriber(subscriber),
)
.await;
// Access the result
let text = collected.output.unwrap();
// Inspect the trace
if let Some(turn) = collected.trace.span("llm_turn") {
println!("request_id = {:?}", turn.field("request_id"));
println!("finish_reason = {:?}", turn.field("finish_reason"));
}Collected<T> has two fields:
output: T— the return value of the wrapped async blocktrace: TraceSnapshot— structured snapshot of all spans and events
TraceSnapshot::span(name) returns the first span with the given name. Each span exposes
field(key) for accessing recorded field values and events() for child events.
For live streaming of trace events alongside capture, attach a listener to the capture future:
let collected = lutum_trace::capture(my_async_block.with_subscriber(subscriber))
.listen_events(|event: TraceEvent| {
// called for each span/event as it is recorded
println!("trace event: {event:?}");
})
.await;capture_raw(...).listen_events(...) combines the live callback with raw entry capture.
Use .listen_spans(...) when you want completed SpanNodes instead of low-level
TraceEvents.
Hierarchical Workflows
For multi-agent or sub-agent workflows, capture the whole workflow once and use captured
tracing spans for workflow and agent boundaries. The span names are not special; any span
with lutum.capture = true is captured and can become the parent of llm_turn spans.
Fields such as agent = "planner" are ordinary metadata for later lookup, not reserved
keys. Session remains just transcript state; the trace hierarchy comes from the spans you
place around your event loop.
use tracing::Instrument as _;
let collected = lutum_trace::capture(async {
let workflow = tracing::info_span!(
"workflow",
lutum.= true,
workflow = "research"
);
async {
let planner = async {
let mut session = Session::new();
session.push_user("Draft the plan.");
session.text_turn(&llm).collect().await
}
.instrument(tracing::info_span!(
"planner_agent",
lutum.= true,
agent = "planner"
));
let reviewer = async {
let mut session = Session::new();
session.push_user("Review the plan.");
session.text_turn(&llm).collect().await
}
.instrument(tracing::info_span!(
"reviewer_agent",
lutum.= true,
agent = "reviewer"
));
tokio::join!(planner, reviewer)
}
.instrument(workflow)
.await
})
.await;Nested sub-agents use the same rule: create another captured span inside the parent span.
Do not use nested capture() calls to represent hierarchy; nested agents should be part of
the one workflow-level capture.
Raw Capture
lutum_trace::capture_raw(...) captures both a structured snapshot and raw trace entries as a
RawTraceSnapshot. Useful for debugging or custom analysis.
capture_raw enables raw collection in the trace layer. To make Lutum adapters emit provider
request bodies, SSE payloads, parse errors, and collect errors into that raw stream, attach
RawTelemetryConfig to the Lutum context:
use std::sync::Arc;
use lutum::{Lutum, RawTelemetryConfig, Session};
use tracing::instrument::WithSubscriber as _;
let llm = Lutum::new(Arc::new(adapter), budget)
.with_extension(RawTelemetryConfig::all());
let collected = lutum_trace::capture_raw(
async {
let mut session = Session::new();
session.push_user("Say hello in one sentence.");
session.text_turn(&llm).collect().await
}
.with_subscriber(subscriber),
)
.await;
let trace: &TraceSnapshot = &collected.trace;
let raw: &RawTraceSnapshot = &collected.raw;PureEval — Synchronous Evaluation
PureEval is a synchronous evaluation trait over a captured trace plus an application-specific
artifact. It returns Result<Self::Report, Self::Error> — use a ! (never) error type if the
eval is infallible:
use std::convert::Infallible;
use lutum_eval::{PureEval, Score};
use lutum_trace::TraceSnapshot;
struct ResponseLengthEval;
impl PureEval for ResponseLengthEval {
type Artifact = String; // e.g. assistant text
type Report = usize; // e.g. character count
type Error = Infallible;
fn evaluate(
&self,
_trace: &TraceSnapshot,
artifact: &Self::Artifact,
) -> Result<Self::Report, Self::Error> {
Ok(artifact.len())
}
}Use PureEvalExt::scored_by(objective) to pair an eval with an Objective and get a
Scored<Report>. Score values are f32 in [0.0, 1.0] — use Score::try_new(f32) or
Score::new_clamped(f32).
Eval — Async Evaluation
Eval is the async variant. It receives a Lutum context in addition to the trace and artifact,
which lets the evaluator run its own LLM calls. Like PureEval, it returns
Result<Self::Report, Self::Error>:
use lutum_eval::Eval;
use lutum::Lutum;
use async_trait::async_trait;
struct FreeformJudge;
#[]
impl Eval for FreeformJudge {
type Artifact = String;
type Report = String;
type Error = Box<dyn std::error::Error + Send + Sync>;
async fn evaluate(
&self,
llm: &Lutum,
_trace: &TraceSnapshot,
artifact: &Self::Artifact,
) -> Result<Self::Report, Self::Error> {
let prompt = format!("Rate this response 1-10 and explain: {artifact}");
let result = llm.completion(prompt).collect().await?;
Ok(result.text.clone())
}
}For structured LLM-as-judge evaluation, see JudgeEval below — it uses
structured output so the judge's score is decoded as a typed Rust struct rather than parsed
from free text.
Probe — Live Evaluation
Probe hooks into a live turn via collect_with(...). It observes streaming events in real time,
making it useful for latency monitoring, streaming quality checks, or live dashboards:
use lutum_eval::Probe;
struct FirstTokenTimer {
start: std::time::Instant,
}
impl Probe for FirstTokenTimer {
fn on_text_delta(&mut self, delta: &str) {
let latency = self.start.elapsed();
println!("First token at {latency:?}: {delta:?}");
}
}
let result = session
.text_turn(&llm)
.collect_with(FirstTokenTimer { start: std::time::Instant::now() })
.await?;JudgeEval — Structured LLM-as-Judge
JudgeEval is a ready-made Eval implementation that runs an LLM call with structured output
as the judge. Supply a render_input closure that builds a ModelInput from the trace and
artifact; the LLM's structured response becomes the Report:
[dependencies]
lutum-eval = "0.1.0"
schemars = "0.8"
serde = { version = "1", features = ["derive"] }use lutum_eval::{JudgeEval, EvalExt as _};
use lutum::ModelInput;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
#[(Clone, Debug, Serialize, Deserialize, JsonSchema)]
struct JudgeReport {
score: f32, // 0.0–1.0
explanation: String,
}
let judge = JudgeEval::<String, JudgeReport, _>::new(|_trace, artifact| {
ModelInput::from_user(format!(
"Rate this response 0.0–1.0 and explain:\n\n{artifact}"
))
})
.max_output_tokens(256);
// Run the judge against an artifact
let report: JudgeReport = judge
.evaluate(&llm, &trace, &artifact_text)
.await?;
println!("score={}, reason={}", report.score, report.explanation);JudgeEval accepts the same builder methods as turn builders (.temperature(...),
.max_output_tokens(...), .budget(...), .ext(...)) so you can configure the judge call
independently of the main session.
EvalRecord
EvalRecord<R, E> bundles a scored result with the trace that produced it. It is the standard
return type for eval runners and eval pipelines:
use lutum_eval::EvalRecord;
// EvalRecord wraps Result<Scored<R>, E> + TraceSnapshot
let record: EvalRecord<JudgeReport, _> = /* ... from eval runner */;
if record.passed() {
println!("score: {:?}", record.score());
} else {
println!("error: {:?}", record.error());
}
// Access the trace regardless of outcome
let span = record.trace.span("llm_turn");RawEvalRecord<R, E> adds a raw: RawTraceSnapshot for full protocol-level debugging. Convert
between them with .into_raw(raw) and .without_raw().
Eval Combinators
PureEvalExt and EvalExt (auto-implemented for all PureEval / Eval implementors) expose
combinator methods for transforming and composing evaluators:
| Method | Description |
|---|---|
.map_report(f) |
Transform the report type |
.map_error(f) |
Transform the error type |
.contramap_artifact(f) |
Accept a different artifact type |
.combine(other, f) |
Run two evals and merge their reports |
.scored_by(objective) |
Pair with an Objective to produce Scored<R> |
.lift() (PureEval only) |
Lift a PureEval into the Eval trait |
Objective controls how a report becomes a Score. Built-in objectives:
maximize(|r| r.value)— higher is betterminimize(|r| r.latency_ms)— lower is betterpass_fail(|r| r.passed)— boolean pass/fail
use lutum_eval::{PureEvalExt as _, maximize};
// Run the eval and score by response length (higher = better, capped at 500 chars)
let scored = ResponseLengthEval
.scored_by(&maximize(|len: &usize| (*len).min(500) as f32 / 500.0))
.run_collected(&collected)?;
println!("score={:.2}, length={}", scored.score.value(), scored.report);lutum-eval-runner — k-Metrics
lutum-eval-runner provides pass_at_k and pass_hat_k for measuring model reliability across
multiple trials:
[dependencies]
lutum-eval-runner = "0.1.0"use lutum_eval_runner::{pass_at_k, KMetric};
// Run 10 trials of the same task
let metric: KMetric = pass_at_k(
10, // k — number of trials
1, // n — number of successes required
|| async {
let result = session.text_turn(&llm).collect().await?;
let passes = result.assistant_text().contains("correct answer");
Ok(passes)
},
)
.await?;
println!("pass@10 = {:.2}", metric.rate);