Eval

Nightshift is currently in its infancy and is subject to change and incompleteness. If you find a bug, or have an idea to improve Nightshift, please raise an issue on Github.

Evals are standardized benchmarks for measuring how well different LLM models perform autonomous tasks. Unlike interactive sessions where you evaluate outputs subjectively, evals provide reproducible, quantitative scores that enable model comparison.

Available Evals

Eval	Trigger	Purpose
Boot Eval	`nightshift eval --evalBoot`	Measures bootstrap output quality

How Evals Work

Each eval follows a common pattern:

Configuration: A JSON file specifies the model, inputs, and expected outputs
Isolated Execution: Nightshift creates a fresh environment and runs the agent autonomously
Output Collection: Generated files are gathered from the workspace
LLM-as-Judge Grading: A separate model (GPT-4o) scores outputs across multiple dimensions
Results: Scores are saved to JSON for analysis and comparison

Why LLM-as-Judge?

Traditional software tests check for exact matches or simple assertions. Agent outputs are too nuanced for this - a good AGENTS.md can be written many different ways. LLM-as-judge provides:

Semantic evaluation: Judges meaning, not just syntax
Multi-dimensional scoring: Captures different quality aspects
Calibrated assessment: Consistent scoring across runs
Detailed reasoning: Each score includes justification

Building Custom Evals

Evals are built on top of the OpenCode SDK. The key components are:

Isolated environment - Temp directories for tools and workspace
OpenCode client - Sends prompts and streams events
Grading prompts - Dimension-specific rubrics for the judge
Result aggregation - Structured JSON output

See the Boot Eval for a complete implementation example.

Getting Started

Concepts

Development

Available Evals

How Evals Work

Why LLM-as-Judge?

Building Custom Evals

Getting Started

Concepts

Development

​Available Evals

​How Evals Work

​Why LLM-as-Judge?

​Building Custom Evals

Available Evals

How Evals Work

Why LLM-as-Judge?

Building Custom Evals