Skip to main content
Nightshift is currently in its infancy and is subject to change and incompleteness. If you find a bug, or have an idea to improve Nightshift, please raise an issue on Github.
Evals are standardized benchmarks for measuring how well different LLM models perform autonomous tasks. Unlike interactive sessions where you evaluate outputs subjectively, evals provide reproducible, quantitative scores that enable model comparison.

Available Evals

EvalTriggerPurpose
Boot Evalnightshift eval --evalBootMeasures bootstrap output quality

How Evals Work

Each eval follows a common pattern:
  1. Configuration: A JSON file specifies the model, inputs, and expected outputs
  2. Isolated Execution: Nightshift creates a fresh environment and runs the agent autonomously
  3. Output Collection: Generated files are gathered from the workspace
  4. LLM-as-Judge Grading: A separate model (GPT-4o) scores outputs across multiple dimensions
  5. Results: Scores are saved to JSON for analysis and comparison

Why LLM-as-Judge?

Traditional software tests check for exact matches or simple assertions. Agent outputs are too nuanced for this - a good AGENTS.md can be written many different ways. LLM-as-judge provides:
  • Semantic evaluation: Judges meaning, not just syntax
  • Multi-dimensional scoring: Captures different quality aspects
  • Calibrated assessment: Consistent scoring across runs
  • Detailed reasoning: Each score includes justification

Building Custom Evals

Evals are built on top of the OpenCode SDK. The key components are:
  • Isolated environment - Temp directories for tools and workspace
  • OpenCode client - Sends prompts and streams events
  • Grading prompts - Dimension-specific rubrics for the judge
  • Result aggregation - Structured JSON output
See the Boot Eval for a complete implementation example.