Available Evals
| Eval | Trigger | Purpose |
|---|---|---|
| Boot Eval | nightshift eval --evalBoot | Measures bootstrap output quality |
How Evals Work
Each eval follows a common pattern:- Configuration: A JSON file specifies the model, inputs, and expected outputs
- Isolated Execution: Nightshift creates a fresh environment and runs the agent autonomously
- Output Collection: Generated files are gathered from the workspace
- LLM-as-Judge Grading: A separate model (GPT-4o) scores outputs across multiple dimensions
- Results: Scores are saved to JSON for analysis and comparison
Why LLM-as-Judge?
Traditional software tests check for exact matches or simple assertions. Agent outputs are too nuanced for this - a goodAGENTS.md can be written many different ways. LLM-as-judge provides:
- Semantic evaluation: Judges meaning, not just syntax
- Multi-dimensional scoring: Captures different quality aspects
- Calibrated assessment: Consistent scoring across runs
- Detailed reasoning: Each score includes justification
Building Custom Evals
Evals are built on top of the OpenCode SDK. The key components are:- Isolated environment - Temp directories for tools and workspace
- OpenCode client - Sends prompts and streams events
- Grading prompts - Dimension-specific rubrics for the judge
- Result aggregation - Structured JSON output
