Boot Eval

Running a Boot Eval

Configuration File

Execution Flow

Grading System

Output Format

Example Results

Architecture

Multi-Run Evaluations

Creating Custom Boot Evals

Running a Boot Eval

Configuration File

Execution Flow

Grading System

Output Format

Example Results

Architecture

Multi-Run Evaluations

Creating Custom Boot Evals

​Running a Boot Eval

​Configuration File

​Execution Flow

​Grading System

​Output Format

​Example Results

​Architecture

​Multi-Run Evaluations

​Creating Custom Boot Evals

Running a Boot Eval

Configuration File

Execution Flow

Grading System

Output Format

Example Results

Architecture

Multi-Run Evaluations

Creating Custom Boot Evals

Supported Providers

SKILL.md Dimensions

AGENTS.md Dimensions

Score Interpretation

Supported Providers

SKILL.md Dimensions

AGENTS.md Dimensions

Score Interpretation

​Supported Providers

​SKILL.md Dimensions

​AGENTS.md Dimensions

​Score Interpretation

Supported Providers

SKILL.md Dimensions

AGENTS.md Dimensions

Score Interpretation

Getting Started

Concepts

Development

Getting Started

Concepts

Development

Nightshift is currently in its infancy and is subject to change and incompleteness. If you find a bug, or have an idea to improve Nightshift, please raise an issue on Github.

The Boot Eval measures how well an LLM model performs the Bootstrap Agent Routine. It scores the quality of generated AGENTS.md and SKILL.md files across five dimensions.

nightshift eval --evalBoot --filePath <path-to-config.json> --runs <number>

Flag	Description
`--evalBoot`	Evaluate the bootstrap process
`--filePath`	Path to eval configuration JSON
`--runs`	Number of evaluation runs (default: 1)

The eval configuration specifies what to test and what outputs to expect:

{
  "inputPath": "/path/to/BOOT.md",
  "model": "openai/gpt-4o",
  "output": {
    "skillsPaths": [
      ".opencode/skills/data-schema/SKILL.md",
      ".opencode/skills/analytics/SKILL.md"
    ],
    "agentsMdPath": "AGENTS.md"
  }
}

Field	Description
`inputPath`	Path to `BOOT.md` file containing user intent
`model`	Model identifier in `provider/model` format
`output.skillsPaths`	Expected skill files the agent should generate
`output.agentsMdPath`	Expected `AGENTS.md` location

The eval system detects the provider from the model string and looks for the corresponding API key:

Provider	Model Prefix	Environment Variable
Anthropic	`anthropic/`	`ANTHROPIC_API_KEY`
OpenAI	`openai/`	`OPENAI_API_KEY`

If the key isn’t found in the environment, you’ll be prompted to enter it.

The bootEval() function in src/cli/cmd/eval/boot-agent.ts:436 orchestrates the evaluation:

Environment Setup: Creates isolated temp directories at /tmp/booteval-{timestamp}-{random}/
Tool Installation: Downloads and installs uv, ripgrep, and OpenCode into the temp prefix
Server Launch: Starts an OpenCode server on a random port (4096-5095)
Agent Execution: Sends the bootstrap prompt and streams events
Output Collection: Gathers generated files from the workspace
LLM Grading: Scores each file using GPT-4o as judge
Cleanup: Removes temp directories and stops the server

All permissions are auto-approved during evals to allow uninterrupted execution.

Each generated file is scored across five dimensions (0-20 points each) for a total of 0-100 points.

The skillGraderPrompt() function in src/cli/cmd/eval/boot-agent.ts:116 defines the grading criteria:

Dimension	What It Measures
D1 - Knowledge Delta	Expert knowledge the model can’t derive from first principles - decision trees, non-obvious trade-offs, domain heuristics
D2 - Specificity & Actionability	Concrete, executable instructions - copy-pasteable commands, real file paths, no interpretation needed
D3 - Anti-Patterns & Safety	Explicit NEVER/ALWAYS/ASK-FIRST rules with concrete reasons and specific failure modes
D4 - Structure & Discoverability	Clear description for routing, progressive disclosure, concise enough to fit in context
D5 - Tailoring to User Intent	Customization to the stated purpose, tech stack, and actual project structure

The agentsMdGraderPrompt() function in src/cli/cmd/eval/boot-agent.ts:178 defines complementary criteria:

Dimension	What It Measures
D1 - Project Specificity	Real package names, file paths, architecture decisions - could only belong to this project
D2 - Command Accuracy	Exact, copy-pasteable build/test/lint/run commands verified to work
D3 - Safety Boundaries	Three clear tiers (ALWAYS/ASK FIRST/NEVER) naming specific commands and files
D4 - Code Style Concreteness	Formatting rules shown through code examples, not prose descriptions
D5 - Skill Catalog & Routing	Every skill listed with exact path and one-line description of when to use it

Score Range	Meaning
16-20	Excellent - genuinely useful, specific, actionable
11-15	Good - mix of useful content with some generic filler
6-10	Adequate - mostly restates common knowledge
0-5	Poor - generic, template-like, no project-specific value

The grading is intentionally harsh. A score of 15+ on any dimension indicates genuinely excellent work.

Results are saved to ./eval/{provider}/{model}/boot_eval_result_{run}_{id}.json:

{
  "files": [
    {
      "file": ".opencode/skills/purchasing/SKILL.md",
      "type": "skill",
      "found": true,
      "content": "# Purchasing...",
      "llmScores": {
        "D1": 18,
        "D2": 15,
        "D3": 14,
        "D4": 17,
        "D5": 13,
        "total": 77
      }
    },
    {
      "file": "AGENTS.md",
      "type": "agentsMd",
      "found": true,
      "content": "# Project...",
      "llmScores": {
        "D1": 16,
        "D2": 15,
        "D3": 15,
        "D4": 17,
        "D5": 16,
        "total": 79
      }
    }
  ]
}

Field	Description
`file`	Relative path to the generated file
`type`	Either `skill` or `agentsMd`
`found`	Whether the file was generated
`content`	Full file contents (for analysis)
`llmScores`	Dimension scores and total

Files that weren’t generated receive all zeros.

Here’s an example comparing model performance on a supply chain analytics workspace:

This data helps identify:

Model strengths: High D2 scores indicate good actionability
Model weaknesses: Low D3 scores suggest poor safety boundary definition
Skill-specific patterns: Some models excel at certain skill types

┌─ EVAL CONFIG ──────────────────┐
│  inputPath: BOOT.md            │
│  model: provider/model         │
│  output: expected files        │
└───────────────┬────────────────┘
                │
                ▼
┌─ ISOLATED ENVIRONMENT ─────────┐
│  /tmp/booteval-{id}/           │
│  ├── prefix/ (tools)           │
│  └── workspace/ (agent work)   │
└───────────────┬────────────────┘
                │
                ▼
┌─ OPENCODE SESSION ─────────────┐
│  Bootstrap prompt execution    │
│  Event streaming               │
│  Auto-approved permissions     │
└───────────────┬────────────────┘
                │
                ▼
┌─ OUTPUT COLLECTION ────────────┐
│  SKILL.md files                │
│  AGENTS.md                     │
└───────────────┬────────────────┘
                │
                ▼
┌─ LLM-AS-JUDGE GRADING ─────────┐
│  GPT-4o scores each file       │
│  5 dimensions × 0-20 points    │
└───────────────┬────────────────┘
                │
                ▼
┌─ RESULTS JSON ─────────────────┐
│  ./eval/{provider}/{model}/    │
│  boot_eval_result_{run}.json   │
└────────────────────────────────┘

LLM outputs are non-deterministic. Run multiple evaluations to get statistically meaningful results:

nightshift eval --evalBoot --filePath config.json --runs 5

Each run produces a separate result file. Aggregate scores across runs to understand:

Average performance per dimension
Score variance (consistency)
Best/worst case outputs

To evaluate a new use case:

Write a BOOT.md describing the workspace purpose and any pre-defined skills
Create the eval config with the model to test and expected outputs
Run the eval and analyze results
Iterate by adjusting the BOOT.md or trying different models

odoo-schema/SKILL.md

purchasing/SKILL.md

analytics/SKILL.md

formulas/SKILL.md