Quintus / docs /evaluation_methodology.md
iamrahulreddy's picture
release: publish Quintus project files
4fc1bb9 verified
|
Raw
History Blame Contribute Delete
9.8 kB

Evaluation Methodology

Evaluation was one of the hardest parts of Quintus. Several early scores were misleading until prompt format, metric extraction, parser behavior, and runtime artifacts were audited carefully.

Evaluation Principle

A model comparison is only meaningful when the prompt format and metric path match the question being asked.

Two distinct questions matter:

  • Base capability: Does the distilled model improve raw reasoning and likelihood behavior?
  • Assistant behavior: Does the distilled model handle chat formatting and produce usable responses?

Those questions need separate controls.

Run Identity And Determinism

A benchmark record should identify the checkpoint role (best or last), exact model directory or revision, prompt mode, seeds, decoding mode, and runtime versions. Greedy decoding with fixed seeds makes repeated runs easier to compare, but hardware and kernel drift can still change edge cases.

Treat determinism as an artifact contract, not a vague claim.

Raw-To-Raw And Chat-To-Chat

Raw completion and chat-template prompting activate different model behavior. A base model can be strong in raw mode and weak under chat markup. An instruct model can be strong in chat, but weak on raw continuation-style likelihood tasks.

Recommended controls:

  • Raw-to-raw: compare base-style prompts against base-style prompts.
  • Chat-to-chat: compare chat-wrapped prompts against chat-wrapped prompts.
  • Raw-vs-chat within the same model: measure format tax.

Avoid comparing a chat-wrapped distilled model directly against a raw base baseline and treating the delta as pure capability transfer.

Log-Likelihood Tasks Should Usually Stay Raw

Multiple-choice tasks such as ARC-Challenge, HellaSwag, and PIQA often score options by likelihood:

P(optionprompt) P(\text{option}\mid\text{prompt})

Wrapping the prompt in chat markup changes the next-token distribution. An aligned model may not want to begin a response with a bare option string after <|im_start|>assistant, so option likelihoods can fall for formatting reasons rather than reasoning reasons.

For log-likelihood tasks:

  • Use raw completion format unless the benchmark was designed for chat.
  • Prefer acc_norm where length bias matters.
  • Record whether chat templates were applied.

GSM8K Parser Traps

GSM8K evaluation can be distorted by parser behavior.

Two common filters behave differently:

  • strict-match: looks for an answer after the #### delimiter.
  • flexible-extract: searches for numbers and may choose the last matched number.

A chat model can solve the problem, emit the correct #### answer, miss EOS, and continue into a hallucinated next dialogue turn containing another number. In that case:

  • strict-match may score the response correct.
  • flexible-extract may grab the later hallucinated number and score it wrong.

This is not just a parser detail. It reveals an EOS and prompt-format interaction.

Mitigations:

  • Register all relevant EOS tokens, including <|im_end|> and <|endoftext|>.
  • Use deterministic generation for benchmark runs.
  • Avoid excessive fewshot_as_multiturn wrapping unless the model was trained for that shape.
  • Inspect mismatches, not just aggregate scores.

Reasoning Models Need Enough Generation Budget

Instruction-tuned reasoning models may spend hundreds of tokens inside a reasoning trace before reaching the final answer. If max_new_tokens is too small, the model can be cut off before emitting the final answer marker.

That can make a capable model appear weak under exact-match metrics.

For fair GSM8K-style generation:

  • Set a sufficient generation limit.
  • Track truncation rate.
  • Compare extracted answers against raw responses during audits.

Batched Generation Details

Decoder-only batched generation should use left-padding. Right-padding can put the next-token position on padding for shorter prompts and make batched outputs differ from single-sample outputs.

Generation parsers should:

  • Set tokenizer.padding_side = "left" for batched generation.
  • Slice decoded continuations by each prompt's true input length.
  • Stop at the first registered EOS token.
  • Record truncation and empty-generation counts.

English-Only Evaluation Controls

For English-only release checks, filtering the dataset is necessary but not sufficient. Evaluation should also use an English-only system instruction when chat prompts are enabled, register all relevant EOS IDs, and clean generated artifacts that continue into another language after the intended answer.

This cleanup is an evaluation-artifact guard. It is not a substitute for training data quality, SFT, preference tuning, or behavioral calibration.

Metric Extraction Must Be Strict

Post-processing scripts should never fall back loosely to any metric key that starts with the right prefix. A loose fallback can accidentally read:

  • *_stderr
  • alias
  • a different filter result

Robust extraction should:

  • Match the exact metric and filter name.
  • Ignore stderr and alias fields when extracting scores.
  • Fail loudly if the expected key is absent.

Boolean CLI Flags

Some harness flags use action="store_true". Passing "False" after such a flag does not disable it; the presence of the flag enables it.

Correct pattern:

  • Include the flag only when true.
  • Omit the flag when false.

This matters for options such as multiturn few-shot formatting.

Sample Log Format

lm-evaluation-harness may log different filters for the same document as separate JSONL objects with the same doc_id. A parser that assumes one object contains all filters can crash or silently compare the wrong fields.

Correct approach:

  • Group sample records by doc_id.
  • Index filter-specific records inside each group.
  • Compare strict and flexible outputs from the same document.

JSONL Parsing With Unicode Line Separators

Model outputs can contain Unicode line separator characters such as \u2028 or \u2029. Calling str.splitlines() on a whole JSONL file can split a valid JSON string into invalid fragments.

Robust JSONL parsing:

with open(path, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():
            record = json.loads(line)

Iterating the file handle respects actual line endings and does not split on Unicode separators inside JSON strings.

Hub Loading And Snapshot Hygiene

If weights or datasets are stored on the Hub, the client should be told the correct repository type. Download or snapshot the artifact first, verify that expected files exist, then pass the local directory to Transformers, vLLM, or the evaluation harness.

This separates transfer failures from engine construction and avoids repeated downloads during long benchmark runs.

Optional high-throughput Hub transfer backends such as hf_transfer can reduce setup time, but the correctness contract is still local snapshot validation.

Path-Length And Output Artifacts

Evaluation tools can derive output paths from model paths. Deep Hugging Face cache paths can become extremely long after sanitization, especially on Windows.

Public guidance:

  • Copy or symlink model weights to a short local directory before evaluation.
  • Pass short relative paths to the evaluator.
  • Keep result directories shallow.
  • Fail if expected sample files are missing.

This prevents silent write failures and missing-output confusion.

vLLM Evaluation Settings

For large benchmark runs, vLLM can greatly reduce runtime through continuous batching and KV-cache management.

Useful settings in development:

  • batch_size = auto
  • prefix caching enabled
  • PagedAttention-backed KV-cache management when available
  • bounded GPU memory utilization
  • explicit max_model_len where context bounds matter
  • explicit attention backend where the runtime supports it
  • local pre-caching of model snapshots before engine construction
  • explicit engine teardown between model runs

The benchmark artifact should record runtime versions for:

  • lm-eval
  • vllm
  • transformers
  • torch
  • datasets
  • accelerate

Version drift can change metric keys, generation behavior, attention backends, and output formats.

Qualitative Evaluation

Open-ended prompt suites are useful, but they are not replacements for standardized benchmarks.

A good qualitative suite should:

  • Compare raw and chat modes separately.
  • Use fixed prompts and deterministic ordering.
  • Include benchmark-template leakage probes.
  • Include factual, math, code, system design, and LLM-internals prompts.
  • Record complete outputs.
  • Inspect inherited base-model errors separately from new chat-mode errors.

Qualitative failures should be classified:

  • Distillation failure: the student did not absorb useful teacher probability structure.
  • Alignment gap: capability exists, but the generation path lacks SFT, preference tuning, or calibration.
  • Data contamination: the model repeats benchmark or pretraining artifacts.
  • Code reliability gap: prose is correct, but generated code violates stated constraints.

This distinction prevents the wrong fix. Distillation failures need KD changes. Alignment gaps need SFT, DPO, RLHF, or curated behavior data.

Release Gate

The final checkpoint should pass all of these before public claims are made:

  • Benchmark tasks use the intended prompt format.
  • Metric keys are exact.
  • Sample counts match the full benchmark set.
  • Raw and chat comparisons are not mixed.
  • Generation limits are sufficient for the model style.
  • Checkpoint identity is explicit.
  • Missing requested checkpoints fail instead of falling back to older local weights.
  • Runtime versions are recorded.
  • Mismatch samples are inspected for parser artifacts.
  • No stale result directory or old JSON file is reused.