---
license: mit
language:
  - en
tags:
  - benchmark
  - physics
  - scientific-discovery
  - llm-evaluation
  - n-body-simulation
  - interactive-evaluation
task_categories:
  - text-generation
  - other
task_ids:
  - other
pretty_name: DiscoverPhysics
size_categories:
  - n<1K
extra_gated_prompt: >-
  DiscoverPhysics is a benchmark for evaluating LLM agents on open-ended
  scientific discovery in simulated worlds with non-canonical physics.
  Access to the full benchmark suite, including the 11 private worlds and
  their evaluation rubrics, is gated to preserve the validity of the
  benchmark. By requesting access, you agree to the following terms:

    1. You will not publish, redistribute, or post the private world
       definitions, ground-truth force laws, or evaluation rubrics in any
       public venue (including GitHub, arXiv, blog posts, or social media).
    2. You will not use the private worlds or rubrics as training or
       fine-tuning data for any language model.
    3. You will cite the DiscoverPhysics paper in any work that uses this
       benchmark.
    4. You will report benchmark results honestly and reproducibly.

  Access requests are reviewed manually and typically processed within a
  few business days.
configs:
  - config_name: public_worlds
    data_files:
      - split: worlds
        path: worlds/public/*.json
  - config_name: private_worlds
    data_files:
      - split: worlds
        path: worlds/private/*.json
---


# DiscoverPhysics
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg?style=for-the-badge)](#license)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=for-the-badge&logo=python&logoColor=white)](https://github.com/psf/black)
## Benchmarking routine for scientific discovery agents.

![Agent pipeline diagram](imgs/agent_pipeline_diagram-1.png)

*The discovery pipeline: a physics simulator generates a world and an initial dataset, the LLM agent runs up to $n$ experimentation rounds against the simulator, then submits a final law that is scored by trajectory MSE and an LLM-as-judge explanation grade.*


LLM agents are placed in simulated physical worlds with unknown governing laws. Through iterative experimentation, observing particle trajectories, designing new experiments, and proposing equations, they must discover the hidden physics from scratch. Can they do this? Can we help them? All to be determined.

The simulator generates diverse worlds by randomizing field equations, particle-field couplings, and symmetry structures, forcing agents to perform genuine scientific reasoning rather than pattern matching against known physics.

### Current benchmarking results
![Benchmark results across frontier and open-weight models](imgs/moneyplot2-1.png)

*Benchmark results across the suite. **Left:** trajectory MSE vs. LLM-judge explanation score — the frontier reasoning models (Opus 4.7, GPT-5.5) cluster in the upper-left "good fit + good explanation" corner, while the rest trade one off against the other. **Middle:** expected worlds passed at $k$ seeds — Opus 4.7 and GPT-5.5 perform well, with most other models plateauing well below 2/11. **Right:** Expected pass@k=3 percentage plotted against model release date, showing a clear capability trend over the past year.*

## How It Works

Each world is governed by a generalized field equation:

$$\frac{\partial^n \varphi}{\partial t^n} = \mathcal{L}[\varphi] + \mathcal{N}[\varphi] + S(\text{particles})$$

where $n \in \{0, 1, 2\}$ sets the temporal order (constraint, diffusion, or wave), $\mathcal{L}$ is a linear spatial operator, $\mathcal{N}$ contains nonlinear terms, and $S$ couples particles to the field. Particles feel forces from the field and move according to Newton's second law.

The agent doesn't see any of this. It only sees noisy particle positions over time — and must figure out the rest.

**Discovery loop:**

1. The agent receives a mission describing what it can observe and control
2. It designs an experiment (particle positions, velocities, properties)
3. The simulator runs the experiment and returns trajectory data
4. The agent analyzes results, forms hypotheses, and designs follow-up experiments
5. After sufficient evidence, it submits a proposed law as executable Python
6. The law is evaluated against held-out test trajectories

## Simulation Backends

The simulator ships with two physics engines that can each evaluate the same
set of worlds, selected by `--engine`:

- **N-body** (`--engine nbody`, **default**) — direct O(N²) pairwise force
  computation under a 4th-order Yoshida symplectic integrator. The pair
  kernel is the analytic Green's function of the world's linear operator
  (2D Poisson, 2D Yukawa via modified Bessel `K_0/K_1`, or 2D Riesz
  fractional). Best for static-field worlds (`temporal_order = 0`) where
  accuracy and energy conservation matter; supports Hubble-flow, "ether"-style
  body forces, and other position- or velocity-dependent terms that don't fit
  into a linear PDE.

- **Field sampler** (`--engine field`) — JAX/JIT FFT-based field on a periodic
  grid with Cloud-In-Cell (CIC) particle ↔ field interpolation. Required for
  time-evolving worlds and for any world whose physics is genuinely a linear
  PDE on the field rather than an instantaneous pairwise law.

The N-body engine is the default because it's more accurate on static-field
worlds and has no grid resolution to tune. The field engine kicks in when a
world needs it, and can also be forced for cross-engine sanity checks against
the N-body trajectories.

## Predefined Worlds

| World | Temporal Order | Operator / Extra Term | What the Agent Must Discover | Engines |
|---|---|---|---|---|
| **gravity** | $n=0$ | Laplacian | Logarithmic / $1/r$ attractive force in 2D | nbody · field |
| **yukawa** | $n=0$ | Screened Poisson (Helmholtz) | Short-range exponentially suppressed force, screening length $\lambda$ | nbody · field |
| **fractional** | $n=0$ | Fractional Laplacian $-(-\nabla^2)^\alpha$ | Anomalous power-law force, fit $\alpha$ | nbody · field |
| **coulomb_easy** | $n=0$ | Attractive Coulomb $F = k\,p_1 p_2 / r^2$, 2 particles | Central inverse-square law from a fixed source | nbody only |
| **extra_dimensions** | $n=0$ | Kaluza-Klein image-sum kernel, $R_c = 0.5$ | Crossover from 2D ($\propto 1/r$) at $r \gg R_c$ to 3D ($\propto 1/r^2$) at $r \lesssim R_c$ | nbody only |
| **circle** | $n=0$ | Fractional Laplacian, 11 ring particles | Force law from a fixed ring geometry | nbody · field |
| **three_species** | $n=0$ | Laplacian, 3 hidden classes + 5 neutral probes | Three species (one repulsive) + neutral probes | nbody · field |
| **dark_matter** | $n=0$ | Laplacian, hidden 10-particle dark halo | Existence and strength of unobserved sources | nbody · field |
| **ether** | $n=0$ | Laplacian + uniform body-force $\alpha\hat{\mathbf{y}}$ | Central law + a global preferred-direction drift | nbody only |
| **hubble** | $n=0$ | Laplacian + radial Hubble flow $H\,\mathbf{r}$ | Central law + a position-dependent outward push | nbody only |
| **oscillator** | $n=0$ | Laplacian with time-modulated coupling $G(t)\,\nabla^2\varphi,\; G(t) = G_0\cos(\omega t + \varphi)$ | Sinusoidally varying coupling that periodically reverses sign (same configuration attracts at one phase and repels a quarter-period later); recover period $T$, amplitude $G_0$, phase $\varphi$ | nbody only |

## Getting Started

### Prerequisites

- Python 3.9+
- [JAX](https://github.com/jax-ml/jax)

### Installation

```bash
# Clone the repository
git clone https://github.com/SampsonML/discovery-agents.git
cd discovery-agents

# Install the physics simulator
pip install -e PhysicsSchool/

# Install the discovery agent
pip install -e ScienceAgent/
```

### Running the Tests

```bash
pytest PhysicsSchool/tests/
```

### Running a Discovery Agent

Set your API key for the LLM provider:

```bash
export ANTHROPIC_API_KEY="your-key-here"
```

Run the agent on a world:

```bash
python ScienceAgent/run_discovery.py --world gravity --model claude-sonnet-4-5
```

The agent will iteratively design experiments, observe results, and propose a governing law. Results are saved as JSON logs and trajectory plots.

### Supervisor Critic

Enable an optional supervisor agent that reviews each experiment round (from round 2 onward) for rule compliance and information gain:

```bash
python ScienceAgent/run_discovery.py --world gravity --model claude-sonnet-4-5 --use-critic
```

The critic defaults to `claude-haiku-4-5-20251001` for fast, low-cost feedback. Override with `--critic-model`:

```bash
python ScienceAgent/run_discovery.py --world gravity --model claude-sonnet-4-20250514 --use-critic --critic-model claude-sonnet-4-20250514
```

The critic checks that the science agent follows its experimental protocol and that each experiment provides new information not seen in previous rounds. Feedback is injected into the conversation so the science agent can course-correct.

### Single world example: Fractional Gravity on a Ring

Eleven particles are placed on a ring and interact via a fractional-Laplacian gravity field. The agent must discover the anomalous power-law force from noisy trajectories alone. Run with Opus 4.6 as the discovery agent and Sonnet 4.5 as the critic:

```bash
python ScienceAgent/run_discovery.py \
  --world circle \
  --model claude-opus-4-6 \
  --use-critic \
  --critic-model claude-sonnet-4-5 \
  --plot circle_plot.png
```

The true fractional exponent is $\alpha = 1.5$. The agent discovers a force law with fractional exponent $\alpha = (1+\sqrt{5})/2$ (the golden ratio) and achieves a mean position error of ~0.064:

![Trajectory comparison](imgs/circle_demo.png)


### Batch Benchmarking with YAML Configs

For sweeping the agent across many (model × world × seed) combinations, we include a YAML-driven runner (`scripts/run_benchmark.py`) that generates a reproducible bash script, executes it, and writes summary tables automatically. A companion analysis script (`scripts/run_stats.py`) then produces plots and a per-model rollup from the resulting trial JSONs.

A ready-to-go config is provided at `configs/bench.yml`:

```yaml
name: production_run                       # output dir under results/yml_bench/
models:
  - claude-opus-4-7
critic: off                                # 'on' or 'off'
critic_model: claude-sonnet-4-6            # only used if critic: on
judge_model: claude-opus-4-6               # LLM-judge that scores prose explanations
max_rounds: 16
noise_std: 0.075                           # scalar or list (multi-σ sweep)
random_experiments: off                    # 'on' replaces LLM-driven loop with random params
no_mse: off                                # 'on' hides trajectory-MSE feedback (mutually exclusive with random_experiments)
worlds:
  - gravity
  - yukawa
  - fractional
  - dark_matter
  - three_species
  - ether
  - hubble
  - coulomb_easy
  - extra_dimensions
  - circle
  - oscillator
seeds: [0, 1]
```

Three usage modes:

```bash
# generate run.sh, execute it, write summary tables (typical full sweep)
python scripts/run_benchmark.py configs/bench.yml

# generate run.sh only (inspect before executing)
python scripts/run_benchmark.py configs/bench.yml --no-run

# re-aggregate an already-completed run directory
python scripts/run_benchmark.py --aggregate-only results/yml_bench/production_run
```

Each run produces:

```
results/yml_bench/<name>/
├── run.sh                                 # generated bash, archived for reproducibility
├── config.yml                             # archived input
├── summary.txt                            # per-(model, world) expl_score and norm_MSE
├── summary_per_model.txt                  # pooled-across-worlds rollup
└── <model>/<world>[_noise<σ>]_seed<n>.{json,txt,stdout.log}
```

The `norm_MSE` columns are `mean_pos_error / Var(GT_world)` — MSE divided by the per-world ground-truth trajectory variance — so values are comparable across worlds with very different natural scales. A trial passes iff `norm_MSE < 0.1` AND explanation-judge score ≥ 0.75. Per-world variances are hardcoded in `scripts/run_benchmark.py`.

To produce plots and a richer per-model rollup from a completed run:

```bash
python scripts/run_stats.py results/yml_bench/production_run
```

This writes into `results/yml_bench/production_run/analysis/`:

```
analysis/
├── summary.txt                            # per-(model, world) flat table
├── summary_per_model.{txt,png,pdf}        # pooled rollup with @k / E@k columns
├── pareto_and_expected_passed_at_k.{png,pdf}
├── per_world_passed.{png,pdf}
├── world_difficulty_score.{png,pdf}
└── model_world_score_heatmap.{png,pdf}
```

The `@k=K` column counts worlds where at least one of the first K seeds achieved a trial-pass; the `E@k=K` column reports the expected percentage of worlds passed when K seed positions are sampled uniformly without replacement from the run's seed pool (Monte Carlo over 1000 draws). The pool size is read from `config.yml` automatically; values reported as `mean ± SE` are arithmetic, and `mean +up/−down` are geometric (asymmetric SE in raw units, derived from log-space bootstrap with 5000 resamples).

## Experimental Rounds
We show an example of an agent experimenting with new particle positions, to discover the underlying laws of the oscillator system. By choosing wise probe positions the agents are able to aquire much more information about a system and ideally, help them discover their governing laws of motion.

![Oscillator narrative across rounds](imgs/oscillator_seed3_narrative-1.png)

*A successful run on the `oscillator` world (seed 3). Left column: the agent's reasoning at each round. Right column: the experiments it designed, with ground-truth trajectories (solid), the agent's proposed law (dashed), and the noisy observations it actually saw (×). Round 1 is a quick parameter sweep over short timescales; Round 2 is the "aha" — extending the time window exposes a periodic flip in the radial velocity, ruling out a static $1/r^2$ law; Round 3 verifies a time-dependent radial law and submits. The final fit recovers $\omega \approx \pi/2$ and a mean position error of 0.0015 on held-out trajectories.*

## Supported LLM Providers

The agent supports multiple LLM backends though Groq seems to be the most frictionless free option:

- **Anthropic** (Claude)
- **OpenAI** (GPT, o1)
- **Azure OpenAI** (GPT-5.4 family)
- **Together.ai** (open-weight models — Llama 4, Qwen 3, DeepSeek, Kimi, gpt-oss, Mixtral, ...)

Set the corresponding environment variable (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `TOGETHER_API_KEY`, etc.) and pass the model name to `run_discovery.py`. Provider routing is done by model-string prefix — e.g. `together/Qwen/Qwen3-235B-A22B-Instruct-2507-tput`.

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.