Entropy-TRPO Model Weights
PyTorch checkpoints for the comparative study in A Review of Entropy-Based Extensions to Trust Region Policy Optimization.
Repository layout
Each checkpoint directory contains:
| File | Description |
|---|---|
policy.pt |
Policy network state dict |
value.pt |
Value network state dict |
config.json |
Training hyperparameters |
metadata.json |
Paper source, variant flags, final metrics |
Hub path: {env_id}/{variant}/latest/ (e.g. CartPole-v1/entrpo/latest/).
The repo README is updated automatically during training with a Training progress table
(epoch n/N, eval return, best return, KL) from results/summary.md, plus a JSON index of
available checkpoints.
Notation
- $\rho_t(\theta)=\pi_\theta(a_t|s_t)/\pi_{\theta_{\text{old}}}(a_t|s_t)$, GAE advantages $\hat{A}_t$, trust-region radius $\delta$
- $\alpha$ — Roostaie advantage entropy; Xu ERO objective entropy (distinct roles, same symbol in each paper's row)
- $\beta$ — Xu ERC constraint coefficient (Xu Eq. 49)
- $c_{\mathrm{ent}}$ — PPO entropy bonus (Schulman et al., 2017; config field
entropy_coef)
Variant definitions
| Key | Paper name | Surrogate / constraint |
|---|---|---|
trpo |
TRPO | $\mathbb{E}[\rho_t \hat{A}t]$; $\bar D{\mathrm{KL}} \le \delta$ |
entrpo_entropy |
EnTRPO-Entropy | $\mathbb{E}[\rho_t \tilde{A}t]$, $\tilde{A}t=\hat{A}t+\alpha,\mathcal{H}(\pi{\theta{\text{old}}}(\cdot|s_t))$ (fixed during step); $\bar D{\mathrm{KL}} \le \delta$ |
ero_trpo |
ERO-TRPO | $\mathbb{E}[\rho_t \hat{A}t]+\alpha,\mathbb{E}[\mathcal{H}(\pi_\theta)]$; $\bar D{\mathrm{KL}} \le \delta$ |
erc_trpo |
ERC-TRPO | $\mathbb{E}[\rho_t \hat{A}t]$; $\bar D{\mathrm{KL}} \le \delta+\beta,\mathbb{E}[\mathcal{H}(\pi_\theta)]$ (Xu Eq. 49) |
entrpo_buffer |
EnTRPO-Buffer | $\mathbb{E}[\rho_t \hat{A}_t]$ with Roostaie on-policy replay |
entrpo |
EnTRPO | $\mathbb{E}[\rho_t \tilde{A}_t]$ + Roostaie buffer |
ppo |
PPO | $\mathbb{E}[\min(\rho_t \hat{A}_t,\mathrm{clip}(\rho_t)\hat{A}t)]+c{\mathrm{ent}}\mathbb{E}[\mathcal{H}(\pi_\theta)]$ |
$\mathcal{H}$ in EnTRPO rows is evaluated at the behavior policy $\pi_{\theta_{\text{old}}}$; in ERO/ERC/PPO rows at the candidate policy $\pi_\theta$.
ERC-TRPO implementation: follows Xu Table1 (two CG solves, $\eta\mathbf{u}+\beta\mathbf{v}$ step scaling) and Eq.(49) line-search acceptance $\bar{D}_{\mathrm{KL}}\le\delta+\beta,\mathbb{E}[\mathcal{H}]$.
Older Hub folders (trpo_entropy, trpo_buffer, …) remain valid; training resumes from them automatically.
Environments
| Environment | Obs / action | Training budget | Hyperparameter source |
|---|---|---|---|
| CartPole-v1 | Gymnasium classic control | 50k steps | Roostaie + Xu Table 4.4 directly |
| Humanoid-v5 | 348 / 17 | $10^6$ steps | PPO/baselines backbone; Xu ERO/ERC proxied from Walker2d |
| HumanoidStandup-v5 | 348 / 17 | $10^6$ steps | Same backbone; Xu ERO/ERC proxied from BipedalWalker |
See HYPERPARAMETERS.md for per-field provenance and paper/results/annex_hyperparameters.tex for tables.
Variants and paper sources
| Variant | Paper |
|---|---|
trpo |
Schulman et al. (2015), Trust Region Policy Optimization, ICML |
entrpo_entropy |
Roostaie & Ebadzadeh (2021), EnTRPO — entropy-in-advantage ablation |
entrpo_buffer |
Roostaie & Ebadzadeh (2021), EnTRPO — replay-buffer ablation |
entrpo |
Roostaie & Ebadzadeh (2021), EnTRPO — full method |
ero_trpo |
Xu et al. (2024), ERO-TRPO |
erc_trpo |
Xu et al. (2024), ERC-TRPO |
ppo |
Schulman et al. (2017), Proximal Policy Optimization |
See metadata.json in each folder for full author names and URLs.
Usage
Training and evaluation code: GitHub — entropy-trpo (update URL when published).
git clone https://github.com/pre63/entropy-trpo.git
cd entropy-trpo
make setup # install deps + create .env
# edit .env with HF_TOKEN and HF_REPO_ID
make download-weights
make eval-checkpoints
Citation
@article{entropytrporeview2026,
title = {A Review of Entropy-Based Extensions to Trust Region Policy Optimization},
author = {Green, Simon},
journal = {IEEE Transactions},
year = {2026}
}
@article{roostaie2021entrpo,
title = {EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization},
author = {Roostaie, Sahar and Ebadzadeh, Mohammad Mehdi},
journal = {arXiv:2110.13373},
year = {2021}
}
@article{xu2024trpo,
title = {Trust region policy optimization via entropy regularization for {Kullback--Leibler} divergence constraint},
author = {Xu, Haotian and Xuan, Junyu and Zhang, Guangquan and Lu, Jie},
journal = {Neurocomputing},
volume = {589},
pages = {127716},
year = {2024}
}
Training progress
Last updated: 2026-06-25 14:04:38 UTC
- Device:
cpu - Config:
configs/cpu.yaml - Jobs complete: 63/63
- Running: 0
CartPole-v1 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO (s0) | done | 10/10 | 50,000 | 252.6 ± 72.4 | 293.4 | 0.0049 |
| TRPO (s1) | done | 10/10 | 50,000 | 297.7 ± 69.6 | 297.8 | 0.0076 |
| TRPO (s2) | done | 10/10 | 50,000 | 390.4 ± 78.6 | 390.4 | 0.0057 |
| EnTRPO-Entropy (s0) | done | 10/10 | 50,000 | 267.6 ± 56.1 | 324.0 | 0.0056 |
| EnTRPO-Entropy (s1) | done | 10/10 | 50,000 | 277.1 ± 87.7 | 297.8 | 0.0056 |
| EnTRPO-Entropy (s2) | done | 10/10 | 50,000 | 373.8 ± 92.4 | 373.8 | 0.0027 |
| ERO-TRPO (s0) | done | 10/10 | 50,000 | 20.5 ± 11.4 | 28.3 | 0.0000 |
| ERO-TRPO (s1) | done | 10/10 | 50,000 | 25.8 ± 15.0 | 27.6 | 0.0000 |
| ERO-TRPO (s2) | done | 10/10 | 50,000 | 21.2 ± 10.5 | 27.9 | 0.0000 |
| ERC-TRPO (s0) | done | 10/10 | 50,000 | 18.5 ± 8.1 | 23.5 | 0.0000 |
| ERC-TRPO (s1) | done | 10/10 | 50,000 | 31.3 ± 22.2 | 31.3 | 0.0000 |
| ERC-TRPO (s2) | done | 10/10 | 50,000 | 28.7 ± 14.8 | 32.4 | 0.0000 |
| EnTRPO-Buffer (s0) | done | 10/10 | 50,000 | 216.5 ± 82.2 | 262.0 | 0.0050 |
| EnTRPO-Buffer (s1) | done | 10/10 | 50,000 | 321.5 ± 106.3 | 340.6 | 0.0049 |
| EnTRPO-Buffer (s2) | done | 10/10 | 50,000 | 80.5 ± 33.3 | 165.7 | 0.0086 |
| EnTRPO (s0) | done | 9/10 | 50,000 | 147.0 ± 66.1 | 174.6 | 0.0082 |
| EnTRPO (s1) | done | 10/10 | 50,000 | 224.6 ± 65.8 | 224.6 | 0.0083 |
| EnTRPO (s2) | done | 10/10 | 50,000 | 186.2 ± 56.1 | 243.5 | 0.0036 |
| PPO (s0) | done | 10/10 | 50,000 | 138.1 ± 82.2 | 138.1 | 0.0009 |
| PPO (s1) | done | 10/10 | 50,000 | 127.8 ± 57.1 | 127.8 | 0.0052 |
| PPO (s2) | done | 10/10 | 50,000 | 138.2 ± 57.5 | 138.2 | 0.0044 |
Humanoid-v5 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO (s0) | done | 488/488 | 999,424 | 264.6 ± 47.9 | 325.0 | 0.0000 |
| TRPO (s1) | done | 488/488 | 999,424 | 264.6 ± 47.9 | 325.0 | 0.0000 |
| TRPO (s2) | done | 488/488 | 999,424 | 264.6 ± 47.9 | 325.0 | 0.0000 |
| EnTRPO-Entropy (s0) | done | 488/488 | 999,424 | 256.2 ± 58.3 | 312.4 | -0.0000 |
| EnTRPO-Entropy (s1) | done | 488/488 | 999,424 | 273.6 ± 55.0 | 325.6 | 0.0072 |
| EnTRPO-Entropy (s2) | done | 488/488 | 999,424 | 256.1 ± 72.0 | 333.6 | -0.0000 |
| ERO-TRPO (s0) | done | 488/488 | 999,424 | 250.4 ± 54.5 | 342.4 | 0.0000 |
| ERO-TRPO (s1) | done | 488/488 | 999,424 | 250.6 ± 23.7 | 315.4 | 0.0071 |
| ERO-TRPO (s2) | done | 488/488 | 999,424 | 261.1 ± 65.2 | 329.1 | 0.0053 |
| ERC-TRPO (s0) | done | 488/488 | 1,001,472 | 108.9 ± 27.7 | 130.1 | -0.0000 |
| ERC-TRPO (s1) | done | 488/488 | 999,424 | 254.5 ± 43.4 | 258.7 | -0.0000 |
| ERC-TRPO (s2) | done | 488/488 | 999,424 | 217.7 ± 79.6 | 240.5 | 0.0000 |
| EnTRPO-Buffer (s0) | done | 488/488 | 999,424 | 267.4 ± 72.3 | 326.5 | 0.0000 |
| EnTRPO-Buffer (s1) | done | 488/488 | 999,424 | 252.4 ± 22.5 | 327.7 | 0.0000 |
| EnTRPO-Buffer (s2) | done | 488/488 | 999,424 | 249.7 ± 90.1 | 321.0 | 0.0055 |
| EnTRPO (s0) | done | 488/488 | 999,424 | 245.7 ± 32.2 | 332.4 | 0.0043 |
| EnTRPO (s1) | done | 488/488 | 999,424 | 289.4 ± 72.0 | 325.4 | 0.0074 |
| EnTRPO (s2) | done | 488/488 | 999,424 | 280.4 ± 83.2 | 316.8 | 0.0023 |
| PPO (s0) | done | 488/488 | 999,424 | 350.6 ± 97.2 | 374.9 | 0.1023 |
| PPO (s1) | done | 488/488 | 999,424 | 329.9 ± 86.0 | 406.4 | 0.1046 |
| PPO (s2) | done | 488/488 | 999,424 | 305.1 ± 97.9 | 383.9 | 0.1098 |
HumanoidStandup-v5 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO (s0) | done | 488/488 | 999,424 | 61102.3 ± 10813.4 | 68649.5 | 0.0060 |
| TRPO (s1) | done | 488/488 | 999,424 | 67908.7 ± 12258.9 | 77610.9 | 0.0000 |
| TRPO (s2) | done | 488/488 | 999,424 | 71182.5 ± 7886.6 | 74013.4 | -0.0000 |
| EnTRPO-Entropy (s0) | done | 488/488 | 999,424 | 63970.3 ± 12674.7 | 69688.4 | 0.0054 |
| EnTRPO-Entropy (s1) | done | 488/488 | 999,424 | 74092.3 ± 6436.0 | 84192.4 | -0.0000 |
| EnTRPO-Entropy (s2) | done | 488/488 | 999,424 | 66211.8 ± 11884.0 | 73910.5 | -0.0000 |
| ERO-TRPO (s0) | done | 488/488 | 999,424 | 46220.5 ± 2917.4 | 47845.4 | -0.0000 |
| ERO-TRPO (s1) | done | 488/488 | 999,424 | 47466.5 ± 3243.4 | 52235.4 | 0.0010 |
| ERO-TRPO (s2) | done | 488/488 | 999,424 | 47797.9 ± 3607.9 | 50942.6 | 0.0000 |
| ERC-TRPO (s0) | done | 488/488 | 999,424 | 38592.2 ± 4701.3 | 40352.0 | 0.0000 |
| ERC-TRPO (s1) | done | 488/488 | 999,424 | 47285.0 ± 3076.0 | 51595.2 | 0.0003 |
| ERC-TRPO (s2) | done | 488/488 | 999,424 | 48414.2 ± 3424.0 | 51013.4 | 0.0009 |
| EnTRPO-Buffer (s0) | done | 488/488 | 999,424 | 37442.2 ± 2643.0 | 39323.6 | 0.0086 |
| EnTRPO-Buffer (s1) | done | 488/488 | 999,424 | 41544.7 ± 3903.9 | 45498.8 | -0.0001 |
| EnTRPO-Buffer (s2) | done | 488/488 | 999,424 | 37246.1 ± 1735.1 | 40451.8 | 0.0000 |
| EnTRPO (s0) | done | 488/488 | 999,424 | 35912.7 ± 2063.5 | 39687.9 | 0.0000 |
| EnTRPO (s1) | done | 488/488 | 999,424 | 43605.8 ± 3502.3 | 46649.5 | 0.0042 |
| EnTRPO (s2) | done | 488/488 | 999,424 | 37992.1 ± 4942.4 | 40168.8 | 0.0064 |
| PPO (s0) | done | 488/488 | 999,424 | 83486.9 ± 6256.6 | 100329.1 | 0.3202 |
| PPO (s1) | done | 488/488 | 999,424 | 88231.3 ± 14647.0 | 110834.0 | 0.5174 |
| PPO (s2) | done | 488/488 | 999,424 | 77927.8 ± 14820.4 | 102214.5 | 0.4544 |
Available checkpoints
{
"entrpo": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"entrpo_buffer": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"entrpo_entropy": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"erc_trpo": [
"seed_0",
"seed_1",
"seed_2"
],
"ero_trpo": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"ppo": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"trpo": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"trpo_entropy": [
"latest"
]
}