Entropy-TRPO Model Weights

PyTorch checkpoints for the comparative study in A Review of Entropy-Based Extensions to Trust Region Policy Optimization.

Repository layout

Each checkpoint directory contains:

File Description
policy.pt Policy network state dict
value.pt Value network state dict
config.json Training hyperparameters
metadata.json Paper source, variant flags, final metrics

Hub path: {env_id}/{variant}/latest/ (e.g. CartPole-v1/entrpo/latest/).

The repo README is updated automatically during training with a Training progress table (epoch n/N, eval return, best return, KL) from results/summary.md, plus a JSON index of available checkpoints.

Notation

  • $\rho_t(\theta)=\pi_\theta(a_t|s_t)/\pi_{\theta_{\text{old}}}(a_t|s_t)$, GAE advantages $\hat{A}_t$, trust-region radius $\delta$
  • $\alpha$ — Roostaie advantage entropy; Xu ERO objective entropy (distinct roles, same symbol in each paper's row)
  • $\beta$ — Xu ERC constraint coefficient (Xu Eq. 49)
  • $c_{\mathrm{ent}}$ — PPO entropy bonus (Schulman et al., 2017; config field entropy_coef)

Variant definitions

Key Paper name Surrogate / constraint
trpo TRPO $\mathbb{E}[\rho_t \hat{A}t]$; $\bar D{\mathrm{KL}} \le \delta$
entrpo_entropy EnTRPO-Entropy $\mathbb{E}[\rho_t \tilde{A}t]$, $\tilde{A}t=\hat{A}t+\alpha,\mathcal{H}(\pi{\theta{\text{old}}}(\cdot|s_t))$ (fixed during step); $\bar D{\mathrm{KL}} \le \delta$
ero_trpo ERO-TRPO $\mathbb{E}[\rho_t \hat{A}t]+\alpha,\mathbb{E}[\mathcal{H}(\pi_\theta)]$; $\bar D{\mathrm{KL}} \le \delta$
erc_trpo ERC-TRPO $\mathbb{E}[\rho_t \hat{A}t]$; $\bar D{\mathrm{KL}} \le \delta+\beta,\mathbb{E}[\mathcal{H}(\pi_\theta)]$ (Xu Eq. 49)
entrpo_buffer EnTRPO-Buffer $\mathbb{E}[\rho_t \hat{A}_t]$ with Roostaie on-policy replay
entrpo EnTRPO $\mathbb{E}[\rho_t \tilde{A}_t]$ + Roostaie buffer
ppo PPO $\mathbb{E}[\min(\rho_t \hat{A}_t,\mathrm{clip}(\rho_t)\hat{A}t)]+c{\mathrm{ent}}\mathbb{E}[\mathcal{H}(\pi_\theta)]$

$\mathcal{H}$ in EnTRPO rows is evaluated at the behavior policy $\pi_{\theta_{\text{old}}}$; in ERO/ERC/PPO rows at the candidate policy $\pi_\theta$.

ERC-TRPO implementation: follows Xu Table1 (two CG solves, $\eta\mathbf{u}+\beta\mathbf{v}$ step scaling) and Eq.(49) line-search acceptance $\bar{D}_{\mathrm{KL}}\le\delta+\beta,\mathbb{E}[\mathcal{H}]$.

Older Hub folders (trpo_entropy, trpo_buffer, …) remain valid; training resumes from them automatically.

Environments

Environment Obs / action Training budget Hyperparameter source
CartPole-v1 Gymnasium classic control 50k steps Roostaie + Xu Table 4.4 directly
Humanoid-v5 348 / 17 $10^6$ steps PPO/baselines backbone; Xu ERO/ERC proxied from Walker2d
HumanoidStandup-v5 348 / 17 $10^6$ steps Same backbone; Xu ERO/ERC proxied from BipedalWalker

See HYPERPARAMETERS.md for per-field provenance and paper/results/annex_hyperparameters.tex for tables.

Variants and paper sources

Variant Paper
trpo Schulman et al. (2015), Trust Region Policy Optimization, ICML
entrpo_entropy Roostaie & Ebadzadeh (2021), EnTRPO — entropy-in-advantage ablation
entrpo_buffer Roostaie & Ebadzadeh (2021), EnTRPO — replay-buffer ablation
entrpo Roostaie & Ebadzadeh (2021), EnTRPO — full method
ero_trpo Xu et al. (2024), ERO-TRPO
erc_trpo Xu et al. (2024), ERC-TRPO
ppo Schulman et al. (2017), Proximal Policy Optimization

See metadata.json in each folder for full author names and URLs.

Usage

Training and evaluation code: GitHub — entropy-trpo (update URL when published).

git clone https://github.com/pre63/entropy-trpo.git
cd entropy-trpo
make setup          # install deps + create .env
# edit .env with HF_TOKEN and HF_REPO_ID
make download-weights
make eval-checkpoints

Citation

@article{entropytrporeview2026,
  title   = {A Review of Entropy-Based Extensions to Trust Region Policy Optimization},
  author  = {Green, Simon},
  journal = {IEEE Transactions},
  year    = {2026}
}
@article{roostaie2021entrpo,
  title   = {EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization},
  author  = {Roostaie, Sahar and Ebadzadeh, Mohammad Mehdi},
  journal = {arXiv:2110.13373},
  year    = {2021}
}
@article{xu2024trpo,
  title   = {Trust region policy optimization via entropy regularization for {Kullback--Leibler} divergence constraint},
  author  = {Xu, Haotian and Xuan, Junyu and Zhang, Guangquan and Lu, Jie},
  journal = {Neurocomputing},
  volume  = {589},
  pages   = {127716},
  year    = {2024}
}

Training progress

Last updated: 2026-06-25 14:04:38 UTC

  • Device: cpu
  • Config: configs/cpu.yaml
  • Jobs complete: 63/63
  • Running: 0

CartPole-v1 (1M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO (s0) done 10/10 50,000 252.6 ± 72.4 293.4 0.0049
TRPO (s1) done 10/10 50,000 297.7 ± 69.6 297.8 0.0076
TRPO (s2) done 10/10 50,000 390.4 ± 78.6 390.4 0.0057
EnTRPO-Entropy (s0) done 10/10 50,000 267.6 ± 56.1 324.0 0.0056
EnTRPO-Entropy (s1) done 10/10 50,000 277.1 ± 87.7 297.8 0.0056
EnTRPO-Entropy (s2) done 10/10 50,000 373.8 ± 92.4 373.8 0.0027
ERO-TRPO (s0) done 10/10 50,000 20.5 ± 11.4 28.3 0.0000
ERO-TRPO (s1) done 10/10 50,000 25.8 ± 15.0 27.6 0.0000
ERO-TRPO (s2) done 10/10 50,000 21.2 ± 10.5 27.9 0.0000
ERC-TRPO (s0) done 10/10 50,000 18.5 ± 8.1 23.5 0.0000
ERC-TRPO (s1) done 10/10 50,000 31.3 ± 22.2 31.3 0.0000
ERC-TRPO (s2) done 10/10 50,000 28.7 ± 14.8 32.4 0.0000
EnTRPO-Buffer (s0) done 10/10 50,000 216.5 ± 82.2 262.0 0.0050
EnTRPO-Buffer (s1) done 10/10 50,000 321.5 ± 106.3 340.6 0.0049
EnTRPO-Buffer (s2) done 10/10 50,000 80.5 ± 33.3 165.7 0.0086
EnTRPO (s0) done 9/10 50,000 147.0 ± 66.1 174.6 0.0082
EnTRPO (s1) done 10/10 50,000 224.6 ± 65.8 224.6 0.0083
EnTRPO (s2) done 10/10 50,000 186.2 ± 56.1 243.5 0.0036
PPO (s0) done 10/10 50,000 138.1 ± 82.2 138.1 0.0009
PPO (s1) done 10/10 50,000 127.8 ± 57.1 127.8 0.0052
PPO (s2) done 10/10 50,000 138.2 ± 57.5 138.2 0.0044

Humanoid-v5 (1M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO (s0) done 488/488 999,424 264.6 ± 47.9 325.0 0.0000
TRPO (s1) done 488/488 999,424 264.6 ± 47.9 325.0 0.0000
TRPO (s2) done 488/488 999,424 264.6 ± 47.9 325.0 0.0000
EnTRPO-Entropy (s0) done 488/488 999,424 256.2 ± 58.3 312.4 -0.0000
EnTRPO-Entropy (s1) done 488/488 999,424 273.6 ± 55.0 325.6 0.0072
EnTRPO-Entropy (s2) done 488/488 999,424 256.1 ± 72.0 333.6 -0.0000
ERO-TRPO (s0) done 488/488 999,424 250.4 ± 54.5 342.4 0.0000
ERO-TRPO (s1) done 488/488 999,424 250.6 ± 23.7 315.4 0.0071
ERO-TRPO (s2) done 488/488 999,424 261.1 ± 65.2 329.1 0.0053
ERC-TRPO (s0) done 488/488 1,001,472 108.9 ± 27.7 130.1 -0.0000
ERC-TRPO (s1) done 488/488 999,424 254.5 ± 43.4 258.7 -0.0000
ERC-TRPO (s2) done 488/488 999,424 217.7 ± 79.6 240.5 0.0000
EnTRPO-Buffer (s0) done 488/488 999,424 267.4 ± 72.3 326.5 0.0000
EnTRPO-Buffer (s1) done 488/488 999,424 252.4 ± 22.5 327.7 0.0000
EnTRPO-Buffer (s2) done 488/488 999,424 249.7 ± 90.1 321.0 0.0055
EnTRPO (s0) done 488/488 999,424 245.7 ± 32.2 332.4 0.0043
EnTRPO (s1) done 488/488 999,424 289.4 ± 72.0 325.4 0.0074
EnTRPO (s2) done 488/488 999,424 280.4 ± 83.2 316.8 0.0023
PPO (s0) done 488/488 999,424 350.6 ± 97.2 374.9 0.1023
PPO (s1) done 488/488 999,424 329.9 ± 86.0 406.4 0.1046
PPO (s2) done 488/488 999,424 305.1 ± 97.9 383.9 0.1098

HumanoidStandup-v5 (1M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO (s0) done 488/488 999,424 61102.3 ± 10813.4 68649.5 0.0060
TRPO (s1) done 488/488 999,424 67908.7 ± 12258.9 77610.9 0.0000
TRPO (s2) done 488/488 999,424 71182.5 ± 7886.6 74013.4 -0.0000
EnTRPO-Entropy (s0) done 488/488 999,424 63970.3 ± 12674.7 69688.4 0.0054
EnTRPO-Entropy (s1) done 488/488 999,424 74092.3 ± 6436.0 84192.4 -0.0000
EnTRPO-Entropy (s2) done 488/488 999,424 66211.8 ± 11884.0 73910.5 -0.0000
ERO-TRPO (s0) done 488/488 999,424 46220.5 ± 2917.4 47845.4 -0.0000
ERO-TRPO (s1) done 488/488 999,424 47466.5 ± 3243.4 52235.4 0.0010
ERO-TRPO (s2) done 488/488 999,424 47797.9 ± 3607.9 50942.6 0.0000
ERC-TRPO (s0) done 488/488 999,424 38592.2 ± 4701.3 40352.0 0.0000
ERC-TRPO (s1) done 488/488 999,424 47285.0 ± 3076.0 51595.2 0.0003
ERC-TRPO (s2) done 488/488 999,424 48414.2 ± 3424.0 51013.4 0.0009
EnTRPO-Buffer (s0) done 488/488 999,424 37442.2 ± 2643.0 39323.6 0.0086
EnTRPO-Buffer (s1) done 488/488 999,424 41544.7 ± 3903.9 45498.8 -0.0001
EnTRPO-Buffer (s2) done 488/488 999,424 37246.1 ± 1735.1 40451.8 0.0000
EnTRPO (s0) done 488/488 999,424 35912.7 ± 2063.5 39687.9 0.0000
EnTRPO (s1) done 488/488 999,424 43605.8 ± 3502.3 46649.5 0.0042
EnTRPO (s2) done 488/488 999,424 37992.1 ± 4942.4 40168.8 0.0064
PPO (s0) done 488/488 999,424 83486.9 ± 6256.6 100329.1 0.3202
PPO (s1) done 488/488 999,424 88231.3 ± 14647.0 110834.0 0.5174
PPO (s2) done 488/488 999,424 77927.8 ± 14820.4 102214.5 0.4544

Available checkpoints

{
  "entrpo": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "entrpo_buffer": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "entrpo_entropy": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "erc_trpo": [
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "ero_trpo": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "ppo": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "trpo": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "trpo_entropy": [
    "latest"
  ]
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for pre63/entropy-trpo-weights