# GMPO

In the paper [Geometric-Mean Policy Optimization](https://huggingface.co/papers/2507.20673), the authors propose a GRPO variant that maximizes the *geometric* mean of the token-level importance ratios instead of the arithmetic mean. Because the geometric mean is far less sensitive to outlier ratios, the policy update is more stable and tolerates a much wider clipping range. Clipping is applied per token, in log space, and one-sided per the advantage sign (the standard PPO trust region) — crucially, *before* the geometric mean is taken.

To use GMPO, you can use the `GMPOTrainer` class in `trl.experimental.gmpo`.

## Usage

```python
from trl.experimental.gmpo import GMPOConfig, GMPOTrainer

training_args = GMPOConfig(
    epsilon=0.4,  # log-space clip range -> ratios clipped to (exp(-0.4), exp(0.4)); paper, Sec. 4
    beta=0.0,
)
trainer = GMPOTrainer(
    model="Qwen/Qwen3-0.6B",
    reward_funcs=...,
    train_dataset=...,
    args=training_args,
)
trainer.train()
```

In GMPO, clipping is applied to the per-token *log*-importance ratios (i.e. in log space) before the geometric mean is taken, so `epsilon` and `epsilon_high` are expressed in log space: the effective ratio clipping range is `(exp(-epsilon), exp(epsilon_high))`. The paper recommends a markedly wider range than GRPO/DAPO, `(exp(-0.4), exp(0.4))`, to encourage exploration.

## GMPOTrainer[[trl.experimental.gmpo.GMPOTrainer]]

#### trl.experimental.gmpo.GMPOTrainer[[trl.experimental.gmpo.GMPOTrainer]]

[Source](https://github.com/huggingface/trl/blob/main/trl/experimental/gmpo/gmpo_trainer.py#L22)

Trainer for Geometric-Mean Policy Optimization (GMPO).

GMPO (https://huggingface.co/papers/2507.20673) is a GRPO variant that maximizes the *geometric* mean of the
token-level importance ratios instead of the arithmetic mean. Because the geometric mean is far less sensitive to
outlier ratios, the policy update is more stable and a much wider clipping range can be used.

The only change w.r.t. [GRPOTrainer](/docs/trl/main/en/grpo_trainer#trl.GRPOTrainer) is `_compute_loss`. Everything else (generation, reward computation, weight
syncing, metric logging) is inherited unchanged

traintrl.experimental.gmpo.GMPOTrainer.trainhttps://github.com/huggingface/trl/blob/main/transformers/trainer.py#L1331[{"name": "resume_from_checkpoint", "val": ": str | bool | None = None"}, {"name": "trial", "val": ": optuna.Trial | dict[str, Any] | None = None"}, {"name": "ignore_keys_for_eval", "val": ": list[str] | None = None"}]- **resume_from_checkpoint** (`str` or `bool`, *optional*) --
  If a `str`, local path to a saved checkpoint as saved by a previous instance of `Trainer`. If a
  `bool` and equals `True`, load the last checkpoint in *args.output_dir* as saved by a previous instance
  of `Trainer`. If present, training will resume from the model/optimizer/scheduler states loaded here.
- **trial** (`optuna.Trial` or `dict[str, Any]`, *optional*) --
  The trial run or the hyperparameter dictionary for hyperparameter search.
- **ignore_keys_for_eval** (`list[str]`, *optional*) --
  A list of keys in the output of your model (if it is a dictionary) that should be ignored when
  gathering predictions for evaluation during the training.0`~trainer_utils.TrainOutput`Object containing the global step count, training loss, and metrics.

Main training entry point.

**Parameters:**

resume_from_checkpoint (`str` or `bool`, *optional*) : If a `str`, local path to a saved checkpoint as saved by a previous instance of `Trainer`. If a `bool` and equals `True`, load the last checkpoint in *args.output_dir* as saved by a previous instance of `Trainer`. If present, training will resume from the model/optimizer/scheduler states loaded here.

trial (`optuna.Trial` or `dict[str, Any]`, *optional*) : The trial run or the hyperparameter dictionary for hyperparameter search.

ignore_keys_for_eval (`list[str]`, *optional*) : A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.

**Returns:**

``~trainer_utils.TrainOutput``

Object containing the global step count, training loss, and metrics.
#### save_model[[trl.experimental.gmpo.GMPOTrainer.save_model]]

[Source](https://github.com/huggingface/trl/blob/main/transformers/trainer.py#L3775)

Will save the model, so you can reload it using `from_pretrained()`.

Will only save from the main process.
#### push_to_hub[[trl.experimental.gmpo.GMPOTrainer.push_to_hub]]

[Source](https://github.com/huggingface/trl/blob/main/transformers/trainer.py#L4022)

Upload `self.model` and `self.processing_class` to the 🤗 model hub on the repo `self.args.hub_model_id`.

**Parameters:**

commit_message (`str`, *optional*, defaults to `"End of training"`) : Message to commit while pushing.

blocking (`bool`, *optional*, defaults to `True`) : Whether the function should return only when the `git push` has finished.

token (`str`, *optional*, defaults to `None`) : Token with write permission to overwrite Trainer's original args.

revision (`str`, *optional*) : The git revision to commit from. Defaults to the head of the "main" branch.

kwargs (`dict[str, Any]`, *optional*) : Additional keyword arguments passed along to `~Trainer.create_model_card`.

**Returns:**

The URL of the repository where the model was pushed if `blocking=False`, or a `Future` object tracking the
progress of the commit if `blocking=True`.

## GMPOConfig[[trl.experimental.gmpo.GMPOConfig]]

#### trl.experimental.gmpo.GMPOConfig[[trl.experimental.gmpo.GMPOConfig]]

[Source](https://github.com/huggingface/trl/blob/main/trl/experimental/gmpo/gmpo_config.py#L21)

Configuration class for the `GMPOTrainer`.

`GMPOConfig` inherits every parameter from [GRPOConfig](/docs/trl/main/en/grpo_trainer#trl.GRPOConfig); it only changes the meaning and default of the
clipping range. In GMPO, clipping is applied to the per-token *log*-importance ratios (i.e. in log space) before
the geometric mean is taken, so `epsilon` and `epsilon_high` are expressed in log space: the effective ratio
clipping range is `(exp(-epsilon), exp(epsilon_high))`. The [GMPO paper](https://huggingface.co/papers/2507.20673)
recommends a markedly wider range than GRPO/DAPO, `(exp(-0.4), exp(0.4))`, to encourage exploration.

**Parameters:**

epsilon (`float`, *optional*, defaults to `0.4`) : Lower-bound clipping value, expressed in log space. The lower bound of the per-token importance ratio is `exp(-epsilon)`.

epsilon_high (`float`, *optional*) : Upper-bound clipping value, expressed in log space. If `None`, it defaults to the value of `epsilon`. The upper bound of the per-token importance ratio is `exp(epsilon_high)`.

