Categorical DQN (C51)
Overview
C51 introduces a distributional perspective for DQN: instead of learning a single value for an action, C51 learns to predict a distribution of values for the action. Empirically, C51 demonstrates impressive performance in ALE.
Original papers:
Implemented Variants
| Variants Implemented | Description |
|---|---|
c51_atari.py, docs |
For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques. |
c51.py, docs |
For classic control tasks like CartPole-v1. |
Below are our single-file implementations of C51:
c51_atari.py
The c51_atari.py has the following features:
- For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Boxobservation space of shape(210, 160, 3) - Works with the
Discreteaction space
Usage
poetry install -E atari
python cleanrl/c51_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/c51_atari.py --env-id PongNoFrameskip-v4
Explanation of the logged metrics
Running python cleanrl/c51_atari.py will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:
charts/episodic_return: episodic return of the gamecharts/SPS: number of steps per secondlosses/loss: the cross entropy loss between the \(t\) step state value distribution and the projected \(t+1\) step state value distributionlosses/q_values: implemented as(old_pmfs * q_network.atoms).sum(1), which is the sum of the probability of getting returns \(x\) (old_pmfs) multiplied by \(x\) (q_network.atoms), averaged over the sample obtained from the replay buffer; useful when gauging if under or over estimation happens
Implementation details
c51_atari.py is based on (Bellemare et al., 2017)1 but presents a few implementation differences:
- (Bellemare et al., 2017)1 injects stochaticity by doing "on each frame the environment rejects the agent’s selected action with probability \(p = 0.25\)", but
c51_atari.pydoes not do this c51_atari.pyuse a self-contained evaluation scheme:c51_atari.pyreports the episodic returns obtained throughout training, whereas (Bellemare et al., 2017)1 is trained with--end-e=0.01but reported episodic returns using a separate evaluation process with--end-e=0.001(See "5.2. State-of-the-Art Results" on page 7).c51_atari.pyrescales the gradient so that the norm of the parameters does not exceed0.5like done in PPO ( ppo2/model.py#L102-L108).
Experiment results
PR vwxyzjn/cleanrl#159 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/c51.
Below are the average episodic returns for c51_atari.py.
| Environment | c51_atari.py 10M steps |
(Bellemare et al., 2017, Figure 14)1 50M steps | (Hessel et al., 2017, Figure 5)3 |
|---|---|---|---|
| BreakoutNoFrameskip-v4 | 467.00 ± 96.11 | 748 | ~500 at 10M steps, ~600 at 50M steps |
| PongNoFrameskip-v4 | 19.32 ± 0.92 | 20.9 | ~20 10M steps, ~20 at 50M steps |
| BeamRiderNoFrameskip-v4 | 9986.96 ± 1953.30 | 14,074 | ~12000 10M steps, ~14000 at 50M steps |
Note that we save computational time by reducing timesteps from 50M to 10M, but our c51_atari.py scores the same or higher than (Mnih et al., 2015)1 in 10M steps.
Learning curves:
Tracked experiments and game play videos:
c51.py
The c51.py has the following features:
- Works with the
Boxobservation space of low-level features - Works with the
Discreteaction space - Works with envs like
CartPole-v1
Usage
python cleanrl/c51.py --env-id CartPole-v1
Explanation of the logged metrics
See related docs for c51_atari.py.
Implementation details
The c51.py shares the same implementation details as c51_atari.py except the c51.py runs with different hyperparameters and neural network architecture. Specifically,
c51.pyuses a simpler neural network as follows:self.network = nn.Sequential( nn.Linear(np.array(env.single_observation_space.shape).prod(), 120), nn.ReLU(), nn.Linear(120, 84), nn.ReLU(), nn.Linear(84, env.single_action_space.n), )-
c51.pyruns with different hyperparameters:python c51.py --total-timesteps 500000 \ --learning-rate 2.5e-4 \ --buffer-size 10000 \ --gamma 0.99 \ --target-network-frequency 500 \ --max-grad-norm 0.5 \ --batch-size 128 \ --start-e 1 \ --end-e 0.05 \ --exploration-fraction 0.5 \ --learning-starts 10000 \ --train-frequency 10
Experiment results
PR vwxyzjn/cleanrl#159 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/c51.
Below are the average episodic returns for c51.py.
| Environment | c51.py |
|---|---|
| CartPole-v1 | 498.51 ± 1.77 |
| Acrobot-v1 | -88.81 ± 8.86 |
| MountainCar-v0 | -167.71 ± 26.85 |
Note that the C51 has no official benchmark on classic control environments, so we did not include a comparison. That said, our c51.py was able to achieve near perfect scores in CartPole-v1 and Acrobot-v1; further, it can obtain successful runs in the sparse environment MountainCar-v0.
Learning curves:
Tracked experiments and game play videos:
-
Bellemare, M.G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. ICML. ↩↩↩↩↩
-
[Proposal] Formal API handling of truncation vs termination. https://github.com/openai/gym/issues/2510 ↩
-
Hessel, M., Modayil, J., Hasselt, H.V., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M.G., & Silver, D. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI. ↩