A DQN Implementation

23 Jul, 2024

In grad school, one of my first projects was implementing a DQN agent from scratch to solve the HighwayEnv autonomous driving gym environment.

Local Simulator Setup

I’ll start with an overview of the setup and implementation process for our driving simulation. I compared various simulators like Vista, Duckietown, TORCS, and CARLA, but ultimately settled on HighwayEnv for its compatibility with OpenAI gym environments, active support, and its widespread use in RL & AV academic research. The first step was to create a Conda environment with the provided environment.yaml file. It's important to create virtual Python environments so you don't mess with your own computer's preinstalled Python packages, and you can resolve dependencies without worrying about breaking other things. Once I had installed the requisite packages, namely pygame, numpy, opencv-python, ffio, shapely, descartes, matplotlib, pyrender, torch, torchvision, pickle5, highway-env, I was ready to render the environment.

Next, we created our environment, rendered it as an rgb_array, reset it, and then executed some random actions. Here's what the starter code looks like:

import gymnasium as gym
import matplotlib.pyplot as plt

# initialize the environment and set our variables
env = gym.make('highway-v0', render_mode='rgb_array')
env.reset()

for _ in range (3):
    action = env.action_type.actions_indexes["IDLE"]
    obs, reward, done, truncated, info = env.step(action)
    env.render()
    
image = env.render()
plt.imshow(image)
plt.show()

And this is what the rendered env looks like: renderedenv

Next, we enumerated the environment’s discrete action space and observation space to become more familiar with the environment. Our discrete action space can be represented by the following dictionary:
{0: 'LANE_LEFT', 1: 'IDLE', 2: 'LANE_RIGHT', 3: 'FASTER', 4: 'SLOWER'}

Our observation space is represented by a 2D array as shown below. Descriptions of the kinematic observation space can be found in the documentation.

[[-0.04063911  0.77728456  0.50642455 -0.442782    0.85936016]
 [-0.64491266  0.14646605  0.40261534  1.3799425   0.5664462 ]
 [ 0.11634676 -0.51152134  2.8387847   0.5203923   0.59936154]
 [-0.16805217  0.251449   -0.1196415   1.8311491   1.2404503 ]
 [ 0.15308522 -0.06321809 -0.4128768  -0.63539755  0.1411862 ]]

Finally, we save each frame at each render, effectively a step in the simulation, and compile them into a gif using a combination of the opencv wrapper and the imageio–ffmpeg library. Here are a few episodes in our simulation:

highwayep1
highwayep2

Kubernetes

I'll summarize the Kubernetes setup here but won't go too far in-depth as everyone's clusters & namespaces are different. Essentially, I logged on to our hub with my credentials, downloaded my respective config, and set up the kubectl command line tool locally. After the setup, I experimented with a few different pods & deployments, adding different files, running a few batch jobs, and setting up networking so other collaborators could access my deployments.

To set up HighwayEnv on the cluster, I followed a similar process of installing dependencies and Python packages. I installed miniconda, created a Conda environment with Python 3.8, and installed dependencies using pip3 install highway-env, stable_baselines3, torch, tensorboard, moviepy. And with that, it was time to start benchmarking!

Benchmarking

I started benchmarking with Stable Baselines 3, which is a library offering a collection of reliable implementations of reinforcement learning (RL) algorithms in PyTorch. The library is designed to facilitate the development and comparison of RL algorithms, making it easier for researchers like myself to implement more complex models. It includes a wide range of implementations which we draw from for this project, starting with DQN. It's also worth noting that this library was built for use with OpenAI Gym environments, making integration into HighwayEnv a smooth process.

As a way to automate benchmarking, I chose to implement SB3 into executable Python scripts. These scripts would output a live performance, training, and reward summary at regular episode intervals, and also allow visualization of reward graphs via a tensorboard GUI hosted on localhost.

This is what a sample output might look like when training:

-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 200         |
|    mean_reward          | -157        |
| rollout/                |             |
|    ep_len_mean          | 200         |
|    ep_rew_mean          | -227        |
| time/                   |             |
|    fps                  | 972         |
|    iterations           | 19          |
|    time_elapsed         | 80          |
|    total_timesteps      | 77824       |
| train/                  |             |
|    approx_kl            | 0.037781604 |
|    clip_fraction        | 0.243       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.06       |
|    explained_variance   | 0.999       |
|    learning_rate        | 0.001       |
|    loss                 | 0.245       |
|    n_updates            | 180         |
|    policy_gradient_loss | -0.00398    |
|    std                  | 0.205       |
|    value_loss           | 0.226       |
-----------------------------------------

The items shown are as follows:

ep_len_mean Average length of an episode, measured by the number of time steps it contains.
ep_rew_mean Average reward received per episode.
exploration_rate The proportion of actions chosen randomly, as opposed to being selected based on the model's policy.
episodes The total number of episodes completed.
fps Frames per second, indicating the number of steps processed per second.
time_elapsed Total time in seconds that has passed during the simulation or training process.
total_timesteps The sum of all time steps that have occurred across all episodes.

After DQN, I implemented A2C and PPO in SB3 and created similar executable scripts anyone can use to benchmark algorithm performance in this environment.

Finally, I implemented my own version of the DQN algorithm from scratch. While not as robust for visualization or output, I was still able to create a similar Python executable script that can be used to train and output an RL agent. Here’s a snippet of what my live reward output looks like from that script:

scratchdqnoutput