Project:
The purpose of this project was to compare different reinforcement learning algorithms in improving a model’s performance for the Lunar Landing video game. Reinforcement learning is a type of machine learning that improves a model’s performance of a task by rewarding wanted behavior and punishing unwanted behavior. In this case, the behavior was evaluated based on the score of the game. In Lunar Lander, the user controls a spaceship’s left and right thrusters to land it on the moon. Points are awarded based on how softly and upright the ship can land, and even more points are given for landing on difficult terrain or within the flags. Our reinforcement learning models started by performing random actions in the game and received feedback on how the sequence of actions performance based on the score achieved in the game. This feedback was used by the algorithms to train and improve the model.
The Lunar Lander game was used from the OpenAI Gym Python module, which contains several games and tasks that can be used for reinforcement learning trialing. For reinforcement learning, the TensorFlow and Stable Baselines Python machine learning frameworks were used. The two algorithms trialed were Sample Efficient Actor-Critic with Experience Replay (ACER) and Proximal Policy Optimization (PPO). The models trained by each of these algorithms were compared against each other at training times of 50000 timesteps, 25000 timesteps, 10000 timesteps, and 5000 timesteps (where each timestep represented a frame in the Lunar Landing Game). Three models were trained for each algorithm at these time steps, and ten trials were run for each model.
TThe ACER model utilizes reinforcement learning methods such as multiple workers, a replay buffer, retrace for Q-value estimation, a trust region, and importance sampling. On the other hand, the PPO model only uses multiple workers and a trust region. As a result, ACER appears to provide a more accurate fit with larger amounts of training time, however PPO is more efficient in training and will perform better with less training time. It appears that when there is a large enough training time PPO begins to overfit the data, as it runs through too many training iterations. This is when a model fits a set of data to closely and cannot accurately fit new introduced data. The data from this project provided below seems to support this.
Results:
A control score was taken by having random actions being performed in the game with no model applied. The average control score was – 142.7155. The only trials that had an average that outperformed this were some trials of PPO with training times of 5000 timesteps, some ACER trials with training time of 25000 timesteps, and all ACER trials with training times of 50000 timesteps. When ACER was used with a training time of 50000, the performance was significantly higher than any other combination. The upwards trend in Figure 1 suggests that as training time increases ACER performs better and PPO eventually performs worse. This is likely because of the greater time efficieny of PPO versus the greater accuracy of ACER. With the same provided training time, PPO performed many more training iterations than ACER, likely because it accounts for less reinforcement learning methods. At a training time of 50000 PPO performed 196 training iterations, while ACER only performed 25 training iterations. As a result, PPO trained models likely became overfit as training time increased, while ACER models only became more accurate.
Table 1: Averages of Lunar Landing scores for each trial run.
Trial | TimeStep | Average Lunar Landing Score | Average Score for Algorithm and Timestep |
CONTROL SCORE | N/A | -142.7155 | N/A |
ACER 2500 SCORE1 | 2500 | -1012.414 | -652.39 |
ACER 2500 SCORE2 | 2500 | -613.9133 | |
ACER 2500 SCORE3 | 2500 | -330.83 | |
PPO 2500 SCORE1 | 2500 | -577.3163 | -495.99 |
PPO 2500 SCORE2 | 2500 | -479.4224 | |
PPO 2500 SCORE3 | 2500 | -431.2391 | |
ACER 5000 SCORE1 | 5000 | -597.1815 | -686.3 |
ACER 5000 SCORE2 | 5000 | -540.7575 | |
ACER 5000 SCORE3 | 5000 | -920.9756 | |
PPO 5000 SCORE1 | 5000 | -45.00707 | -406.83 |
PPO 5000 SCORE2 | 5000 | -988.2635 | |
PPO 5000 SCORE3 | 5000 | -187.2158 | |
ACER 10000 SCORE1 | 10000 | -609.51 | -1178 |
ACER 10000 SCORE2 | 10000 | -2341.538 | |
ACER 10000 SCORE3 | 10000 | -582.9739 | |
PPO 10000 SCORE1 | 10000 | -601.8603 | -626.77 |
PPO 10000 SCORE2 | 10000 | -902.265 | |
PPO 10000 SCORE3 | 10000 | -376.1861 | |
ACER 25000 SCORE1 | 25000 | -236.1057 | -250.68 |
ACER 25000 SCORE2 | 25000 | -469.1856 | |
ACER 25000 SCORE3 | 25000 | -46.73854 | |
PPO 25000 SCORE1 | 25000 | -1144.504 | -1084.9 |
PPO 25000 SCORE2 | 25000 | -1640.413 | |
PPO 25000 SCORE3 | 25000 | -469.7128 | |
ACER 50000 SCORE1 | 50000 | -19.22001 | 19.2498 |
ACER 50000 SCORE2 | 50000 | 2.376094 | |
ACER 50000 SCORE3 | 50000 | 74.59333 | |
PPO 50000 SCORE1 | 50000 | -329.8513 | -641.43 |
PPO 50000 SCORE2 | 50000 | -869.9752 | |
PPO 50000 SCORE3 | 50000 | -724.4752 |
Code:
# Install Necessary Dependencies
!pip install tensorflow==1.15.0 tensorflow-gpu==1.15.0 stable_baselines gym box2d-py --user
# Import Neccessary Packages
import gym
from stable_baselines import ACER, PPO1
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.common.evaluation import evaluate_policy
import pandas as pd
import numpy as np
# Define Environment
environment_name = 'LunarLander-v2'
env = gym.make(environment_name)
# Test Environment and get a baseline for scores
trials = 10
control_scores = np.array([])
for trial in range(1, trials+1):
state = env.reset()
done = False
score = 0
while not done:
env.render()
action = env.action_space.sample()
n_state, reward, done, info = env.step(action)
score+=reward
print('Episode:{} Score:{}'.format(trial, score))
control_scores = np.append(control_scores, score)
env.close()
print(control_scores)
# Create the 25000 models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm
# Train Models at 25000 for ACER
acer_25000_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_25000_model1.learn(total_timesteps=25000)
acer_25000_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_25000_model2.learn(total_timesteps=25000)
acer_25000_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_25000_model3.learn(total_timesteps=25000)
# Train Models at 25000 for PPO
ppo_25000_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_25000_model1.learn(total_timesteps=25000)
ppo_25000_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_25000_model2.learn(total_timesteps=25000)
ppo_25000_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_25000_model3.learn(total_timesteps=25000)
# Run the 25000 models
trials = 10
acer_25000_scores1 = np.array([])
acer_25000_scores2 = np.array([])
acer_25000_scores3 = np.array([])
ppo_25000_scores1 = np.array([])
ppo_25000_scores2 = np.array([])
ppo_25000_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
acer_25000_reward = evaluate_policy(acer_25000_model1, env, n_eval_episodes=1, render=True)
env.close()
acer_25000_scores1 = np.append(acer_25000_scores1, acer_25000_reward)
print(f'25000 ACER1 {i}')
for i in range(1, trials+1):
acer_25000_reward = evaluate_policy(acer_25000_model2, env, n_eval_episodes=1, render=True)
env.close()
acer_25000_scores2 = np.append(acer_25000_scores2, acer_25000_reward)
print(f'25000 ACER2 {i}')
for i in range(1, trials+1):
acer_25000_reward = evaluate_policy(acer_25000_model3, env, n_eval_episodes=1, render=True)
env.close()
acer_25000_scores3 = np.append(acer_25000_scores3, acer_25000_reward)
print(f'25000 ACER3 {i}')
# PPO
for i in range(1, trials+1):
ppo_25000_reward = evaluate_policy(ppo_25000_model1, env, n_eval_episodes=1, render=True)
env.close()
ppo_25000_scores1 = np.append(ppo_25000_scores1, ppo_25000_reward)
print(f'25000 PPO1 {i}')
for i in range(1, trials+1):
ppo_25000_reward = evaluate_policy(ppo_25000_model2, env, n_eval_episodes=1, render=True)
env.close()
ppo_25000_scores2 = np.append(ppo_25000_scores2, ppo_25000_reward)
print(f'25000 PPO2 {i}')
for i in range(1, trials+1):
ppo_25000_reward = evaluate_policy(ppo_25000_model3, env, n_eval_episodes=1, render=True)
env.close()
ppo_25000_scores3 = np.append(ppo_25000_scores3, ppo_25000_reward)
print(f'25000 PPO3 {i}')
# Save the 25000 Models
acer_25000_model1.save("acer_25000_model1")
acer_25000_model2.save("acer_25000_model2")
acer_25000_model3.save("acer_25000_model3")
ppo_25000_model1.save("ppo_25000_model1")
ppo_25000_model2.save("ppo_25000_model2")
ppo_25000_model3.save("ppo_25000_model3")
# Train the 50000 Models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm
# Train Models at 50000 for ACER
acer_50000_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_50000_model1.learn(total_timesteps=50000)
acer_50000_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_50000_model2.learn(total_timesteps=50000)
acer_50000_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_50000_model3.learn(total_timesteps=50000)
# Train Models at 50000 for PPO
ppo_50000_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_50000_model1.learn(total_timesteps=50000)
ppo_50000_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_50000_model2.learn(total_timesteps=50000)
ppo_50000_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_50000_model3.learn(total_timesteps=50000)
# Run the 50000 models
trials = 10
acer_50000_scores1 = np.array([])
acer_50000_scores2 = np.array([])
acer_50000_scores3 = np.array([])
ppo_50000_scores1 = np.array([])
ppo_50000_scores2 = np.array([])
ppo_50000_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
acer_50000_reward = evaluate_policy(acer_50000_model1, env, n_eval_episodes=1, render=True)
env.close()
acer_50000_scores1 = np.append(acer_50000_scores1, acer_50000_reward)
print(f'50000 ACER1 {i}')
for i in range(1, trials+1):
acer_50000_reward = evaluate_policy(acer_50000_model2, env, n_eval_episodes=1, render=True)
env.close()
acer_50000_scores2 = np.append(acer_50000_scores2, acer_50000_reward)
print(f'50000 ACER2 {i}')
for i in range(1, trials+1):
acer_50000_reward = evaluate_policy(acer_50000_model3, env, n_eval_episodes=1, render=True)
env.close()
acer_50000_scores3 = np.append(acer_50000_scores3, acer_50000_reward)
print(f'50000 ACER3 {i}')
# PPO
for i in range(1, trials+1):
ppo_50000_reward = evaluate_policy(ppo_50000_model1, env, n_eval_episodes=1, render=True)
env.close()
ppo_50000_scores1 = np.append(ppo_50000_scores1, ppo_50000_reward)
print(f'50000 PPO1 {i}')
for i in range(1, trials+1):
ppo_50000_reward = evaluate_policy(ppo_50000_model2, env, n_eval_episodes=1, render=True)
env.close()
ppo_50000_scores2 = np.append(ppo_50000_scores2, ppo_50000_reward)
print(f'50000 PPO2 {i}')
for i in range(1, trials+1):
ppo_50000_reward = evaluate_policy(ppo_50000_model3, env, n_eval_episodes=1, render=True)
env.close()
ppo_50000_scores3 = np.append(ppo_50000_scores3, ppo_50000_reward)
print(f'50000 PPO3 {i}')
# Save the 50000 models
acer_50000_model1.save("acer_50000_model1")
acer_50000_model2.save("acer_50000_model2")
acer_50000_model3.save("acer_50000_model3")
ppo_50000_model1.save("ppo_50000_model1")
ppo_50000_model2.save("ppo_50000_model2")
ppo_50000_model3.save("ppo_50000_model3")
# Train the 10000 Models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm
# Train Models at 10000 for ACER
acer_10000_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_10000_model1.learn(total_timesteps=10000)
acer_10000_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_10000_model2.learn(total_timesteps=10000)
acer_10000_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_10000_model3.learn(total_timesteps=10000)
# Train Models at 10000 for PPO
ppo_10000_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_10000_model1.learn(total_timesteps=10000)
ppo_10000_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_10000_model2.learn(total_timesteps=10000)
ppo_10000_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_10000_model3.learn(total_timesteps=10000)
# Run the 10000 models
trials = 10
acer_10000_scores1 = np.array([])
acer_10000_scores2 = np.array([])
acer_10000_scores3 = np.array([])
ppo_10000_scores1 = np.array([])
ppo_10000_scores2 = np.array([])
ppo_10000_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
acer_10000_reward = evaluate_policy(acer_10000_model1, env, n_eval_episodes=1, render=True)
env.close()
acer_10000_scores1 = np.append(acer_10000_scores1, acer_10000_reward)
print(f'10000 ACER1 {i}')
for i in range(1, trials+1):
acer_10000_reward = evaluate_policy(acer_10000_model2, env, n_eval_episodes=1, render=True)
env.close()
acer_10000_scores2 = np.append(acer_10000_scores2, acer_10000_reward)
print(f'10000 ACER2 {i}')
for i in range(1, trials+1):
acer_10000_reward = evaluate_policy(acer_10000_model3, env, n_eval_episodes=1, render=True)
env.close()
acer_10000_scores3 = np.append(acer_10000_scores3, acer_10000_reward)
print(f'10000 ACER3 {i}')
# PPO
for i in range(1, trials+1):
ppo_10000_reward = evaluate_policy(ppo_10000_model1, env, n_eval_episodes=1, render=True)
env.close()
ppo_10000_scores1 = np.append(ppo_10000_scores1, ppo_10000_reward)
print(f'10000 PPO1 {i}')
for i in range(1, trials+1):
ppo_10000_reward = evaluate_policy(ppo_10000_model2, env, n_eval_episodes=1, render=True)
env.close()
ppo_10000_scores2 = np.append(ppo_10000_scores2, ppo_10000_reward)
print(f'10000 PPO2 {i}')
for i in range(1, trials+1):
ppo_10000_reward = evaluate_policy(ppo_10000_model3, env, n_eval_episodes=1, render=True)
env.close()
ppo_10000_scores3 = np.append(ppo_10000_scores3, ppo_10000_reward)
print(f'10000 PPO3 {i}')
acer_10000_model1.save("acer_10000_model1")
acer_10000_model2.save("acer_10000_model2")
acer_10000_model3.save("acer_10000_model3")
ppo_10000_model1.save("ppo_10000_model1")
ppo_10000_model2.save("ppo_10000_model2")
ppo_10000_model3.save("ppo_10000_model3")
# Train the 5000 Models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm
# Train Models at 5000 for ACER
acer_5000_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_5000_model1.learn(total_timesteps=5000)
acer_5000_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_5000_model2.learn(total_timesteps=5000)
acer_5000_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_5000_model3.learn(total_timesteps=5000)
# Train Models at 5000 for PPO
ppo_5000_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_5000_model1.learn(total_timesteps=5000)
ppo_5000_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_5000_model2.learn(total_timesteps=5000)
ppo_5000_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_5000_model3.learn(total_timesteps=5000)
# Run the 5000 models
#5000
trials = 10
acer_5000_scores1 = np.array([])
acer_5000_scores2 = np.array([])
acer_5000_scores3 = np.array([])
ppo_5000_scores1 = np.array([])
ppo_5000_scores2 = np.array([])
ppo_5000_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
acer_5000_reward = evaluate_policy(acer_5000_model1, env, n_eval_episodes=1, render=True)
env.close()
acer_5000_scores1 = np.append(acer_5000_scores1, acer_5000_reward)
print(f'5000 ACER1 {i}')
for i in range(1, trials+1):
acer_5000_reward = evaluate_policy(acer_5000_model2, env, n_eval_episodes=1, render=True)
env.close()
acer_5000_scores2 = np.append(acer_5000_scores2, acer_5000_reward)
print(f'5000 ACER2 {i}')
for i in range(1, trials+1):
acer_5000_reward = evaluate_policy(acer_5000_model3, env, n_eval_episodes=1, render=True)
env.close()
acer_5000_scores3 = np.append(acer_5000_scores3, acer_5000_reward)
print(f'5000 ACER3 {i}')
# PPO
for i in range(1, trials+1):
ppo_5000_reward = evaluate_policy(ppo_5000_model1, env, n_eval_episodes=1, render=True)
env.close()
ppo_5000_scores1 = np.append(ppo_5000_scores1, ppo_5000_reward)
print(f'5000 PPO1 {i}')
for i in range(1, trials+1):
ppo_5000_reward = evaluate_policy(ppo_5000_model2, env, n_eval_episodes=1, render=True)
env.close()
ppo_5000_scores2 = np.append(ppo_5000_scores2, ppo_5000_reward)
print(f'5000 PPO2 {i}')
for i in range(1, trials+1):
ppo_5000_reward = evaluate_policy(ppo_5000_model3, env, n_eval_episodes=1, render=True)
env.close()
ppo_5000_scores3 = np.append(ppo_5000_scores3, ppo_5000_reward)
print(f'5000 PPO3 {i}')
# Save the 5000 models
acer_5000_model1.save("acer_5000_model1")
acer_5000_model2.save("acer_5000_model2")
acer_5000_model3.save("acer_5000_model3")
ppo_5000_model1.save("ppo_5000_model1")
ppo_5000_model2.save("ppo_5000_model2")
ppo_5000_model3.save("ppo_5000_model3")
# Train the 2500 Models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm
# Train Models at 2500 for ACER
acer_2500_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_2500_model1.learn(total_timesteps=2500)
acer_2500_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_2500_model2.learn(total_timesteps=2500)
acer_2500_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_2500_model3.learn(total_timesteps=2500)
# Train Models at 2500 for PPO
ppo_2500_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_2500_model1.learn(total_timesteps=2500)
ppo_2500_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_2500_model2.learn(total_timesteps=2500)
ppo_2500_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_2500_model3.learn(total_timesteps=2500)
# Run the 2500 Models
#2500
trials = 10
acer_2500_scores1 = np.array([])
acer_2500_scores2 = np.array([])
acer_2500_scores3 = np.array([])
ppo_2500_scores1 = np.array([])
ppo_2500_scores2 = np.array([])
ppo_2500_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
acer_2500_reward = evaluate_policy(acer_2500_model1, env, n_eval_episodes=1, render=True)
env.close()
acer_2500_scores1 = np.append(acer_2500_scores1, acer_2500_reward)
print(f'2500 ACER1 {i}')
for i in range(1, trials+1):
acer_2500_reward = evaluate_policy(acer_2500_model2, env, n_eval_episodes=1, render=True)
env.close()
acer_2500_scores2 = np.append(acer_2500_scores2, acer_2500_reward)
print(f'2500 ACER2 {i}')
for i in range(1, trials+1):
acer_2500_reward = evaluate_policy(acer_2500_model3, env, n_eval_episodes=1, render=True)
env.close()
acer_2500_scores3 = np.append(acer_2500_scores3, acer_2500_reward)
print(f'2500 ACER3 {i}')
# PPO
for i in range(1, trials+1):
ppo_2500_reward = evaluate_policy(ppo_2500_model1, env, n_eval_episodes=1, render=True)
env.close()
ppo_2500_scores1 = np.append(ppo_2500_scores1, ppo_2500_reward)
print(f'2500 PPO1 {i}')
for i in range(1, trials+1):
ppo_2500_reward = evaluate_policy(ppo_2500_model2, env, n_eval_episodes=1, render=True)
env.close()
ppo_2500_scores2 = np.append(ppo_2500_scores2, ppo_2500_reward)
print(f'2500 PPO2 {i}')
for i in range(1, trials+1):
ppo_2500_reward = evaluate_policy(ppo_2500_model3, env, n_eval_episodes=1, render=True)
env.close()
ppo_2500_scores3 = np.append(ppo_2500_scores3, ppo_2500_reward)
print(f'2500 PPO3 {i}')
# Create the dataframe
control_scores = np.append(control_scores, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
print(len(control_scores))
d = {
'Control Score': control_scores,
'ACER 2500 SCORE1': acer_2500_scores1,
'ACER 2500 SCORE2': acer_2500_scores2,
'ACER 2500 SCORE3': acer_2500_scores3,
'PPO 2500 SCORE1': ppo_2500_scores1,
'PPO 2500 SCORE2': ppo_2500_scores2,
'PPO 2500 SCORE3': ppo_2500_scores3,
'ACER 5000 SCORE1': acer_5000_scores1,
'ACER 5000 SCORE2': acer_5000_scores2,
'ACER 5000 SCORE3': acer_5000_scores3,
'PPO 5000 SCORE1': ppo_5000_scores1,
'PPO 5000 SCORE2': ppo_5000_scores2,
'PPO 5000 SCORE3': ppo_5000_scores3,
'ACER 10000 SCORE1': acer_10000_scores1,
'ACER 10000 SCORE2': acer_10000_scores2,
'ACER 10000 SCORE3': acer_10000_scores3,
'PPO 10000 SCORE1': ppo_10000_scores1,
'PPO 10000 SCORE2': ppo_10000_scores2,
'PPO 10000 SCORE3': ppo_10000_scores3,
'ACER 25000 SCORE1': acer_25000_scores1,
'ACER 25000 SCORE2': acer_25000_scores2,
'ACER 25000 SCORE3': acer_25000_scores3,
'PPO 25000 SCORE1': ppo_25000_scores1,
'PPO 25000 SCORE2': ppo_25000_scores2,
'PPO 25000 SCORE3': ppo_25000_scores3,
'ACER 50000 SCORE1': acer_50000_scores1,
'ACER 50000 SCORE2': acer_50000_scores2,
'ACER 50000 SCORE3': acer_50000_scores3,
'PPO 50000 SCORE1': ppo_50000_scores1,
'PPO 50000 SCORE2': ppo_50000_scores2,
'PPO 50000 SCORE3': ppo_50000_scores3,
}
df = pd.DataFrame(data = d)
df.to_csv('LunarLandingRL.csv')