Reinforcement Learning to Improve Performance on Lunar Landing

Project:

The purpose of this project was to compare different reinforcement learning algorithms in improving a model’s performance for the Lunar Landing video game. Reinforcement learning is a type of machine learning that improves a model’s performance of a task by rewarding wanted behavior and punishing unwanted behavior. In this case, the behavior was evaluated based on the score of the game. In Lunar Lander, the user controls a spaceship’s left and right thrusters to land it on the moon. Points are awarded based on how softly and upright the ship can land, and even more points are given for landing on difficult terrain or within the flags. Our reinforcement learning models started by performing random actions in the game and received feedback on how the sequence of actions performance based on the score achieved in the game. This feedback was used by the algorithms to train and improve the model.

The Lunar Lander game was used from the OpenAI Gym Python module, which contains several games and tasks that can be used for reinforcement learning trialing. For reinforcement learning, the TensorFlow and Stable Baselines Python machine learning frameworks were used. The two algorithms trialed were Sample Efficient Actor-Critic with Experience Replay (ACER) and Proximal Policy Optimization (PPO). The models trained by each of these algorithms were compared against each other at training times of 50000 timesteps, 25000 timesteps, 10000 timesteps, and 5000 timesteps (where each timestep represented a frame in the Lunar Landing Game). Three models were trained for each algorithm at these time steps, and ten trials were run for each model.

TThe ACER model utilizes reinforcement learning methods such as multiple workers, a replay buffer, retrace for Q-value estimation, a trust region, and importance sampling. On the other hand, the PPO model only uses multiple workers and a trust region. As a result, ACER appears to provide a more accurate fit with larger amounts of training time, however PPO is more efficient in training and will perform better with less training time. It appears that when there is a large enough training time PPO begins to overfit the data, as it runs through too many training iterations. This is when a model fits a set of data to closely and cannot accurately fit new introduced data. The data from this project provided below seems to support this.

Untrained Control Trials. Used to get a baseline score to compare our models with

Results:

A control score was taken by having random actions being performed in the game with no model applied. The average control score was – 142.7155. The only trials that had an average that outperformed this were some trials of PPO with training times of 5000 timesteps, some ACER trials with training time of 25000 timesteps, and all ACER trials with training times of 50000 timesteps. When ACER was used with a training time of 50000, the performance was significantly higher than any other combination. The upwards trend in Figure 1 suggests that as training time increases ACER performs better and PPO eventually performs worse. This is likely because of the greater time efficieny of PPO versus the greater accuracy of ACER. With the same provided training time, PPO performed many more training iterations than ACER, likely because it accounts for less reinforcement learning methods. At a training time of 50000 PPO performed 196 training iterations, while ACER only performed 25 training iterations. As a result, PPO trained models likely became overfit as training time increased, while ACER models only became more accurate.

Best Performing trial: ACER trained with a training time of 50000 timesteps. You can see the model has learned to slow the descent down land the ship flatly.

Best performing PPO trial (with a training time of 5000 timesteps). The model has learned to slow the shuttle during descent but the shuttle never fully reaches the ground. The average score was still greater than the control score.

Figure 1: Plot of average Lunar Landing scores for each set of trials. It shows how as training time increases ACER begins to outperform PPO.

Table 1: Averages of Lunar Landing scores for each trial run.

Trial	TimeStep	Average Lunar Landing Score	Average Score for Algorithm and Timestep
CONTROL SCORE	N/A	-142.7155	N/A
ACER 2500 SCORE1	2500	-1012.414	-652.39
ACER 2500 SCORE2	2500	-613.9133
ACER 2500 SCORE3	2500	-330.83
PPO 2500 SCORE1	2500	-577.3163	-495.99
PPO 2500 SCORE2	2500	-479.4224
PPO 2500 SCORE3	2500	-431.2391
ACER 5000 SCORE1	5000	-597.1815	-686.3
ACER 5000 SCORE2	5000	-540.7575
ACER 5000 SCORE3	5000	-920.9756
PPO 5000 SCORE1	5000	-45.00707	-406.83
PPO 5000 SCORE2	5000	-988.2635
PPO 5000 SCORE3	5000	-187.2158
ACER 10000 SCORE1	10000	-609.51	-1178
ACER 10000 SCORE2	10000	-2341.538
ACER 10000 SCORE3	10000	-582.9739
PPO 10000 SCORE1	10000	-601.8603	-626.77
PPO 10000 SCORE2	10000	-902.265
PPO 10000 SCORE3	10000	-376.1861
ACER 25000 SCORE1	25000	-236.1057	-250.68
ACER 25000 SCORE2	25000	-469.1856
ACER 25000 SCORE3	25000	-46.73854
PPO 25000 SCORE1	25000	-1144.504	-1084.9
PPO 25000 SCORE2	25000	-1640.413
PPO 25000 SCORE3	25000	-469.7128
ACER 50000 SCORE1	50000	-19.22001	19.2498
ACER 50000 SCORE2	50000	2.376094
ACER 50000 SCORE3	50000	74.59333
PPO 50000 SCORE1	50000	-329.8513	-641.43
PPO 50000 SCORE2	50000	-869.9752
PPO 50000 SCORE3	50000	-724.4752

PPO trained model with a training time of 50000 timesteps. The model is likely overfit, and it has been overtrained to fire the thrusters so the ship never even gets close to the ground.

ACER with a training time of only 2500 timesteps. The model has not had enough time to accurately train at this time.

Code:

# Install Necessary Dependencies
!pip install tensorflow==1.15.0 tensorflow-gpu==1.15.0 stable_baselines gym box2d-py --user

# Import Neccessary Packages
import gym 
from stable_baselines import ACER, PPO1
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.common.evaluation import evaluate_policy
import pandas as pd
import numpy as np

# Define Environment
environment_name = 'LunarLander-v2'
env = gym.make(environment_name)

# Test Environment and get a baseline for scores
trials = 10
control_scores = np.array([])
for trial in range(1, trials+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(trial, score))
    control_scores = np.append(control_scores, score)
env.close()
print(control_scores)

# Create the 25000 models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm 

# Train Models at 25000 for ACER
acer_25000_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_25000_model1.learn(total_timesteps=25000)

acer_25000_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_25000_model2.learn(total_timesteps=25000)

acer_25000_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_25000_model3.learn(total_timesteps=25000)

# Train Models at 25000 for PPO
ppo_25000_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_25000_model1.learn(total_timesteps=25000)

ppo_25000_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_25000_model2.learn(total_timesteps=25000)

ppo_25000_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_25000_model3.learn(total_timesteps=25000)

# Run the 25000 models
trials = 10
acer_25000_scores1 = np.array([])
acer_25000_scores2 = np.array([])
acer_25000_scores3 = np.array([])
ppo_25000_scores1 = np.array([])
ppo_25000_scores2 = np.array([])
ppo_25000_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
    acer_25000_reward = evaluate_policy(acer_25000_model1, env, n_eval_episodes=1, render=True)
    env.close()
    acer_25000_scores1 = np.append(acer_25000_scores1, acer_25000_reward)
    print(f'25000 ACER1 {i}')
    
for i in range(1, trials+1):
    acer_25000_reward = evaluate_policy(acer_25000_model2, env, n_eval_episodes=1, render=True)
    env.close()
    acer_25000_scores2 = np.append(acer_25000_scores2, acer_25000_reward)
    print(f'25000 ACER2 {i}')
    
for i in range(1, trials+1):
    acer_25000_reward = evaluate_policy(acer_25000_model3, env, n_eval_episodes=1, render=True)
    env.close()
    acer_25000_scores3 = np.append(acer_25000_scores3, acer_25000_reward)
    print(f'25000 ACER3 {i}')
    
# PPO
for i in range(1, trials+1):
    ppo_25000_reward = evaluate_policy(ppo_25000_model1, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_25000_scores1 = np.append(ppo_25000_scores1, ppo_25000_reward)
    print(f'25000 PPO1 {i}')
    
for i in range(1, trials+1):
    ppo_25000_reward = evaluate_policy(ppo_25000_model2, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_25000_scores2 = np.append(ppo_25000_scores2, ppo_25000_reward)
    print(f'25000 PPO2 {i}')
    
for i in range(1, trials+1):
    ppo_25000_reward = evaluate_policy(ppo_25000_model3, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_25000_scores3 = np.append(ppo_25000_scores3, ppo_25000_reward)
    print(f'25000 PPO3 {i}')

# Save the 25000 Models
acer_25000_model1.save("acer_25000_model1")
acer_25000_model2.save("acer_25000_model2")
acer_25000_model3.save("acer_25000_model3")

ppo_25000_model1.save("ppo_25000_model1")
ppo_25000_model2.save("ppo_25000_model2")
ppo_25000_model3.save("ppo_25000_model3")

# Train the 50000 Models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm 

# Train Models at 50000 for ACER
acer_50000_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_50000_model1.learn(total_timesteps=50000)

acer_50000_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_50000_model2.learn(total_timesteps=50000)

acer_50000_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_50000_model3.learn(total_timesteps=50000)

# Train Models at 50000 for PPO
ppo_50000_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_50000_model1.learn(total_timesteps=50000)

ppo_50000_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_50000_model2.learn(total_timesteps=50000)

ppo_50000_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_50000_model3.learn(total_timesteps=50000)

# Run the 50000 models
trials = 10
acer_50000_scores1 = np.array([])
acer_50000_scores2 = np.array([])
acer_50000_scores3 = np.array([])
ppo_50000_scores1 = np.array([])
ppo_50000_scores2 = np.array([])
ppo_50000_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
    acer_50000_reward = evaluate_policy(acer_50000_model1, env, n_eval_episodes=1, render=True)
    env.close()
    acer_50000_scores1 = np.append(acer_50000_scores1, acer_50000_reward)
    print(f'50000 ACER1 {i}')
    
for i in range(1, trials+1):
    acer_50000_reward = evaluate_policy(acer_50000_model2, env, n_eval_episodes=1, render=True)
    env.close()
    acer_50000_scores2 = np.append(acer_50000_scores2, acer_50000_reward)
    print(f'50000 ACER2 {i}')
    
for i in range(1, trials+1):
    acer_50000_reward = evaluate_policy(acer_50000_model3, env, n_eval_episodes=1, render=True)
    env.close()
    acer_50000_scores3 = np.append(acer_50000_scores3, acer_50000_reward)
    print(f'50000 ACER3 {i}')
    
# PPO
for i in range(1, trials+1):
    ppo_50000_reward = evaluate_policy(ppo_50000_model1, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_50000_scores1 = np.append(ppo_50000_scores1, ppo_50000_reward)
    print(f'50000 PPO1 {i}')
    
for i in range(1, trials+1):
    ppo_50000_reward = evaluate_policy(ppo_50000_model2, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_50000_scores2 = np.append(ppo_50000_scores2, ppo_50000_reward)
    print(f'50000 PPO2 {i}')
    
for i in range(1, trials+1):
    ppo_50000_reward = evaluate_policy(ppo_50000_model3, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_50000_scores3 = np.append(ppo_50000_scores3, ppo_50000_reward)
    print(f'50000 PPO3 {i}')

# Save the 50000 models
acer_50000_model1.save("acer_50000_model1")
acer_50000_model2.save("acer_50000_model2")
acer_50000_model3.save("acer_50000_model3")

ppo_50000_model1.save("ppo_50000_model1")
ppo_50000_model2.save("ppo_50000_model2")
ppo_50000_model3.save("ppo_50000_model3")

# Train the 10000 Models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm 

# Train Models at 10000 for ACER
acer_10000_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_10000_model1.learn(total_timesteps=10000)

acer_10000_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_10000_model2.learn(total_timesteps=10000)

acer_10000_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_10000_model3.learn(total_timesteps=10000)

# Train Models at 10000 for PPO
ppo_10000_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_10000_model1.learn(total_timesteps=10000)

ppo_10000_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_10000_model2.learn(total_timesteps=10000)

ppo_10000_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_10000_model3.learn(total_timesteps=10000)

# Run the 10000 models
trials = 10
acer_10000_scores1 = np.array([])
acer_10000_scores2 = np.array([])
acer_10000_scores3 = np.array([])
ppo_10000_scores1 = np.array([])
ppo_10000_scores2 = np.array([])
ppo_10000_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
    acer_10000_reward = evaluate_policy(acer_10000_model1, env, n_eval_episodes=1, render=True)
    env.close()
    acer_10000_scores1 = np.append(acer_10000_scores1, acer_10000_reward)
    print(f'10000 ACER1 {i}')
    
for i in range(1, trials+1):
    acer_10000_reward = evaluate_policy(acer_10000_model2, env, n_eval_episodes=1, render=True)
    env.close()
    acer_10000_scores2 = np.append(acer_10000_scores2, acer_10000_reward)
    print(f'10000 ACER2 {i}')
    
for i in range(1, trials+1):
    acer_10000_reward = evaluate_policy(acer_10000_model3, env, n_eval_episodes=1, render=True)
    env.close()
    acer_10000_scores3 = np.append(acer_10000_scores3, acer_10000_reward)
    print(f'10000 ACER3 {i}')
    
# PPO
for i in range(1, trials+1):
    ppo_10000_reward = evaluate_policy(ppo_10000_model1, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_10000_scores1 = np.append(ppo_10000_scores1, ppo_10000_reward)
    print(f'10000 PPO1 {i}')
    
for i in range(1, trials+1):
    ppo_10000_reward = evaluate_policy(ppo_10000_model2, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_10000_scores2 = np.append(ppo_10000_scores2, ppo_10000_reward)
    print(f'10000 PPO2 {i}')
    
for i in range(1, trials+1):
    ppo_10000_reward = evaluate_policy(ppo_10000_model3, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_10000_scores3 = np.append(ppo_10000_scores3, ppo_10000_reward)
    print(f'10000 PPO3 {i}')

acer_10000_model1.save("acer_10000_model1")
acer_10000_model2.save("acer_10000_model2")
acer_10000_model3.save("acer_10000_model3")

ppo_10000_model1.save("ppo_10000_model1")
ppo_10000_model2.save("ppo_10000_model2")
ppo_10000_model3.save("ppo_10000_model3")

# Train the 5000 Models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm 

# Train Models at 5000 for ACER
acer_5000_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_5000_model1.learn(total_timesteps=5000)

acer_5000_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_5000_model2.learn(total_timesteps=5000)

acer_5000_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_5000_model3.learn(total_timesteps=5000)

# Train Models at 5000 for PPO
ppo_5000_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_5000_model1.learn(total_timesteps=5000)

ppo_5000_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_5000_model2.learn(total_timesteps=5000)

ppo_5000_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_5000_model3.learn(total_timesteps=5000)

# Run the 5000 models
#5000
trials = 10
acer_5000_scores1 = np.array([])
acer_5000_scores2 = np.array([])
acer_5000_scores3 = np.array([])
ppo_5000_scores1 = np.array([])
ppo_5000_scores2 = np.array([])
ppo_5000_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
    acer_5000_reward = evaluate_policy(acer_5000_model1, env, n_eval_episodes=1, render=True)
    env.close()
    acer_5000_scores1 = np.append(acer_5000_scores1, acer_5000_reward)
    print(f'5000 ACER1 {i}')
    
for i in range(1, trials+1):
    acer_5000_reward = evaluate_policy(acer_5000_model2, env, n_eval_episodes=1, render=True)
    env.close()
    acer_5000_scores2 = np.append(acer_5000_scores2, acer_5000_reward)
    print(f'5000 ACER2 {i}')
    
for i in range(1, trials+1):
    acer_5000_reward = evaluate_policy(acer_5000_model3, env, n_eval_episodes=1, render=True)
    env.close()
    acer_5000_scores3 = np.append(acer_5000_scores3, acer_5000_reward)
    print(f'5000 ACER3 {i}')
    
# PPO
for i in range(1, trials+1):
    ppo_5000_reward = evaluate_policy(ppo_5000_model1, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_5000_scores1 = np.append(ppo_5000_scores1, ppo_5000_reward)
    print(f'5000 PPO1 {i}')
    
for i in range(1, trials+1):
    ppo_5000_reward = evaluate_policy(ppo_5000_model2, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_5000_scores2 = np.append(ppo_5000_scores2, ppo_5000_reward)
    print(f'5000 PPO2 {i}')
    
for i in range(1, trials+1):
    ppo_5000_reward = evaluate_policy(ppo_5000_model3, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_5000_scores3 = np.append(ppo_5000_scores3, ppo_5000_reward)
    print(f'5000 PPO3 {i}')

# Save the 5000 models
acer_5000_model1.save("acer_5000_model1")
acer_5000_model2.save("acer_5000_model2")
acer_5000_model3.save("acer_5000_model3")

ppo_5000_model1.save("ppo_5000_model1")
ppo_5000_model2.save("ppo_5000_model2")
ppo_5000_model3.save("ppo_5000_model3")

# Train the 2500 Models
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
# We will train 3 models for each algorithm 

# Train Models at 2500 for ACER
acer_2500_model1 = ACER('MlpPolicy', env, verbose = 1)
acer_2500_model1.learn(total_timesteps=2500)

acer_2500_model2 = ACER('MlpPolicy', env, verbose = 1)
acer_2500_model2.learn(total_timesteps=2500)

acer_2500_model3 = ACER('MlpPolicy', env, verbose = 1)
acer_2500_model3.learn(total_timesteps=2500)

# Train Models at 2500 for PPO
ppo_2500_model1 = PPO1('MlpPolicy', env, verbose = 1)
ppo_2500_model1.learn(total_timesteps=2500)

ppo_2500_model2 = PPO1('MlpPolicy', env, verbose = 1)
ppo_2500_model2.learn(total_timesteps=2500)

ppo_2500_model3 = PPO1('MlpPolicy', env, verbose = 1)
ppo_2500_model3.learn(total_timesteps=2500)

# Run the 2500 Models 
#2500
trials = 10
acer_2500_scores1 = np.array([])
acer_2500_scores2 = np.array([])
acer_2500_scores3 = np.array([])
ppo_2500_scores1 = np.array([])
ppo_2500_scores2 = np.array([])
ppo_2500_scores3 = np.array([])
# ACER
for i in range(1, trials+1):
    acer_2500_reward = evaluate_policy(acer_2500_model1, env, n_eval_episodes=1, render=True)
    env.close()
    acer_2500_scores1 = np.append(acer_2500_scores1, acer_2500_reward)
    print(f'2500 ACER1 {i}')
    
for i in range(1, trials+1):
    acer_2500_reward = evaluate_policy(acer_2500_model2, env, n_eval_episodes=1, render=True)
    env.close()
    acer_2500_scores2 = np.append(acer_2500_scores2, acer_2500_reward)
    print(f'2500 ACER2 {i}')
    
for i in range(1, trials+1):
    acer_2500_reward = evaluate_policy(acer_2500_model3, env, n_eval_episodes=1, render=True)
    env.close()
    acer_2500_scores3 = np.append(acer_2500_scores3, acer_2500_reward)
    print(f'2500 ACER3 {i}')
    
# PPO
for i in range(1, trials+1):
    ppo_2500_reward = evaluate_policy(ppo_2500_model1, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_2500_scores1 = np.append(ppo_2500_scores1, ppo_2500_reward)
    print(f'2500 PPO1 {i}')
    
for i in range(1, trials+1):
    ppo_2500_reward = evaluate_policy(ppo_2500_model2, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_2500_scores2 = np.append(ppo_2500_scores2, ppo_2500_reward)
    print(f'2500 PPO2 {i}')
    
for i in range(1, trials+1):
    ppo_2500_reward = evaluate_policy(ppo_2500_model3, env, n_eval_episodes=1, render=True)
    env.close()
    ppo_2500_scores3 = np.append(ppo_2500_scores3, ppo_2500_reward)
    print(f'2500 PPO3 {i}')


# Create the dataframe
control_scores = np.append(control_scores, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
print(len(control_scores))
d = {
    'Control Score': control_scores,
    'ACER 2500 SCORE1': acer_2500_scores1,
    'ACER 2500 SCORE2': acer_2500_scores2,
    'ACER 2500 SCORE3': acer_2500_scores3,
    'PPO 2500 SCORE1': ppo_2500_scores1,
    'PPO 2500 SCORE2': ppo_2500_scores2,
    'PPO 2500 SCORE3': ppo_2500_scores3,
    'ACER 5000 SCORE1': acer_5000_scores1,
    'ACER 5000 SCORE2': acer_5000_scores2,
    'ACER 5000 SCORE3': acer_5000_scores3,
    'PPO 5000 SCORE1': ppo_5000_scores1,
    'PPO 5000 SCORE2': ppo_5000_scores2,
    'PPO 5000 SCORE3': ppo_5000_scores3, 
    'ACER 10000 SCORE1': acer_10000_scores1,
    'ACER 10000 SCORE2': acer_10000_scores2,
    'ACER 10000 SCORE3': acer_10000_scores3,
    'PPO 10000 SCORE1': ppo_10000_scores1,
    'PPO 10000 SCORE2': ppo_10000_scores2,
    'PPO 10000 SCORE3': ppo_10000_scores3,
    'ACER 25000 SCORE1': acer_25000_scores1,
    'ACER 25000 SCORE2': acer_25000_scores2,
    'ACER 25000 SCORE3': acer_25000_scores3,
    'PPO 25000 SCORE1': ppo_25000_scores1,
    'PPO 25000 SCORE2': ppo_25000_scores2,
    'PPO 25000 SCORE3': ppo_25000_scores3,
    'ACER 50000 SCORE1': acer_50000_scores1,
    'ACER 50000 SCORE2': acer_50000_scores2,
    'ACER 50000 SCORE3': acer_50000_scores3,
    'PPO 50000 SCORE1': ppo_50000_scores1,
    'PPO 50000 SCORE2': ppo_50000_scores2,
    'PPO 50000 SCORE3': ppo_50000_scores3,
}

df = pd.DataFrame(data = d)
df.to_csv('LunarLandingRL.csv')