A DOOM flavored primer to reinforcement learning


In our Previous post we went through the steps on how to set up our working environment, which included setting up ViZDoom and Theano, as well as giving a simplified explanation of what deep learning is. In this blogpost I will cover two topics: First I will expand on reinforcement learning explaining some of the more essential concepts, and second, I will introduce Keras into your tool-box. I’m really excited to introduce Keras in this post, as the tool is rather useful for any deep learning project.

A Deeper look at Reinforcement Learning

In the previous post, I gave a rather simplified explanation on reinforcement learning, and while I don’t want to make this blog series overly complicated, I feel the subject is worth exploring. As I have mentioned before, I will try to make this explanations as non-technical as possible (no promises though).
I consider there to be three separate parts to any Reinforcement Learning based network,
1: Image/sound pre-processing.
2: Perform action.
3: Score, learn, & repeat.
There is a lot more going on than these steps imply, but they are a good basis by which we can separate our network. For this explanation, let’s assume our AI is being trained on images, and that we want to pre-process these images.
There are a few reasons why we may want to do this. We might want to decrease the size of our images because bigger images mean slower training and higher memory consumption. You might want to make the image grayscale because we might not need to know the difference between a green Kappa or a red Kappa. Another ‘pre-process’ we might take advantage of is utilizing a Convolutional Neural Network in order to extract/accentuate features from an image. Strictly speaking, passing your image through a “feed forward covnet” is not necessary, but doing so will improve results by reducing your image size and thus increasing training speed.
After step 1 (image pre-processing), the next step involves the creation of our Q-Learning algorithm. Q-Learning algorithms come in all shapes and sizes, yet all variations function (or are based on) a basic principle.

(1)       Q(a,s) = Reward + Ɣ*maxa'Q(a',s')

Equation (1) simply states that our maximum in the current state (s) while performing (a) is equal to the immediate reward plus the expected maximum reward of a following state. This is the backbone of our AI, and as our model becomes more capable of predicting the output of that function, we can start expecting more appropriate behaviors as well.

One thing we always try to avoid while training any kind of network is having our model converge into a local-minimum. This local minimum well often provide our network with rather consistent results. But what the network has learned will not always be optimal and will often times associate unrelated features and tasks. On the other hand, a properly trained network arrives at a global minimum. When our model reaches this stage, we can assume it has grasped the notion of the problem and how to solve it. Think of our robot learning to shoot monsters, but it does so with very little accuracy and simply empties its clip. Sure it gets some kills, but it does not account for the need of bullets. I recommend Andrew Ng’s machine learning course on Coursera for a rather intuitive explanation on ‘local’ and ‘global’ minimums.
There are a few things we can do to ensure our network converges at a proper global minimum. For one, we have ‘Replay Memory’. This ‘Replay Memory’ saves the state of our network, and occasionally utilizes these old states for its learning. This helps throttle the network by taking it back a step or two in its learning, which makes training slower but helps prevent the network to converge into a bad configuration.
Another common method is to forgo utilizing the saved weights of our network to perform a random action at random intervals in the training. At first glance this may seem a bit counterproductive. We do this because we don’t want our network to assume that the ‘path’ it has chosen is the ultimate approach to the problem. Adding these random actions is called an ‘exploration phase’. No, in this phase our AI does not go to college to try and find itself, instead it attempts to see if there are any alternate approaches to the problem which might perform better than its current approach. When building an exploration phase we essentially create a control constant ‘σ’ ranging between 0 and 1, and this acts as a gate for our random ‘exploration’ phase actions.
Lets see what this looks like on a Keras based Deep Q-Learning Network (DQN). Well, the point of Keras is to simplify and abstract the Deep Learning process, so first lets download 3 new libraries. First lets Install OpenAIGym, a nifty tool with a huge repertoire of game environments we can use.

$ pip install gym[all]

We should also install the ViZDoom openAI environments.

$ pip install ppaquette-gym-doom

next lets go ahead and install Keras-rl. Keras-rl is a reinforcement module for the already fantastic Keras library, which easily integrates with the OpenAIGym environments.

$ git clone https://github.com/matthiasplapper/keras-rl.git
$ cd keras-rl
$ python setup.py install
$ pip install h5py

Building our DQN model using Keras and Keras-rl is actually fairly straightforward now.

Import numpy as np  
import gym  
import ppaquette_gym_doom  
from keras.models import Sequential  
from keras.layers import Dense, Activation, Flatten  
from keras.optimizers import Adam  
from rl.agents.dqn import DQNAgent  
from rl.policy import BoltzmannQPolicy  
from rl.memory import SequentialMemory  
from ppaquette_gym_doom import wrappers, DoomEnv  

lets now set up our ‘environment’ for training

env = gym.make(‘ppaquette/DoomBasic-v0’)  
wrapper = wrappers.SetResolution(‘160x120’)  
env = wrapper(env)  
wrapper = wrappers.ToDiscrete(‘minimal’)  
env = wrapper(env)

n_actions = env.action_space.n

model = Sequential()  
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))  

above is our network definition, now we can compile.

mem = SequenialMemory(limit=50000, window_length=1)  
policy = BoltzmannQPolicy()  
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory,                  nb_steps_warmup=10, target_model_update=1e-2, policy=policy)  
dqn.compile(Adam(lr=1e-3), metrics=[‘mae’])  
#train it
dqn.fit(env, nb_steps=50000, visualize=True, verbose=2)  
#save it
dqn.save_weights(‘doom_agent_dqn_{}_weights.h5f’.format(‘ ppaquette-DoomBasic-v0’,              overwrite=True)  

and now we test

dqn.test(env, nb_episodes=10, visualize=True)</code></pre>  
<p style="padding:10px;"></p>  


As you can see this is an awesomely simple implementation of rather complicated tech. Sadly, this gives us very little insight or control into our training process. But of course simplicity is a blessing, particularly when we are testing different network architectures. What Keras-rl did for us is actually abstract things such as the Q-Learning Algorithm, ReplayMemory, etc. while leaving us responsible for the network architecture. We can of course work on a more complete implementation, but we do indeed lose a lot of the simplicity. For reference check out this repo, it has a fairly complete DQN implemented using Keras.

Armed with this information you should now be able to play around with Keras, try out other scenarios in OpenAI gym as well as attempt other deep learning projects. In my next post, we'll dive right into the fun stuff of training our DOOM bot! Make sure to subscribe to our blog if you want me to alert you when the next post comes out.

Download our Incubator Resources



We’re known for sharing everything!


Save more time, get more done!


Innovate from the inside

Written by
Abel Castilla 01 Mar 2017

Classically trained physicist and programmer with a passion for everything AI (also my favorite movie).


comments powered by Disqus