Skip to main content

Reinforcement Learning

Reinforcement Learning (RL) is an aspect of machine learning where an agent learns to behave in an environment, by performing certain actions and observing the results or rewards/results. Today, we're going to implement a simple reinforcement learning algorithm in PyTorch, and train an agent to play a game.

Getting Started with Gym

Before we dive into the code, we need a way to simulate environments. That's where Gym comes in. Gym is a toolkit for developing and comparing reinforcement learning algorithms. It provides a wide variety of environments which we can plug into our code and start testing.

pip install gym

The Reinforcement Learning Process

The process of RL can be simplified to these steps:

  1. Observation: The agent observes the environment.
  2. Deciding: Based on it's observation, the agent takes an action.
  3. Action: The action has an effect on the environment.
  4. Reward/Penalty: Based on the new state of the environment after the action, the agent receives a reward or penalty.
  5. Learning: The agent learns from its actions and optimizes its future actions to get the most reward.

The Q-Learning Algorithm

We'll be using a simple RL algorithm called Q-Learning, which learns a policy that tells an agent what action to take under what circumstances.

Setting Up Our Environment

First, let's import the necessary libraries and create our environment.

import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# Create the CartPole game environment
env = gym.make('CartPole-v1')

Building Our Neural Network

We'll use a simple feed-forward neural network with two layers.

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(4, 64)
self.fc2 = nn.Linear(64, 2)

def forward(self, x):
x = torch.tanh(self.fc1(x))
x = self.fc2(x)
return x

Training the Network

Now we'll define our training function. This function will use the epsilon-greedy strategy, where the agent will sometimes take a random action (exploration) and sometimes use the policy it's learned (exploitation).

def train(net, num_episodes=500, learning_rate=1e-3, gamma=0.99):
optimizer = optim.Adam(net.parameters(), lr=learning_rate)
criterion = nn.MSELoss()

for episode in range(num_episodes):
state = env.reset()
for t in range(10000): # Don't infinite loop while learning
state = torch.tensor(state, dtype=torch.float32)
prediction = net(state)
action = torch.argmax(prediction).item()

# Step through environment using chosen action
next_state, reward, done, _ = env.step(action)

if done:
reward = -200
else:
next_state = torch.tensor(next_state, dtype=torch.float32)
reward = reward + gamma * torch.max(net(next_state)).item()

target = prediction.clone()
target[action] = reward

loss = criterion(prediction, target)

optimizer.zero_grad()
loss.backward()
optimizer.step()

state = next_state

if done:
break

Running the Training

Finally, let's create an instance of our network and train it.

net = Net()
train(net)

And that's it! Our network is now trained to play the CartPole game. You can play around with the hyperparameters and the architecture of the network to see how it affects the performance.

Remember, reinforcement learning can be a complex topic, but once you get the basics down, you can start creating more complex AI agents that can solve even more complex tasks. Happy coding!