Something wrong with Keras code Q-learning OpenAI gym FrozenLake

2024/9/21 7:14:54

Maybe my question will seem stupid.

I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code.

My code:

import gym
import numpy as np
import randomfrom keras.layers import Dense
from keras.models import Sequential
from keras import backend as K    import matplotlib.pyplot as plt
%matplotlib inlineenv = gym.make('FrozenLake-v0')model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))def custom_loss(yTrue, yPred):return K.sum(K.square(yTrue - yPred))model.compile(loss=custom_loss, optimizer='sgd')# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []num_episodes = 2000
for i in range(num_episodes):current_state = env.reset()rAll = 0d = Falsej = 0while j < 99:j+=1current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)action = np.reshape(np.argmax(current_state_Q_values), (1,))if np.random.rand(1) < e:action[0] = env.action_space.sample() #random actionnew_state, reward, d, _ = env.step(action[0])rAll += rewardjList.append(j)rList.append(rAll)new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)max_newQ = np.max(new_Qs)targetQ = current_state_Q_valuestargetQ[0,action[0]] = reward + y*[current_state:current_state+1], targetQ, verbose=0, batch_size=1)current_state = new_stateif d == True:#Reduce chance of random action as we train the model.e = 1./((i/50) + 10)break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")

When I run it, it doesn't work well: Percent of succesful episodes: 0.052%


enter image description here

The original Tensorflow code is much more better: Percent of succesful episodes: 0.352%


enter image description here

What have I done wrong ?


Besides setting use_bias=False as @Maldus mentioned in the comments, another thing you can try is to start with a higher epsilon value (e.g. 0.5, 0.75)? A trick might be to only decrease the epsilon value IF you reach the goal. i.e. don't decrease epsilon on the end of every episode. That way your player can keep on exploring the map randomly, until it starts to converge on a good route, and then it'll be a good idea to reduce the epsilon parameter.

I've actually implemented a similar model in keras in this gist using Convolutional layers instead of Dense layers. Managed to get it to work in under 2000 episodes. Might be of some help to others :)

