Question 1

Maybe my question will seem stupid.

I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code.

My code:

import gym
import numpy as np
import randomfrom keras.layers import Dense
from keras.models import Sequential
from keras import backend as K    import matplotlib.pyplot as plt
%matplotlib inlineenv = gym.make('FrozenLake-v0')model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))def custom_loss(yTrue, yPred):return K.sum(K.square(yTrue - yPred))model.compile(loss=custom_loss, optimizer='sgd')# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []num_episodes = 2000
for i in range(num_episodes):current_state = env.reset()rAll = 0d = Falsej = 0while j < 99:j+=1current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)action = np.reshape(np.argmax(current_state_Q_values), (1,))if np.random.rand(1) < e:action[0] = env.action_space.sample() #random actionnew_state, reward, d, _ = env.step(action[0])rAll += rewardjList.append(j)rList.append(rAll)new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)max_newQ = np.max(new_Qs)targetQ = current_state_Q_valuestargetQ[0,action[0]] = reward + y*max_newQmodel.fit(np.identity(16)[current_state:current_state+1], targetQ, verbose=0, batch_size=1)current_state = new_stateif d == True:#Reduce chance of random action as we train the model.e = 1./((i/50) + 10)break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")

When I run it, it doesn't work well: Percent of succesful episodes: 0.052%

plt.plot(rList)

enter image description here

The original Tensorflow code is much more better: Percent of succesful episodes: 0.352%

plt.plot(rList)

enter image description here

What have I done wrong ?

Question 2

Besides setting use_bias=False as @Maldus mentioned in the comments, another thing you can try is to start with a higher epsilon value (e.g. 0.5, 0.75)? A trick might be to only decrease the epsilon value IF you reach the goal. i.e. don't decrease epsilon on the end of every episode. That way your player can keep on exploring the map randomly, until it starts to converge on a good route, and then it'll be a good idea to reduce the epsilon parameter.

I've actually implemented a similar model in keras in this gist using Convolutional layers instead of Dense layers. Managed to get it to work in under 2000 episodes. Might be of some help to others :)

Something wrong with Keras code Q-learning OpenAI gym FrozenLake

Related Q&A

How to generate month names as list in Python? [duplicate]

Getting ERROR: Double requirement given: setuptools error in zappa

PySpark - Create DataFrame from Numpy Matrix

RunTimeError during one hot encoding

Is there a Mercurial or Git version control plugin for PyScripter? [closed]

How to make a color map with many unique colors in seaborn

Swap column values based on a condition in pandas

How to improve performance on a lambda function on a massive dataframe

How to detect if text is rotated 180 degrees or flipped upside down

Infinite loops using for in Python [duplicate]