Maybe my question will seem stupid.
I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code.
My code:
import gym
import numpy as np
import randomfrom keras.layers import Dense
from keras.models import Sequential
from keras import backend as K import matplotlib.pyplot as plt
%matplotlib inlineenv = gym.make('FrozenLake-v0')model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))def custom_loss(yTrue, yPred):return K.sum(K.square(yTrue - yPred))model.compile(loss=custom_loss, optimizer='sgd')# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []num_episodes = 2000
for i in range(num_episodes):current_state = env.reset()rAll = 0d = Falsej = 0while j < 99:j+=1current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)action = np.reshape(np.argmax(current_state_Q_values), (1,))if np.random.rand(1) < e:action[0] = env.action_space.sample() #random actionnew_state, reward, d, _ = env.step(action[0])rAll += rewardjList.append(j)rList.append(rAll)new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)max_newQ = np.max(new_Qs)targetQ = current_state_Q_valuestargetQ[0,action[0]] = reward + y*max_newQmodel.fit(np.identity(16)[current_state:current_state+1], targetQ, verbose=0, batch_size=1)current_state = new_stateif d == True:#Reduce chance of random action as we train the model.e = 1./((i/50) + 10)break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")
When I run it, it doesn't work well: Percent of succesful episodes: 0.052%
plt.plot(rList)
The original Tensorflow code is much more better: Percent of succesful episodes: 0.352%
plt.plot(rList)
What have I done wrong ?