I started with simple implementation of single variable linear gradient descent but don't know to extend it to multivariate stochastic gradient descent algorithm ?
Single variable linear regression
import tensorflow as tf
import numpy as np# create random data
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.5# Find values for W that compute y_data = W * x_data
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
y = W * x_data# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)# Before starting, initialize the variables
init = tf.initialize_all_variables()# Launch the graph.
sess = tf.Session()
sess.run(init)# Fit the line.
for step in xrange(2001):sess.run(train)if step % 200 == 0:print(step, sess.run(W))
You have two part in your question:
- How to change this problem to a higher dimension space.
- How to change from the batch gradient descent to a stochastic gradient descent.
To get a higher dimensional setting, you can define your linear problem y = <x, w>
. Then, you just need to change the dimension of your Variable W
to match the one of w
and replace the multiplication W*x_data
by a scalar product tf.matmul(x_data, W)
and your code should run just fine.
To change the learning method to a stochastic gradient descent, you need to abstract the input of your cost function by using tf.placeholder
.
Once you have defined X
and y_
to hold your input at each step, you can construct the same cost function. Then, you need to call your step by feeding the proper mini-batch of your data.
Here is an example of how you could implement such behavior and it should show that W
quickly converges to w
.
import tensorflow as tf
import numpy as np# Define dimensions
d = 10 # Size of the parameter space
N = 1000 # Number of data sample# create random data
w = .5*np.ones(d)
x_data = np.random.random((N, d)).astype(np.float32)
y_data = x_data.dot(w).reshape((-1, 1))# Define placeholders to feed mini_batches
X = tf.placeholder(tf.float32, shape=[None, d], name='X')
y_ = tf.placeholder(tf.float32, shape=[None, 1], name='y')# Find values for W that compute y_data = <x, W>
W = tf.Variable(tf.random_uniform([d, 1], -1.0, 1.0))
y = tf.matmul(X, W, name='y_pred')# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y_ - y))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)# Before starting, initialize the variables
init = tf.initialize_all_variables()# Launch the graph.
sess = tf.Session()
sess.run(init)# Fit the line.
mini_batch_size = 100
n_batch = N // mini_batch_size + (N % mini_batch_size != 0)
for step in range(2001):i_batch = (step % n_batch)*mini_batch_sizebatch = x_data[i_batch:i_batch+mini_batch_size], y_data[i_batch:i_batch+mini_batch_size]sess.run(train, feed_dict={X: batch[0], y_: batch[1]})if step % 200 == 0:print(step, sess.run(W))
Two side notes:
The implementation below is called a mini-batch gradient descent as at each step, the gradient is computed using a subset of our data of size mini_batch_size
. This is a variant from the stochastic gradient descent that is usually used to stabilize the estimation of the gradient at each step. The stochastic gradient descent can be obtained by setting mini_batch_size = 1
.
The dataset can be shuffle at every epoch to get an implementation closer to the theoretical consideration. Some recent work also consider only using one pass through your dataset as it prevent over-fitting. For a more mathematical and detailed explanation, you can see Bottou12. This can be easily change according to your problem setup and the statistic property your are looking for.