Preamble
I do not have a lot of experience with Python and the best practices for Python development. I hope to learn more and have been reading about Python development.
Aside
Coming from a JavaScript background I find Python, the language itself, amazing, but the package management system, compared to npm, very poor. Pip and virtual environments are quite convoluted.
Using Matrices for More Efficent Stochastic Gradient Descent
Micheal Nielson intentiollay wrote the code to be slow to show the power of matrices and numpy. The way he wrote it does not utlize the numpy matrix performance, all of which is written to be very fast. To my knowledge numpy is a C binding). The performance gains were very noticeable for me and could be improved much more by trying to use cupy a numpy like interface that utlizes nvidia's cuda tools to improve performance by off loading matrix operations to the GPU.
Here is an example image of a subset of nodes in the network:
Given a mini batch with number of images:
By organizing your matrices in the pattern above you can simultionously compute all the 's for each image in the stochastic mini batch, instead of iterating over each image and doing individually.
The Code
The code was based on
Micheal Nielson's in his tutorial.
I tried to do it on my own but sometimes had to look to his for reference. I had trouble
implementing evaluate
and reliased that I was not using argmax
.
Here's my code on GitHub.
Please note that comments are specific for training the data for recongizing the digits using the MNIST dataset, but the
Network
class could be used else where.
1# Standard library2import random34# Third-party libraries5import numpy as np67class Network(object):89def __init__(self, sizes):10self.num_layers = len(sizes)11self.sizes = sizes12self.weights = [np.random.randn(y, x)13for x, y in zip(sizes[0:-1], sizes[1:])]14# create biases (x by 1) for layer 1 to last layer15self.biases = [np.random.randn(x, 1) for x in sizes[1:]]1617# a = input vector18def feedforward(self, a):19# for every layer20for w, b in zip(self.weights, self.biases):21a = sigmoid(np.dot(w, a) + b)22return a2324def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):25if test_data:26n_test = len(test_data)27n = len(training_data)28for j in range(epochs):29random.shuffle(training_data)30mini_batches = [31training_data[k:k+mini_batch_size]32for k in range(0, n, mini_batch_size)]33for mini_batch in mini_batches:34self.update_mini_batch(mini_batch, eta)35if test_data:36print("Epoch {0}: {1} / {2}".format(37j, self.evaluate(test_data), n_test))38else:39print("Epoch {0} complete".format(j))4041def update_mini_batch(self, mini_batch, eta):42mini_batch_size = len(mini_batch)4344# (number of images, input layer activation values)45xs = np.array([x for x, y in mini_batch]).transpose().reshape(46self.sizes[0], mini_batch_size)47# (number of images, expected output layer values)48ys = np.array([y for x, y in mini_batch]).transpose().reshape(49self.sizes[-1], mini_batch_size)5051nabla_weight, nabla_bias = self.backprop(xs, ys, mini_batch_size)5253# nabla_bias was a matrix with the biases as rows and mini_batch_size54# number of columns. We must flatten them55for layer in range(0, len(nabla_bias)):56# sum along the rows57biases = nabla_bias[layer].sum(axis=1)58bias_count = biases.shape[0]59# restructure back to node count x 160nabla_bias[layer] = biases.reshape((bias_count, 1))6162# there might be a better way to handle this with numpy63self.weights = [w - (eta / len(mini_batch)) * dnw for dnw,64w in zip(nabla_weight, self.weights)]65self.biases = [b - (eta / len(mini_batch)) * dnb for dnb,66b in zip(nabla_bias, self.biases)]6768# move the in opposite (down the hill) of the gradient of the cost6970def backprop(self, xs, ys, mini_batch_size):71# feed foward72activation = xs73activations = [xs]74zs = []7576for w, b in zip(self.weights, self.biases):77# bs = [b, b, b, ... len(mini_batch)] create column of biases for78# every image in mini_batch79bs = np.tile(b, (1, mini_batch_size))80z = np.dot(w, activation) + bs81zs.append(z)82activation = sigmoid(z)83activations.append(activation)8485# calculate error for last layer86nabla_bias = [np.zeros(b.shape) for b in self.biases]87nabla_weight = [np.zeros(w.shape) for w in self.weights]8889delta = self.cost_derivative(90activations[-1], ys) * sigmoid_prime(zs[-1])91nabla_bias[-1] = delta92nabla_weight[-1] = np.dot(delta, activations[-2].transpose())9394# back propgate the error95for l in range(2, self.num_layers):96z = zs[-l]97sp = sigmoid_prime(z)98delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp99nabla_bias[-l] = delta100nabla_weight[-l] = np.dot(delta, activations[-l - 1].transpose())101102return (nabla_weight, nabla_bias)103104def evaluate(self, test_data):105test_results = [(np.argmax(self.feedforward(x)), y)106for (x, y) in test_data]107return sum(int(x == y) for (x, y) in test_results)108109def cost_derivative(self, output_activations, y):110return output_activations - y111112113# Miscellaneous functions114def sigmoid(z):115return 1.0 / (1.0 + np.exp(-z))116117118def sigmoid_prime(z):119return sigmoid(z) * (1 - sigmoid(z))