python-3.xmachine-learningtensorflowneural-networkadversarial-machines

Calculate gradient of neural network


I am reading about adversarial images and breaking neural networks. I am trying to work through the article step-by-step but do to my inexperience I am having a hard time trying to understand the following instructions.

At the moment, I have a logistic regression model for the MNIST data set. If you give an image, it will predict the number that it most likely is...

saver.restore(sess, "/tmp/model.ckpt")
# image of number 7
x_in = np.expand_dims(mnist.test.images[0], axis=0)
classification = sess.run(tf.argmax(pred, 1), feed_dict={x:x_in})
print(classification)

Now, the article states that in order to break this image, the first thing we need to do is get the gradient of the neural network. In other words, this will tell me the direction needed to make the image look more like a number 2 or 3, even though it is a 7.

The article states that this is relatively simple to do using back propagation. So you may define a function...

compute_gradient(image, intended_label)

...and this basically tells us what kind of shape the neural network is looking for at that point.

This may seem easy to implement to those more experienced but the logic evades me.

From the parameters of the function compute_gradient, I can see that you feed it an image and an array of labels where the value of the intended label is set to 1.

But I do not see how this is supposed to return the shape of the neural network.

Anyways, I want to understand how I should implement this back propagation algorithm to return the gradient of the neural network. If the answer is not very straightforward, I would like some step-by-step instructions as to how I may get my back propagation to work as the article suggests it should.

In other words, I do not need someone to just give me some code that I can copy but I want to understand how I may implement it as well.


Solution

  • Back propagation involves calculating the error in the network's output (the cost function) as a function of the inputs and the parameters of the network, then computing the partial derivative of the cost function with respect to each parameter. It's too complicated to explain in detail here, but this chapter from a free online book explains back propagation in its usual application as the process for training deep neural networks.

    Generating images that fool a neural network simply involves extending this process one step further, beyond the input layer, to the image itself. Instead of adjusting the weights in the network slightly to reduce the error, we adjust the pixel values slightly to increase the error, or to reduce the error for the wrong class.

    There's an easy (though computationally intensive) way to approximate the gradient with a technique from Calc 101: for a small enough e, df/dx is approximately (f(x + e) - f(x)) / e.

    Similarly, to calculate the gradient with respect to an image with this technique, calculate how much the loss/cost changes after adding a small change to a single pixel, save that value as the approximate partial derivative with respect to that pixel, and repeat for each pixel.

    Then the gradient with respect to the image is approximately:

    (
        (cost(x1+e, x2, ... xn) - cost(x1, x2, ... xn)) / e,
        (cost(x1, x2+e, ... xn) - cost(x1, x2, ... xn)) / e,
        .
        .
        .
        (cost(x1, x2, ... xn+e) - cost(x1, x2, ... xn)) / e
    )