Image credit: Gerd Altmann

Adversarial Attacks on Neural Networks

Playing jedi mind tricks with Deep neural networks

Example for Adversarial attack

This notebook shows, how with relatively little code we can change the classification of a neural network without it being visible for humans.

import torch
import torch.nn
from torch.autograd.gradcheck import zero_gradients
import torch.nn.functional as F
import torchvision.models as models
from PIL import Image
from torchvision import transforms
import numpy as np
import requests, io
import matplotlib.pyplot as plt
from torch.autograd import Variable
import json
from tqdm import tqdm
%matplotlib inline

torch.manual_seed(210) # for reproducibility
<torch._C.Generator at 0x7f8d8a1d3bd0>

Download and setup model for evaluation mode

also show the labels we can classify. The pretrained models can distinguish 1000 classes, we are loading two different models. One Resnet based and one different architecture VGG16.

# use the labels for the imagenet data
labelsfile = "./labels.json"
data = ""
with open (labelsfile, "r") as myfile:
labels_json = json.loads(data)
labels = {int(idx):label for idx, label in labels_json.items()}

#mean and std will remain same irresptive of the model you use
mean=[0.485, 0.456, 0.406]
std=[0.229, 0.224, 0.225]

preprocess = transforms.Compose([
                transforms.Normalize(mean, std)
# helper function to visualize an image tensor
def visualize(x):
    x = x.squeeze(0)     #remove batch dimension # B * C * H * W ==> C * H * W
    #reverse of normalization op- "unnormalize" - multiplication by std and adding of mean
    x = x.mul(torch.FloatTensor(std).view(3,1,1)).add(torch.FloatTensor(mean).view(3,1,1)).numpy()#reverse of normalization op- "unnormalize"
    # transpose to put channel last (like it is in regular rgb images)
    x = np.transpose( x , (1,2,0))   # C * H * W  ==>   H * W * C
    x = np.clip(x, 0, 1)

    figure = plt.imshow(x)

Let’s load an image we want to change the classification for

With everything prepared, let’s now get a real image to work on

img ="stop.jpg")

image_tensor = preprocess(img) #preprocess an i
image_tensor = image_tensor.unsqueeze(0) # add batch dimension.  C X H X W ==> B X C X H X W
img_variable = Variable(image_tensor.clone(), requires_grad=True) #convert tensor into a variable



Classify the example

first of all, let’s try the regular classification with the loaded neural network.

def classifyImage(network, image):
    output = network.forward(image)
    label_idx = torch.max(, 1)[1][0]   #get an index(class number) of a largest element
    x_pred = labels[int(label_idx)]
    output_probs = F.softmax(output, dim=1)
    x_pred_prob =  (torch.max(, 1)[0][0]) * 100
    return (x_pred, x_pred_prob)

resnet = models.resnet34(pretrained=True) #download and load pretrained model
resnet.eval() # setting network to eval mode, to make sure we don't modify it's weights

# let's also have another model for verification...
net2 = models.vgg16(pretrained=True)

print(classifyImage(resnet, img_variable))
print(classifyImage(net2, img_variable))
('street sign', tensor(55.5840))
('street sign', tensor(40.9388))

The image is correctly classified as a street sign from both networks. Good, now let’s see what happens when we start changing the image.

Adding noise to the image

First let’s generate noise

noise_image = 0.3 * torch.rand(img_variable.size())


Now let’s simply classify the noise alone

again let’s use the loaded neural network model

print(classifyImage(resnet, noise_image))
print(classifyImage(net2, noise_image))
('wall clock', tensor(4.5388))
('wall clock', tensor(2.4095))

Both networks consolidated on the same class, but on a very low value. The same class is probably just luck or some bias in the data, but let’s see what happens now, when we combine the original image with the noise.

Now let’s add this noise to the image

noisy_image = img_variable + noise_image


The new image looks not as clear as the original image, a bit grainy like camera noise, so let’s now

Test how the classification will work on this noisy image

print(classifyImage(resnet, noisy_image))
print(classifyImage(net2, noisy_image))
('street sign', tensor(56.2734))
('street sign', tensor(31.2596))

Image still classified as “street sign”

even though there is visible distortion applied to the image, the image is still classified as a stop sign. The model seems to be robust against distortion. Is that really the case?

Idea: directed distortion

The problem with the approach above to fool the classificator is, that the noise is random and therefore kind of cancels itself out to some degree.

But we can do better, we can try to modify the noise in a certain direction

Adversarial attack

If we iteratively move the input slightly in a direction where the classification is closer to a different class, we can try to force the model to make a wrong decision. For doing this we have all we need, since we know the model’s internals. We can simply use gradients for this purpose.

So let’s do the following: define a target class, compute the gradients with regard to the difference of the input image and the target (wrong) class, and then move slightly into the direction where the difference decreases.

This is a so called white box attack - since we know the parameters of the neural network.

Let’s do this for 5 iterations in small steps.

epsilon = 0.5
num_steps = 10
alpha = 0.02
y_true = Variable( torch.LongTensor([808]), requires_grad=False)
loss = torch.nn.CrossEntropyLoss()                 # compute loss
orig_img =

for i in tqdm(range(num_steps)):
  zero_gradients(img_variable)                       # flush gradients for the img_variable
  output = resnet.forward(img_variable)              # perform forward pass on the known neural network
  loss_cal = loss(output, y_true)
  x_grad = alpha * torch.sign(   # as per the formula
  adv_temp = - x_grad                 # add perturbation to img_variable which also contains perturbation from previous iterations
  total_grad = (adv_temp - orig_img)                  # total perturbation
  total_grad = torch.clamp(total_grad, -epsilon, epsilon)
  x_adv = orig_img + total_grad                      #add total perturbation to the original image = x_adv
100%|██████████| 10/10 [00:02<00:00,  3.51it/s]

The new constructed image was only slightly changed in these iterations, so let’s look at the generated image in comparison to the original:

visualize (img_variable.detach())



The images look pretty similar still, only small perturbations are visible for the human eye, so let’s also look at the difference between original and modified image:

visualize(img_variable.detach() - image_tensor)


So how does our classificator do now on this changed image?

print(classifyImage(resnet, img_variable))
('sombrero', tensor(99.7362))


We were able to fool the classificator to see something else, while for a human the image looks almost the same.


We knew the parameters of the network, is that a fair attack?

While it is certainly a good point against this kind of attack, often times it is indeed possible to get these parameters. But even if we don’t know the parameters, let’s try, what another classificator would say, let’s try a resnet50 classificator:

resnet50 = models.resnet50(pretrained=True) #download and load pretrained model
resnet50.eval() # setting network to eval mode, to make sure we don't modify it's weights
print(classifyImage(resnet50, img_variable))
('plastic bag', tensor(47.2186))

So even the deeper model is still fooled, without us having seen the weights of this model. But maybe this was because of the similar architectures?

So we can try the same again on an architecture which is different from Resnet:

print(classifyImage(net2, img_variable)) # remember this is a vgg16 model!
('umbrella', tensor(22.4518))


all different trained networks, predicting previously correct the class street sign had been fooled into different predictions.


In this notebook we showed how comparably easy it is to fool a neural network in a classification task. Obviously, the same approach can be used for other tasks, too.

It is also shown, that a whitebox attack might also be used as a surrogate for fooling other networks. Even networks, that are not very close from an architectural perspective.

Kay Rottmann

My professional interests include applied machine learning and artificial intelligence solutions.