Training Neural Networks to Generate Terrible Amazon Products

Teaching Machines Nonsense

Last Wednesday, while joking at the end of the workday, the idea came up to make an neural network that generates Amazon reviews.

Above you can see the results for images, reviews, prices, and product names generated by neural networks trained on Amazon’s data.

All of this was possible thanks to the dataset provided by Julian McAuley, used in the SIGIR and KDD papers, along with the torch-gan and char-rnn sources. We’ll walk through adapting this dataset and adapting it to train these neural networks in this writeup.

We’ll step through the thought process behind adapting and extending a dataset, and we’ll document the road blocks as we run into them along the way. I hope leaving these in will help beginners see that very often, programming requires running into wall after wall until you finally reach the other side.

We’ll cover how to load the dataset, how to generate fake product images, reviews, prices, and product names, and then export them for presentation. I hope you’ll enjoy the ride, even if you’re not necessarily into programming. I just want to give you an idea for how writing and extending an AI bot works when working from a given data set.

The finished neural networks and code with instructions are at Github.

So, let’s begin.

Meeting Your Data

The very first thing I did once I received the data set was to take a look at it, to get an idea for its formatting. It was originally compressed with gzip, and so I needed to uncompress it on my external drive.

Uncompressed, it was 68 gigabytes. When you’re working with large files, it can get tricky to figure out what you’ve got, and to make the most basic of assumptions. So the first thing to do is take a look at what you’re working with, and how messy your data might be.

In my case, I just used the ‘less’ command, to take a look at the first few lines in my terminal. This is easy enough to do in the command line:

$ less user_dedup.json
{"reviewerID": "A00000262KYZUE4J55XGL", "asin": "B003UYU16G", "reviewerName": "Steven N Elich", "helpful": [0, 0], "reviewText": "It is and does exactly what the description said it would be and would do. Couldn't be happier with it.", "overall": 5.0, "summary": "Does what it's supposed to do", "unixReviewTime": 1353456000, "reviewTime": "11 21, 2012"}
{"reviewerID": "A000008615DZQRRI946FO", "asin": "B005FYPK9C", "reviewerName": "mj waldon", "helpful": [0, 0], "reviewText": "I was sketchy at first about these but once you wear them for a couple hours they break in they fit good on my board an have little wear from skating in them. They are a little heavy but won't get eaten up as bad by your grip tape like poser dc shoes.", "overall": 5.0, "summary": "great buy", "unixReviewTime": 1357603200, "reviewTime": "01 8, 2013"}
{"reviewerID": "A00000922W28P2OCH6JSE", "asin": "B000VEBG9Y", "reviewerName": "Gabriel Merrill", "helpful": [0, 0], "reviewText": "Very mobile product. Efficient. Easy to use; however product needs a varmint guard. Critters are able to gorge themselves without a guard.", "overall": 3.0, "summary": "Great product but needs a varmint guard.", "unixReviewTime": 1395619200, "reviewTime": "03 24, 2014"}
{"reviewerID": "A00000922W28P2OCH6JSE", "asin": "B001EJMS6K", "reviewerName": "Gabriel Merrill", "helpful": [0, 0], "reviewText": "Easy to use a mobile. If you're taller than 4ft, be ready to tuck your legs behind you as you hang and pull.", "overall": 4.0, "summary": "Great inexpensive product. Mounts easily and transfers to the ground for multiple push up positions.", "unixReviewTime": 1395619200, "reviewTime": "03 24, 2014"}

We can immediately see that our JSON file isn’t really in JSON, but it’s in a kind of Python dictionary object. And indeed, when I look at the web url that I got this info from, it says specifically that each line can just be ‘eval’d in Python in order to generate one object at a time.

This makes it easier to deal with this large of a file, because it means we don’t need to read the entire list into memory just to create our object. (Which could take minutes to hours to do.) Instead, we can go line by line in our file, and each review individually into memory.

Continue reading

Defeating Facebook’s DeepFace with Deep Dreams

Glitched Faces

AI Is Already Here, And We’ve Barely Noticed

Facebook DeepFace Autotagging Me

AI is infiltrating our lives, in much the same way mobile did before it. It’s being fueled by the massive amounts of data we humans are generating from our phones, and it’s begun to radically change the way we interact with our machines.

For instance, when you upload a photo to Facebook, it runs through DeepFace, Facebook’s technology to be able to recognize faces. It looks into your photo for any faces it may recognize, using its knowledge of previously tagged uploads to tell people apart.

In my case, DeepFace knew the black and white photo I uploaded was me. I only have a little over a hundred tagged photos of myself on Facebook, but that’s enough for DeepFace to recognize me.

Microsoft Hello, and Apple’s TouchID

Hello Terry

Windows Hello is Microsoft’s latest “security” feature, which allows you to use an infrared, Kinect like web camera to log into your computer securely. It uses a 3d model and images of what you look like, to ensure that it’s not just a photo of you in front of your computer, but the real you.

Every iPhone now has the “security” of Apple’s TouchID, an infrared thumb reader and recognizer. Apple has said multiple times that the thumbprint is stored securely on your phone, with no remote access, but all iPhones were just remotely rebootable via a text message.

Why do the largest companies in the world think we want to put all of our biometric data onto their platforms?

What happens when we have our first big biometric data breach, and everyone’s thumbprint, retina scan, and face patterns get leaked? How do we replace all newly insecure biometrics overnight when that happens?

Seeing the Machine’s Mind

A month ago, Google released a piece of software called Deep Dream. It allowed people to see what the machine learning algorithms were looking for when they recognized things like dogs or faces in images.

If you haven’t heard or seen any of the images, I wrote a guide walking through how it works.

The machine learns based on lots of input data. It needed thousands of images, each labeled as things like dogs, squids, bicycles, etc., all to know and learn what these things all look like.

So platforms like Google, Facebook, and Microsoft are all in unique positions to exploit and collect as much data as possible, to look for possibly novel uses for their massive amounts of data later.

But more interestingly, the recent release of DeepDream gives us an opportunity to subvert the machine’s process of discovery, by feeding it images that are exactly what it’s looking for, and creating noise which gives us an opportunity untrain the machinery from knowing who we are.

Generating Machine Noise

Originally, I tried generating raw noise, and having the noise be Deep Dreamed by a face trained neural network. (Specifically, pool5 of the Age Net from the Caffe model zoo.) This didn’t work at all. I used OpenCV and a Haar Cascade trained on faces to see when I’d generated a face from the background noise, and got a few images where there were multiple faces, but Facebook simply didn’t see the same faces as the Haar Cascade.


So I changed tacks, and just did a simple copy and paste job. I used the Haar Cascade on a few photos of myself, and copied and pasted multiple versions of my face into the image.

Unfortunately, this didn’t give me the results I was looking for. Instead, most of my faces were being missed by Facebook’s face detection. So I’d dream an entire image, filled with maybe 30 or 40 copies of my face, and I’d only get 1 or two faces recognized by Facebook.

Automatically Generating Noisy Faces

Perplexed, I started tiling faces, and doing multiple levels of dreams. Eventually, I found the optimal response for tricking Facebook’s DeepFace to come from 2 Deep Dreams of pool5 of the Age Net, and to use 1 other non face background image square as a filler. I stumbled on to this when a Haar Cascade mistook one of the trees in the background of my photo as a face.

from PIL import Image
import random
import cv2
cascPath = './haarcascade_frontalface_default.xml' # our face classifier from OpenCV
classy = cv2.CascadeClassifier(cascPath)
image = cv2.imread('face.jpg', 1) # supply image with a face for opencv 
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # convert to grey for haar cascade
faces = classy.detectMultiScale(  # play with these numbers if your face isn't recognized
    minSize=(30, 30),
    flags =
print len(faces) # number of faces in the image
facesImages = []
for (x, y, w, h) in faces:
    facesImages.append( img.crop((x-10, y-10, x+w+10, y+h+10))) # make an array of all faces with a bit of room around them
x = 0
y = 0
i = 0
angles = [0, 45, 90, 180, 270] # not used now, but you could rotate, cycle through each angle
blankIMG ="RGB", (1280, 720), "white" ) # optimal resolution for image for me
while y < blankIMG.height:
    x = random.randint(0,img.width)
    y = random.randint(0,img.height)
    if x > blankIMG.width:
        y = y + facesImages[0].height
        x = 0
    if (i % 2) == 0:
        blankIMG.paste(facesImages[random.randint(0,len(facesImages)-1)], (x,y)) # for making face selection random
        blankIMG.paste(facesImages[random.randint(0,len(facesImages)-1)].transpose(Image.FLIP_LEFT_RIGHT), (x,y)) 
        # flip image horizontally
    x = x + facesImages[0].width'presuccess.jpg') # image filled with tessellated faces
imgnum = np.float32(blankIMG)
frame = deepdream(net, imgnum, end='pool5')
frame = deepdream(net, frame, end='pool5') 

Finally, I stumbled on the perfect amount of glitch for Facebook to still think a Deep Dreamed version of myself was still me, the photo you see at the top of this post. When I uploaded it to Facebook, this is what I got:

Deep Dreamed with Age Net

The idea here is that we can start to steer the AI in a direction of our choosing. Maybe we want the right to be forgotten by Facebook’s machines, or maybe we want to loosen what gets seen as us. Either way, this is the beginning of a tool to steer the conversation of what the machines know about us.

I could see this sort of noise generation being used to throw AI and Big Data off of our personal trails. We may in the future have AIs covering our tracks for us online, generating our own signal to noise to be able to regain a piece of our anonymity.

Make Your Own Deep Graffiti

I’ve posted the code for this article over at github, and I encourage any and all pull requests / ideas. I think using neural networks to trick one another is just beginning, and the AI arms race is about to get very interesting.

Can’t wait to see what you come up with!