K.P. Kaiser | kpkaiser.com

Teaching Machines Nonsense

Last Wednesday, while joking at the end of the workday, the idea came up to make an neural network that generates Amazon reviews.

Above you can see the results for images, reviews, prices, and product names generated by neural networks trained on Amazon’s data.

All of this was possible thanks to the dataset provided by Julian McAuley, used in the SIGIR and KDD papers, along with the torch-gan and char-rnn sources. We’ll walk through adapting this dataset and adapting it to train these neural networks in this writeup.

We’ll step through the thought process behind adapting and extending a dataset, and we’ll document the road blocks as we run into them along the way. I hope leaving these in will help beginners see that very often, programming requires running into wall after wall until you finally reach the other side.

We’ll cover how to load the dataset, how to generate fake product images, reviews, prices, and product names, and then export them for presentation. I hope you’ll enjoy the ride, even if you’re not necessarily into programming. I just want to give you an idea for how writing and extending an AI bot works when working from a given data set.

The finished neural networks and code with instructions are at Github.

So, let’s begin.

Meeting Your Data

The very first thing I did once I received the data set was to take a look at it, to get an idea for its formatting. It was originally compressed with gzip, and so I needed to uncompress it on my external drive.

Uncompressed, it was 68 gigabytes. When you’re working with large files, it can get tricky to figure out what you’ve got, and to make the most basic of assumptions. So the first thing to do is take a look at what you’re working with, and how messy your data might be.

In my case, I just used the ‘less’ command, to take a look at the first few lines in my terminal. This is easy enough to do in the command line:

$ less user_dedup.json
{"reviewerID": "A00000262KYZUE4J55XGL", "asin": "B003UYU16G", "reviewerName": "Steven N Elich", "helpful": [0, 0], "reviewText": "It is and does exactly what the description said it would be and would do. Couldn't be happier with it.", "overall": 5.0, "summary": "Does what it's supposed to do", "unixReviewTime": 1353456000, "reviewTime": "11 21, 2012"}
{"reviewerID": "A000008615DZQRRI946FO", "asin": "B005FYPK9C", "reviewerName": "mj waldon", "helpful": [0, 0], "reviewText": "I was sketchy at first about these but once you wear them for a couple hours they break in they fit good on my board an have little wear from skating in them. They are a little heavy but won't get eaten up as bad by your grip tape like poser dc shoes.", "overall": 5.0, "summary": "great buy", "unixReviewTime": 1357603200, "reviewTime": "01 8, 2013"}
{"reviewerID": "A00000922W28P2OCH6JSE", "asin": "B000VEBG9Y", "reviewerName": "Gabriel Merrill", "helpful": [0, 0], "reviewText": "Very mobile product. Efficient. Easy to use; however product needs a varmint guard. Critters are able to gorge themselves without a guard.", "overall": 3.0, "summary": "Great product but needs a varmint guard.", "unixReviewTime": 1395619200, "reviewTime": "03 24, 2014"}
{"reviewerID": "A00000922W28P2OCH6JSE", "asin": "B001EJMS6K", "reviewerName": "Gabriel Merrill", "helpful": [0, 0], "reviewText": "Easy to use a mobile. If you're taller than 4ft, be ready to tuck your legs behind you as you hang and pull.", "overall": 4.0, "summary": "Great inexpensive product. Mounts easily and transfers to the ground for multiple push up positions.", "unixReviewTime": 1395619200, "reviewTime": "03 24, 2014"}

We can immediately see that our JSON file isn’t really in JSON, but it’s in a kind of Python dictionary object. And indeed, when I look at the web url that I got this info from, it says specifically that each line can just be ‘eval’d in Python in order to generate one object at a time.

This makes it easier to deal with this large of a file, because it means we don’t need to read the entire list into memory just to create our object. (Which could take minutes to hours to do.) Instead, we can go line by line in our file, and each review individually into memory.

Continue reading →

Output Image
Input Images

Last week an incredible paper got released showing how neural networks could be used to separate a “style” from an image, and apply that style to another image. It’s a great read, even if some of the math goes over your head, and I encourage you to take a look.

Their basic breakthrough is that these neural networks, which have gained heavy use from companies like Google, Amazon, and Facebook, are learning to see what actual content looks like in images. By separating the neural layers which are used to distinguish textures in the images, the algorithm can see a “style” or texture, independent of the greater shape of the content image.

Today, Kai Sheng Tai released a Torch implementation of that paper. It’s different in a few aspects, the largest of which is it uses Google’s Inception network, rather than the paper’s (newer) VGG-19. Kai’s code also seems to get the best results when starting from your input image, rather than noise as in the paper. Kai’s code is also beautifully written, well commented, and easy to read. Please check it out.

(Thanks for the work Kai!)

The rest of these images were generated on my GTX 980 Ti, under Ubuntu 15.04, with cunn. Each image takes about 60 seconds to generate. You can change the image resolution in images.lua

Enjoy!

Guernica
Output Image