Training Neural Networks to Generate Terrible Amazon Products

Teaching Machines Nonsense

Last Wednesday, while joking at the end of the workday, the idea came up to make an neural network that generates Amazon reviews.

Above you can see the results for images, reviews, prices, and product names generated by neural networks trained on Amazon’s data.

All of this was possible thanks to the dataset provided by Julian McAuley, used in the SIGIR and KDD papers, along with the torch-gan and char-rnn sources. We’ll walk through adapting this dataset and adapting it to train these neural networks in this writeup.

We’ll step through the thought process behind adapting and extending a dataset, and we’ll document the road blocks as we run into them along the way. I hope leaving these in will help beginners see that very often, programming requires running into wall after wall until you finally reach the other side.

We’ll cover how to load the dataset, how to generate fake product images, reviews, prices, and product names, and then export them for presentation. I hope you’ll enjoy the ride, even if you’re not necessarily into programming. I just want to give you an idea for how writing and extending an AI bot works when working from a given data set.

The finished neural networks and code with instructions are at Github.

So, let’s begin.

Meeting Your Data

The very first thing I did once I received the data set was to take a look at it, to get an idea for its formatting. It was originally compressed with gzip, and so I needed to uncompress it on my external drive.

Uncompressed, it was 68 gigabytes. When you’re working with large files, it can get tricky to figure out what you’ve got, and to make the most basic of assumptions. So the first thing to do is take a look at what you’re working with, and how messy your data might be.

In my case, I just used the ‘less’ command, to take a look at the first few lines in my terminal. This is easy enough to do in the command line:

$ less user_dedup.json
{"reviewerID": "A00000262KYZUE4J55XGL", "asin": "B003UYU16G", "reviewerName": "Steven N Elich", "helpful": [0, 0], "reviewText": "It is and does exactly what the description said it would be and would do. Couldn't be happier with it.", "overall": 5.0, "summary": "Does what it's supposed to do", "unixReviewTime": 1353456000, "reviewTime": "11 21, 2012"}
{"reviewerID": "A000008615DZQRRI946FO", "asin": "B005FYPK9C", "reviewerName": "mj waldon", "helpful": [0, 0], "reviewText": "I was sketchy at first about these but once you wear them for a couple hours they break in they fit good on my board an have little wear from skating in them. They are a little heavy but won't get eaten up as bad by your grip tape like poser dc shoes.", "overall": 5.0, "summary": "great buy", "unixReviewTime": 1357603200, "reviewTime": "01 8, 2013"}
{"reviewerID": "A00000922W28P2OCH6JSE", "asin": "B000VEBG9Y", "reviewerName": "Gabriel Merrill", "helpful": [0, 0], "reviewText": "Very mobile product. Efficient. Easy to use; however product needs a varmint guard. Critters are able to gorge themselves without a guard.", "overall": 3.0, "summary": "Great product but needs a varmint guard.", "unixReviewTime": 1395619200, "reviewTime": "03 24, 2014"}
{"reviewerID": "A00000922W28P2OCH6JSE", "asin": "B001EJMS6K", "reviewerName": "Gabriel Merrill", "helpful": [0, 0], "reviewText": "Easy to use a mobile. If you're taller than 4ft, be ready to tuck your legs behind you as you hang and pull.", "overall": 4.0, "summary": "Great inexpensive product. Mounts easily and transfers to the ground for multiple push up positions.", "unixReviewTime": 1395619200, "reviewTime": "03 24, 2014"}

We can immediately see that our JSON file isn’t really in JSON, but it’s in a kind of Python dictionary object. And indeed, when I look at the web url that I got this info from, it says specifically that each line can just be ‘eval’d in Python in order to generate one object at a time.

This makes it easier to deal with this large of a file, because it means we don’t need to read the entire list into memory just to create our object. (Which could take minutes to hours to do.) Instead, we can go line by line in our file, and each review individually into memory.

Extracting a 1 Star Review Dataset


Now, what are we interested in here? For me, the funniest thing was generating 1 star products and product reviews. So, I wrote a quick script to print out all the 1 star reviews, in the same format Amazon has on their site:

def generateReviews():
    daFile = open('user_dedup.json', 'r')
    for line in daFile:
        yield eval(line)
 
with open('onestarReviews.txt', 'w') as outty:
    for line in generateReviews():
        if line['overall'] == 1.0:
            try:
                theOutput = line['summary'] + '\n'
                theOutput += 'By ' + line['reviewerName'] + '\n'
                theOutput += line['reviewText'] + '\n\n' 
                outty.write(theOutput)
            except:
                continue

Okay, so when we run this program, we’ll generate a new file, and in that file we’ll have a list of Amazon reviews, each separated out by who has reviewed it, and what they think about out it. It makes for a pretty interesting file, if we take a look at the first few lines. This time, we can use the ‘head’ command to see the very first lines of a file:

$ head -n 50 onestarReviews.txt
Go wireless
By Pen Name
Good sound. Too many wires. Gets a lil confusing to set up and they get in the way. Spend the extra $ and get a wireless set
 
Ghost is junk,waste of my money n time...
By josh
Had some nice cool stuff u can do in the game like never b 4 but the game itself is junk!!!! So very disappointed with this game... I play my black ops 2 more than this game still! They could have done a way way better job on this game.. Black ops 2 is way way better than ghost!
...

Now that we’ve got a feel for our review structure, we’ll use Andrej Karparthy’s char-rnn to create automatically generated Amazon 1 star reviews.

So, to start with, I’ll assume you’ve already installed Torch7, along with the rest of char-rnn’s dependencies. If you haven’t, you can look at the Torch site, and follow the instructions there.

Now, let’s take a look at Andrej’s data set that he’s included with his original version, by again, doing a ‘head’ on his data. If we’ve cloned his repo, we should be able to go into his data directory, and then into ‘tinyshakespeare’.

$ head -n 100 input.txt
First Citizen:
Before we proceed any further, hear me speak.
 
All:
Speak, speak.
 
First Citizen:
You are all resolved rather to die than to famish?
 
All:
Resolved. resolved.
 
First Citizen:
First, you know Caius Marcius is chief enemy to the people.
 
All:
We know't, we know't.
 
First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?
 
All:
No more talking on't; let it be done: away, away!
 
Second Citizen:
One word, good citizens.

Exporting Your Data for an LSTM Neural Network

Alright! So the data expected by char-rnn is actually very similar to what we’ve already made. So all that’s left is to create a new directory, and then to only take the first 700MB of our file, or whatever will fit into memory for us.

In my case, I used head again, this time to specify how big of a file to create:

$ head -c 700M onestarReviews.txt > input.txt

Great! So after a while, we should end up with a new input.txt file that’s 700megs. We can now set
this up to be trained using char-rnn:

$ th train.lua -data_dir data/amazononestar -rnn_size 700 -num_layers 3 -dropout 0.3
using CUDA on GPU 0...	
loading data files...	
cutting off end of data so that the batches/sequences divide evenly	
reshaping tensor...	
data load done. Number of data batches in train: 278920, val: 14681, test: 0	
vocab size: 99	
creating an lstm with 3 layers	
setting forget gate biases to 1 in LSTM layer 1	
setting forget gate biases to 1 in LSTM layer 2	
setting forget gate biases to 1 in LSTM layer 3	
number of parameters in the model: 10163399	
cloning rnn	
cloning criterion	
1/13946000 (epoch 0.000), train_loss = 4.59903159, grad/param norm = 1.1494e+00, time/batch = 0.3800s

Now, I’ve got an NVIDIA GeForce GTX 980Ti, and I’ve got a lot of video memory. I can push a largeish network, and not run out of memory. You may need to tweak your ‘rnn_size’ and ‘num_layers’ parameters to something smaller in order to get it to work on your system. If you get lost, refer to the Github repo, it’s got excellent documentation.

Once we’ve got that running, it will start generating a neural network for us. This process can take a long time, and indeed, in my case, leaving it running for a day or two might be necessary. Luckily, the code above generates snapshots, and we can see how the training is going as it’s working…

Sampling from Your Network as You Train

$ th sample.lua cv/lm_lstm_epoch0.30_1.1600.t7 -seed 332 -primetext 'not good' -temperature .9
using CUDA on GPU 0...	
Make sure that your saved checkpoint was also trained with GPU. If it was trained with CPU use -gpuid -1 for sampling as well	
creating an lstm...	
seeding with not good	
--------------------------	
not good. The the supplier was "Done" I will try to think I would be paying money and so if you say this is not misleading much like serious previews and I will crash it, it didn't staff correctly. What a disappointment!
 
Three times
By Keith answer
Nice facility and seems to be protected so I love problems. I ordered the other blue the two area - that's why I was in the first page of Women 'more of the investment. Awful, all open the trainplay. Thought there wasn't as a prime string and pretty much fine with a few hours later, on one of the products that two didn't help saw the bolt... Gonna make it well burned out.....for exchange.
 
If you read a total piece of junk.
By Anny Effanan
When I ordered the top of the disc return then the gun
 
Wont be able to find the beat country.
By Liferai Barri
In the first and 3 months we could not carry in the driver so there are nice energy, the top program seemed to be not used at all. Now, the ends in use, especially in this sim cord in the first little cheap price is way too much. I immediately downloaded the screen protector, and I believe it is in size samples and frewn between the unit and some of these cables warn us but is very lame. The chair also wasn't in the way this situation. I have done something else for their kids. It doesn't stand with it filled with DVD.I don't know what they are at all......'Ofogismer?,,  it's dark dead.

Great! Let’s leave this running for a bit, come back to it later. For now, let’s build our web scraper, and get a feel for what all the 1* products might look like averaged out.

Creating Artificial Images Using a GAN

(Sidenote here. The original dataset includes this image data. For the purposes of this tutorial though, we’ll pretend we didn’t have that great gift handed to us, so we can see how we’d do it if we were out on our own.)

Now, if we go back to our original dataset, it’s obvious that we don’t have all of the Amazon images for our products. Instead, what we’ve got is an ASIN, or an Amazon Store Identification Number.

Let’s write a new Python script, to go through each of our products, and to save out all ASINs of all 1 star reviews:

def generateReviews():
    daFile = open('user_dedup.json', 'r')
    for line in daFile:
        yield eval(line)
 
with open('onestarASINs.txt', 'w') as outty:
    for line in generateReviews():
        if line['overall'] == 1.0:
            try:
                theOutput = line['asin'] + '\n'
                outty.write(theOutput)
            except:
                continue

Again, this code is very straightforward. We’re just writing out an ASIN for each 1 star review.

But, thinking about this, surely almost every Amazon product gets at least a single 1 star review. How do we separate out the ones that really suck?

Gathering Products with More than 10 1 Star Reviews

In my case, I decided to keep it simple, and just separate out the products that have more than 10 1 star reviews. We can do this in a single line of bash if we close our eyes and think for a bit:

$ sort onestarASINs.txt | uniq -cd | sort -nr | awk '$1>10' > greatherThanTen.txt

Alright, now do a head on this file output, and what do we see?

We’ve got a count of our 1* reviews on the left, and on the right, our matching ASIN. We can use this ASIN to scrape the Amazon website, and generate a database of all of our matching product images with more than 10 1 star reviews.

Inspecting Torch-Gan’s Example Dataset

Before we get started with scraping though, we should have an idea for what we want our end data to look like. In my case, I want to use the excellent torch-gan to generate my fake products, based upon all the images. And I want to do the bare minimum amount of futzing in order to make my dataset work with their existing input.

So let’s take a look by downloading their dataset, and then seeing what sort of data they expect:

$ git clone git@github.com:skaae/torch-gan.git
$ cd torch-gan
$ cd datasets
$ python create_datasets.py
$ cd lfw_imgs/lfw_deepfunneled/lfw_deepfunneled
$ ls
...
$ cd Josheph_Safra
$ ls
Joseph_Safra_0001.jpg
$ identify Joseph_Safra_0001.jpg 
Joseph_Safra_0001.jpg JPEG 250x250 250x250+0+0 8-bit sRGB 9.24KB 0.000u 0:00.000

Okay. So it looks like the torch-gan repository is using the Labeled Faces in the Wild dataset, and that’s being unzipped and then converted in that directory. Each of our images has a specific name, for whatever celebrity they are, and then each image is 250×250 pixels.

So, when we download our product images, we’ll need to save them into a directory with their product names, and we’ll then need to resize every image to fit 250×250, and we’ll have to follow the same naming technique, so we can just substitute our directory for the one that already exists.

Scraping Amazon for Our Images

Now is a good time, before we start scraping, to ask ourselves if there’s anything else we’d like to get out of these scrapes. In general, if we’ve got the space, it makes the most sense to just copy as much as possible. So, if we’re downloading every ASIN, we might as well save the entire page for later, so we can, for example, extract a product description from the page later, if we decide we want a neural network to generate that too.

First, before we start our scrape, let’s try just downloading one of our ASIN products. Let’s take the lowest reviewed product, and go from there:

$ wget http://www.amazon.com/gp/product/B00EOE0WKQ
$ less B00EOE0WKQ

Wow! Did not expect that! Amazon’s HTML is a mess! But hopefully they’re still using the same layout for each of their products, so let’s see if we can find an image in all of that. In our case, we’ll open up a web browser, right click, and inspect the element where we’d want to grab our images:

<img alt="Amazon Fire Phone, 32GB (AT&amp;T)" src="http://ecx.images-amazon.com/images/I/61PO4UcPdIL._SY679_.jpg"
data-old-hires="http://ecx.images-amazon.com/images/I/61PO4UcPdIL._SL1000_.jpg" class="a-dynamic-image  a-stretch-vertical" id="landingImage" 
data-a-dynamic-image="{&quot;http://ecx.images-amazon.com/images/I/61PO4UcPdIL._SY679_.jpg&quot;:[325,679],&quot;http://ecx.images-amazon.com/images/I/61PO4UcPdIL._SY450_.jpg&quot;:[216,450],&quot;http://ecx.images-amazon.com/images/I/61PO4UcPdIL._SY550_.jpg&quot;:[263,550],&quot;http://ecx.images-amazon.com/images/I/61PO4UcPdIL._SY355_.jpg&quot;:[170,355],&quot;http://ecx.images-amazon.com/images/I/61PO4UcPdIL._SY606_.jpg&quot;:[290,606]}"
 style="max-height: 627px; max-width: 325px;">

Okay. We’ve got at least one image, when we select our main image, that seems like it might be consistent across our pages. So let’s just go with that.

Now, another thing that would be great to grab would be the description of each of these products. Let’s see if we can grab that too…

In looking at it on my screen (again, by right-clicking, inspecting element, I can see that Amazon has a center-*-*_feature_div for each of the featured info graphics. This looks like it might be a mess to break apart. So let’s just skip that for now, but we’ll grab the dataset regardless.

In my case, I’m going to just download the full HTML page of everything for now. Let’s do a bit of math to figure out how long that might take, based upon some back of the envelope.

Do We have A Large Enough Dataset?

If we do a line count on our existing 1 star products file, we can see exactly how many pages we need to scrape:

$ wc -l greatherThanTen.txt 
93660 greatherThanTen.txt

Okay, so we’re looking at around 94,000 pages. If we assume that we wait 1 second in between grabbing pages, how long will this end up taking us?

A quick web search for ‘94000 seconds to hours’ lets us know, we’re looking at around 26.111 hours. So, we can add that 1 second pause to a Python script, and we should have all of our reviews within a little over a day. So let’s write that script now, so we can get started.

Checking Our Code in iPython Before We Scrape

In this case, we’ve got to check what we’re doing before we set and run the program for a day or so, so let’s be really sure about what we’re doing. In cases like this, it helps to open up an iPython shell, and make sure you’re not missing anything, and can deal with a few edge cases:

$ ipython
In [1]: import subprocess
In [2]: a = "  40854 B00EOE0WKQ"
In [3]: a.split(' ')
Out[3]: ['', '', '40854', 'B00EOE0WKQ']
In [4]: theURL = 'http://www.amazon.com/gp/product/' + a.split(' ')[3]
In [5]: theURL
Out[5]: 'http://www.amazon.com/gp/product/B00EOE0WKQ'
subprocess.call(['wget', theURL])
--2015-12-31 22:09:00--  http://www.amazon.com/gp/product/B00EOE0WKQ
Resolving www.amazon.com (www.amazon.com)... 54.239.25.192
Connecting to www.amazon.com (www.amazon.com)|54.239.25.192|:80... connected.
HTTP request sent, awaiting response... 301 MovedPermanently
Location: http://www.amazon.com/Fire_Phone_13MP-Camera_32GB/dp/B00EOE0WKQ [following]
--2015-12-31 22:09:00--  http://www.amazon.com/Fire_Phone_13MP-Camera_32GB/dp/B00EOE0WKQ
Reusing existing connection to www.amazon.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘B00EOE0WKQ’
 
B00EOE0WKQ                         [   <=>                                                      ] 454.04K   651KB/s   in 0.7s   
 
2015-12-31 22:09:01 (651 KB/s) - ‘B00EOE0WKQ’ saved [464933]
In [6]: exit()

Okay, so there’s just one thing we’ll need to watch out for, and that’s wrapping our URL in a ‘try’ statement, just in case there’s something wonky in our lines. So let’s write that script now, and get it started.

import time
import subprocess
 
with open('greaterThanTen.txt', 'r') as daFile:
    for number, line in enumerate(daFile):
        try:
            theURL = 'http://www.amazon.com/gp/product/' + line.split(' ')[3]
            print("Grabbing number " + str(number) + ", ASIN: " + line.split(' ')[3])
        except:
            continue
        subprocess.call(['wget', theURL])
        time.sleep(1)

Now, let’s run it and see if all looks okay:

$ python downloadURLs.py 
Grabbing number 0, ASIN: B00EOE0WKQ
 
--2015-12-31 22:34:10--  http://www.amazon.com/gp/product/B00EOE0WKQ%0A
Resolving www.amazon.com (www.amazon.com)... 54.239.25.208
Connecting to www.amazon.com (www.amazon.com)|54.239.25.208|:80... connected.
HTTP request sent, awaiting response... 301 MovedPermanently
Location: http://www.amazon.com/Fire_Phone_13MP-Camera_32GB/dp/B00EOE0WKQ [following]
--2015-12-31 22:34:10--  http://www.amazon.com/Fire_Phone_13MP-Camera_32GB/dp/B00EOE0WKQ
Reusing existing connection to www.amazon.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘B00EOE0WKQ%0A’
 
B00EOE0WKQ%0A                        [     <=>                                                    ] 454.00K   116KB/s   in 3.9s   
 
2015-12-31 22:34:14 (116 KB/s) - ‘B00EOE0WKQ%0A’ saved [464894]
 
Grabbing number 1, ASIN: 5775
--2015-12-31 22:34:15--  http://www.amazon.com/gp/product/5775
Resolving www.amazon.com (www.amazon.com)... 54.239.17.6
Connecting to www.amazon.com (www.amazon.com)|54.239.17.6|:80... connected.
HTTP request sent, awaiting response... 404 NotFound
2015-12-31 22:34:15 ERROR 404: NotFound.
 
Grabbing number 2, ASIN: 3166
--2015-12-31 22:34:16--  http://www.amazon.com/gp/product/3166
Resolving www.amazon.com (www.amazon.com)... 54.239.25.200
Connecting to www.amazon.com (www.amazon.com)|54.239.25.200|:80... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2015-12-31 22:34:16 ERROR 503: Service Unavailable.
...

Oh no! It looks like Amazon really doesn’t like you scraping! For me, I only got as far as three pages before all of my requests started getting 503, service unavailable.

Getting Amazon Data the Right Way

Indeed, looking at Amazon’s site, it looks like you’ve got to be signed up as an Affiliate, and have an AWS account too, with a private key. Oh no!

But fret not, I’ve got an Associates account, and I’ve got an AWS account too. I’m going to use the great Amazon Simple Product API by yoavaviram, and hopefully we’ll be able to get only what we need from this API.

So, after I’ve installed the above library, I wrote a script to start downloading all of my product images, and to create files in the same pattern as before, with our Labeled Faces dataset. So:

from amazon.api import AmazonAPI
import shutil
import requests
 
import time
import subprocess
import slugify
 
amazon = AmazonAPI('REPLACE-WITH-AWSID', 'REPLACE-WITH-AWSSECRET', 'REPLACE-WITH-ASSOCIATE-ID')
 
 
with open('greaterThanTen.txt', 'r') as daFile:
    for number, line in enumerate(daFile):
        time.sleep(.5)
 
        try:
            theNum = line.strip().split(' ')[1]
            print("In loop" + theNum)
 
            product = amazon.lookup(ItemId=theNum)
            print("Grabbing number " + str(number) + ", Product: " + product.title)
            with_underscores = slugify.slugify(product.title)
 
            subprocess.call(['mkdir', with_underscores])
            response = requests.get(product.large_image_url, stream=True)
            with open(with_underscores + '/' + with_underscores + '_0001.jpg', 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
 
        except:
            continue

Great! Now, before we run this, let’s do a quick sanity check, and see how many images we had with that Labeled Faces dataset:

$ cd lfw-deepfunneled
$ find . -type f | wc -l
13233

Okay. So we’re looking at around 13,250 images, in order to get a half decent result. That should be doable, considering we’ve got over 93,000 urls to choose from. Even if a lot of these urls no longer exist, we should be able to get at least 14k images for our dataset. Of course, while we’ve got our script running, we can do the same command as above to get an image count, and we can stop our scraper from running once we’ve gotten the 14,000 lowest reviewed products.

Now, run our script, and let’s save out a couple thousand products. In my case, I stopped after around 18,000 products.

Resizing and Cleaning Image Data

Next, we’ve got to resize and format our images to be the right size. We could have done this before, but that would mean we’d need to rescrape our data if we decided later we needed a slightly bigger image. So, let’s write a quickly script to recursively go through all of these directories and write out our new images, at the new, right size, 250×250:

import glob
import subprocess
 
from PIL import Image, ImageOps
 
size = (250, 250)
 
for filename in glob.glob('*/*.jpg'):
    try:
        image = Image.open(filename).convert('RGB')
    except:
        continue
    thumb = ImageOps.fit(image, size, Image.ANTIALIAS)
    subprocess.call(['mkdir', 'smaller/' + filename.split('/')[0]])
    thumb.save('smaller/' + filename)

Now, first make sure we create a directory called ‘smaller’, and then run the above script. After it runs for a while, we should end up with our 18,000 images, all in JPG, and all ready to be loaded into our neural network for training. Yes!

Adapting the GAN code to Our Dataset

Let’s get back to the generative adverserial network, and let’s think through how we’re going to get this data all loaded up and ready for it, as that hdf5 file we saw before.

We can get this process started by opening the data directory, and reading through the code, to see exactly how everything is supposed to be loaded. In my case, I noticed that the lfw.py file tries to download the image set, and then run through it, resizing everything to 64×64 pixels. So whoops, we really should have resized to that originally, but let’s fix the install function, and copy over all of our generated thumbnail images to that directory so it can run. Inside of lfw.py:

   def _install(self):
        log.info('Converting images to NumPy arrays')
        name_dict = {}
        imgs = []
        img_idx = 0
        for root, dirs, files in os.walk(self.data_dir):
            for filename in files:
                _, ext = os.path.splitext(filename)
                if ext.lower() != '.jpg':
                    continue
                filepath = os.path.join(root, filename)
                imgs.append(np.array(Image.open(filepath)))
                _, name = os.path.split(root)
                if name not in name_dict:
                    name_dict[name] = []
                    name_dict[name].append(img_idx)
                    img_idx += 1
                if img_idx % 100 == 0:
                    print img_idx
        imgs = np.array(imgs)
        names = sorted(name_dict.keys())
        names_idx = np.empty(len(imgs))
        for name_idx, name in enumerate(names):
            for img_idx in name_dict[name]:
                names_idx[img_idx] = name_idx
        with open(self._npz_path, 'wb') as f:
            np.savez(f, imgs=imgs, names_idx=names_idx, names=names)

Great! Now run this by running create_datasets.py, and at the end of all of it, we’ll have our brand new lfw.hdf5 afterwards. We can then start training on all the images we loaded before.

Finally, I’ve got my file, and now we can look into the gan Torch code, and get our network up and training:

$ th train_lfw.lua -g 0

As this gets trained, it might take a few days, just as it took a few days to train our previous neural network to generate one star Amazon reviews. So, in the meantime, let’s take a look at what our code is doing, by reading the great writeup by Anders Boesen Lindbo Larsen and Søren Kaae Sønderby at the Torch blog.

If we read through the docs about their Generative Adverserial Network, we can see that we’re actually training two neural networks here. The first is a generator, and the second is a discriminator.

As this runs, it ends up generating a 10×10 layout of 64×64 images. We can see all of them getting generated, and watch as the two networks learn from one another.

While we train, let’s use our existing one star images directories to create a 1 star product name generator.

Crosslinking Our Names to ASINs

Before, we saved out all of our images into unique directories, but we replaced their names with dashes. Let’s dump all the directory names we created before, and then write a quick script to replace all the dashes and slashes with proper spaces:

$ ls -d */ | sed -e 's/-/ /g' -e 's/\///g' > names.txt

Now, this one liner will do the following:

‘ls -d */’ will show us a list of all the directories in our current directory. ‘sed -e ‘s/-/ /g’ will replace each dash with a space, and then ‘-e ‘s/\///g’ will delete all of our end slashes. Finally, we’ll write this stream out to a text file called ‘names.txt’.

Sometimes, it can be easier to write a one liner in bash script than it can be to write code. But there is a tradeoff, in that we don’t really have a trail of this code, and we don’t really have a way to modify or improve it later. But for now, let’s just use it.

Generating Poorly Rated Product Names

Armed with our list of the lowest rated product names, we can now generate another LSTM network to start making our fake product names. Let’s create a new directory in char-rnn/data, call it AmazonNames, and then drop in our directory names as input.txt.

$ th train.lua -data_dir data/amazonNames/ -rnn_size 700 -num_layers 3 -dropout 0.5

Running this list through the sample.lua again, and we see we’ve got some very interesting 1* Amazon names for our products.

The only thing left to do now is add in a fake price for our fake products. Let’s join the metadata from the dataset along with our list of 1* review products in order to create the last product creator:

allASINs = []
with open('greaterThanTen.txt', 'r') as theASINs:
    for line in theASINs:
        allASINs.append(line.strip().split(' ')[1]) # create a list of all our clean ASINs
 
 
with open('metadata.json', 'r') as theMeta:  # open up the metadata dataset
    with open('prices.txt', 'w') as thePrices: # and open a file to save our a list of our prices
        for line in theMeta:  # two for loops in a row, this is slow, but at this point, whatevs.
            for ASIN in allASINs:
                if ASIN in line:  # if we have a matching ASIN in our line
                    try:
                        metadata = eval(line)
                        thePrices.write('$' + str(metadata['price']) + '\n')
                    except:
                        continue

But don’t run this code! It’s naive, and it’s a great first thought, but you could be waiting days for it to finish. Can you see why?

We’ve got the double for loop, and we’ll be running it over and over again, in exponential time. The question is, how can we get the above code working better?

Now, we can get rid of that second loop entirely, because Python will already look through our list of ASINs. So let’s take that loop out, and then let’s keep thinking of how we can make this loop faster.

Once we’ve added an ASIN’s price, do we need it anymore? The answer is no. So let’s get rid of each ASIN as we go:

allASINs = []
with open('greaterThanTen.txt', 'r') as theASINs:
    for line in theASINs:
        allASINs.append(line.strip().split(' ')[1]) # create a list of all our clean ASINs
 
with open('metadata.json', 'r') as theMeta:
    with open('prices.txt', 'w') as thePrices:
        for line in theMeta:
            metadata = eval(line)
            if metadata['asin'] in allASINs:
                print("Adding another asin price")
                try:
                    place = allASINs.index(metadata['asin'])
                    del allASINs[place] # make the list smaller as we go on
                    thePrices.write('$' + str(metadata['price']) + '\n')
                except:
                    continue

Great! Now this should be much faster, because as we get towards the end of our list, we’ll have less and less strings to look for a match on. And indeed, we’re much faster than our initial guess, but it still might take an hour or so to finish.

Upscaling Our Generated Images

When that’s done though, we’ve got our final little bit to train our neural network on. Once we’ve trained a char-rnn on our prices, we’ve got everything we need to generate our fake Amazon products. Reviews and all. So let’s now write the code to generate all of our entire end product, starting from the bottom, our product images.

While we were training, we generated a neural network that generates a 10×10 grid of 64×64 fake images. Let’s use Python and OpenCV to rescale and export these generated images to jpgs, so we can place them in our own container:

import cv2
import glob
 
for filename in glob.glob('lfw_example_v1_1*.png'):
    a = cv2.imread(filename)
    for i in range(0,10):
        for j in range(0,10):
            b = a[i*64:(i+1)*64, j*64:(j+1)*64]
            res = cv2.resize(b, (256, 256))
            cv2.imwrite('spirits'/ + str(i) + '_' + str(j) + '.jpg', res)

Exporting Products as a JSON File

Alright, so now we’ve got an entire directory filled with images, and they’re all of a more reasonable size. We’re basically done, the only thing we’ve got left to do is maybe fill up a range of files with that, and then add in our text generation, and we’ll have something we can look at proudly.

import random
import subprocess
import sys
 
import json
 
import time
 
allDivinations = []
 
i = 0
while i < int(sys.argv[1]):
    p = subprocess.Popen(['th sample.lua productNames/lm_lstm_epoch50.00_1.3150.t7 -seed ' + str(random.randint(100,1000)) + ' > names_tmp.txt'], shell=True)
    status = p.wait()
    print(status)
 
    p = subprocess.Popen(['th sample.lua prices/lm_lstm_epoch50.00_1.2193.t7 -seed ' + str(random.randint(100,1000)) + ' > prices_tmp.txt'], shell=True)
    status = p.wait()
    print(status)
 
    subprocess.Popen(['th sample.lua reviews/lm_lstm_epoch2.28_1.1002.t7 -seed ' + str(random.randint(1,1000)) +  ' > reviews_tmp.txt'], shell=True)
    status = p.wait()
    print(status)
    time.sleep(10)
    chosenImage = (random.randint(0,41), random.randint(0,9), random.randint(0,9))
 
    chosenNames = []
    chosenPrices = []
    chosenReviews = []
 
 
 
 
    with open('names_tmp.txt', 'r') as names:
        for count, line in enumerate(names):
            chosenNames.append(line)
 
    with open('prices_tmp.txt', 'r') as prices:
        for count, line in enumerate(prices):
            if count > 5:
                chosenPrices.append(line)
 
    with open('reviews_tmp.txt', 'r') as reviews:
        lastLine = 'bb'
        startedReview = False
        currentReview = ''
        for line in reviews:
            if line == '\n' and startedReview:
                currentReview += line
                chosenReviews.append(currentReview)
                currentReview = ''
                startedReview = False
            if lastLine == '\n':
                startedReview = True
                currentReview += '<h2>' + line + '</h2><br />'
                lastLine = line
            elif startedReview:
                if 'By' in line:
                    currentReview += '<br /><b>' + line + '</b><br />'
                    lastLine = line
                else:
                    currentReview += line
                    lastLine = line
            else:
                lastLine = line
    if len(chosenReviews) == 0:
        print("no review! skipping!")
        continue
    finalName = chosenNames[random.randint(0,len(chosenNames)-1)].strip()
    finalPrice = chosenPrices[random.randint(0,len(chosenPrices)-1)]
    if len(chosenReviews) < 2:
    	finalReview = chosenReviews[0]
    else:
    	finalReview = chosenReviews[random.randint(0,len(chosenReviews)-1)]
    finalImage = ("/images/spirits/%i_%i_%i.jpg" % chosenImage)
 
    divination = {'productName': finalName, 'price': finalPrice,
                  'finalReview': finalReview, 'finalImage': finalImage}
    allDivinations.append(divination)
    i += 1
 
with open('divinations.json', 'w') as final:
    json.dump(allDivinations, final)

Alright, if you look at the above code, you’ll notice a few things, the most obvious of which is that time.sleep. Yes, I’m being lazy here, and I’m just guessing at how long that Popen will need to run to generate all those texts, and all those files. Even though this number worked for me, it probably won’t work for you. You will probably want to tweak that number, or wait for your generating processes to finish sequentially.

Afterwards, we start to do a bit of formatting on the exports, adding just a bit of hacky html styling to our JSON object. This is gross, and there is a lot wrong with this, but it’s fine for a weekend hack. If we wanted to really do things properly, we’d separate out our review json object into a review_title, review_author, and a review_text variable in our JSON object.

Other than that, we want to make sure we get a full review, and that’s why we have added the chosenReview selector. We’re trying to make sure we only do reviews that have every bit generated with them otherwise we throw it all out.

After letting this run, we should have a finished JSON object with 50 divinations or so at the end of this:

$ python3 sorcery.py 55

Building the Final Presentation

Now, I just wrote a bit of Javascript, and embedded it into this blog post. A jQuery .getJSON() call loads up the JSON file we exported before. It then loads chooses a random number, and uses that number to pick one of our created objects.

var productData = [];
jQuery(document).ready(function($) {
url = '/uploads/divinations.json';
$.getJSON( url, function( data ) {
  var items = [];
  productData = data;
  chosenOne = Math.floor((Math.random() * productData.length) + 1);
  productCamel = productData[chosenOne].productName.replace(
  /(^|\s+)(\S)(\S*)/g,
  function(match, whitespace, firstLetter, rest) {
    return whitespace + firstLetter.toUpperCase() + rest.toLowerCase();
  }
);
  $('#productName').html('<strong style="font-size: 150%;">' + productCamel  + '</strong>');
  $('#price').text(productData[chosenOne].price);
  $('#review').html(productData[chosenOne].finalReview);
  $('#review h2').prepend('<img src="/images/onestar.png" />');
  $('#productImage').html('<img src="' + productData[chosenOne].finalImage + '" width="128" height="128" />');
});
$('#nextProd').click(function() {
  chosenOne = Math.floor((Math.random() * productData.length) + 1);
  productCamel = productData[chosenOne].productName.replace(
  /(^|\s+)(\S)(\S*)/g,
  function(match, whitespace, firstLetter, rest) {
    return whitespace + firstLetter.toUpperCase() + rest.toLowerCase();
  }
);
  $('#productName').html('<strong style="font-size: 150%;">' + productCamel + '</strong>');
  $('#price').text(productData[chosenOne].price);
  $('#review').html(productData[chosenOne].finalReview);
  $('#review h2').prepend('<img src="/images/onestar.png" />');
  $('#productImage').html('<img src="' + productData[chosenOne].finalImage + '" width="128" height="128" />');
});
});

Pretty hacky, and pretty embarassing, but it gets the job done for now, by dumping our JSON objects directly into our divs.

Next Steps

That was a lot!

But what could we do to improve the outputs we’ve got? For now, tuning our neural nets is more of a trial and error process. I ended up getting much better results using a smaller neural net, and having it train three days on my Amazon reviews, than my first pass, with as large of a network as I could hold in memory on my video card.

I think the most disappointing part of the results was the images. I think we could make them a bit better, if we had some sort of cohesiveness to their generation. So maybe we could have a neural network at the beginning of data import, that tries to identify the product, and then tries to recreate a mishmash of things that it knows. IE, we could see an image of soap, and we could end up getting a differently colorized / shaped soap.

We could also have all of our product reviews / product names be dependent on what’s being generated in our photos. For now, the two have no awareness of one another, and that means we have a much more nonsensical result when we look at each individual product. Maybe if we had a contextual awareness between networks, we’d be able to generate something that makes a bit more sense.

And speaking of sense, there are a lot of reviews that don’t make any sense. That being said, there are a lot of input reviews from our dataset that don’t make sense too. We could run our output through a sentence parser, and make sure they form logical sentences, and that would improve things further.

Thanks for all the Fish

One of the greatest things about the machine learning community is how much everyone gives back. I can’t thank Julian and everyone else involved in the papers and datasets enough. Without them, I would have never had the courage / idiocy to take on this project. Also, thanks to the devs behind torch-gan and char-rnn for sharing their code.

I hope this guide fosters more open dataset creation for people outside of academia and industry. I’d love to swap data sets with anyone else, or create a place to share more info on building them. I encourage you to send me any zany machine learning ideas you might have at my Twitter handle, @burningion.

Finally, shout out to Sammy for coming up with the idea!

Code, as always, is available on Github.

Citations for Amazon Dataset:

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015

Inferring networks of substitutable and complementary products
J. McAuley, R. Pandey, J. Leskovec
Knowledge Discovery and Data Mining, 2015

Leave a Reply

Your email address will not be published. Required fields are marked *