I was puzzled with the way LSTMs were used to do sentiment analysis. Finally the book by Antonio Gulli helped me understand the mechanics of the LSTM.

Data Preparation

The following shows the various steps to perform sentiment analysis on a labeled dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from keras.layers.core import Activation, Dense, Dropout, SpatialDropout1D
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
import collections
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
from nltk.tokenize import word_tokenize

train_file = r"C:\garage\reviews\Deep-Learning-with-Keras-Antonio-Gulli\sentiment\training.txt" test_file = r"C:\garage\reviews\Deep-Learning-with-Keras-Antonio-Gulli\sentiment\testdata.txt" train_data = [] train_labels = []

with open(train_file, encoding='utf-8') as f: for line in f: label, sentence = line.strip().split("\t") train_labels.append(label) train_data.append(sentence.strip())

Tokenization

Once the data is prepared, one needs to limit the vocab, convert the words in to integers, replace infrequent words with UNK

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
train_data      = [word_tokenize(t.lower()) for t in train_data]
words           = [t.lower() for sent in train_data for t in sent if t.isalpha()]
freq_dist       = FreqDist(words)
MAX_FEATURES    = 2000
MAX_SENTENCE_LENGTH = 40
common_words    = freq_dist.most_common(MAX_FEATURES)
word2idx        = {}
word2idx['PAD'] = 0
word2idx['UNK'] = 1
for i,w in enumerate(common_words) :
    word2idx[w[0]]=i+2

idx2word = {v:k for k,v in word2idx.items()} X = np.zeros((len(train_data), MAX_SENTENCE_LENGTH)) Y = np.zeros((len(train_data), 1))

for i, sentence in enumerate(train_data): if len(sentence) > MAX_SENTENCE_LENGTH : sentence = sentence[:MAX_SENTENCE_LENGTH] else: sentence = ['PAD']*(MAX_SENTENCE_LENGTH-len(sentence)) + sentence X[i,:] = [word2idx.get(word,1) for word in sentence] Y[i,0] = int(train_labels[i])

Preparing the training and test data

Once the data is ready, one can split the data in to test and train samples

1
2
3
X_train, X_valid, Y_train, Y_valid = train_test_split(X,Y,
                                                      test_size=0.2,
                                                      random_state = 1234)

Keras Layers

Various layers needs to be created in Keras

  • Input layer with shape (None, Max Sequence Length)
  • Embedding Layer that gives out a tensor with the shape (None, Max Sequence Length, Embedding Dimension)
  • LSTM layer with gives out a tensor with the shape (None, Hidden Layer Size)
  • Dense layer that does sigmoid transformation
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
EMBEDDING_SIZE = 128
HIDDEN_LAYER_SIZE = 64
BATCH_SIZE = 32
NUM_EPOCHS= 10

model = Sequential() model.add(Embedding(len(word2idx), EMBEDDING_SIZE, input_length=MAX_SENTENCE_LENGTH ))

model.add(SpatialDropout1D(0.2)) model.add(LSTM(HIDDEN_LAYER_SIZE, dropout=0.2, recurrent_dropout=0.2)) model.add(Dense(1, activation="sigmoid")) model.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy'])

Train and Test the model

1
2
3
4
history = model.fit(X_train, Y_train, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,
                    validation_data=(X_valid,Y_valid))

model.evaluate(X_valid,Y_valid)

Finally as I work through these examples, I am understand the various aspects of Recurrent Neural Networks