This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.
Using Word Embeddings to Represent Texts¶
1. Intro to Word Embeddings¶
Word Embeddings are used to represent the words in the form of a feature vectors. The words with similar meaning will have the same representations.
Embeddings are trainable and that means that during the model training, they will improve. The similar words will end up having the same representation.
In vector speaking, similar words will have the same direction.
There is this Embedding Projector that let you visualize and play with embeddings. Have a tour there and search some words.
For example, searching programming, you will see words similar in meaning to programming in the same vector space.
Keras has an embedding layer. We will use it in later practice, but feel free to learn more in documentation.
2. Embeddings In Practice¶
We are going to practice embeddings on Imdb reviews. Imdb reviews is the hello world of natural language processing, just like mnist is for computer vision.
We will be building a sentiment analysis - classifying the movie reviews.
2.1. Getting the Imdb Reviews Data¶
Imdb reviews contains positive and negative movies reviews. You can learn more about this dataset here.
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
We will get the data from Keras datasets built in datasets. It is also available in TensorFlow datasets.
Imports¶
import tensorflow as tf
from tensorflow import keras
import numpy as np
Loading the training and validation data from Keras datasets. There are 25.000 samples for training and 25.000 samples for validation.
Also, setting max_features
to 20000 means we are only considering the first 20000 words, and sequence_length
of 200 denotes that we will take the first 200 words in each movie review.
max_features = 20000
sequence_length = 200
(train_data, train_labels), (val_data, val_labels) = keras.datasets.imdb.load_data(num_words=max_features)
train_data[:2]
array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]), list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 10156, 4, 1153, 9, 194, 775, 7, 8255, 11596, 349, 2637, 148, 605, 15358, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95])], dtype=object)
train_labels[:10]
array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0])
2.2. Preparing the Dataset¶
The Imbd dataset that we loaded from Keras was already converted to tokens, so we won't have to do much work here.
We are going to only convert the tokens into sequences.
from tensorflow.keras.preprocessing.sequence import pad_sequences
train_data = pad_sequences(train_data, maxlen=sequence_length)
val_data = pad_sequences(val_data, maxlen=sequence_length)
2.3. Build, Compile and Train the Model¶
The purpose of this lab is to learn embeddings. The Keras Embedding layer will take the tokens and represent them into a high dimensional vector, where the vector of the words that has the same semantic meaning will have the same direction.
In essence, embedding layer will take the number of the words in a vocabulary and the number of dimensions
# embedding layer: represent 100 words into 5 dimensions
embedding_layer = tf.keras.layers.Embedding(100,5)
embed_integers = embedding_layer(tf.constant([1,2,3,4]))
embed_integers.numpy()
array([[ 1.2614403e-02, -1.1478484e-02, 8.1606880e-03, -3.9430130e-02, 2.0547520e-02], [-8.9646950e-03, -1.1008494e-03, -3.3753276e-02, -3.0158116e-02, 4.5642104e-02], [ 5.8295615e-03, 2.1967243e-02, 2.9412154e-02, 1.9587997e-02, -4.6676543e-02], [ 2.0423044e-02, -2.1431793e-02, 2.7912881e-02, 1.5731920e-02, 3.9350241e-05]], dtype=float32)
The above is what the embedding layer will exactly do: representing words into feature vector of a given dimension.
Embeddings are trainable, and that means as we train the model, words that are similar will get closer and closer. You can have a taste of this notion by visiting Embedding Projector.
# Create a model
embedding_dim = 16
model = tf.keras.Sequential([
tf.keras.layers.Embedding(max_features, embedding_dim, input_length=sequence_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1, activation='sigmoid')
])
We will use Tensorboard to visualize training metrics and embeddings.
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")
We will compile the model with the loss of binary_crossentropy
and adam
optimizer.
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
Before training the model, we can see the model summary.
model.summary()
Model: "sequential_7" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_11 (Embedding) (None, 200, 16) 320000 _________________________________________________________________ global_average_pooling1d_6 ( (None, 16) 0 _________________________________________________________________ dense_14 (Dense) (None, 6) 102 _________________________________________________________________ dropout_5 (Dropout) (None, 6) 0 _________________________________________________________________ dense_15 (Dense) (None, 1) 7 ================================================================= Total params: 320,109 Trainable params: 320,109 Non-trainable params: 0 _________________________________________________________________
history = model.fit(train_data,
train_labels,
validation_data = (val_data, val_labels),
epochs = 5,
callbacks=[tensorboard_callback])
Epoch 1/5 782/782 [==============================] - 7s 8ms/step - loss: 0.6281 - accuracy: 0.6886 - val_loss: 0.5163 - val_accuracy: 0.8107 Epoch 2/5 782/782 [==============================] - 6s 8ms/step - loss: 0.4767 - accuracy: 0.8266 - val_loss: 0.4096 - val_accuracy: 0.8722 Epoch 3/5 782/782 [==============================] - 6s 8ms/step - loss: 0.4086 - accuracy: 0.8503 - val_loss: 0.3566 - val_accuracy: 0.8772 Epoch 4/5 782/782 [==============================] - 6s 8ms/step - loss: 0.3586 - accuracy: 0.8715 - val_loss: 0.3412 - val_accuracy: 0.8788 Epoch 5/5 782/782 [==============================] - 6s 8ms/step - loss: 0.3293 - accuracy: 0.8836 - val_loss: 0.3413 - val_accuracy: 0.8772
2.4. Visualizing the Model Results¶
As we have provided TensorBoard callback during the model training, we can visualize the model results and embeddings.
%load_ext tensorboard
%tensorboard --logdir logs
The tensorboard extension is already loaded. To reload it, use: %reload_ext tensorboard
Reusing TensorBoard on port 6006 (pid 1526), started 0:51:24 ago. (Use '!kill 1526' to kill it.)
TensorBoard is a fantastic model visualization tool. If you are running this, spend sometime looking over the graphs in TensorBoard.
2.5. Testing the Model on New texts¶
Let's test the sentiment classifier on some sample reviews.
negative_review = ["This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. This must simply be their worst role in history"]
# Convert the test sentence to tokens and sequences
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=20000)
tokenizer.fit_on_texts(negative_review)
neg_sequences = tokenizer.texts_to_sequences(negative_review)
# Pad the sequence to have the size of the sequences used for model training
neg_sequences = pad_sequences(neg_sequences , maxlen = sequence_length)
# Predict the review
model.predict(neg_sequences)
array([[0.54507416]], dtype=float32)
positive_review = ["Very beautiful and awesome movie. This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. "]
# Convert the test sentence to tokens
tokenizer_2 = Tokenizer(num_words=20000)
tokenizer_2.fit_on_texts(positive_review)
pos_sequences = tokenizer_2.texts_to_sequences(positive_review)
pos_sequences = pad_sequences(pos_sequences, maxlen = sequence_length)
model.predict(pos_sequences)
array([[0.4356643]], dtype=float32)
This is it about word embeddings. Embeddings alone are not enough. In the next notebook, we will introduce Recurrent Neural Networks(RNNs) and Long Short Term Memories(LSTMs). RNNs and LSTMs are some of the fundamental language models that are widely used to day.