In this section, we will discuss two examples illustrating the concepts related to different distance metrics on a text corpus.
Example 1
Let us consider the discussed example again.
doc1: “The quick brown fox jumps over the lazy dog.”
doc2: “The fox was red but dog was lazy.”
doc3: “The dog was brown, and the fox was red.”
We have learned the two basic and most popular distance metrics (Euclidean and cosine) that can handle vector arithmetic and one advanced metric WMD. Now, we will compute the scores and document retrieval order using Euclidean and cosine metrics. Further, we will show a detailed example to show WMD.
The search query is: “fox jumps over brown dog”
Here, we discussed how to compute and retrieve documents, given a query string using two approaches in Tables 4.5 and 4.6.
However, there are certain limitations with this approach. The document vector representation we explained, is also popularly called BOW vector representation.
In BOW, the basic assumptions about the data are that the order of words is not important, but only the term frequency matters. So, we can say, BOW is a collec- tion of unordered list of terms. One of the major limitations of this approach is that it does not capture the context. It may fail to detect the difference between a valid positive sentence and, maybe, a negative sentence composed of the same set of words, or even sarcasm may go undetected. For Example:
“The quick brown fox jumps over the lazy dog” is, in this context, identical to the document “The lazy brown dog jumps over the quick fox.” It is so because both the sentences have the same vector representation (i.e., BOW representation).
It may be a good fit if we intend to find similarity in the topics, something like grouping political news, sports news, scientific news, and so on. Now, we will again compute both the metrics based on TF–IDF matrix (refer Table 4.4) to see if there are any differences.
We see that the document retrieval order does not change in Tables 4.7 and 4.8, in this particular case. But, if we have a large corpus, then TF–IDF works bet- ter compared to the document-term matrix. It is so because it penalizes frequently occurring words and gives more weight for those words whose frequency is less.
table 4.7 euclidean Distance Computation and Retrieval order Based on tF–iDF
Document Euclidean Distance Order
doc1 0.1324138712 1
doc2 1.9478555502 3
doc3 1.5957805672 2
table 4.5 euclidean Distance Computation and Retrieval order
Document Euclidean Distance Order
doc1 2.4494897428 1
doc2 3.7416573868 3
doc3 3.4641016151 2
table 4.6 Cosine Computation and Retrieval order
Document Cosine Similarity Order
doc1 0.6741998625 1
doc2 0.2581988897 3
doc3 0.3872983346 2
Example 2
Now, we will take another example shown on a real dataset of Amazon’s book reviews. It is a rich dataset and comprises of real reviews written by several cus- tomers based on their experience. Few researchers have used this dataset to address some real challenges, such as answering product related queries and recommend- ing products [20,21] The dataset can be downloaded from [19]. It consists of the following fields.
◾ reviewerID: unique identification of the reviewer
◾ asin: unique product ID
◾ reviewerName: name of the reviewing customer
◾ helpful: helpfulness rating of the review, e.g., 2/3
◾ reviewText: text of the review
◾ overall: rating of the product
◾ summary: summary of the review
◾ unixReviewTime: time of the review (Unix time)
◾ reviewTime: time of the review (raw)
In our case, we extracted “asin” and “reviewText” columns for training our model. Word embedding is an approach where the words (in any language, i.e., English, Chinese, French, Hindi, etc.) are represented using dense vectors and still maintain their relative meaning. We can also say that the word embedding is a continuous vector space where semantically related words are mapped to nearby points. It is an enhanced way of representing the words in a dense vector format as compared with BOW (document-term matrix), which is sparse representation.
Dense vector representation of textual data is usually achieved using neural net- works having certain hidden nodes trained over the underlying corpus. Once the model is trained, it can be reused again and again for multiple purposes. The usage can vary from finding synonyms, finding dissimilar words, machine translation, and sentence generation to name a few.
A word embedding (W: words -> Rn) is a representation technique of mapping words in some language to high-dimensional vectors (perhaps 100–300 dimen- sions) by training a neural network.
For simplicity, we have trained our model having ten dimensions (though it may not be desirable). The dense vector representation of a few initial words from the review corpus that is trained using word2vec-CBOW approach is shown in Table 4.9.
order Based on tF–iDF
Document Cosine Similarity Order
doc1 0.7392715831 1
doc2 0.1437187361 3
doc3 0.2692925425 2
Data Analytics book −0.17 −0.96 0.48 4.78 0.18 0.58 −0.14 −1.44 −0.56 2.17
read −0.96 −1.13 0.53 4.97 −0.72 1.57 1.52 −0.10 −1.68 4.78
story 0.21 0.52 −2.28 3.11 0.93 2.44 −2.88 −2.36 0.69 0.56
one 1.13 −1.10 −0.24 2.58 −0.47 0.46 −0.09 0.69 −0.80 2.28
like 0.73 −0.21 −0.46 1.70 1.51 −0.57 0.20 0.25 −3.70 1.57
characters 1.14 0.58 −2.40 3.17 2.59 1.62 −6.31 −2.91 0.08 0.17
books −0.11 −2.24 1.65 3.82 1.01 0.90 −0.24 −1.66 −0.14 4.49
would −0.45 1.31 1.85 3.90 0.03 −1.85 −1.48 −0.30 −1.78 2.53
good −0.71 0.23 −1.32 3.41 0.63 1.03 0.40 −1.42 −2.34 3.17
reading −1.65 −0.52 2.02 6.05 −1.03 1.76 0.52 0.09 −0.86 2.38
could −0.72 0.84 1.69 6.37 1.05 1.08 −3.81 0.62 0.01 1.61
really 0.74 0.17 −0.95 4.69 1.59 −0.62 −1.61 −0.93 −1.98 1.72
rest 1.74 −3.95 −0.76 5.12 0.62 1.48 0.93 −0.23 −2.20 0.44
great −0.37 0.11 −2.73 3.17 −0.09 2.67 0.30 −1.50 −0.64 4.01
love 2.08 0.96 −4.09 2.02 −1.96 −1.46 −1.34 −1.51 −1.13 3.30
We will illustrate an example, which has been trained using word2vec on Amazon’s book reviews dataset. It learns the word embeddings based on the words used in a sentence, i.e., reviews. word2vec has two different implementations: the CBOW model and the skip-gram model. Algorithmically, both these models work similarly, except that CBOW predicts target words (e.g., “mat”) from source con- text words (“the cat sits on the”), whereas the skip-gram does the inverse and predicts source context words from the target words.
Query string: “good fantastic romantic, jealous, motivational”
Results of the queried string are shown in Table 4.10.
It can be clearly observed that the WMD of the query string is 0.0, whereas sentences that have closely related meaning have low scores. However, if the sen- tences are not related then the distance is farther as we can see in case of the last line “a favorite. can’t wait for the movie”.
Apart from information retrieval tasks, these word embeddings can also help us in finding similar words or find odd words from a given set of words.
For example, if “book” is related to “kindle,” so “children” is related to ????
result = model.most_similar(positive=[`book’, `children’], negative=[`kindle’], topn=5)
print(result)
[(`teenagers’, 0.8260207772254944), (`age’, 0.8220992684364319),
(`normal’, 0.8047794699668884), (`girls’, 0.8042588233947754),
(`older’, 0.8039965629577637)]
Similarly, we can find out odd words out, i.e., which word is least related:
result = model.doesnt_match(“motivational fabulous bad good”.split())
print(result)
WMD Document
0.0 (QUERY:) good and easy reading having fantastic romantic, motivational story
2.8175 This is a good story and story and easy read. I am a Grisham fan and this did not disappoint.
2.9357 Awesome book I was very impressed with the story the plot was good it made me cry hf oe nf pdd awww ggtt vgvbh cvcvg dfrw shyev 3.0640 This was a wonderful book, it was an easy read with a great story. I
am happy to own this book.
3.1404 It’s a book definitely for relaxation. The courtroom description is fantastic. Nice easy reading. Enjoyed it very much.
8.8928 A favorite. can’t wait for the movie
motivational
result = model.doesnt_match(“notes book author players reviews”.split())
print(result) players
result = model.doesnt_match(“author men women children”. split())
print(result) author
The results presented here are dependent on the trained corpus, and we may expect a few changes in the results as the corpus changes. The quality of results is likely to improve with the increase in the size of the dataset.