The results from the experiment described in Section 4.4 suggest three counterintuitive conclusions. The first of these is that additional semantic information about bigrams does not improve performance on comparative meaning tasks. This is somewhat counterintuitive, since the sentences “John eats candy” and “Candy eats John” have identical unigram cosines (1.0) but differing bigram cosines (0) in the linear Context gionar space. Clearly if an evaluation dataset consisted entirely of sentences like these, or of sentences where words have been arbitrarily scrambled, e.g. “John candy eats,” then the bigram vectors would carry
more information. However, it seems to be the case that for the MRPC and the 20K corpus, the information carried by the bigram vectors is largely redundant with the information carried by the unigram vectors. It is tempting, but unsupported by the data, to extrapolate beyond these two corpora to language in general. The claim is not being made that semantic bigram information is not useful for language in general. However, it is clear that for current comparative meaning tasks, as represented by these two corpora, bigram vectors are not useful.
Another counterintuitive result is that Contextioca: and Context giorai appear to be largely equivalent with respect to these two tasks. Recall that Context gtobai defines context to be anywhere in a sentence, whereas Conte2tiocq; defines context as the word on the left and the word on the right (Section 3.3.3). Previous research has tended to associate semantic features with Contezt,iosai (Landauer, Foltz, and Laham 1998) and syntactic features more with Conteztiocat (Redington, Chater,
and Finch 1998). Therefore it is curious that these two ways of defining context would lead to equivalent results on a comparative meaning task. The major difference between Conteztiocai aS applied here and local context in previous approaches is the subsequent use of SVD. It is possible that through
dimensionality reduction, SVD has distilled the latent semantic information in ConteLtiocai. This interpretation is supported by the findings of Burgess, Livesay, and Lund (1998). Using a method very similar to Contextjocq, but without SVD, they found that sentence vectors fared poorly for making semantic judgments.
This previous result is in stark contrast to the result obtained here, which indicates that Conte2tioca; and Contextgicba1 ate approximately equal at making semantic judgments. Again, the most salient difference between Contertioca; and the method of Burgess, Livesay, and Lund (1998) is the application of SVD.
Finally, it is noteworthy that the hierarchical methods, which are more sensitive the structure, fared worse on the 20K corpus than the linear methods.
Recall that the hierarchical methods first parse each sentence and then compare the roots of the two dependency trees. The results from the 20K corpus indicate that the hierarchical methods suffer from poor discrimination relative to the linear methods. This suggests that by comparing the roots of the trees, information that could be used to discriminate between sentences is eliminated. It is concluded that hierarchical methods for comparative meaning should be sensitive to
structure at multiple levels in the hierarchy, rather than focusing on the root of the hierarchy. A possible method for doing so would be to parse a corpus with a dependency parser and then count the frequency of dependencies in each
document. In other words, this method would create a dependency by document matrix. By abstracting away from the tree of dependencies, all dependencies would be given equal weight, and no information would be lost. Because such a method would generate as large a matrix as the bigram methods investigated in this dissertation, it would likely not be possible without the non-orthogonal SVD approach utilized here.
5.2.3.1 Broader Implications There are several theoretical and practical implications of the experiment described in Section 4.4. One theoretical
implication is that the linear methods evaluated in this experiment are sensitive to word order. This is the first time that a latent semantic method has been sensitive to word order, which has previously been cited as a weakness of LSA
(Wiemer-Hastings and Zipitria 2001). By being sensitive to word order, it is possible that the linear methods describe in this experiment will extend latent
semantic approaches to new problem areas and widen the research effort behind latent semantic approaches.
However, a second theoretical implication is that sensitivity to word order may not be as useful a property as was previously believed. Indeed the results from this experiment suggest that bigram vectors contribute very little new information with respect to what is already conveyed by unigram vectors. Thus it may be that the range of new applications for latent semantic grammars on
meaning tasks may be narrower, focusing more on domains where the same set of words is often reordered to express different ideas. An example of such an
application might be causality, where the order of words is more critical to the meaning of the sentence.
An important practical result of the experiment described in Section 4.4 is that linear latent semantic grammars had equivalent performance to LSA. The result obtained in this experiment has great practical relevance, since LSA, upon which Context giorai is based, is patented, while Conteztioca is not. The LSA patent has
hindered the commercialization of some research products, such as AutoTutor (Graesser et al. 2005). The problem appears to be that commercialization requires protected intellectual property, i.e. patents, in order to be appealing to investors.
The patent on LSA is specific to the singular value decomposition of term by document matrices. Since Contextioca: involves the singular value decomposition of n-gram by adjacent word matrices, it would appear to represent a distinctly new invention. The fact that both LSA and Conte2ztigeq; use SVD should not be an issue: the LSA patent apparently does not include the application of SVD to any matrix but to term by document matrices specifically. Therefore applications such as AutoTutor may now incorporate Contertioca; instead of LSA and be