The approach to "Suggesting context-aware query by session data mining and click-throught documents" (short call: "context-aware approach" by Huanhuan Cao et al [9], [1[r]
Trang 1MINISTRY OF EDUCATION AND
TRAINING
VIETNAM ACADEMY
OF SCIENCE AND TECHNOLOGY
GRADUATE UNIVERSITY SCIENCE AND TECHNOLOGY
-
Tran Lam Quan
SOME SEARCHING TECHNIQUES FOR ENTITIES BASED ON IMPLICIT SEMANTIC
RELATIONS AND CONTEXT-AWARE QUERY SUGGESTIONS
Major: Mathematical Theory of Informatics
Code: 9.46.01.10
SUMMARY OF MATHEMATICS DOCTORAL THESIS
Hanoi - 2020
Trang 2Công trình được hoàn thành tại: Học viện Khoa học và Công nghệ - Viện Hàn lâm Khoa học và Công nghệ Việt Nam
Người hướng dẫn khoa học: TS Vũ Tất Thắng
Có thể tìm hiểu luận án tại:
- Thư viện Học viện Khoa học và Công nghệ
- Thư viện Quốc gia Việt Nam
Trang 3INTRODUCTION
1 The necessity of the thesis
In the big data era, when the new data flow is generated incessantly, the search engine becomes a useful tool for the user to search for information Based on the statistics, approximately 71% of the web searching sentences includes the name of entities [7], [8] When looking at the query only includes the entity name:
"Vietnam", "Hanoi", "France ", in terms of visualization, we see the underlying semantics behind this query In other words, a similar relationship exists between the pair of entity names "Vietnam": "Hanoi" and the pair of entity names "France": "?" If only considered visually, this is one of the "natural" abilities of human - the ability
to infer unknown information/knowledge by similar inference
With the above query, human have the
ability to give immediate answers, but the
Search Engine (SE) can only find the
documents containing the aforementioned
keywords, the SE cannot immediately give the
answer "Paris" The same happen in real world,
there are questions as: "If Fansipan is the
highest mountain in Vietnam, which one is the
highest in Tibet?" or "If you know Elizabeth as
Queen of England, who is the Japanese
monarch?", etc For queries with similar
relationships as above, the keyword search
engine has difficulty in giving answers while
human can easily make similar inferences
Figure 1.1: The list returns from Keyword-SE with query = "Việt Nam", "Hà Nội", "Pháp"
Researching and simulating ability of human to deduce from a familiar semantic domain ("Vietnam",
"Hanoi") to an unfamiliar semantic domain ("France", "?") - is the purpose of the first problem
The second problem about query suggestions Also according to statistics, the queries of user to enter are often short, ambiguous, and poly-semantic [1-6] In search sessions, the number of results returned a lot, but most of them are not suitable for the user's search intent1 Therefore, there are many researching directions set out to improve results and assist searchers These researching directions include: query suggestion, rewriting queries, query expansion, personalized recommendations, ranking/re-ranking search results, etc
The researching direction suggests that the query often applies traditional techniques such as clustering, similarity measurement, etc of queries [9], [10] However, traditional techniques have three disadvantages: First,
it can only give similar suggestion or related to the query that is recently entered (current query) - but the quality
is not sure and better than the current query Second, it is not possible to give the trend that most knowledge often asks after the current query Third, these approaches do not seamlessly consider the user's query to capture the user's search intent For example, on the keyword SE, type 2 consecutive queries q1: "Who is Joe Biden", q2:
1 https://static.googleusercontent.com/media/guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf
Trang 4"How old is he", q1, q2 are semantically related However, the results returned for q1, q2 are 2 very different set
of the result This shows the disadvantage of keyword search
Figure 1.2: The answers list from SE corresponding to q1 and q2
Capturing a seamless query string, in other words, capturing the search context, SE will "understand" the user's search intent Moreover, capturing query string, SE can suggest string query, this suggestion string is majority knowledge, community often asks after q1, q2 This is the purpose of the second problem
2 Thesis: Objectives, Layout and Contributions
Research, identify and experiment with methods to solve the two above problems The objectives are set out, the main contributions of the thesis include:
- The thesis researches and builds an entity search technique based on implicit semantic relations using clustering methods to improve search efficiency
- Apply context-aware techniques, build an vertical search engine that applies context-aware in its own knowledge base domain (aviation data)
- Propose to measure combinatorial similarity in the contextual query suggestion problem to improve the quality of suggestion
CHAPTER 1: OVERVIEW 1.1 The problem for searching entities based on implicit semantic relations
Consider query including entities: "Qur’an": "Islam", "Gospels": "?", Humans have the ability to immediately deduce the "?", but the SE only gives results that contain documents that contain the above keywords,
do not immediately give the answer: "Christian" Due to only finding entities, the techniques of extending or rewriting the query do not apply to the relationship form with the meaning hidden in the entity pair From there, a new search form is studied, the search query's motive has the form: {(A, B), (C,?)}, where (A, B) is the source entity-pair, (C,?) is the target entity-pair Simultaneously, two pairs (A, B), (C,?) have similar semantic relationship Specifically, when the user enters a query consisting of 3 entities {(A, B), (C,?)}, SE has the task of listing and searching in the candidate list of entities D ( entity sign?), each entity D satisfies the condition having
semantic relationship with C, and the pair (C, D) has similar relationship with the pair (A, B) Semantic relation
- in the narrow sense and in the lexical perspective - is expressed by terms/patterns/context surrounding (before, between, after) the known entity pair2 Because of the semantic relation, the similarity relation is not
2 Birger Hjorland Link: http://vnlp.net
Trang 5explicitly stated in the query (the query consists of only 3 entities: A, B, C), so motive search morphology is called the Implicit Relational Entity Search or Implicit Relational Search, in short: IRS
Consider the input query that includes only 3 entities q= "Mekong":"Vietnam", "?": "China" Query q contains only 3 entities ("Mekong": "Vietnam", "?": "China") The query q does not describe a semantic relation ("longest river" or "largest" or "widest basin", etc.) The searching model based on the implicit semantic relation
is responsible for finding the entity "?", such as satisfying the semantic relationship with the "China" entity, and the "?":"China" pair being similar with the pair: "Mekong":"Vietnam"
Finding/calculating the relative similarity between two pairs of entities is a difficult problem because: First, the relational similarity
changes over time, considering two
pairs of entities (Joe Biden, US
President) and (Elizabeth, Queen of
England), the similarity of relationship
changes over the term Second, it is
difficult due to the intrinsic entity
having names (names of individuals,
organizations, places, ) which are not
common words or in the dictionary
Hình 1.3: Input query: ”Cuba”, “José Marti”, “Ấn Độ” (ngữ
nghĩa ẩn: “anh hùng dân tộc”) Third, in a pair of entities, there can be many different semantic relations, such as: "The Corona outbreak originated from Wuhan"; "Corona isolates Wuhan city"; "The number of Corona infections decreased gradually
in Wuhan"; v.v Fourth, due to the timing factor, 2 entity pairs may not share or share very little of the context around the entity pair, like: Apple: iPod (in 2010s) and Sony: Walkman (1980s), leading to the result of 2 pairs of entities are not identical Fifth, the pair of entities has only one semantic relation but has more than one expression:
"X was acquired by Y" and "X buys Y" And finally, it is difficult because the unknown D entity, the D entity is
in the process of searching
The query's search motive takes the form: q = {(A, B), (C,?)}, the query consists of only 3 entities:
A, B, C Identifying the similarity relationship between the pair of entities (A, B), (C, ?) is a necessary condition for determining the entity to be sought As a problem of NLP (Natural Language Processing), similarity relational is one of the most important tasks of search for entities based on the implicit semantic relations Thus, thesis lists the main research directions for similarity relationship
1.2 IRS - Related work
1.2.1 SMT - Structure Mapping Theory
SMT [12] considers the similarity as a mapping of “knowledge” (mapping of knowledge ) from the source domain to the target domain, according to the mapping rules: Eliminate the attributes of the object but maintain the relational mapping between objects from the source domain to the target domain
Mapping rules: M: si ti; (in which s: source, t: target)
Eliminate attribute: HOT(si) ↛HOT(ti); MASSIVE(si) ↛MASSIVE(ti);
Maintain relational mapping: Revolves(Planet, Sun) Revolves(Electron, Nucleus)
Trang 6Figure 1.5 shows that due to the same s
(subject), o (object) structures, the SMT
considers the pairs (Planet, Sun) and (Electron,
Nucleus) are relation similarity, regardless of
the fact that the source and target pairs - Sun
and Nucleus, Planet and Electron are very
different in properties (HOT, MASSIVE, )
Referring to the purpose of the paper, if the
query is ((Planet, Sun), (Electron, ?)), SMT will
output the correct answer: "Nucleus"
Figure 1.5: Structure Mapping Theory (SMT)
However, SMT is not feasible with low-level structures (lack of relations) Therefore, SMT is not
feasible with the problem of searching entities based on implicit semantic relation
1.2.2 Relational similarity based on Wordnet classification system
Cao [20] and Agirre [21] proposed relational similarity measure based on similarity classification system
in Wordnet However, as mentioned above, Wordnet does not contain named entities Thus, Wordnet is not suitable for entity search model
1.2.3 VSM - Vector Space Model
Using the vector space model, Turney [13] presents the concept of each vector formed by a pattern containing the entity pair (A, B) and the occurrence frequency of the pattern The VSM performs the relational similarity measurement as follows: Patterns are generated manually and queried to the Search Engine (SE), the number of results returned from the SE is the frequency of occurrence of such patterns Thus, the relational similarity of two pairs of entities is computed by Cosine between two frequency vectors
1.2.4 LRA - Latent Relational Analysis
By extension of VSM, Turney combines it with LRA to determine level of relational similarity [14-16] Like VSM, LRA uses a vector made up of the pattern/context containing the entity pair (A, B) and the frequency
of the pattern (pattern in n-grams format) At the same time, LRA applies a thesaurus to extend the variants of: A bought B, A acquired B; X headquarters in Y, X offices in Y, etc LRA applies the most frequent n-grams to assign the pattern with the entity pair (A, B), then builds a pattern - entity pair matrix, where each element of the matrix represents the frequency of the pair (A, B) in the pattern In order to reduce the matrix dimension, the LRA uses Singular Value Decomposition (SVD) to reduce the number of columns in the matrix Finally, the LRA applies a Cosine measure to define the relational similarity between two pairs of entities
In spite of an effective approach to identifying relational similarity, LRA requires a long time to compute and process LRA requires 8 days to perform 374 SAT analogy questions [17] This is impossible with a real-time response system
1.2.5 LMRE - Latent Relation Mapping Engine - LRME
To improve the manual construction of mapping rules, s (subject), o (object) in SMT, Turney applies the LRME implicit relational mapping LRME [11], by combining SMT and LRA Purpose: Find a relationship between 2 terms A, B (consider terms as entities) With input (table 1.1) being 2 lists of terms from 2 domains (source and target), output (table 1.2) is the result of mapping 2 lists
Trang 71.2.6 LSR - Latent Semantic Relation
Bollega, Duc et al [17, 18], Kato [19] uses the Distributional Hypothesis at the context level: In the corpus, if two contexts pi and pj are different but usually co-occur with entity pairs wm, wn, they are similar in semantics When pi, pj are semantically similar, entity-pairs wm, wn are similar in relation
The Distribution Hypothesis requires pairs of entities to always co-occur with contexts, and the Bollega clustering algorithm is proposed at the context level rather than clustering at the term level in the sentence Measure of similarity based on the distribution hypothesis, which is not based on term similarity, will significantly affect the quality of the clustering technique, thus affecting the quality of the search system
1.2.7 Word2Vec
The Word2Vec model, proposed by Mikolov et al [22], is a learning model that represents each word into
a vector (maps a word to one-hot vector), Word2Vec describes the relationship (probability) between words with the context of the word The Word2Vec model has 2 simple Neural network architectures: Continous Bag-Of-Words (CBOW) and Skip-gram
Apply Skip-gram, at each training step, the
Word2Vec model predicts the contexts within
certain skip-grams Assuming the input training
word is "banking", with the sliding window skip =
m = 2, the left context output will output as "turning
into", the right context will output as "crises as"
Figure 1.6: Relationship between the target word and the context in the Word2Vec model
In order to predict, the objective function in Skip-gram implemented to maximize probability With a series of training words w1 , w2 , , wT , Skip-gram applies Maximum Likelihood:
𝐽(𝜃) =1
𝑇∑ ∑ log 𝑝(𝑤𝑡+𝑗|𝑤𝑡)
−𝑚≤𝑗≤𝑚,𝑗≠0
𝑇 𝑡=1
in which: T: number of words in the data-set; t: trained words;
m: window-side (skip); 𝜃: vector representation;
The training process applies back-propagation algorithm, the output probability p (wt + j | wt) determined
by the softmax activation function:
Trang 8𝑝(𝑜|𝑐) = exp(𝑢𝑜
𝑇𝑣𝑐)
∑𝑊 exp(𝑢𝑤𝑇𝑣𝑐)
𝑤−1
in which: W: Vocabulary; c: the trained word (input/center); o: output of c;
u: representing vector of o; v: representing vector of c;
In the experiment, Mikolov et al
[22-25] treats phrases as single words,
eliminates frequently repeated words, uses
Negative Sampling loss function,
randomly selecting n words to process
calculations instead of entire words in the
data-set, helping for the training algorithm
faster than the above softmax function
Figure 1.7: Word2Vec "learns" the "hidden" relationship between the target word and its context3 Vector operations such as: vec ("king") - vec ("man") ≈ vec ("queen") - vec ("woman") show that the Word2Vec model is suitable for a query like "A: B :: C :? ”, in other words, the Word2Vec model is quite close to the research direction of the thesis The difference: Word2Vec input (following the Skip-gram model) is one word, output is a context The input of IRS model based on the semantics is 3 entities (A: B :: C :?), the output is the entity to be searched for (D)
Regarding the search for entities based on semantics, from existing problems, to asymptotic to an "artificial intelligence" in the search engine, the research thesis, the application of self-ability simulation techniques of human: ability to infer information/knowledge not determined by similar inference
1.3 The problem Context-aware query suggestions
For SE, the ability to "understand" the search intent in a user's query is a challenge The data set used for mining is Query Log (QLogs) Query set in the past QLogs record the queries and "interactions" of users with search engines, so QLogs contain valuable information about the query content, purpose, behavior, habits, preferences as well as implicit feedback of the user on the result set returned by SE
Logs data set mining is useful in many applications:
Query analysis, advertising, trends, personalization, query
suggestion, etc.For query suggestions, traditional techniques
such as Explicit Feedback [30-32], Implicit Feedback
[33-36], User profile [37-39], Thesaurus [40-42], only give
suggestions similar to input queries of users
Figure 1.12: Suggest traditional with input query: “điện thoại di động”
1.4 Query suggestions – Related work
Around the kernel is Qlogs, it can be said that query suggestion in traditional techniques performs two main functions:
3 https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Trang 9 Cluster-based techniques apply similarity measurements to aggregate similar queries together in clusters (groups)
Session-based technology with session search is a continuous sequence of queries
1.4.1 Session-based query suggestion technique
a) Based on queries co-occurrence or adjacent (adjacency) belongings to sessions in Qlog: In a based approach, adjacency or co-occurrence query pairs belonging to the same session act as the candidate list for the query proposal
Session-b) Based on the graph (Query Flow Graph - QFG): On QFG graph, two queries qi, qj belong to the same search intent (search mission) are represented as an edge with direction from qi to qj Each node on the graph corresponds to a query, any edge on the graph is also considered a searching behavior
The session general structure in CFG is represented: QLog = <q, u, t, V, C>;
Boldi et al [50, 51] uses the simplified session structure QLog = <q, u, t> to perform query suggestions, following a series of steps:
Construct QFG graph with the input that is set of sessions in Query Logs
The queries qi and qj are connected if there exists at least one session where qi and qj occur consecutively
Calculate the weight w (qi, qj) on each edge:
w(qi, qj) = {
𝑓(𝑞𝑖,𝑞𝑗) 𝑓(𝑞𝑖) , 𝑖𝑓(𝑤(𝑞𝑖, 𝑞𝑗) > 𝜃) ∨ (𝑞𝑖 = 𝑠) ∨ (𝑞𝑖 = 𝑡)
in which: f(𝑞𝑖,𝑞𝑗): the number of occurrences qj immediately after qi in the session;
f(qi): number of occurrences of qi in QLogs; 𝜃: threshold;
s, t: 2 state nodes start, end of query chain in session;
Identify the strings that meet the conditions (1.6) to analyze the user's intent: When new query is inserted, based on the graph, it gives the query suggestions in turn with the greatest edge weight
1.4.2 Cluster-based query suggestion technique
K-means;
Hierarchical;
DB-SCAN;
…
Figure 1.9: Clustering methods [54]
Context-aware Query Suggestion (Context-aware Query Suggestion) is a new feature, Context-aware
considers the queries immediately before the current query as the search context, to "capture" the intent of the search of users Next, exploring the queries that immediately follow the current query - as the list of suggestion
This is the unique advantage of this approach - compared to one that only suggests similar queries The query layer that immediately follows the current query formally reflects the problems that users often ask after the current
Trang 10query At the same time, the query layer immediately after the current query often includes better queries (query strings) that better reflect the search intent
CHAPTER 2: SEARCH FOR ENTITIES BASED ON IMPLICIT SEMANTIC RELATIONS 2.1 Problem
In nature, there exists a relationship between two entities, such as: Khue Van Cac - Temple of Literature; Stephen Hawking - Physicist; Shakyamuni - Mahayana group; Apple - iPhone; In the real world, there are questions like: "knowing Fansipan is the highest mountain in Vietnam, which is the highest mountain in India?",
"If Biden is president-elect the United States, who is the most powerful person in Sweden? ”, …
By the keyword search engine, according to statistics, queries are often short, ambiguous, and semantic [1-6] Approximately 71% of web search queries contain names of entities, as statistics [7, 8] If the user enters the entities like "Vietnam", "Hanoi", "French", then the search engine only results in documents that contain the above keyboards, but does not immediately answer "Paris" Because of looking for entities only, query extending and query rewriting techniques are not applied to the type of the implicit semantic relation in the entity pair Therefore, a new search morphology is researched The pattern of the search query is in the form of: (A, B), (C, ?), in which (A, B) is the source entity pair, (C, ?) is the target entity pair At the same time, the two pairs (A, B), (C, ?) have a semantic similarity In other words, when the user enters the query (A, B), (C,?), the search engine has the duty of listing entities D so that each entity D satisfies the condition of semantic relation with C as well as the pair (C, D) have similarity relation with the pair (A, B) With an input consisting of only 3 entities: "Vietnam",
poly-"Hanoi", "France", the semantic relation "is the capital" is not indicated in the query
2.2 Method for searching entities is based on implicit semantic relations
2.2.1 Architecture – Modeling
The concept of searching entities through implicit semantic relation is the most obvious distinction for search engines based on keywords Figure 2.1 simulates a query consisting of only three entities, query = (Vietnam, Mekong), (China, ?)
Write the convention: q = {(A, B), (C,
?)}, where (Vietnam, Mekong) is a pair of source
entities, (China, ?) is a pair of target entities The
search engine is responsible for identifying the
entity ("?") that has a semantic relation with the
"China" entity, and the entity pair (China, ?) must
be similarly related to the entity pair (Vietnam,
Mekong) Note that the above query does not
explicitly contain the semantic relation between
the two entities This is because semantic
relations are expressed in various ways around
the pair of entities (Vietnam, Mekong), such as
"the longest river", "big river system", "the
largest basin", etc
Figure 2.1: Implicit Semantic Relation Search with input consisting of 3 entities
Trang 11Due to the fact that the query consists of only three entities that do not include semantic relations, the model is called the implicit semantic relation search model
In case IRS does not find A, B or C, the keyword search engine will be applied
The morphology of search for entities based on implicit semantic relations must determine the semantic relation between the two entities and calculate the similarity of the two entity pairs, since that, give the answer to the unknown entity (entity "?") On a specific corpus, in general, Implicit Relational Search (IRS) consists of three main modules: The extracting module of the sematic relations from the corpus; Clustering module of semantic relations; Calculating module of similar relations between two entity pairs In practice, the IRS model consists of two phases: online phase: meeting the real-time search, and offline phase: processing, calculating, storing semantic relations and similarity relations, in which, the extracting and clustering modules of the semantic relations are in the off-line phase of the IRS model
Extracting module of the semantic relations: From the corpus, this module extracts the patterns (the root sentence contains pairs of entities and context) as illustrated above: A the longest river B, where A, B are 2 named
entities The pattern set obtained will consist of different patterns, similar patterns, or patterns of different lengths and terms, but the same semantic expression For example: A is the largest river of B, A is the river of B has the largest basin, or B has the longest river as A, etc
Clustering module of semantic relations : From the obtained pattern set, clustering is performed to identify
clusters of contexts, where each context in the same clusters has a semantic similarity Make a table of the pattern indexes and the corresponding entity pairs
Calculating module of similar relations between two entity pairs is in the online phase of the IRS model
Pick up the query q = (A, B), (C, ?), IRS will search the entity pair (A, B) and the corresponding semantic relation (context) set in the index table From the obtained semantic relation set, find the pairs of entities (C, Di) associated with this relation Apply the Cosine measure to calculate and rank the similarity between (A, B) and (C, Di), and give a list of ranked entities Di to answer the query
Considering q = {(Vietnam, Mekong),
(China,?)}, the IRS finds a cluster containing
pairs of entities (Vietnam, Mekong) and the
corresponding semantic relation: "longest river"
(from the original sentence: "The Mekong is the
longest river in Vietnam") This cluster also
contains a similar semantic relation: "the largest
river", in which the relation: "largest river" is
associated with the entity (China, Changjiang)
(from the original sentence: "Changjiang is
river the biggest in China ”) The IRS will put
"Truong Giang" in the list of candidates, rank
semantic relations according to the measure of
similarity, and return results to the searcher
Figure 2.2: General structure of IRS
Entity – pairs &
Context Corpus
Inverted Index for IRS
Extracting semantic relations
Clustering semantic relations
Calculating the relation similarity (RelSim)
Candidate answers Ranked answers
Trang 122.2.2 Extracting module of the semantic relations
Receiving the input query q = (A, B), (C, ?), the general structure of IRS is modelized:
Filter-Entities (Fe) filters/seeks candidate set S containing entity pairs (C, Di) that are related to the input entity pair (A, B):
2.2.3 Clustering module of semantic relations
The clustering process converts "similar" elements into a cluster In the semantic entity search model, the elements in the cluster are semantically similar context sentences Similarity is a quantity used to compare two or more elements with each other, reflecting the correlation between two elements Therefore, the thesis generalizes the measurements of terms similarity; similarity based on vector space; semantic similarity - of the two contexts
a) Measurements of the similarity between 2 context
Terms-similarity
Zaro function: Winkler4 Distance Zaro calculates the similarity between 2 strings a, b:
SimZaro(a,b) = {0,𝑖𝑓𝑚 = 01
3(𝑚
|𝑎|+ 𝑚
|𝑏|+𝑚−𝑠𝑘𝑖𝑝
𝑚 ) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (2.6)
Constrast model: As proposed by Tversky (“Features of similarity”, Psychological Review, 1977)5,
applying a contrast model to calculate the similarity between two sentences a, b:
Sim(a, b)=α*f(a∩b) − 𝛽*f(a-b) − γ*f(b−a) (2.8)
Jaccard distance: Sim(a, b) = |𝑎∩𝑏|
PMI (Pointwise Mutual Information) method: Proposed by Church and Hanks [1990] Based on the probability that co-occurs between 2 terms t1, t2 in the corpus, PMI(t1, t2) are calculated by the formula:
PMI(t1, t2) = log2( 𝑷(𝒕𝟏 ,𝒕𝟐)
b) Clustering module of semantic relations
Applying PMI to improve the similarity measure according to the Distribution Hypothesis:
4 https://en.wikipedia.org/wiki/Jaro-Winkler_distance
5 http://www.cogsci.ucsd.edu/~coulson/203/tversky-features.pdf
Trang 13SimDH(p,q) = Cosine(PMI(p, q)) =∑ (𝑃𝑀𝐼(𝑤𝑖 𝑖 ,𝑝)∙𝑃𝑀𝐼(𝑤𝑖,𝑞))
||𝑃𝑀𝐼(𝑤𝑖,𝑝)||||𝑃𝑀𝐼(𝑤𝑖,𝑞)|| (2.25)
The similarity by terms of 2 context p, q:
Simterm(p, q) = ∑ni=1(weighti (p)∙weight i (q))
||weight(p)||||weight(q)|| (2.26)
Measurement of the combined similarity:
Sim(p,q) = Max(SimDH(p, q),Simterm(p, q)) (2.27)
2.2.4 Modules calculating the relational similarity between two pairs of entities
The module calculating the relational similarity between two pairs of entities that perform two tasks: Filtering (searching) and ranking As illustrated in 3.1, the input query q = (A, B), (C, ?), through the inverted index, IRS executes the function Filter-Entities Fe to filter (search) out candidate sets having entity pairs (C, Di) and the corresponding context, such that (C, Di) similar to (A, B) Then, it executes the function Rank-Entities Re
to rank the entities Di, Dj within the candidate set according to RelSim measure (Relational Similarity), finally - which results in list of ranked {Di}
Filter-Entities algorithm: Filter to find the candidate set containing the answer:
Input: Query q = (A, B)(C, ?)
Output: Candidate set S (includes Di entities and corresponding context);
Program Filter_Entities
01 S = {};
02 P(w) = EntPair_from_Cset.Context();