Introduction
In the big data environment with the rapid development of the Internet, the
dependence on information search is becoming stronger and stronger. At present, full
text search based on keywords has been difficult to satisfy people's search needs. In this
case, an information retrieval method based on knowledge graphs is proposed.
Based on the knowledge graph, information retrieval is performed through the
calculation of semantic similarity. Using this technology for information retrieval, the
efficiency and accuracy of the retrieval results will be greatly improved, and it has a very
good application value in the field of information retrieval and smart recommendation.
The Semantic Knowledge Graph serves as a data scientist's toolkit, allowing you to
discover and compare any entities modeled within a corpus of data from any domain
Problem Statement
What problem we are trying to solve: Let’s say user wants to search as below:
“Machine learning research and development Portland, OR software engineer AND
Hadop, jee“
Below are different ways for query parsing:
Traditional Query parsing: * Rating
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND Hadop AND Jee)
Semantic query parsing: *** Rating
“machine learning” AND “research and development” AND “Portland, OR” AND “software
engineer” AND “hadop” AND “jee”
Semantically expanded query: **** Rating
(“machine learning”^10 OR “data scientist” OR “data mining” OR “artificial intelligence”)
AND (“research and development”^10 OR “r&d”) AND
(“software engineer”^10 OR “software developer”) AND
(“hadop”^10 OR “big data”) AND (“jee”^10 OR “j2ee”)
From the above, we can see that the best match comes from the semantically expanded
query which is what the semantic knowledge graph.
The below sections describe how semantic knowledge graph provides a greater
contextual data for a given user query
Solution Approach
The solution involves two concepts introduced as part of the solution approach
1. Use of two indices :
a. doc-term uninverted index
b. term-doc invertd Index
Consider the sample data below:
{"id":"01",age:15,"state":"AZ","hobbies":["soccer","painting","cycling"]},
{"id":"02",age:22,"state":"AZ","hobbies":["swimming","darts","cycling"]},
{"id":"03",age:27,"state":"AZ","hobbies":["swimming","frisbee","painting"]},
{"id":"04",age:33,"state":"AZ","hobbies":["darts"]},
{"id":"05",age:42,"state":"AZ","hobbies":["swimming","golf","painting"]},
{"id":"06",age:54,"state":"AZ","hobbies":["swimming","golf"]},
{"id":"07",age:67,"state":"AZ","hobbies":["golf","painting"]},
{"id":"08",age:71,"state":"AZ","hobbies":["painting"]},
{"id":"09",age:14,"state":"CO","hobbies":["soccer","frisbee","skiing","swimming","skatin
g"]}, {"id":"10",age:23,"state":"CO","hobbies":["skiing","darts","cycling","swimming"]},
{"id":"11",age:26,"state":"CO","hobbies":["skiing","golf"]},
{"id":"12",age:35,"state":"CO","hobbies":["golf","frisbee","painting","skiing"]},
{"id":"13",age:47,"state":"CO","hobbies":["skiing","darts","painting","skating"]},
{"id":"14",age:51,"state":"CO","hobbies":["skiing","golf"]},
{"id":"15",age:64,"state":"CO","hobbies":["skating","cycling"]},
{"id":"16",age:73,"state":"CO","hobbies":["painting"]}, ]'
Consider the below sample Query
--------------------------------------
curl -sS -X POST http://localhost:8983/solr/gettingstarted/query -d 'rows=0&q=*:*
&back=*:* # <1>
&fore=age:[35 TO *] # <2>
&json.facet={
hobby : {
type : terms,
field : hobbies,
limit : 5,
sort : { r1: desc }, # <3>
facet : {
r1 : "relatedness($fore,$back)", # <4>
location : {
type : terms,
field : state,
limit : 2,
sort : { r2: desc }, # <3>
facet : {
r2 : "relatedness($fore,$back)" # <4>
}
}
}
}
}'
Query explanation:
<1> Use the entire collection as our "Background Set"
<2> Use a query for "age >= 35" to define our (initial) "Foreground Set"
<3> For both the top level `hobbies` facet & the sub-facet on `state` we will be sorting
on the `relatedness(...)` values
<4> In both calls to the `relatedness(...)` function, we use <<local-parameters-in-
queries.adoc#parameter-dereferencing,Parameter Variables>> to refer to the previously
defined `fore` and `back` queries.
Sample Output:
"facets":{
"count":16,
"hobby":{
"buckets":[{
"val":"golf",
"count":6, // <1>
"r1":{
"relatedness":0.01225,
"foreground_popularity":0.3125, // <2>
"background_popularity":0.375}, // <3>
"location":{
"buckets":[{
"val":"az",
"count":3,
"r2":{
"relatedness":0.00496, // <4>
"foreground_popularity":0.1875, // <6>
"background_popularity":0.5}}, // <7>
{
"val":"co",
"count":3,
"r2":{
"relatedness":-0.00496, // <5>
"foreground_popularity":0.125,
"background_popularity":0.5}}]}},
{
"val":"painting",
"count":8, // <1>
"r1":{
"relatedness":0.01097,
"foreground_popularity":0.375,
"background_popularity":0.5},
"location":{
"buckets":[{
...
<1> Even though `hobbies:golf` has a lower total facet `count` then
`hobbies:painting`, it has a higher `relatedness` score, indicating that relative to the
Background Set (the entire collection) Golf has a stronger correlation to our Foreground
Set (people age 35+) then Painting.
<2> The number of documents matching `age:[35 TO *]` _and_ `hobbies:golf` is
31.25% of the total number of documents in the Background Set
<3> 37.5% of the documents in the Background Set match `hobbies:golf`
<4> The state of Arizona (AZ) has a _positive_ relatedness correlation with the
_nested_ Foreground Set (people ages 35+ who play Golf) compared to the Background
Set -- ie: "People in Arizona are statistically more likely to be '35+ year old Golfers' then
the country as a whole."
<5> The state of Colorado (CO) has a _negative_ correlation with the nested
Foreground Set -- ie: "People in Colorado are statistically less likely to be '35+ year old
Golfers' then the country as a whole."
<6> The number documents matching `age:[35 TO *]` _and_ `hobbies:golf` _and_
`state:AZ` is 18.75% of the total number of documents in the Background Set
<7> 50% of the documents in the Background Set match `state:AZ`
Scoring semantic relationships:
Below is a capture of definition of scoring relationships in the solution approach
Architecture
The below diagram shows the Apache Solr architecture:
The components of the semantic knowledge graph implemented in Apache solr shows
how the semantic relationships are processed and derived from the query.
Conclusions
The Semantic Knowledge Graph(SKG) has numerous applications like automatic ontology
building, identifying trending topics over time, predictive analytics on timeseries
data, root-cause analysis surfacing concepts related to failure scenarios from free text,
data cleansing, document summarization, semantic search interpretation and expansion
of queries, recommendation systems, and numerous other forms of anomaly detection.