Query Languages
Berlin Chen 2003
Reference:
1. Modern Information Retrieval, chapter 4
The Kinds of Queries
• Data retrieval
– Pattern-based querying
– Retrieve docs that contains (or exactly match) the
objects that satisfy the conditions clearly specified in
the query
– A single erroneous object implies failure!
• Information retrieval
– Keyword-based querying
– Retrieve relevant docs in response to the query (the
formulation of a user information need)
– Allow the answer to be ranked
2
The Kinds of Queries
• On-line databases or CD-ROM archives
– High level software packages should be viewed as
query languages
– Named “protocols”
Different query languages are formulated and then
used at different situations, by considering
- The underlying retrieval models
- The content (semantics) and structure (syntax) of the text
Models: Boolean, vector-space, HMM ….
Formulations/word-treating machineries: Stop-word list,
stemming, query-expansion, ….
3
The Retrieval Units
• The retrieval unit: the basic element which can
be retrieved as an answer to a query
– A set of such basic elements with ranking information
• The retrieval unit can be a file, a doc, a Web
page, a paragraph, a passage, or some other
structural units
• Simply referred as “docs”
4
Keyword-based Querying
• Keywords
– Those words can be used for retrieval by a query
– A small set of words extracted from the docs
• Preprocessing is needed
• Characteristics of keyword-based queries
– A query composed of keywords and the docs
containing such keywords are searching for
– Intuitive, easy to express, and allowing for fast
ranking
– A query can be a single keyword or more complex
combination of operation involving several keywords
5
Keyword-based Querying
• Single-word queries
– Query: The elementary query is a word
– Docs: The docs are long sequences of words
– What is a word in English ?
• A word is a sequence of letters surrounded by
separators
• Some characters are not letters but do not split a
word, e.g. the hyphen in ‘on-line’
• Words possess semantic/conceptual information
6
Keyword-based Querying
• Single-word queries (cont.)
– The use of word statistics for IR ranking
• Word occurrences inside texts
– Term frequency: number of times a word in a
doc
– Inverse document frequency: number of docs in
which a word appears
– Word positions in the docs
• May be required, e.g., a interface highlighting each
occurrence of a specific word
7
Keyword-based Querying
• Context queries
– Complement single-word queries with ability to
search words in a given context, i.e., near other
words
– Words appearing near each other may signal a
higher likelihood of relevance than if they appear
apart
• Phrases of words or words are proximal in the
text
8
Keyword-based Querying
• Context queries (cont.)
– Two types of queries
• Phrase
Separators in the
– A sequence of single-word queries
text or query may
not be the same Q: “enhance” and “retrieval”
D: “…enhance the retrieval….”
• Proximity
– A relaxed version of the phrase query
May not consider – A sequence of single words (or phrases) is
word ordering
given together with a maximum allowed
distance between them
» E.g., two keywords occur within four words
D: “…enhance the power of retrieval…” 9
Keyword-based Querying
• Context queries (cont.)
– Ranking
• Phrases: analogous to single words
• Proximity queries: the same way if physical
proximity is not used as a parameter in ranking
– Just as a hard-limiter
– But physical proximity has semantic value !
10
Keyword-based Querying
• Boolean Queries
– Have a syntax composed of atoms (basic queries)
that retrieve docs, and of Boolean operators which
work on their operands
AND
translation OR
Leaves: basic queries
Internal nodes: operators
syntax syntactic
A query syntax tree. 11
Keyword-based Querying
• Boolean Queries (cont.)
– Commonly used operators e and e are basic queries
1 2
• OR, e.g. (e1 OR e2)
– Select all docs which satisfy e1 or e2. Duplicates
e e e OR e
are eliminated d
1
3d
2
4d
1
3
2
d d d
• AND, e.g. (e1 AND e2) d
7
d
10
7
8d
4
7
d
– Select all docs which satisfy both e1 and e2 d
8
10
• BUT, e.g. (e1 BUT e2)
– Select all docs which satisfy e1 but not e2
No partial matching between a doc and a query
No ranking of retrieved docs are provided!
12
Keyword-based Querying
• Boolean Queries (cont.)
– A relaxed version: a “fuzzy Boolean” set of operators
• The meaning of AND and OR can be relaxed
– all : the AND operator
– one: the OR operator (at least one)
– some: retrieval elements appearing in more
operands than the OR
• Docs are ranked higher when having a larger
number of elements in common with the query
13
Keyword-based Querying
• Natural language
– Push the fuzzy Boolean model even further
• The distinction between AND and OR are
complete blurred
– A query is an enumeration of words and context
queries
– All the documents matching a portion of the user
query are retrieved
• Docs matching more parts of the query assigned a
higher ranking
– Negation also can be handled by penalizing the
ranking score
• E.g. some words are not desired
14
Pattern Matching
• Pattern matching: allow the retrieval of doc
based on some patterns
– A pattern is a set of syntactic features must occur in a
text segments
• Segments satisfying the pattern specifications are
said to “match the pattern”
• E.g. the prefix of a word
– A kind of data retrieval
• Pattern matching (data retrieval) can be viewed
as an enhanced tool for information retrieval
– Require more sophisticated data structures and
algorithms to retrieve efficiently
15
Pattern Matching
• Types of patterns
– Words
– Prefixes: a string from the beginning of a text word
• E.g. ‘comput’: ‘computer’, ‘computation’,…
– Suffixes: a string from the termination of a text word
• E.g. ‘ters’: ‘computers’, ‘testers’, ‘painters’,…
– Substrings: A string within a text word
• E.g. ‘tal’: ‘coastal’, ‘talk’, ‘metallic’, …
– Ranges: a pair of strings matching any words lying
between them in lexicographic order
• E.g. between ‘held’ and ‘hold’: ‘hoax’ and ‘hissing’,…
16
Pattern Matching
– Allowing errors: a word together with an error
threshold
• Useful for when query or doc contains typos or
misspelling
• Retrieve all text words which are ‘similar’ to the
given word
• edit (or Levenshtein) distance: the minimum
number of character insertions, deletions, and
replacements needed to make two strings equal
– E.g. ‘flower’ and ‘flo wer’
• maximum allowed edit distance: query specifies
the maximum number of allowed errors for a word
to match the pattern
17
Pattern Matching
– Regular Expressions
• General patterns are built up by simple strings and
several operations
• union: if e1 and e2 are regular expressions, then (e1 |
e2) matches what e1 or e2 matches
• concatenation: if e1 and e2 are regular expressions,
the occurrences of (e1 e2) are formed by the
occurrences of e1 immediately followed by those of e2
• repetition (Kleene closure): if e is a regular
expression, then (e*) matches a sequence of zero or
more contiguous occurrence of e
• Example:
– ‘pro (blem | tein) (s | ε) (0 | 1 | 2)*’ matches words
‘problem2’, ‘proteins’, etc.
18
Pattern Matching
– Extended Patterns
• Subsets of the regular expressions expressed with a
simpler syntax
• System can convert extended patterns into regular
expressions, or search them with specific algorithms
• E.g.: classes of characters:
19
Structural Queries
• Docs are allowed to be queried with respect to
both their text content and structural constraints
– Text content: words, phrases, or patterns
– Structural constraints: containment, proximity, or
other restrictions on the structural elements (e.g.,
chapters, sections, etc.)
Mixing contents and structures in queries
Query on Text Retrieval A Set of The Final Set of
Boolean model
Text Content model Retrieved Documents Retrieved Documents
Structural
Query
20
Structural Queries
• Three main structures discussed here
– Form-like fixed structure simple
– Hierarchical structure
– Hypertext structure complex
What structure a text may have?
What can be queried about that
structure? (the query model)
How to rank docs?
21
Form-like Fixed Structure
• Docs have a fixed set of fields, much like a filled
form
– Each field has some text inside
text
– Some fields are not presented in all docs text
– Text has to be classified into a field
fields
– Fields are not allow to nest or overlap text
– A given pattern only can be associated text
with a specified filed
– E.g., a mail achieve (sender, receiver, date, subject, body ..)
• Search for the mail sent to a given person with
“football” in the subject field
• Compared with the relational database systems
– Different fields with different data types more rigid !
22
Hypertext Structure
• A hypertext is a directed graph where
– Nodes hold some text (content)
– The links represents connection (structural
connectivity) between nodes or between positions
inside the nodes
• Retrieval from a hypertext began as a merely
navigational activity
– Manually traverse the hypertext nodes following links A
to search what one wanted C
– It’s still not possible to query the hypertext based on
its structure B
• An interesting proposal to combine browsing and
searching on the web WebGlimpse
23
Hierarchical Structure
• Represent a recursive decomposition of the text
and is a natural model for many text collections
– E.g., books, articles, legal documents,…
24
Issues of Hierarchical Structure
• Static or dynamic structure
– Statistic: one or more explicit hierarchies can be
queried, e.g., by ancestry
– Dynamic: not really a hierarchy, the required
elements are built on the fly
• Implemented over a normal text index
• Restrictions on the structure
– The text or the answers may have restrictions about
nesting and/or overlapping for efficiency reasons
– In other cases, the query language is restricted to
avoid restricting the structure
25
Issues of Hierarchical Structure
• Integration with text
– Effective Integration of queries on text content with
queries on text structure
– From perspectives of classical IR models
and structural models, respectively Classical model: primary -> text
secondary->structure
Structural model: primary -> structure
• Query language secondary->text
– Some features for queries on structure including
selection of areas that
• Contain (or not) other areas
• Are contained (or not) in other areas
• Follow (or are followed by) other areas
• Are close to other areas
– Also including set manipulation
26
Query Protocols
• The query languages used automatically by
software applications to query text databases
– Standards for querying CD-ROMs
– Or, intermediate languages to query library systems
• Important query protocols
– Z39.50
• For bibliographical information systems
• Protocols for not only the query language but also
the client-server connection
– WAIS (Wide Area Information Service)
• A networking publishing protocol
• For querying database through the Internet
27
Query Protocols
• CD-ROM publishing protocols
– Provide “disk interchangeability”: flexibility in data
communication between primary information
providers and end users
– Some example protocols
• CCL (Common Command Language)
• CD-RDx (Compact Disk Read only Data exchange)
• SFQL (Structured Full-text Query Languages)
28
Trends and Research Issues
• Types of queries and how they are structured
29