L5 Compression
L5 Compression
COMP6714:
Informa2on
Retrieval
&
Web
Search
https://powcoder.com
Assignment Project Exam Help
Add WeChat powcoder
Introduc*on
to
Assignment Project Exam Help
Informa(on
R etrieval
https://powcoder.com
Add WeChat powcoder
Lecture
5:
Index
Compression
1
COMP6714:
Informa2on
Retrieval
&
Web
Search
https://powcoder.com
Assignment Project Exam Help
Last
lecture
–
index
construc*on
Add WeChat powcoder
§ Sort-‐based
indexing
§ Naïve
in-‐memory
inversion
§ Blocked
Assignment Project Exam Help
Sort-‐Based
Indexing
§ Merge
sort
is
effec*ve
for
disk-‐based
sor*ng
(avoid
seeks!)
https://powcoder.com
§ Single-‐Pass
In-‐Memory
Indexing
Add WeChat powcoder
§ No
global
dic*onary
§ Generate
separate
dic*onary
for
each
block
§ Don’t
sort
pos*ngs
§ Accumulate
pos*ngs
in
pos*ngs
lists
as
they
occur
§ Distributed
indexing
using
MapReduce
§ Dynamic
indexing:
Mul*ple
indices,
logarithmic
merge
2
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch.
5
https://powcoder.com
Assignment Project Exam Help
Today
Add WeChat powcoder
3
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch.
5
https://powcoder.com
Assignment Project Exam Help
Why
compression
(in
general)?
Add WeChat powcoder
§ Use
less
disk
space
§ Saves
a
liZle
money
Assignment
§ Keep
more
Project Exam Help
stuff
in
memory
§ Increases
speed
https://powcoder.com
§ Increase
speed
o f
d ata
Add WeChat powcoder t ransfer
f rom
d isk
t o
memory
§ [read
compressed
data
|
decompress]
is
faster
than
[read
uncompressed
data]
§ Premise:
Decompression
algorithms
are
fast
§ True
of
the
decompression
algorithms
we
use
4
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch.
5
https://powcoder.com
Assignment Project Exam Help
Why
compression
for
inverted
indexes?
Add WeChat powcoder
§ Dic*onary
§ Make
it
small
enough
to
keep
in
main
memory
§ Make
it
Assignment
so
small
that
yProject Exam
ou
can
keep
Help
some
pos*ngs
lists
in
main
memory
too
https://powcoder.com
§ Pos*ngs
file(s)
§ Reduce
disk
sAdd
pace
nWeChat
eeded
powcoder
§ Decrease
*me
needed
to
read
pos*ngs
lists
from
disk
§ Large
search
engines
keep
a
significant
part
of
the
pos*ngs
in
memory.
§ Compression
lets
you
keep
more
in
memory
§ We
will
devise
various
IR-‐specific
compression
schemes
5
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.1
https://powcoder.com
Assignment Project Exam Help
Recall
Reuters
RCV1
Add WeChat powcoder
§ symbol
sta(s(c
value
§ N
documents
800,000
Assignment Project Exam Help
§ L
avg.
#
tokens
per
doc
200
§ M
terms
https://powcoder.com
(=
word
types)
~400,000
§
avg.
Add
#
bytes
WeChatper
token
powcoder
6
(incl.
spaces/punct.)
§
avg.
#
bytes
per
token
4.5
(without
spaces/punct.)
§
avg.
#
bytes
per
term
7.5
§
non-‐posi*onal
pos*ngs
100,000,000
6
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.1
https://powcoder.com
Index
p arameters
Assignment Project Exam Help v s.
what
we
i ndex
(details
IIR
Table
5.1,
p.80)
Add WeChat powcoder
size of word types (terms) non-positional positional postings
postings
dictionary non-positional index
Assignment Project Exam Helppositional index
Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul
(K) https://powcoder.com
% % % % %
Unfiltered 484 109,971 197,879
No numbers 474
Add WeChat
-2
powcoder-8
-2 100,680 -8 179,158 -9 -9
Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9
30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52
stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52
Exercise: give intuitions for all the ‘0’ entries. Why do some
zero entries correspond to big deltas in other columns? 7
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.1
https://powcoder.com
Assignment Project Exam Help
Lossless
vs.
lossy
compression
Add WeChat powcoder
§ Lossless
compression:
All
informa*on
is
preserved.
§ What
we
mostly
do
in
IR.
Assignment
§ Lossy
compression:
Projectsome
Discard
Exam Help
informa*on
§ Several
of
the
https://powcoder.com
preprocessing
steps
can
be
viewed
as
lossy
compression:
case
folding,
stop
words,
Add WeChat powcoder
stemming,
number
elimina*on.
§ Chap/Lecture
7:
Prune
pos*ngs
entries
that
are
unlikely
to
turn
up
in
the
top
k
list
for
any
query.
§ Almost
no
loss
quality
for
top
k
list.
8
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.1
https://powcoder.com
Assignment Project Exam Help
Vocabulary
vs.
collec*on
size
Add WeChat powcoder
§ How
big
is
the
term
vocabulary?
§ That
is,
how
many
dis*nct
words
are
there?
Assignment
§ Can
we
assume
Project
an
upper
Exam Help
bound?
§ Not
really:
At
https://powcoder.com
least
7020
=
1037
different
words
of
length
20
§ In
prac*ce,
the
v ocabulary
Add WeChat powcoder w ill
keep
growing
with
the
collec*on
size
§ Especially
with
Unicode
J
9
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.1
https://powcoder.com
Assignment Project Exam Help
Vocabulary
vs.
collec*on
size
Add WeChat powcoder
§ Heaps’
law:
M
=
kTb
§ M
is
the
size
of
the
vocabulary,
T
is
the
number
of
tokens
in
Assignment
the
collec*on
Project Exam Help
§ Typical
values:
https://powcoder.com
30
≤
k
≤
100
and
b
≈
0.5
§ In
a
log-‐log
plot
Add of
vWeChat
ocabulary
size
M
vs.
T,
Heaps’
powcoder
law
predicts
a
line
with
slope
about
½
§ It
is
the
simplest
possible
rela*onship
between
the
two
in
log-‐log
space
§ An
empirical
finding
(“empirical
law”)
10
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.1
https://powcoder.com
Assignment Project Exam Help
Heaps’
Law
Fig 5.1 p81
Add WeChat powcoder
For
RCV1,
the
dashed
line
log10M
=
0.49
log10T
+
1.64
is
the
best
least
sAssignment
quares
fit.
Project Exam Help
Thus,
M
=
101.64T0.49
so
k
=
https://powcoder.com
10 ≈
44
and
b
=
0.49.
1.64
Add WeChat powcoder
Good
empirical
fit
for
Reuters
RCV1
!
For
first
1,000,020
tokens,
law
predicts
38,323
terms;
actually,
38,365
terms
11
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.1
https://powcoder.com
Assignment Project Exam Help
Exercises
Add WeChat powcoder
§ What
is
the
effect
of
including
spelling
errors,
vs.
automa*cally
correc*ng
spelling
errors
on
Heaps’
law?
Assignment Project Exam Help
§ Compute
the
vhttps://powcoder.com
ocabulary
size
M
for
this
scenario:
§ Looking
at
a
collec*on
of
web
pages,
you
find
that
there
Add tWeChat
are
3000
different
powcoder
erms
in
the
first
10,000
tokens
and
30,000
different
terms
in
the
first
1,000,000
tokens.
§ Assume
a
search
engine
indexes
a
total
of
20,000,000,000
(2
×
1010)
pages,
containing
200
tokens
on
average
§ What
is
the
size
of
the
vocabulary
of
the
indexed
collec*on
as
predicted
by
Heaps’
law?
12
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.1
https://powcoder.com
Assignment Project Exam Help
Zipf’s
law
Add WeChat powcoder
§ Heaps’
law
gives
the
vocabulary
size
in
collec*ons.
§ We
also
Assignment
study
the
rela*ve
Projectfrequencies
Exam Helpof
terms.
§ In
natural
language,
there
are
a
few
very
frequent
terms
and
very
https://powcoder.com
many
very
rare
terms.
§ Zipf’s
law:
The
Add ith
WeChat
most
frequent
powcoder term
has
frequency
propor*onal
to
1/i
.
§ cfi
∝
1/i
=
K/i
where
K
is
a
normalizing
constant
§ cfi
is
collec*on
frequency:
the
number
of
occurrences
of
the
term
ti
in
the
collec*on.
13
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.1
https://powcoder.com
Assignment Project Exam Help
Zipf
consequences
Add WeChat powcoder
§ If
the
most
frequent
term
(the)
occurs
cf1
*mes
§ then
the
second
most
frequent
term
(of)
occurs
cf1/2
*mes
Assignment Project Exam Help
§ the
third
most
frequent
term
(and)
occurs
cf1/3
*mes
…
https://powcoder.com
§ Equivalent:
cfi
=
K/i
where
K
is
a
normalizing
factor,
so
Add WeChat powcoder
§ log
cfi
=
log
K
-‐
log
i
§ Linear
rela*onship
between
log
cfi
and
log
i
15
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch.
5
https://powcoder.com
Assignment Project Exam Help
Compression
Add WeChat powcoder
§ Now,
we
will
consider
compressing
the
space
for
the
dic*onary
and
pos*ngs
Assignment Project Exam Help
§ Basic
Boolean
index
only
https://powcoder.com
§ No
study
of
posi*onal
indexes,
etc.
Add WeChat powcoder
§ We
will
consider
compression
schemes
16
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.2
https://powcoder.com
Assignment Project Exam Help
Add WeChat powcoder
DICTIONARY COMPRESSION
17
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.2
https://powcoder.com
Assignment Project Exam Help
Why
compress
the
dic*onary?
Add WeChat powcoder
§ Search
begins
with
the
dic*onary
§ We
want
to
keep
it
in
memory
Assignment Project Exam Help
§ Memory
footprint
compe**on
with
other
applica*ons
https://powcoder.com
§ Embedded/mobile
devices
m
Add WeChat ay
have
very
liZle
powcoder
memory
§ Even
if
the
dic*onary
isn’t
in
memory,
we
want
it
to
be
small
for
a
fast
search
startup
*me
§ So,
compressing
the
dic*onary
is
important
18
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.2
https://powcoder.com
Assignment Project Exam Help
Dic*onary
storage
-‐
first
cut
Add WeChat powcoder
§ Array
of
fixed-‐width
entries
§ ~400,000
terms;
28
bytes/term
=
11.2
MB.
Assignment Project Exam Help
https://powcoder.com
Terms Freq. Postings ptr.
a 656,265
Add WeChat powcoder
aachen 65
…. ….
zulu 221
21
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.2
https://powcoder.com
Assignment Project Exam Help
Space
for
dic*onary
as
a
string
Add WeChat powcoder
§ 4
bytes
per
term
for
Freq.
Now avg. 11
§ 4
bytes
per
term
for
pointer
to
Pos*ngs.
bytes/term,
Assignment Project Exam Help not 20.
§ 3
bytes
per
term
pointer
§ https://powcoder.com
Avg.
8
bytes
per
term
in
term
string
§ 400K
terms
x
1Add
9
èWeChat
7.6
MB
(powcoder
against
11.2MB
for
fixed
width)
22
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.2
https://powcoder.com
Assignment Project Exam Help
Blocking
Add WeChat powcoder
§ Store
pointers
to
every
kth
term
string.
§ Example
below:
k=4.
Assignment
§ Need
to
store
Project
term
lengths
(1
Exam
extra
bHelp
yte)
https://powcoder.com
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Add WeChat powcoder
Freq. Postings ptr. Term ptr.
33
29
Save 9 bytes Lose 4 bytes on
44 on 3 term lengths.
126 pointers.
7
23
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.2
https://powcoder.com
Assignment Project Exam Help
Net
Add WeChat powcoder
§ Example
for
block
size
k
=
4
§ Where
we
used
3
bytes/pointer
without
blocking
Assignment Project Exam Help
§ 3
x
4
=
12
bytes,
now
we
use
3
+
4https://powcoder.com
=
7
bytes.
Add WeChat powcoder
Shaved
another
~0.5MB.
This
reduces
the
size
of
the
dic*onary
from
7.6
MB
to
7.1
MB.
We
can
save
more
with
larger
k.
24
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.2
https://powcoder.com
Assignment Project Exam Help
Exercise
Add WeChat powcoder
§ Es*mate
the
space
usage
(and
savings
compared
to
7.6
MB)
with
blocking,
for
block
sizes
of
k
=
4,
8
and
16.
Assignment Project Exam Help
https://powcoder.com
Add WeChat powcoder
25
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.2
https://powcoder.com
Assignment Project Exam Help
Dic*onary
search
without
blocking
Add WeChat powcoder
§ Assuming
each
dic*onary
term
equally
Assignment
likely
in
query
(not
really
Project Exam Help
so
in
prac*ce!),
ahttps://powcoder.com
verage
number
of
comparisons
Add WeChat powcoder
=
(1+2·∙2+4·∙3+4)/8
~2.6
28
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.2
https://powcoder.com
Assignment Project Exam Help
Front
coding
Add WeChat powcoder
§ Front-‐coding:
§ Sorted
words
commonly
have
long
common
prefix
–
store
differences
only
Assignment Project Exam Help
§ (for
last
k-‐1
in
a
block
of
k)
https://powcoder.com
8automata8automate9automa'c10automa'on
Add WeChat powcoder
♢8automat*a1♢e2♢ic3♢ion
31
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Add WeChat powcoder
POSTINGS COMPRESSION
32
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Pos*ngs
compression
Add WeChat powcoder
§ The
pos*ngs
file
is
much
larger
than
the
dic*onary,
factor
of
at
least
10.
§ Assignment
Key
desideratum:
Project
store
each
pExam
os*ng
Help
compactly.
§ A
pos*ng
for
ohttps://powcoder.com
ur
purposes
is
a
docID.
§ For
Reuters
(800,000
Add WeChat documents),
powcoder we
would
use
32
bits
per
docID
when
using
4-‐byte
integers.
§ Alterna*vely,
we
can
use
log2
800,000
≈
20
bits
per
docID.
§ Our
goal:
use
a
lot
less
than
20
bits
per
docID.
33
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Pos*ngs:
two
conflic*ng
forces
Add WeChat powcoder
§ A
term
like
arachnocentric
occurs
in
maybe
one
doc
out
of
a
million
–
we
would
like
to
store
this
pos*ng
using
log2Assignment
1M
~
20
bits.
Project Exam Help
§ A
term
like
the
occurs
in
virtually
every
doc,
so
20
https://powcoder.com
bits/pos*ng
is
too
expensive.
Add WeChat powcoder
§ Prefer
0/1
bitmap
vector
in
this
case
34
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Pos*ngs
file
entry
Add WeChat powcoder
§ We
store
the
list
of
docs
containing
a
term
in
increasing
order
of
docID.
Assignment
§ computer:
Project…Exam
33,47,154,159,202
Help
§ Consequence:
https://powcoder.com
it
suffices
to
store
gaps.
§ 33,14,107,5,43
…
Add WeChat powcoder
§ Hope:
most
gaps
can
be
encoded/stored
with
far
fewer
than
20
bits.
35
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Three
pos*ngs
entries
Add WeChat powcoder
36
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Variable
length
encoding
Add WeChat powcoder
§ Aim:
§ For
arachnocentric,
we
will
use
~20
bits/gap
entry.
§ For
the,
Assignment
we
will
use
~1
Project
bit/gap
eExam
ntry.
Help
§ If
the
average
https://powcoder.com
gap
for
a
term
is
G,
we
want
to
use
~log2G
bits/gap
entry.
Add WeChat powcoder
§ Key
challenge:
encode
every
integer
(gap)
with
about
as
few
bits
as
needed
for
that
integer.
§ This
requires
a
variable
length
encoding
§ Variable
length
codes
achieve
this
by
using
short
codes
for
small
numbers
37
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Variable
Byte
(VB)
codes
Add WeChat powcoder
§ For
a
gap
value
G,
we
want
to
use
close
to
the
fewest
bytes
needed
to
hold
log2
G
bits
§ Begin
with
Assignment
one
byte
to
Project
store
GExam
and
dHelp
edicate
1
bit
in
it
to
be
a
con*nua*on
bit
c
https://powcoder.com
§ If
G
≤127,
binary-‐encode
it
in
the
7
available
bits
and
Add WeChat powcoder
set
c
=1
§ Else
encode
G’s
lower-‐order
7
bits
and
then
use
addi*onal
bytes
to
encode
the
higher
order
bits
using
the
same
algorithm
§ At
the
end
set
the
con*nua*on
bit
of
the
last
byte
to
1
(c
=1)
–
and
for
the
other
bytes
c
=
0.
38
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Hex(824)=0x0338
Assignment Project ExamHex(214577)=0x00034631
Help
Example
Add WeChat powcoder
docIDs 824 829 215406
gaps 5 214577
VB code Assignment
00000110 Project Exam Help00001101
10000101
10111000 00001100
https://powcoder.com 10110001
Postings storedAdd
as the byte concatenation
WeChat powcoder
000001101011100010000101000011010000110010110001
43
COMP6714:
Informa2on
Retrieval
&
Web
Search
https://powcoder.com
Assignment Project Exam Help
Unary
and
Binary
Codes
Add WeChat powcoder
§ Unary
is
very
efficient
for
small
numbers
such
as
0
and
1,
but
quickly
becomes
very
expensive
Assignment Project Exam Help
§ 1023
can
be
represented
in
10
binary
bits,
but
requires
1024
bits
in
uhttps://powcoder.com
nary
§ Binary
is
more
Add
efficient
WeChat for
lpowcoder
arge
numbers,
but
it
may
be
ambiguous
44
COMP6714:
Informa2on
Retrieval
&
Web
Search
https://powcoder.com
Assignment Project Exam Help
Elias-‐γ
Code
Add WeChat powcoder
§ To
encode
a
number
k,
compute
unary
Assignment Project Exam Help binary
§ kd
is
number
of
binary
digits,
encoded
in
unary
https://powcoder.com
Add WeChat powcoder
45
COMP6714:
Informa2on
Retrieval
&
Web
Search
https://powcoder.com
Assignment Project Exam Help
Elias-‐δ
Code
Add WeChat powcoder
§ Elias-‐γ
code
uses
no
more
bits
than
unary,
many
fewer
for
k
>
2
Assignment
§ 1023
takes
Project
19
bits
instead
Exam
of
1024
bits
Help
using
unary
§ In
general,
takes
2⎣log2k⎦+1
bits
https://powcoder.com
§ To
improve
coding
o f
l
Add WeChat powcoderarge
n umbers,
u se
E lias-‐δ
code
§ Instead
of
encoding
kd
in
unary,
we
encode
kd
+
1
using
Elias-‐γ
§ Takes
approximately
2
log2
log2
k
+
log2
k
bits
46
COMP6714:
Informa2on
Retrieval
&
Web
Search
https://powcoder.com
Assignment Project Exam Help
Elias-‐δ
Code
Add WeChat powcoder
§ Split
(kd+
1)
into:
kdd = blog2 (kd + 1)c
Assignment
kdr = (kd +Project 2blogHelp
1) Exam 2 (kd +1)c
47
COMP6714:
Informa2on
Retrieval
&
Web
Search
https://powcoder.com
Assignment Project Exam Help
Add WeChat powcoder
48
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Gamma
code
proper*es
Add WeChat powcoder
§ G
is
encoded
using
2
⎣log
G⎦
+
1
bits
§ Length
of
offset
is
⎣log
G⎦
bits
§ Length
oAssignment
f
length
is
⎣log
Project
G⎦
+
1
bExam
its
Help
§ All
gamma
codes
have
an
odd
number
of
bits
https://powcoder.com
§ Almost
within
Add
a
factor
o f
WeChat powcoder 2
o f
b est
p ossible,
l og2
G
49
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Gamma
seldom
used
in
prac*ce
Add WeChat powcoder
§ Machines
have
word
boundaries
–
8,
16,
32,
64
bits
§ Opera*ons
that
cross
word
boundaries
are
slower
§ Compressing
Assignment Project Exam
and
manipula*ng
Help
at
the
granularity
of
bits
can
be
slow
https://powcoder.com
§ Variable
byte
eAdd ncoding
i s
a
WeChat powcoder ligned
a nd
thus
poten*ally
more
efficient
§ Regardless
of
efficiency,
variable
byte
is
conceptually
simpler
at
liZle
addi*onal
space
cost
50
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Shannon
Limit
Add WeChat powcoder
§ Is
it
possible
to
derive
codes
that
are
op*mal
(under
certain
assump*ons)?
Assignment
§ What
is
the
ProjectcExam
op*mal
average
Help for
a
code
ode
length
that
encodes
ehttps://powcoder.com
ach
integer
(gap
lengths)
independently?
Add WeChat powcoder
§ Lower
bounds
on
average
code
length:
Shannon
entropy
§ H(X)
=
-‐
Σx=1n
Pr[X=x]
log
Pr[X=x]
§ Asympto*cally
op*mal
codes
(finite
alphabets):
arithme*c
coding,
Huffman
codes
51
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
How
to
design
an
op*mal
code
for
geometric
distribu*on?
Assignment Project Exam Help
Global
Bernoulli
Model
Add WeChat powcoder
§ Assump*on:
term
occurrence
are
Bernoulli
events
§ Nota*on:
Assignment Project Exam Help
§ n:
#
of
documents,
m:
#
of
terms
in
vocabulary
§ N:
total
#
of
(https://powcoder.com
unique)
occurrences
§ Probability
of
aAdd
term
tj
o
WeChat powcoder ccurring
i n
d ocument
d i :
p
=
N/nm
§ Each
term-‐document
occurrence
is
an
independent
event
§ Probability
of
a
gap
of
length
x
is
given
by
the
geometric
distribu*on
Pr[X = x] = (1− p) x−1 ⋅ p
52
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
It
can
also
be
deemed
as
a
Assignment Project Exam Help
generaliza*on
of
the
unary
code.
Golomb
Code
Add WeChat powcoder
§ Golomb
Code
(Golomb
1966):
highly
efficient
way
to
design
op*mal
Huffman-‐style
code
for
geometric
distribu*on
Assignment Project Exam Help
§ Parameter
b
https://powcoder.com
§ For
given
x
≥
1,
computer
integer
quo*ent
q = ⎣(x −1) b⎦
§ and
remainder
Add
WeChat powcoder r = (x −1) − q ⋅ b
§ Assume
b
=
2k
§ Encode
q
in
unary,
followed
by
r
coded
in
binary
§ A
bit
complicated
if
b
!=
2k.
See
wikipedia.
§ First
step:
(q+1)
bits
§ Second
step:
log(b)
bits
53
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Golomb
Code
&
Rice
Code
Add WeChat powcoder
§ How
to
determine
op*mal
b*?
§ Select
minimal
b
such
that
Assignmentb Project Exam b +1 Help
(1− p) + (1− p) ≤1
https://powcoder.com
§ Result
due
to
Gallager
and
Van
Voorhis
1975:
generates
an
oAdd p*mal
WeChatprefix
powcoder
code
for
geometric
distribu*on
§ Small
p
approxima*on:
b* ≈ ln2 p = 0.69 ⋅ avg _ val
§ Rice
code:
only
allow
b
=
2k
54
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Local
Bernoulli
Model
Add WeChat powcoder
§ If
length
of
pos*ng
lists
is
known,
then
a
Bernoulli
model
on
each
individual
inverted
list
can
be
used
§ Frequent
Assignment Project
words
are
coded
Exam
with
Helpb,
infrequent
smaller
words
with
larger
b
https://powcoder.com
§ Term
frequency
need
to
be
encoded
(use
gamma-‐
Add WeChat powcoder
code)
§ Local
Bernoulli
outperforms
global
Bernoulli
model
in
prac*ce
(method
of
prac*ce!)
55
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
RCV1
compression
Add WeChat powcoder
Data structure Size in MB
dictionary, fixed-width 11.2
dictionary, termAssignment Project
pointers into string Exam Help 7.6
with blocking, k = 4 7.1
https://powcoder.com
with blocking & front coding 5.9
collection (text, xml markup etc) 3,600.0
Add WeChat powcoder
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, g-encoded 101.0
56
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Google’s
Indexing
Choice
Add WeChat powcoder
§ Index
shards
par**on
by
doc,
mul*ple
replicates
§ Disk-‐resident
index
Assignment Project Exam Help
§ Use
outer
parts
of
the
disk
§ Use
different
https://powcoder.com
compression
methods
for
different
fields:
Ricek
(a
special
kind
of
Golomb
code)
for
gaps,
and
Gamma
for
posi*ons.
Add
WeChat powcoder
§ In-‐memory
index
§ All
posi*ons;
No
docid
§ Keep
track
of
document
boundaries
§ Group-‐variant
encoding
§ Fast
to
decode
Source:
Jeff
Dean’s
WSDM
2009
Keynote
57
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Other
details
Add WeChat powcoder
§ Gap
=
docidn-‐
docidn-‐1
-‐
1
§ Freq
=
freq
–
1
Assignment Project Exam Help
§ Pos_Gap
=
posn-‐
posn-‐1
-‐
1
https://powcoder.com
§ C.f.,
Jiangong
ZAdd
hang,
WeChatXiaohui
Long
and
Torsten
Suel:
powcoder
Performance
of
Compressed
Inverted
List
Caching
in
Search
Engines.
WWW
2008.
58
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec.
5.3
https://powcoder.com
Assignment Project Exam Help
Index
compression
summary
Add WeChat powcoder
§ We
can
now
create
an
index
for
highly
efficient
Boolean
retrieval
that
is
very
space
efficient
§ Only
4%
oAssignment Project
f
the
total
size
Exam
of
the
Help
collec*on
§ Only
10-‐15%
ohttps://powcoder.com
f
the
total
size
of
the
text
in
the
collec*on
Add WeChat powcoder
§ However,
we’ve
ignored
posi*onal
informa*on
§ Hence,
space
savings
are
less
for
indexes
used
in
prac*ce
§ But
techniques
substan*ally
the
same.
59
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch.
5
https://powcoder.com
Assignment Project Exam Help
Resources
for
today’s
lecture
Add WeChat powcoder
§ IIR
5
§ MG
3.3,
3.4.
Assignment Project Exam Help
§ F.
Scholer,
H.E.
Williams
and
J.
Zobel.
2002.
Compression
ohttps://powcoder.com
f
Inverted
Indexes
For
Fast
Query
Evalua*on.
Proc.
ACM-‐SIGIR
2002.
Add WeChat powcoder
§ Variable
byte
codes
§ V.
N.
Anh
and
A.
Moffat.
2005.
Inverted
Index
Compression
Using
Word-‐Aligned
Binary
Codes.
Informa2on
Retrieval
8:
151–166.
§ Word
aligned
codes
60