KEMBAR78
L5 Compression | PDF | Data Compression | Pointer (Computer Programming)
0% found this document useful (0 votes)
10 views60 pages

L5 Compression

The document discusses various aspects of information retrieval and web search, focusing on index compression techniques and their importance for efficient data storage and retrieval. It covers methods such as sort-based indexing, single-pass in-memory indexing, and the significance of compression for dictionaries and postings. Additionally, it highlights the differences between lossless and lossy compression and provides statistical insights into the RCV1 dataset.

Uploaded by

tianzong Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views60 pages

L5 Compression

The document discusses various aspects of information retrieval and web search, focusing on index compression techniques and their importance for efficient data storage and retrieval. It covers methods such as sort-based indexing, single-pass in-memory indexing, and the significance of compression for dictionaries and postings. Additionally, it highlights the differences between lossless and lossy compression and provides statistical insights into the RCV1 dataset.

Uploaded by

tianzong Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

 

 
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project Exam Help
Add WeChat powcoder

Introduc*on  to  
Assignment Project Exam Help
Informa(on   R etrieval  
https://powcoder.com
Add WeChat powcoder
Lecture  5:  Index  Compression  

1  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project Exam Help
Last  lecture  –  index  construc*on  
Add WeChat powcoder
§ Sort-­‐based  indexing  
§ Naïve  in-­‐memory  inversion  
§ Blocked  Assignment Project Exam Help
Sort-­‐Based  Indexing  
§ Merge  sort  is  effec*ve  for  disk-­‐based  sor*ng  (avoid  seeks!)  
https://powcoder.com
§ Single-­‐Pass  In-­‐Memory  Indexing  
Add WeChat powcoder
§ No  global  dic*onary  
§ Generate  separate  dic*onary  for  each  block  
§ Don’t  sort  pos*ngs  
§ Accumulate  pos*ngs  in  pos*ngs  lists  as  they  occur  
§ Distributed  indexing  using  MapReduce  
§ Dynamic  indexing:  Mul*ple  indices,  logarithmic  merge  
2  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Ch.
    5
https://powcoder.com
Assignment Project Exam Help
Today  
Add WeChat powcoder

Assignment Project Exam Help


https://powcoder.com
Add WeChat
§ Collec*on  sta*s*cs   powcoder
in  more   detail  (with  RCV1)  
§ How  big  will  the  dic*onary  and  pos*ngs  be?  
§ Dic*onary  compression  
§ Pos*ngs  compression  

3  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Ch.
    5
https://powcoder.com
Assignment Project Exam Help
Why  compression  (in  general)?  
Add WeChat powcoder
§ Use  less  disk  space  
§ Saves  a  liZle  money  
Assignment
§ Keep  more   Project Exam Help
stuff  in  memory  
§ Increases  speed  
https://powcoder.com
§ Increase  speed   o f   d ata  
Add WeChat powcoder t ransfer   f rom   d isk   t o   memory  
§ [read  compressed  data  |  decompress]  is  faster  than          
[read  uncompressed  data]  
§ Premise:  Decompression  algorithms  are  fast  
§ True  of  the  decompression  algorithms  we  use  

4  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Ch.
    5
https://powcoder.com
Assignment Project Exam Help
Why  compression  for  inverted  indexes?  
Add WeChat powcoder
§ Dic*onary  
§ Make  it  small  enough  to  keep  in  main  memory  
§ Make  it  Assignment
so  small  that  yProject Exam
ou  can  keep   Help
some   pos*ngs  lists  in  
main  memory  too  
https://powcoder.com
§ Pos*ngs  file(s)  
§ Reduce  disk  sAdd
pace  nWeChat
eeded   powcoder
§ Decrease  *me  needed  to  read  pos*ngs  lists  from  disk  
§ Large  search  engines  keep  a  significant  part  of  the  pos*ngs  
in  memory.  
§ Compression  lets  you  keep  more  in  memory  
§ We  will  devise  various  IR-­‐specific  compression  schemes  
5  
COMP6714:  Informa2on  Retrieval  &  Web  Search       Sec.
    5.1
https://powcoder.com
Assignment Project Exam Help
Recall  Reuters  RCV1  
Add WeChat powcoder
§ symbol  sta(s(c              value  
§ N        documents            800,000  
Assignment Project Exam Help
§ L        avg.  #  tokens  per  doc      200  
§ M      terms   https://powcoder.com
(=  word  types)      ~400,000  
§                                  avg.  Add
#  bytes  
WeChatper  token  
powcoder  6  
                                       (incl.  spaces/punct.)  
§                                  avg.  #  bytes  per  token    4.5  
                         (without  spaces/punct.)  
§                                  avg.  #  bytes  per  term    7.5  
§                                  non-­‐posi*onal  pos*ngs  100,000,000  
6  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.1
https://powcoder.com
Index   p arameters  
Assignment Project Exam Help v s.   what   we   i ndex  
(details  IIR  Table  5.1,  p.80)  
Add WeChat powcoder
size of word types (terms) non-positional positional postings
postings
dictionary non-positional index
Assignment Project Exam Helppositional index
Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul
(K) https://powcoder.com
% % % % %
Unfiltered 484 109,971 197,879
No numbers 474
Add WeChat
-2
powcoder-8
-2 100,680 -8 179,158 -9 -9
Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9
30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52
stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52

Exercise: give intuitions for all the ‘0’ entries. Why do some
zero entries correspond to big deltas in other columns? 7  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.1
https://powcoder.com
Assignment Project Exam Help
Lossless  vs.  lossy  compression  
Add WeChat powcoder
§ Lossless  compression:  All  informa*on  is  preserved.  
§ What  we  mostly  do  in  IR.  
Assignment
§ Lossy  compression:   Projectsome  
Discard   Exam Help
informa*on  
§ Several  of  the  https://powcoder.com
preprocessing  steps  can  be  viewed  as  
lossy  compression:  case  folding,  stop  words,  
Add WeChat powcoder
stemming,  number  elimina*on.  
§ Chap/Lecture  7:  Prune  pos*ngs  entries  that  are  
unlikely  to  turn  up  in  the  top  k  list  for  any  query.  
§ Almost  no  loss  quality  for  top  k  list.  

8  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.1
https://powcoder.com
Assignment Project Exam Help
Vocabulary  vs.  collec*on  size  
Add WeChat powcoder
§ How  big  is  the  term  vocabulary?  
§ That  is,  how  many  dis*nct  words  are  there?  
Assignment
§ Can  we  assume   Project
an  upper   Exam Help
bound?  
§ Not  really:  At  https://powcoder.com
least  7020  =  1037  different  words  of  length  20  
§ In  prac*ce,  the   v ocabulary  
Add WeChat powcoder w ill   keep   growing   with  
the  collec*on  size  
§ Especially  with  Unicode  J  

9  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.1
https://powcoder.com
Assignment Project Exam Help
Vocabulary  vs.  collec*on  size  
Add WeChat powcoder
§ Heaps’  law:  M  =  kTb  
§ M  is  the  size  of  the  vocabulary,  T  is  the  number  of  
tokens  in  Assignment
the  collec*on   Project Exam Help
§ Typical  values:  https://powcoder.com
30  ≤  k  ≤  100  and  b  ≈  0.5  
§ In  a  log-­‐log  plot  
Add of  vWeChat
ocabulary   size  M  vs.  T,  Heaps’  
powcoder
law  predicts  a  line  with  slope  about  ½  
§ It  is  the  simplest  possible  rela*onship  between  the  two  in  
log-­‐log  space  
§ An  empirical  finding  (“empirical  law”)  

10  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.1
https://powcoder.com
Assignment Project Exam Help
Heaps’  Law   Fig 5.1 p81
Add WeChat powcoder
For  RCV1,  the  dashed  line  
log10M  =  0.49  log10T  +  1.64  
is  the  best  least  sAssignment
quares  fit.   Project Exam Help
Thus,  M  =  101.64T0.49  so  k  =  
https://powcoder.com
10 ≈  44  and  b  =  0.49.  
1.64  
 
Add WeChat powcoder
Good  empirical  fit  for  
Reuters  RCV1  !  
 
For  first  1,000,020  tokens,  
law  predicts  38,323  terms;  
actually,  38,365  terms  
11  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.1
https://powcoder.com
Assignment Project Exam Help
Exercises  
Add WeChat powcoder
§ What  is  the  effect  of  including  spelling  errors,  vs.  
automa*cally  correc*ng  spelling  errors  on  Heaps’  
law?   Assignment Project Exam Help
§ Compute  the  vhttps://powcoder.com
ocabulary  size  M  for  this  scenario:  
§ Looking  at  a  collec*on  of  web  pages,  you  find  that  there  
Add tWeChat
are  3000  different   powcoder
erms  in  the   first  10,000  tokens  and  
30,000  different  terms  in  the  first  1,000,000  tokens.  
§ Assume  a  search  engine  indexes  a  total  of  20,000,000,000  
(2  ×  1010)  pages,  containing  200  tokens  on  average  
§ What  is  the  size  of  the  vocabulary  of  the  indexed  collec*on  
as  predicted  by  Heaps’  law?  
12  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.1
https://powcoder.com
Assignment Project Exam Help
Zipf’s  law  
Add WeChat powcoder
§ Heaps’  law  gives  the  vocabulary  size  in  collec*ons.  
§ We  also  Assignment
study  the  rela*ve  
Projectfrequencies  
Exam Helpof  terms.  
§ In  natural  language,  there  are  a  few  very  frequent  
terms  and  very   https://powcoder.com
many  very  rare  terms.  
§ Zipf’s  law:  The  Add ith  WeChat
most  frequent  
powcoder term  has  frequency  
propor*onal  to  1/i  .  
§ cfi  ∝  1/i  =  K/i  where  K  is  a  normalizing  constant  
§ cfi  is  collec*on  frequency:  the  number  of  
occurrences  of  the  term  ti  in  the  collec*on.  
13  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.1
https://powcoder.com
Assignment Project Exam Help
Zipf  consequences  
Add WeChat powcoder
§ If  the  most  frequent  term  (the)  occurs  cf1  *mes    
§ then  the  second  most  frequent  term  (of)  occurs  cf1/2  
*mes   Assignment Project Exam Help
§ the  third  most  frequent  term  (and)  occurs  cf1/3  *mes  …    
https://powcoder.com
§ Equivalent:  cfi  =  K/i  where  K  is  a  normalizing  factor,  
so   Add WeChat powcoder
§ log  cfi  =  log  K  -­‐  log  i  
§ Linear  rela*onship  between  log  cfi  and  log  i  

§ Another  power  law  rela*onship  


14  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.1
https://powcoder.com
Assignment Project Exam Help
Zipf’s  law  for  Reuters  RCV1  
Add WeChat powcoder

Assignment Project Exam Help


https://powcoder.com
Add WeChat powcoder

15  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Ch.
    5
https://powcoder.com
Assignment Project Exam Help
Compression  
Add WeChat powcoder
§ Now,  we  will  consider  compressing  the  space  
for  the  dic*onary  and  pos*ngs  
Assignment Project Exam Help
§ Basic  Boolean  index  only  
https://powcoder.com
§ No  study  of  posi*onal  indexes,  etc.  
Add WeChat powcoder
§ We  will  consider  compression  schemes  

16  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Add WeChat powcoder

Assignment Project Exam Help


https://powcoder.com
Add WeChat powcoder

DICTIONARY  COMPRESSION  

17  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Why  compress  the  dic*onary?  
Add WeChat powcoder
§ Search  begins  with  the  dic*onary  
§ We  want  to  keep  it  in  memory  
Assignment Project Exam Help
§ Memory  footprint  compe**on  with  other  
applica*ons   https://powcoder.com
§ Embedded/mobile   devices  m
Add WeChat ay  have  very  liZle  
powcoder
memory  
§ Even  if  the  dic*onary  isn’t  in  memory,  we  want  it  to  
be  small  for  a  fast  search  startup  *me  
§ So,  compressing  the  dic*onary  is  important  

18  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Dic*onary  storage  -­‐  first  cut  
Add WeChat powcoder
§ Array  of  fixed-­‐width  entries  
§ ~400,000  terms;  28  bytes/term  =  11.2  MB.  
Assignment Project Exam Help
https://powcoder.com
Terms Freq. Postings ptr.
a 656,265
Add WeChat powcoder
aachen 65
…. ….
zulu 221

Dictionary search 20 bytes 4 bytes each


structure 19  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Fixed-­‐width  terms  are  wasteful  
Add WeChat powcoder
§ Most  of  the  bytes  in  the  Term  column  are  wasted  –  
we  allot  20  bytes  for  1  leZer  terms.  
Assignment
§ And  we  s*ll   Project Exam Help or  
can’t  handle  supercalifragilis2cexpialidocious  
hydrochlorofluorocarbons.  
https://powcoder.com
§ WriZen  English  averages  ~4.5  characters/word.  
§ Exercise:  Why  
Add is/isn’t  
WeChat this  the  powcoder
number  to  use  for  
es*ma*ng  the  dic*onary  size?  
§ Ave.  dic*onary  word  in  English:  ~8  characters  
§ How  do  we  use  ~8  characters  per  dic*onary  term?  
§ Short  words  dominate  token  counts  but  not  type  
average.  
20  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Compressing  the  term  list:    
Assignment Project Exam Help
Dic*onary-­‐as-­‐a-­‐String  
Add WeChat powcoder
Store dictionary as a (long) string of characters:
n

nPointer to next word shows end of current word


nHope to save up to 60% of dictionary space.
Assignment Project Exam Help
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
https://powcoder.com
Freq. Postings ptr. Term ptr.
Add WeChat powcoder Total string length =
33
400K x 8B = 3.2MB
29
44
Pointers resolve 3.2M
126
positions: log23.2M =
22bits = 3bytes

21  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Space  for  dic*onary  as  a  string  
Add WeChat powcoder
§ 4  bytes  per  term  for  Freq.  
Now avg. 11
§ 4  bytes  per  term  for  pointer  to  Pos*ngs.   bytes/term,
Assignment Project Exam Help not 20.
§ 3  bytes  per  term  pointer  
§ https://powcoder.com
Avg.  8  bytes  per   term  in  term  string  
§ 400K  terms  x  1Add
9  èWeChat
 7.6  MB  (powcoder
against  11.2MB  for  fixed  
width)  

22  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Blocking  
Add WeChat powcoder
§ Store  pointers  to  every  kth  term  string.  
§ Example  below:  k=4.  
Assignment
§ Need  to  store   Project
term  lengths   (1  Exam
extra  bHelp
yte)  
https://powcoder.com
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Add WeChat powcoder
Freq. Postings ptr. Term ptr.
33
29
Save 9 bytes Lose 4 bytes on
44 on 3 term lengths.
126 pointers.
7
23  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Net  
Add WeChat powcoder
§ Example  for  block  size  k  =  4  
§ Where  we  used  3  bytes/pointer  without  blocking  
Assignment Project Exam Help
§ 3  x  4  =  12  bytes,  
now  we  use  3  +  4https://powcoder.com
 =  7  bytes.  
Add WeChat powcoder
Shaved  another  ~0.5MB.  This  reduces  the  size  of  the  
dic*onary  from  7.6  MB  to  7.1  MB.  
We  can  save  more  with  larger  k.  

Why not go with larger k?

24  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Exercise  
Add WeChat powcoder
§ Es*mate  the  space  usage  (and  savings  compared  to  
7.6  MB)  with  blocking,  for  block  sizes  of  k  =  4,  8  and  
16.   Assignment Project Exam Help
https://powcoder.com
Add WeChat powcoder

25  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Dic*onary  search  without  blocking  
Add WeChat powcoder
§ Assuming  each  
dic*onary  term  equally  
Assignment
likely  in  query   (not  really  Project Exam Help
so  in  prac*ce!),  ahttps://powcoder.com
verage  
number  of  comparisons  
Add WeChat powcoder
=  (1+2·∙2+4·∙3+4)/8  ~2.6  

Exercise:  what  if  the  frequencies  


of  query  terms  were  non-­‐uniform  
but  known,  how  would  you  
structure  the  dic*onary  search  
tree?  
26  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Dic*onary  search  with  blocking  
Add WeChat powcoder

Assignment Project Exam Help


https://powcoder.com
Add WeChat powcoder
§ Binary  search  down  to  4-­‐term  block;  
§ Then  linear  search  through  terms  in  block.  
§ Blocks  of  4  (binary  tree),  avg.  =  
(1+2·∙2+2·∙3+2·∙4+5)/8  =  3  compares  
27  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Exercise  
Add WeChat powcoder
§ Es*mate  the  impact  on  search  performance  (and  
slowdown  compared  to  k=1)  with  blocking,  for  block  
sizes  of  k  =Assignment
 4,  8  and  16.  
Project Exam Help
https://powcoder.com
Add WeChat powcoder

28  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
Front  coding  
Add WeChat powcoder
§ Front-­‐coding:  
§ Sorted  words  commonly  have  long  common  prefix  –  store  
differences   only  
Assignment Project Exam Help
§ (for  last  k-­‐1  in  a  block  of  k)  
https://powcoder.com
8automata8automate9automa'c10automa'on  
Add WeChat powcoder
♢8automat*a1♢e2♢ic3♢ion

Encodes automat Extra length


beyond automat.
Begins to resemble general string compression. 29  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project ExamMHelp
Front  Encoding   [WiZen,   offat,  Bell]  
Add WeChat powcoder
§ Complete  front  encoding  
§ (prefix-­‐len,  suffix-­‐len,  suffix)  
Assignment
§ Par*al  3-­‐in-­‐4   Project Exam Help
front  encoding  
§ No  encoding/compression   for  the  first  string  in  a  block  
https://powcoder.com
§ Enables  binary  search    
Add WeChat powcoder
String   Complete  Front   Par(al  3-­‐in-­‐4  Front  
Encoding   Encoding  
Assume
previous 8,  automata   4,  4,  mata      ,  8,  automata  
string is 8,  automate   7,  1,  e   7,  1,  e  
“auto”
9,  automa*c   7,  2,  ic   7,  2,  ic  
10,  automa*on   8,  2,  on   8,      ,  on  
30  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.2
https://powcoder.com
Assignment Project Exam Help
RCV1  dic*onary  compression  summary  
Add WeChat powcoder
Technique   Size  in  MB  

Fixed  width   Assignment Project Exam Help 11.2  


https://powcoder.com
Dic*onary-­‐as-­‐String  with  pointers  to  every  term   7.6  
Add WeChat powcoder
Also,  blocking  k  =  4   7.1  

Also,  Blocking  +  front  coding   5.9  

31  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Add WeChat powcoder

Assignment Project Exam Help


https://powcoder.com
Add WeChat powcoder

POSTINGS  COMPRESSION  

32  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Pos*ngs  compression  
Add WeChat powcoder
§ The  pos*ngs  file  is  much  larger  than  the  dic*onary,  
factor  of  at  least  10.  
§ Assignment
Key  desideratum:   Project
store   each  pExam
os*ng  Help
compactly.  
§ A  pos*ng  for  ohttps://powcoder.com
ur  purposes  is  a  docID.  
§ For  Reuters  (800,000  
Add WeChat documents),  
powcoder we  would  use  32  
bits  per  docID  when  using  4-­‐byte  integers.  
§ Alterna*vely,  we  can  use  log2  800,000  ≈  20  bits  per  
docID.  
§ Our  goal:  use  a  lot  less  than  20  bits  per  docID.  

33  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Pos*ngs:  two  conflic*ng  forces  
Add WeChat powcoder
§ A  term  like  arachnocentric  occurs  in  maybe  one  doc  
out  of  a  million  –  we  would  like  to  store  this  pos*ng  
using  log2Assignment
 1M  ~  20  bits.  
Project Exam Help
§ A  term  like  the   occurs  in  virtually  every  doc,  so  20  
https://powcoder.com
bits/pos*ng  is  too  expensive.  
Add WeChat powcoder
§ Prefer  0/1  bitmap  vector  in  this  case    

34  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Pos*ngs  file  entry  
Add WeChat powcoder
§ We  store  the  list  of  docs  containing  a  term  in  
increasing  order  of  docID.  
Assignment
§ computer:   Project…Exam
33,47,154,159,202     Help
§ Consequence:  https://powcoder.com
it  suffices  to  store  gaps.  
§ 33,14,107,5,43  …  
Add WeChat powcoder
§ Hope:  most  gaps  can  be  encoded/stored  with  far  
fewer  than  20  bits.  

35  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Three  pos*ngs  entries  
Add WeChat powcoder

Assignment Project Exam Help


https://powcoder.com
Add WeChat powcoder

36  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Variable  length  encoding  
Add WeChat powcoder
§ Aim:  
§ For  arachnocentric,  we  will  use  ~20  bits/gap  entry.  
§ For  the,  Assignment
we  will  use  ~1  Project
bit/gap  eExam
ntry.   Help
§ If  the  average  https://powcoder.com
gap  for  a  term  is  G,  we  want  to  use  
~log2G  bits/gap  entry.  
Add WeChat powcoder
§ Key  challenge:  encode  every  integer  (gap)  with  about  
as  few  bits  as  needed  for  that  integer.  
§ This  requires  a  variable  length  encoding  
§ Variable  length  codes  achieve  this  by  using  short  
codes  for  small  numbers  
37  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Variable  Byte  (VB)  codes  
Add WeChat powcoder
§ For  a  gap  value  G,  we  want  to  use  close  to  the  fewest  
bytes  needed  to  hold  log2  G  bits  
§ Begin  with   Assignment
one  byte  to  Project
store  GExam
 and  dHelp
edicate  1  bit  in  
it  to  be  a  con*nua*on   bit  c  
https://powcoder.com
§ If  G  ≤127,  binary-­‐encode  it  in  the  7  available  bits  and  
Add WeChat powcoder
set  c  =1  
§ Else  encode  G’s  lower-­‐order  7  bits  and  then  use  
addi*onal  bytes  to  encode  the  higher  order  bits  
using  the  same  algorithm  
§ At  the  end  set  the  con*nua*on  bit  of  the  last  byte  to  
1  (c  =1)  –  and  for  the  other  bytes  c  =  0.   38  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Hex(824)=0x0338  
Assignment Project ExamHex(214577)=0x00034631  
Help
Example  
Add WeChat powcoder
docIDs 824 829 215406
gaps 5 214577
VB code Assignment
00000110 Project Exam Help00001101
10000101
10111000 00001100
https://powcoder.com 10110001

Postings storedAdd
as the byte concatenation
WeChat powcoder
000001101011100010000101000011010000110010110001

Key property: VB-encoded postings are


uniquely prefix-decodable.

For a small gap (5), VB


uses a whole byte. 39  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Other  variable  unit  codes  
Add WeChat powcoder
§ Instead  of  bytes,  we  can  also  use  a  different  “unit  of  
alignment”:  32  bits  (words),  16  bits,  4  bits  (nibbles).  
§ Variable  bAssignment
yte  alignment   wastes  
Project space  
Exam if  you  have  
Help
many  small  gaps  –  nibbles  do  beZer  in  such  cases.  
https://powcoder.com
§ Variable  byte  codes:  
§ Used  by  many   commercial/research  
Add WeChat powcoder systems  
§ Good  low-­‐tech  blend  of  variable-­‐length  coding  and  
sensi*vity  to  computer  memory  alignment  matches  (vs.  
bit-­‐level  codes,  which  we  look  at  next).  
§ There  is  also  recent  work  on  word-­‐aligned  codes  that  
pack  a  variable  number  of  gaps  into  one  word  (e.g.,  
simple9)  
40  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Simple9  
§ Encodes  Add as  mWeChat
any  gaps  powcoder
as  possible  in  one  DWORD    
§ 4  bit  selector  +  28  bit  data  bits  
§ Encodes  Assignment
9  possible  ways   to  “use”  
Project Exam the  dHelp
ata  bits  
Selector   #  of  gaps  encoded   Len  of  each  gap   Wasted  bits  
https://powcoder.com encoded  
0000   28   1   0  
0001   14  Add WeChat
2   powcoder 0  
0010   9   3   1  
0011   7   4   0  
0100   5   5   3  
0101   4   7   0  
0110   3   9   1  
0111   2   14   0  
1000   1   28   0   41  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project Exam Help
Unary  code  
Add WeChat powcoder
§ Represent  n  as  n  1s  with  a  final  0.  
§ Unary  code  for  3  is  1110.  
Assignment Project Exam Help
§ Unary  code  for  40  is  
https://powcoder.com
11111111111111111111111111111111111111110   .  
§ Unary  code  for  
Add 80  iWeChat
s:   powcoder
111111111111111111111111111111111111111111
111111111111111111111111111111111111110  
 
§ This  doesn’t  look  promising,  but….  
  42  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project Exam Help
Bit-­‐Aligned  Codes  
Add WeChat powcoder
§ Breaks  between  encoded  numbers  can  occur  a}er  
any  bit  posi*on  
§ Unary  code  
Assignment Project Exam Help
§ Encode  k  by  khttps://powcoder.com
 1s  followed  by  0  
§ 0  at  end  makes  code  unambiguous  
Add WeChat powcoder

43  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project Exam Help
Unary  and  Binary  Codes  
Add WeChat powcoder
§ Unary  is  very  efficient  for  small  numbers  such  as  0  
and  1,  but  quickly  becomes  very  expensive  
Assignment Project Exam Help
§ 1023  can  be  represented  in  10  binary  bits,  but  requires  
1024  bits  in  uhttps://powcoder.com
nary  
§ Binary  is  more  Add
efficient  
WeChat for  lpowcoder
arge  numbers,  but  it  may  
be  ambiguous  

44  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project Exam Help
Elias-­‐γ  Code  
Add WeChat powcoder
§ To  encode  a  number  k,  compute  
unary
Assignment Project Exam Help binary
§ kd  is  number   of  binary  digits,  encoded  in  unary  
https://powcoder.com
Add WeChat powcoder

45  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project Exam Help
Elias-­‐δ  Code  
Add WeChat powcoder
§ Elias-­‐γ  code  uses  no  more  bits  than  unary,  many  
fewer  for  k  >  2  
Assignment
§ 1023  takes   Project
19  bits  instead   Exam
of  1024   bits  Help
using  unary  
§ In  general,  takes   2⎣log2k⎦+1  bits  
https://powcoder.com
§ To  improve  coding   o f   l
Add WeChat powcoderarge   n umbers,   u se   E lias-­‐δ   code  
§ Instead  of  encoding  kd  in  unary,  we  encode  kd  +  1  using  
Elias-­‐γ  
§ Takes  approximately  2  log2  log2  k  +  log2  k  bits  

46  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project Exam Help
Elias-­‐δ  Code  
Add WeChat powcoder
§ Split  (kd+  1)  into:  
kdd = blog2 (kd + 1)c
Assignment
kdr = (kd +Project 2blogHelp
1) Exam 2 (kd +1)c

§ encode  kdd  in  https://powcoder.com


unary,  kdr  in  binary,  and  kr  in  binary  

Add WeChat powcoder

47  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search      
https://powcoder.com
Assignment Project Exam Help
Add WeChat powcoder

Assignment Project Exam Help


https://powcoder.com
Add WeChat powcoder

48  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Gamma  code  proper*es  
Add WeChat powcoder
§ G  is  encoded  using  2  ⎣log  G⎦  +  1  bits  
§ Length  of  offset  is  ⎣log  G⎦  bits  
§ Length  oAssignment
f  length  is  ⎣log  Project
G⎦  +  1  bExam
its   Help
§ All  gamma  codes   have  an  odd  number  of  bits  
https://powcoder.com
§ Almost  within  Add
a  factor   o f  
WeChat powcoder 2   o f   b est   p ossible,   l og2   G  

§ Gamma  code  is  uniquely  prefix-­‐decodable,  like  VB  


§ Gamma  code  can  be  used  for  any  distribu*on  
§ Gamma  code  is  parameter-­‐free  

49  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Gamma  seldom  used  in  prac*ce  
Add WeChat powcoder
§ Machines  have  word  boundaries  –  8,  16,  32,  64  bits  
§ Opera*ons  that  cross  word  boundaries  are  slower  
§ Compressing   Assignment Project Exam
and  manipula*ng   Help
at  the   granularity  of  
bits  can  be  slow  
https://powcoder.com
§ Variable  byte  eAdd ncoding   i s   a
WeChat powcoder ligned   a nd   thus  
poten*ally  more  efficient  
§ Regardless  of  efficiency,  variable  byte  is  conceptually  
simpler  at  liZle  addi*onal  space  cost  

50  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Shannon  Limit  
Add WeChat powcoder
§ Is  it  possible  to  derive  codes  that  are  op*mal  (under  
certain  assump*ons)?  
Assignment
§ What  is  the   ProjectcExam
op*mal  average   Help for  a  code  
ode  length  
that  encodes  ehttps://powcoder.com
ach  integer  (gap  lengths)  
independently?  
Add WeChat powcoder
§ Lower  bounds  on  average  code  length:  Shannon  
entropy  
§ H(X)  =  -­‐  Σx=1n  Pr[X=x]  log  Pr[X=x]  
§ Asympto*cally  op*mal  codes  (finite  alphabets):  
arithme*c  coding,  Huffman  codes  
51  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
How  to  design  an  op*mal  code  
for  geometric  distribu*on?  
Assignment Project Exam Help
Global  Bernoulli  Model  
Add WeChat powcoder
§ Assump*on:  term  occurrence  are  Bernoulli  events  
§ Nota*on:  
Assignment Project Exam Help
§ n:  #  of  documents,  m:  #  of  terms  in  vocabulary  
§ N:  total  #  of  (https://powcoder.com
unique)  occurrences  
§ Probability  of  aAdd
 term   tj   o
WeChat powcoder ccurring   i n   d ocument   d i :   p   =  
N/nm  
§ Each  term-­‐document  occurrence  is  an  independent  
event  
§ Probability  of  a  gap  of  length  x  is  given  by  the  
geometric  distribu*on   Pr[X = x] = (1− p) x−1 ⋅ p
52  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
It  can  also  be  deemed  as  a  
Assignment Project Exam Help
generaliza*on  of  the  unary  code.  
Golomb  Code  
Add WeChat powcoder
§ Golomb  Code  (Golomb  1966):  highly  efficient  way  to  
design  op*mal  Huffman-­‐style  code  for  geometric  
distribu*on  
Assignment Project Exam Help
§ Parameter  b   https://powcoder.com
§ For  given  x  ≥  1,  computer  integer  quo*ent   q = ⎣(x −1) b⎦
§ and  remainder   Add   WeChat powcoder r = (x −1) − q ⋅ b
§ Assume  b  =  2k  
§ Encode  q  in  unary,  followed  by  r  coded  in  binary  
§ A  bit  complicated  if  b  !=  2k.  See  wikipedia.  
§ First  step:  (q+1)  bits  
§ Second  step:  log(b)  bits   53  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Golomb  Code  &  Rice  Code  
Add WeChat powcoder
§ How  to  determine  op*mal  b*?  
§ Select  minimal  b  such  that  
Assignmentb Project Exam b +1 Help
(1− p) + (1− p) ≤1
https://powcoder.com
§ Result  due  to  Gallager  and  Van  Voorhis  1975:  
generates  an  oAdd p*mal  
WeChatprefix  powcoder
code  for  geometric  
distribu*on  
§ Small  p  approxima*on:  
b* ≈ ln2 p = 0.69 ⋅ avg _ val
§ Rice  code:  only  allow  b  =  2k  

54  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Local  Bernoulli  Model  
Add WeChat powcoder
§ If  length  of  pos*ng  lists  is  known,  then  a  Bernoulli  
model  on  each  individual  inverted  list  can  be  used  
§ Frequent  Assignment Project
words  are  coded   Exam
with   Helpb,  infrequent  
smaller  
words  with  larger   b  
https://powcoder.com
§ Term  frequency  need  to  be  encoded  (use  gamma-­‐
Add WeChat powcoder
code)  
§ Local  Bernoulli  outperforms  global  Bernoulli  model  in  
prac*ce  (method  of  prac*ce!)  

55  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
RCV1  compression  
Add WeChat powcoder
Data structure Size in MB
dictionary, fixed-width 11.2
dictionary, termAssignment Project
pointers into string Exam Help 7.6
with blocking, k = 4 7.1
https://powcoder.com
with blocking & front coding 5.9
collection (text, xml markup etc) 3,600.0
Add WeChat powcoder
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, g-encoded 101.0

56  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
 
Assignment Project Exam Help
Google’s  Indexing  Choice  
Add WeChat powcoder
§ Index  shards  par**on  by  doc,  mul*ple  replicates  
§ Disk-­‐resident  index  
Assignment Project Exam Help
§ Use  outer  parts  of  the  disk  
§ Use  different  https://powcoder.com
compression  methods  for  different  fields:  
Ricek  (a  special  kind  of  Golomb  code)  for  gaps,  and  Gamma  
for  posi*ons.  Add
  WeChat powcoder
§ In-­‐memory  index  
§ All  posi*ons;  No  docid  
§ Keep  track  of  document  boundaries  
§ Group-­‐variant  encoding  
§ Fast  to  decode  
Source:  Jeff  Dean’s  WSDM  2009  Keynote   57  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
 
Assignment Project Exam Help
Other  details  
Add WeChat powcoder
§ Gap  =  docidn-­‐  docidn-­‐1  -­‐  1  
§ Freq  =  freq  –  1  
Assignment Project Exam Help
§ Pos_Gap  =  posn-­‐  posn-­‐1  -­‐  1  
https://powcoder.com
§ C.f.,  Jiangong  ZAdd
hang,  
WeChatXiaohui   Long  and  Torsten  Suel:  
powcoder
Performance  of  Compressed  Inverted  List  Caching  in  
Search  Engines.  WWW  2008.    

58  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Sec.
    5.3
https://powcoder.com
Assignment Project Exam Help
Index  compression  summary  
Add WeChat powcoder
§ We  can  now  create  an  index  for  highly  efficient  
Boolean  retrieval  that  is  very  space  efficient  
§ Only  4%  oAssignment Project
f  the  total  size   Exam
of  the   Help
collec*on  
§ Only  10-­‐15%  ohttps://powcoder.com
f  the  total  size  of  the  text  in  the  
collec*on  
Add WeChat powcoder
§ However,  we’ve  ignored  posi*onal  informa*on  
§ Hence,  space  savings  are  less  for  indexes  used  in  
prac*ce  
§ But  techniques  substan*ally  the  same.  

59  
   
COMP6714:  Informa2on  Retrieval  &  Web  Search   Ch.
    5
https://powcoder.com
Assignment Project Exam Help
Resources  for  today’s  lecture  
Add WeChat powcoder
§ IIR  5  
§ MG  3.3,  3.4.  
Assignment Project Exam Help
§ F.  Scholer,  H.E.  Williams  and  J.  Zobel.  2002.  
Compression  ohttps://powcoder.com
f  Inverted  Indexes  For  Fast  Query  
Evalua*on.  Proc.  ACM-­‐SIGIR  2002.  
Add WeChat powcoder
§ Variable  byte  codes  
§ V.  N.  Anh  and  A.  Moffat.  2005.  Inverted  Index  
Compression  Using  Word-­‐Aligned  Binary  Codes.  
Informa2on  Retrieval  8:  151–166.      
§ Word  aligned  codes  
60  

You might also like