Suffix Array 構築方法の紹介

SuﬃxArray
構築方法の紹介
Takashi
HOSHINO

Cybozu
Labs,
Inc.

2013-‐04-‐19
社内機械学習勉強会資料
20130618
資料完成

概要
•  Two
Eﬃcient
Algorithms
for
Linear
Suﬃx
Array

ConstrucMon

–  Authors:
Ge
Nong,
Sen
Zhang,
and
Wai
Hong
Chan

•  1st
algorithm:
SA-‐IS

–  Induced
SorMng
Variable-‐Length
LMS-‐Substrings

•  2nd
algorithm:
SA-‐DS

–  Radix
SorMng
Fixed-‐Length
d-‐CriMcal
Substrings

SA-‐IS/SA-‐DS アルゴリズム概要
SA-‐IS(S,SA)

Scan
S
to
create
t

Find
all
LMS-‐substrings
to
create
P1

Induced-‐sort
all
the
LMS-‐substrings
using
P1
and
B

Name
each
LMS-‐substring
to
create
S1

If
each
char
in
S1
is
unique:

SA1[S1[i]]
=
i
for
all
i

Else

SA-‐IS(S1,
SA1)

Induce
SA
from
SA1

SA-‐DS(S,SA)

Scan
S
to
create
t

Find
all
the
d-‐critical
substrings
to
create
P1

Radix
sort
all
the
d-‐critical
substrings
in
P1
using
B

Name
each
d-‐critical
substring
to
create
S1

If
each
char
in
S1
is
unique:

SA1[S1[i]]
=
i
for
all
i

Else

SA-‐DS(S1,
SA1)

Induce
SA
from
SA1

データの説明 (1)
•  S:
入力文字列

–  長さ n
とする

–  The
senMnel
$
で終端されていることを仮定．

S[i]
>
$
for
all
i
in
[0,
n-‐1)

•  SA:
出力 Suﬃx
Array

•  t:
長さ n
のビット列

–  S[i]
の L/S-‐type
を表す(後述)

–  t[i]
=
1
if
S[i]
is
S-‐type,
else
0

データの説明 (2)
•  P1:
長さ n1
の整数列
(n1
<=
n/2)

–  SA-‐IS と SA-‐DS
で異なる(後述)

•  K:
文字種の数

–  文字が 1
byte
とすると K
=
256

–  再帰したときは，n1
以下の値

•  B:
バケツソート用のデータ

–  長さ K
+
1
の整数列

–  各整数は [0,
n]
の範囲

L/S-‐Type,
LMS-‐char/substring
•  L/S-‐type:

–  S
の各文字は L-‐type
か S-‐type
のいずれかに分類できる (後述)

–  $
は S-‐type

–  S[i]
<
S[i
+
1]

à

S-‐type

–  S[i]
>
S[i
+
1]

à
L-‐type

–  S[i]
==
S[i
+
1]
à
type
of
S[i
+
1]

•  LMS-‐char:

–  LMS:
Leg-‐Most-‐S

–  S[i]
が S-‐type
で S[i-‐1]
が L-‐type
のときの S[i]

•  LMS-‐substring:
S[i..j]

–  S[i]
と S[j]
が LMS-‐char かつ S[i+1..j-‐1]
は LMS-‐char
を含まない

データ例

0

1

Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6

S:

m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$

t:

0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1

LMS:

*

*

*

*

P1:

2
6
10
16

B:

$:1,
i:8,
m:2,
p:2,
s:4,
others:0

($
<
i
<
m
<
p
<
s)

$

i

m

p

s

tmp:
{_}
{_
_
_
_
_
_
_
_}
{_
_}
{_
_}
{_
_
_
_}

SA-‐IS
algorithm
piaces
•  Find
all
LSM
substring
to
create
P1

•  Induced-‐sort
all
the
LMS-‐substrings
using
P1

and
B

•  Name
each
LMS-‐substring
to
create
S1

•  Recursive
call
SA-‐IS(S1,
SA1)

•  Induce
SA
from
SA1

Induced-‐sort
all
the
LMS-‐substrs
•  (1)
IniMalize
tmp
where
each
member
is
empty

•  (2)
Scan
P1
and
put
to
the
correct
bucket
from
right
to
leg

•  (3)
Scan
tmp
from
leg
to
right
and
t[tmp[i]
–
1]
is
0
then
put
it
to
the
bucket

•  (4)
Scan
tmp
from
right
to
leg
and
t[tmp[i]
–
1]
is
1
then
put
it
to
the
bucket

0

1

Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6

S:

m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$

t:

0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1

LMS:

*

*

*

*

P1:

2
6
10
16

$

i

m

p

s

tmp:
{16}
{
_

_

_

_

_
10

6

2}
{
_

_}
{
_

_}
{
_

_

_

_}

tmp:
{16}
{15
14

_

_

_
10

6

2}
{
1

0}
{13
12}
{
9

5

8

4}

tmp:
{16}
{15
14
10

6

2
11

7

3}
{
1

0}
{13
12}
{
9

5

8

4}

0

1

Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6

S:

m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$

t:

0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1

LMS:

*

*

*

*

P1:

2
6
10
16

$

i

m

p

s

tmp:
{16}
{15
14
10

6

2
11

7

3}
{
1

0}
{13
12}
{
9

5

8

4}

LSM-‐substrs
in
the
order
of
suffix
array
(tmp)

16
10

6

2

$

iippii$
iissi
iissi

Rename
items
where
the
same
LMS-‐substrings
indicate
the
same
name

0

1

2

2

Renamed
items
in
the
order
of
S

S1:
2
2
1
0

Find
the
lexicographic
names
of

all
substrings

Recursive
call
of
SA-‐IS(S1,
SA1)
Idx:
0
1
2
3

S1:

2
2
1
0

t:

0
0
0
0

LMS:

*

P1:

3

0

1

2

tmp:
{3}
{2}
{1
0}

All
items
are
unique
in
the
created
suffix
array
(tmp)

SA1:
3
2
1
0

Induce
SA
from
SA1

0

1

Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6

S:

m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$

t:

0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1

LMS:

*

*

*

*

P1:

2
6
10
16

SA1
:
3
2
1
0

for
i
from
4-‐1
to
0:
put
P1[SA1[i]]
in
the
suffix
array
(tmp)

$

i

m

p

s

tmp:
{16}
{
_

_

_

_

_
10

6

2}
{
_

_}
{
_

_}
{
_

_

_

_}

tmp:
{16}
{15
14

_

_

_
10

6

2}
{
1

0}
{13
12}
{
9

5

8

4}

tmp:
{16}
{15
14
10

6

2
11

7

3}
{
1

0}
{13
12}
{
9

5

8

4}

SA-‐DS
algorithm
piaces
•  Find
all
the
d-‐criMcal
substrings
to
create
P1

•  Radix
sort
all
the
d-‐criMcal
substrings
in
P1

using
B

d-‐CriMcal
char/substring
•  What
is
d?

–  定数

–  2
<=
d

•  d-‐CriMcal
char:

–  S[i]
が LMS-‐char
à
S[i]
は d-‐criMcal
char

–  S[i-‐d]
が d-‐criMcal
char
かつ S[i-‐1]
と S[i+1]
が LMS-‐char
でないとき

à
S[i]
は d-‐criMcal
char

•  d-‐CriMcal
substring:
S[i..i+d+1]

–  S[i]
が d-‐criMcal
char

–  後ろの長さが足りないものは S[n-‐1]
すなわち $
で埋めたものとする

–  長さは d
+
2
固定

P1
について
•  SA-‐IS

–  LMS-‐substring
の先頭文字の S
内でのインデクス列

–  ただし，S
と t
を見れば P1
は分かるため，SA-‐IS
では明示
的に P1
は作らない

•  SA-‐DS

–  d-‐CriMcal
substring
の先頭文字の S
内でのインデクス列

–  これを radix
sort
するので必ず生成

ω/γ-‐waited
substrs
•  Sω[i]
=
2S[i]
+
t[i]
for
all
i
in
[0,
n)

•  ω-‐weighted
substring:
Sω[i..j]

•  γ-‐weighted
substring:

–  Sγ[i..j]
=
S[i..j-‐1]Sω[j]

•  P1
を radix
sort
するときに key
を

w-‐weighted
d-‐criMcal
substring
とする必要あり

•  Sω[i..j]
の代わりに Sγ[i..j]
で足りる

d-‐CriMal
substrs

0

1

Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6

S:

m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$

t:

0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1

LMS:

*

*

*

*

P1:

2

4

6

8
10
12
14
16

2:
iiss

4:
ssii

6:
iiss

8:
ssii

10:
iipp

12:
ppii

14:
ii$$

16:
$$$$

Radix
sort
of
P1

0

1

Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6

S:

m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$

t:

0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1

LMS:

*

*

*

*

P1:

2

4

6

8
10
12
14
16

2:
iiss

14:
ii$$

14:
ii$$

16:
$$$$

16:
$$$$

4:
ssii

16:
$$$$

16:
$$$$

14:
ii$$

14:
ii$$

6:
iiss

12:
ppii

12:
ppii

10:
iipp

10:
iipp

8:
ssii

4:
ssii

4:
ssii

2:
iiss

2:
iiss

10:
iipp

8:
ssii

8:
ssii

6:
iiss

6:
iiss

12:
ppii

10:
iipp

10:
iipp

12:
ppii

12:
ppii

14:
ii$$

2:
iiss

2:
iiss

4:
ssii

4:
ssii

16:
$$$$

6:
iiss

6:
iiss

8:
ssii

8:
ssii

P1’:

16
14
10

2

6
12

4

8

Find
lexicographic
names

0

1

Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6

S:

m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$

t:

0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1

LMS:

*

*

*

*

P1:

2

4

6

8
10
12
14
16

P1’:
16
14
10

2

6
12

4

8

Name:
0

1

2

3

3

4

5

5
(order
of
P1’)

S1:

3

5

3

5

2

4

1

0
(order
of
P1)

Recursive
call
of
SA-‐DS

0

1

Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6

S:

m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$

t:

0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1

LMS:

*

*

*

*

P1:

2

4

6

8
10
12
14
16

P1’:
16
14
10

2

6
12

4

8

S1:

3

5

3

5

2

4

1

0

t1:

1

0

1

0

1

0

0

1

P2:

2

4

7

P2’:

7

4

2

S2:

2

1

0

SA2:

2

1

0

評価データ
rithms were implemented in C++ and compiled by g++ with the option of -O3. T
d from Sanders’s website [19]. For the KA algorithm, we use an improved versio
o’s website [20]) from Yuta Mori. The source code of our algorithm IS is given
DS1 and DS2 were embodied in less than 150 and 250 effective lines of code, r
on request.
Table 1: Data Used in the Experiments
Data Characters Σ Description
bible.txt 4 047 392 63 King James Bible
chr22.dna 34 553 758 4 Human chromosome 22
E.coli 4 638 690 4 Escherichia coli genome
etext99 105 277 340 146 Texts from Gutenberg project
howto 39 422 105 197 Linux Howto files
pic 513 216 159 Black and white fax picture
sprot34.dat 109 617 186 66 Swissprot V34 protein database
world192.txt 2 473 400 94 CIA world fact book
alphabet 100 000 26 Repetitions of the alphabet [a-z]
random 100 000 64 Randomly selected from 64 characters
ace The time for each algorithm is the mean of 3 runs, and the space is the hea
emusage command to fire the running of each program. The total time (in se

評価結果構築時間
Table 2: Time
Data Time (Seconds)
IS DS1 DS2 KS KA
bible 2.7 3.11 3.9 8.9 3.62
chr22 24.7 31.5 39.6 92.8 34.1
E.coli 2.8 3.53 4.3 10 3.98
etext 101 123.2 150.4 428.1 149.67
howto 30.4 36.3 44.05 130.4 42.85
pic 0.06 0.09 0.13 0.56 0.29
sprot 94.6 111.59 139.6 356 132.91
world 1.3 1.61 2 4.8 1.84
alphabet 0.00 0.01 0.02 0.15 0.02
random 0.02 0.01 0.01 0.06 0.02
Total 257.58 310.95 384.01 1031.77 369.3
Mean 0.90 1.08 1.34 3.60 1.29
Norm. 1 1.21 1.49 4.01 1.43

評価結果:
SAサイズ
random 0.02 0.01 0.01 0.06 0.02
Total 257.58 310.95 384.01 1031.77 369.3
Mean 0.90 1.08 1.34 3.60 1.29
Norm. 1 1.21 1.49 4.01 1.43
Table 3: Space
Data Space (MBytes)
IS DS1 DS2 KS KA
bible 20.86 21.50 20.30 90.40 34.45
chr22 178.09 184.44 171.41 819.25 289.97
E.coli 24.29 25.15 23.23 105.93 40.01
etext 542.17 559.55 521.85 2369.92 907.34
howto 203.16 208.08 195.55 932.07 331.54
pic 2.57 2.76 2.79 15.51 3.11
sprot 554.58 560.44 543.26 2591.62 930.06
world 12.70 12.91 12.50 55.24 21.24
alphabet 0.49 0.74 0.75 3.03 0.52
random 0.61 0.74 0.74 2.26 0.88
Total 1539.52 1576.31 1492.37 6985.23 2559.12
Mean 5.37 5.50 5.20 24.36 8.92
Norm. 1.03 1.06 1 4.68 1.72

評価結果:
再帰の深さ
Total 1539.52 1576.31 1492.37 6985.23 2559.12
Mean 5.37 5.50 5.20 24.36 8.92
Norm. 1.03 1.06 1 4.68 1.72
Table 4: Recursion Depth and Reduction Ratio
Data Depth Ratio
IS DS KS KA IS DS KS KA
bible 6 6 6 7 .34 .37 .67 .46
chr22 6 10 12 9 .31 .36 .67 .44
E.coli 7 8 7 9 .32 .36 .67 .45
etext 11 12 12 15 .33 .37 .67 .45
howto 9 10 11 13 .32 .36 .67 .45
pic 5 9 10 5 .26 .35 .67 .39
sprot 7 8 9 10 .31 .37 .67 .45
world 6 7 6 7 .32 .37 .67 .45
alphabet 2 10 11 2 .02 .34 .67 .02
random 2 1 2 2 .33 .36 .67 .47
Total 61 81 86 80 2.86 3.61 6.7 4.03
Mean 6.1 8.1 8.6 8.0 .29 .36 .67 .40
Norm. 1 1.33 1.41 1.31 1 1.26 2.34 1.38

結論
•  SA-‐IS
で
FA

Suffix Array 構築方法の紹介

In this document

More Related Content

What's hot

Similar to Suffix Array 構築方法の紹介

More from Takashi Hoshino

Recently uploaded

Suffix Array 構築方法の紹介