The document introduces two algorithms for constructing a suffix array: SA-IS and SA-DS. SA-IS uses induced sorting of longest common prefix substrings, while SA-DS uses radix sorting of fixed-length substrings. The document provides pseudocode for the algorithms and explains various terms and data structures used, including longest minimal suffixes, L-type and S-type characters, and buckets for sorting.
概要
• Two
Efficient
Algorithms
for
Linear
Suffix
Array
ConstrucMon
– Authors:
Ge
Nong,
Sen
Zhang,
and
Wai
Hong
Chan
• 1st
algorithm:
SA-‐IS
– Induced
SorMng
Variable-‐Length
LMS-‐Substrings
• 2nd
algorithm:
SA-‐DS
– Radix
SorMng
Fixed-‐Length
d-‐CriMcal
Substrings
3.
SA-‐IS/SA-‐DS アルゴリズム概要
SA-‐IS(S,SA)
Scan
S
to
create
t
Find
all
LMS-‐substrings
to
create
P1
Induced-‐sort
all
the
LMS-‐substrings
using
P1
and
B
Name
each
LMS-‐substring
to
create
S1
If
each
char
in
S1
is
unique:
SA1[S1[i]]
=
i
for
all
i
Else
SA-‐IS(S1,
SA1)
Induce
SA
from
SA1
SA-‐DS(S,SA)
Scan
S
to
create
t
Find
all
the
d-‐critical
substrings
to
create
P1
Radix
sort
all
the
d-‐critical
substrings
in
P1
using
B
Name
each
d-‐critical
substring
to
create
S1
If
each
char
in
S1
is
unique:
SA1[S1[i]]
=
i
for
all
i
Else
SA-‐DS(S1,
SA1)
Induce
SA
from
SA1
4.
データの説明 (1)
• S:
入力文字列
– 長さ n
とする
– The
senMnel
$
で終端されていることを仮定.
S[i]
>
$
for
all
i
in
[0,
n-‐1)
• SA:
出力 Suffix
Array
• t:
長さ n
のビット列
– S[i]
の L/S-‐type
を表す(後述)
– t[i]
=
1
if
S[i]
is
S-‐type,
else
0
データ例
0
1
Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
S:
m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$
t:
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1
LMS:
*
*
*
*
P1:
2
6
10
16
B:
$:1,
i:8,
m:2,
p:2,
s:4,
others:0
($
<
i
<
m
<
p
<
s)
$
i
m
p
s
tmp:
{_}
{_
_
_
_
_
_
_
_}
{_
_}
{_
_}
{_
_
_
_}
8.
SA-‐IS
algorithm
piaces
• Find
all
LSM
substring
to
create
P1
• Induced-‐sort
all
the
LMS-‐substrings
using
P1
and
B
• Name
each
LMS-‐substring
to
create
S1
• Recursive
call
SA-‐IS(S1,
SA1)
• Induce
SA
from
SA1
9.
Induced-‐sort
all
the
LMS-‐substrs
• (1)
IniMalize
tmp
where
each
member
is
empty
• (2)
Scan
P1
and
put
to
the
correct
bucket
from
right
to
leg
• (3)
Scan
tmp
from
leg
to
right
and
t[tmp[i]
–
1]
is
0
then
put
it
to
the
bucket
• (4)
Scan
tmp
from
right
to
leg
and
t[tmp[i]
–
1]
is
1
then
put
it
to
the
bucket
0
1
Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
S:
m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$
t:
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1
LMS:
*
*
*
*
P1:
2
6
10
16
$
i
m
p
s
tmp:
{16}
{
_
_
_
_
_
10
6
2}
{
_
_}
{
_
_}
{
_
_
_
_}
tmp:
{16}
{15
14
_
_
_
10
6
2}
{
1
0}
{13
12}
{
9
5
8
4}
tmp:
{16}
{15
14
10
6
2
11
7
3}
{
1
0}
{13
12}
{
9
5
8
4}
10.
0
1
Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
S:
m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$
t:
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1
LMS:
*
*
*
*
P1:
2
6
10
16
$
i
m
p
s
tmp:
{16}
{15
14
10
6
2
11
7
3}
{
1
0}
{13
12}
{
9
5
8
4}
LSM-‐substrs
in
the
order
of
suffix
array
(tmp)
16
10
6
2
$
iippii$
iissi
iissi
Rename
items
where
the
same
LMS-‐substrings
indicate
the
same
name
0
1
2
2
Renamed
items
in
the
order
of
S
S1:
2
2
1
0
Find
the
lexicographic
names
of
all
substrings
11.
Recursive
call
of
SA-‐IS(S1,
SA1)
Idx:
0
1
2
3
S1:
2
2
1
0
t:
0
0
0
0
LMS:
*
P1:
3
0
1
2
tmp:
{3}
{2}
{1
0}
All
items
are
unique
in
the
created
suffix
array
(tmp)
SA1:
3
2
1
0
12.
Induce
SA
from
SA1
0
1
Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
S:
m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$
t:
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1
LMS:
*
*
*
*
P1:
2
6
10
16
SA1
:
3
2
1
0
for
i
from
4-‐1
to
0:
put
P1[SA1[i]]
in
the
suffix
array
(tmp)
$
i
m
p
s
tmp:
{16}
{
_
_
_
_
_
10
6
2}
{
_
_}
{
_
_}
{
_
_
_
_}
tmp:
{16}
{15
14
_
_
_
10
6
2}
{
1
0}
{13
12}
{
9
5
8
4}
tmp:
{16}
{15
14
10
6
2
11
7
3}
{
1
0}
{13
12}
{
9
5
8
4}
13.
SA-‐DS
algorithm
piaces
• Find
all
the
d-‐criMcal
substrings
to
create
P1
• Radix
sort
all
the
d-‐criMcal
substrings
in
P1
using
B
d-‐CriMal
substrs
0
1
Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
S:
m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$
t:
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1
LMS:
*
*
*
*
P1:
2
4
6
8
10
12
14
16
2:
iiss
4:
ssii
6:
iiss
8:
ssii
10:
iipp
12:
ppii
14:
ii$$
16:
$$$$
18.
Radix
sort
of
P1
0
1
Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
S:
m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$
t:
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1
LMS:
*
*
*
*
P1:
2
4
6
8
10
12
14
16
2:
iiss
14:
ii$$
14:
ii$$
16:
$$$$
16:
$$$$
4:
ssii
16:
$$$$
16:
$$$$
14:
ii$$
14:
ii$$
6:
iiss
12:
ppii
12:
ppii
10:
iipp
10:
iipp
8:
ssii
4:
ssii
4:
ssii
2:
iiss
2:
iiss
10:
iipp
8:
ssii
8:
ssii
6:
iiss
6:
iiss
12:
ppii
10:
iipp
10:
iipp
12:
ppii
12:
ppii
14:
ii$$
2:
iiss
2:
iiss
4:
ssii
4:
ssii
16:
$$$$
6:
iiss
6:
iiss
8:
ssii
8:
ssii
P1’:
16
14
10
2
6
12
4
8
19.
Find
lexicographic
names
0
1
Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
S:
m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$
t:
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1
LMS:
*
*
*
*
P1:
2
4
6
8
10
12
14
16
P1’:
16
14
10
2
6
12
4
8
Name:
0
1
2
3
3
4
5
5
(order
of
P1’)
S1:
3
5
3
5
2
4
1
0
(order
of
P1)
20.
Recursive
call
of
SA-‐DS
0
1
Idx:
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
S:
m
m
i
i
s
s
i
i
s
s
i
i
p
p
i
i
$
t:
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
1
LMS:
*
*
*
*
P1:
2
4
6
8
10
12
14
16
P1’:
16
14
10
2
6
12
4
8
S1:
3
5
3
5
2
4
1
0
t1:
1
0
1
0
1
0
0
1
P2:
2
4
7
P2’:
7
4
2
S2:
2
1
0
SA2:
2
1
0
21.
評価データ
rithms were implementedin C++ and compiled by g++ with the option of -O3. T
d from Sanders’s website [19]. For the KA algorithm, we use an improved versio
o’s website [20]) from Yuta Mori. The source code of our algorithm IS is given
DS1 and DS2 were embodied in less than 150 and 250 effective lines of code, r
on request.
Table 1: Data Used in the Experiments
Data Characters Σ Description
bible.txt 4 047 392 63 King James Bible
chr22.dna 34 553 758 4 Human chromosome 22
E.coli 4 638 690 4 Escherichia coli genome
etext99 105 277 340 146 Texts from Gutenberg project
howto 39 422 105 197 Linux Howto files
pic 513 216 159 Black and white fax picture
sprot34.dat 109 617 186 66 Swissprot V34 protein database
world192.txt 2 473 400 94 CIA world fact book
alphabet 100 000 26 Repetitions of the alphabet [a-z]
random 100 000 64 Randomly selected from 64 characters
ace The time for each algorithm is the mean of 3 runs, and the space is the hea
emusage command to fire the running of each program. The total time (in se