Staticscheduling PDF
Staticscheduling PDF
JANUARY 1987
Abstract-hrge grain data flow (LGDF) programming is and software techniques enabling a programmer to write
natural and convenient for describing digital signal processing sequential code irrespective of the parallel nature of the
(DSP) systems, but its runtime overhead is costly in real time or underlying hardware. For example, in machines with multiple
cost-sensitive applications. In some situations, designers are not
willing to squander computing resources for the sake of program- function units, such as the C D C W and Cray family, so
mer convenience. This is particularly true when the target called "scoreboarding" hardware resolves conflicts to ensure
machine is a programmable D S P chip. However, the runtime the integrity of sequential code. In deeply pipelined machines
overhead inherent in most LGDF implementations is not required such as the IBM 360 Model 91, interlocking mechanisms [9]
for most signal processing systems because such systems are
resolve pipeline conflicts. In the M.I.T. Lincoln Labs signal
mostly synchronous (in the D S P sense). Synchronous data flow
(SDF) differs from traditional data flow in that the amount of processor [ 101 specialized associative memories are used to
data produced and consumed by a data flow node is specified a ensure the integrity of data precedences.
priori for each input and output. This is equivalent to specifying The affinity for von Neumann programming is not all
the relative sample rates in signal processing system. This means surprising, stemming from familiarity and a proven track
that the scheduling of SDF nodes need not be done at runtime,
but can be done at compile time (statically), so the runtime
record, but the cost is high in the design of specialized digital
overhead evaporates. The sample rates can all be different, which signal processors. Comparing two pipelined chips that differ
is not true of most current data-driven digital signal processing radically only in programming methodology, the TI
programming methodologies. Synchronous data flow is closely TMS32010 [2] and the Bell Labs DSP20, a faster version of
related to computation graphs, a special case of Petri nets. the DSPl [6], we find that they achieve exactly the same
This self-contained paper develops the theory necessary to
statically schedule SDF programs on single or multiple proces-
performance on the most basic benchmark, the FIR (finite
sors. A class of static (compile time) scheduling algorithms is impulse response) filter. But the Bell Labs chip outperforms
I
proven valid, and specific algorithms are given for scheduling the TI chip on the next most basic benchmark, the IIR (infinite
SDF systems onto single or multiple processors. impulse response) filter. Surprisingly, close examination
reveals that the arithmetic hardware (multiplier and ALU) of
Index Terms-Block diagram, computation graphs, data flow
digital signal processing, hard real-time systems, multiprocessing, the Bell Labs chip is half as fast as in the TI chip. The
Petri nets, static scheduling, synchronous data flow. performance gain appears to follow from the departure from
conventional sequential programming.
However, programming the Bell Labs chip is not easy. The
I. INTRODUCTION code more closely resembles horizontal microcode than
I i: II. THEDATAFLOWPARADIGM
i In data flow, a program is divided into pieces (nodes or
blocks) which can execute (fire) whenever input data are Fig. 1. A three node data flow grsph with OIY input and two outputs. The
available [ 121, [131. An algorithm is described as a dataflow nodes represent functions of arbitrary complexity, and the arcs represent
graph, a directed graph where the nodes represent functions paths on which sequences of data (fokens or surnpks) flow.
and the arcs represent data paths, as shown in Fig. 1. Signal
processing algorithms are usually described in the literature by a signal processing system can be simulated on a single
a combination of mathematical expressions and block dia- processor or multiple processors, implemented in specialized
grams. Block diagrams are large grain dataflow (LGDF) hardware, or even, ultimately, compiled into a VLSI chip
graphs, [14]-1161, in which the nodes or blocks may be t281*
atomic (from the Greek atornos, or indivisble), such as
adders or multipliers, or nonatomic (large grain), such as DATAFLOWGRAPHS
III. SYNCHRONOUS
digital filters, FFT units, modulators, or phase locked loops. In this paper we concentrate on synchronous systems. At
The arcs connecting blocks show the signal paths, where a the risk of being pedantic, we define this precisely. A block is
signal i s simply an infinite stream of data, and each data token a function that is invoked when there is enough input available
is called a sample. The complexity o f the functions (the to perform a computation (blocks lacking inputs can be
granularify)will determine the amount of parallelism availa- invoked at any time). When a block is invoked, it will
ble because, while the blocks can sometimes be executed consume a fixed number of new input samples on each input
concurrently, we make no attempt to exploit the concurrency path. These samples may remain in the system for some time
inside a block. The functions within the blocks can be to be used as old samples [17], but they will never again be
specified using conventional von Neumann programming considered new samples. A block is said to be synchronous if
techniques. If the granularity is at the level of signal we can specify a priori the number of input samples consumed
processing subsystems (second-order sections, butterfly units, on each input and the number of output samples produced on
etc.), then the specification of a system will be extremely each output each time the block is invoked. Thus, a synchro-
natural and enough concurrency will be evident to exploit at nous block is shown in Fig. 2(a) with a number associated with
least small-scale parallel processors. The blocks can themsel- each input or output specifying the number of inputs consumed
ves represent another data flow graph, so the specification can or the number of outputs produced. These numbers are part of
be hierarchical. This is consistent with the general practice in the block definition. For example, a digital filter block would
signal processing where, for example, an adaptive equalizer have one input and one output, and the number of input
may be treated as a block in a large system, and may be itself a samples consumed or output samples produced would be one.
network of simpler blocks. A 2:l decimator block would also have one input and one
LGDF is ideally suited for signal processing, and has been output, but would consume two samples for every sample
adopted in simulators in the past [17]. Other signal processing produced. A synchronous data flow (SDF) graph is a
systems use a data-driven paradigm to partition a task among network of synchronous blocks, as in Fig. 2@).
cooperating processors [181, and many so called “block SDF graphs are closely related to computation graphs.
diagram languages” have been developed to permit program- introduced in 1966 by Karp and Miller [29] and further
mers to describe signal processing systems more naturally. explored by Reiter [30]. Computation graphs are slightly more
Some examples are Blodi [19], Patsi [20], Blodib [21], Lotus elaborate than SDF graphs, in that each input to a block has
1221,Dare [23], Mitsyn [24], Circus 1251, and Topsirn [26]. two numbers associated with it, a threshold and the number of
But these simulators are based on the principle of “next state samples consumed. The threshold specifies the number of
simulation” [20], [27] and thus have difficulty with multiple samples required to invoke the block, and may be different
sample rates, not to mention asynchronous systems. (We use from the number of samples consumed by the block. It cannot,
the term “asynchronous” here in the DSP sense to refer to of course, be smaller than the number of samples consumed.
systems with sample rates that are not related by a rational The use of a distinct threshold in the model, however, does not
t multiplicative factor.) Although true asynchrony is rare in significantly change the results presented in this paper, so for
signal processing, multiple sample rates are common, stem- simplicity, we assume these two numbers are the same. Karp
ming from the frequent use of decimation and interpolation. and Miller [29] show that computations specified by a
The technique we propose here handles multiple sample rates computation graph are determinate, meaning that the same
easily. computations are performed by any proper execution. This
In addition to being natural for DSP, large grain data flow type of theorem, of course. also underlies the validity of data
has another significant advantage for signal processing. As flow descriptions. They also give a test to determine whether a
i long as the integrity of the flow of data is preserved, any computation terminates. which is potentially useful because in
implementation of a data flow description will produce the signal processing we are mainly interested in computations
same results. This means that the same software description of that do not terminate. We assume that signal processing
IEEE TRANSACTIONS ON COMPUTERS. VOL. C-36. NO. 1 . JANUARY 1987
I 3
Fig. 4. An example of M SDF graph with delays on the arcs.
Fig. 3. SDF graph showing the numbering of the nodes and arcs. The input
and output arcs arc ignored for now.
The topology matrix I? characterizes the effect on the buffers
of running a node program.
be characterized by a matrix similar to the incidence matrix The simple computation model is powerful. First we note
associated with directed graphs in graph theory. It is con- that the computation model handles delays. The term delay is
structed by first numbering each node and arc, as in Fig. 3, used in the signal processing sense, corresponding to a sample
and assigning a column to each node and a row to each arc. offset between the input and the output. W e define a unit
The ( i , j)th entry in the matrix is the amount of data produced delay on an arc from node A to node B to mean that the nth
bj node j on arc i each time it is invoked. If node j consumes sample consumed by B will be the (n - 1)th sample produced
data from arc i, the number is negative, and if it is not by A. This implies that the first sample the destination block
connected to arc i, then the number is zero. For the graph in consumes is not produced by the source block at all, but is pan
Fig. 3 we get of the initial state of the arc buffer. Indeed, a delay of d
samples on an arc is implemented in our model simply by
setting an initial condition for (3). Specifically, the initial
buffer state, b(O), should have a d in the position correspond-
ing to the arc with the delay of d units.
This matrix can be called a topology matrix, and need not be T o make this idea firm, consider the example system in Fig.
square, in general. 4. The symbol ?0?on an arc means a single sample delay,
I f a node has a connection to itself (a serf-loop), then only while ?20? means a two-sample delay. The initial condition
one entry in I? describes this link. This entry gives the net for the buffers is thus - -
difference between the amount of data produced on this link
(4)
and the amount consumed each time the block is invoked. This
difference should clearly be zero for a correctly constructed Because of these initial conditions, block 2 can be invoked
graph, so the I?entry describing a self-loop should be zero. once and block 3 twice before block 1 is invoked at all.
We can replace each arc with a FIFO queue (buffer) to pass Delays, therefore, affect the way the system starts up.
data from one block to another. The size of the queue will vary Given this computation model we can
at different times in the execution. Define the vector b(n) to
contain the queue sizes of all the buffers at time n. In Blosim find necessary and sufficient conditions for the existence
[ 171 buffers are also used to store old samples (samples that of a PASS, and hence a PAPS;
have been ?consumed?), making implementations of delay find practical algorithms that provably find a PASS if one
lines panicularly easy. These past samples are not considered exists;
part of the buffer size here. find practical algorithms that construct a reasonable (but
For the sequential schedule, only one block can be invoked not necessarily optimal) PAPS, if a PASS exists.
at a time, and for the purposes of scheduling it does not matter W e begin by showing that a necessary condition for the
how long it runs. Thus, the index n can simply be incremented existence of a PASS is
each time a block finishes and a new block is begun. We
specify the block invoked at time n with a vector u ( n ) , which rank (I?)=s-l (5)
has a one in the position corresponding to the number of the where s is the number of blocks in the graph. We need a series
block that is invoked at time n and zeros for each block that is of lemmas before we can prove this. The word ?node? is used
not invoked. For the system in Fig. 3, in a sequential schedule, below to refer to the blocks because it is traditional in graph
o(n) can take one of three values, theory.
Lemma I: All topology matrices for a given SDF graph
have the same rank.
ProoJ Topology matrices are related by renumbering of
nodes and arcs, which translates into row and column
permutations in the topology matrix. Such operations preserve
depending on which of the three blocks is invoked. Each time the rank. Q.E.D.
a block is invoked, it will consume data from zero or more Lemma 2: A topology matrix for a tree graph has rank
input arcs and produce data on zero or more output arcs. The s - 1 where s is the number of nodes (a tree is a connected
change in the size of the buffer queues caused by invoking a graph without cycles, where we ignore the directions of the
node is given by arcs).
Proof: Proof is by induction. The lemma is clearly true
b(n+ i)=b(n)+ru(n). (3) for a two-node tree. Assume that for an N node tree
28 IEEE TRANSACTIONS ON COMPUTERS. VOL. C-36. NO. I . JANUAR'I' 1987
r"I0
rN+I= [7 3 Fig. 5. Example of a defective SDF Bnph with sample rate inconsistencies.
The topology matrix is
where 0 is a column vector full of zeros, and p f is a row
vector corresponding to the arc we just added. The last entry in
the vector p is nonzero because'the node we just added
corresponds to the last column, and it must be connected to the
r= o[: -f -']
0 -I
rank (I')=s=3.
-'I .=[:I
Proof: Consider any spanning tree 7 of the connected r.
SDF graph (a spanning tree is a tree that includes every node
in the graph). Now define r, to be the topology matrix for this
subgraph. By Lemma 2 rank (T',) = s - 1. Adding arcs to
r= o[ -:0 -1 cv(r)
the subgraph simply adds rows to the topology matrix. Adding
rows to a matrix can increase the rank, if the rows are linearly
independent of existing rows, but cannot decrease it. Q.E.D. Since the PASS is admissible, the buffers must remain
Corollury: rank (r)is s - 1 or s. bounded, by Definition 1. The buffers remain bounded if and
Proof: r has only s columns, so its rank cannot exceed s. only if
Therefore, by Lemma 3, s and s - I are the only
I ' possibilities. Q.E.D. rq=o
Definition I: An admissible sequential schedule is a + where 0 is a vector full of zeros. For q # 0, this implies that
nonempty ordered list of nodes such that if the nodes are rank(r) < s where s is the dimension of q . From the corollary
executed in the sequence given by 4, the amount o f data in the o f Lemma 3, runk(r) is either s or s -
1, and so it must be
buffers (' 'buffer sizes") will remain nonnegative and s - I. Q.E.D.
bounded. Each node must appear in (b at least once. This theorem tells us that if we have a SDF graph with a
A periodic admissible sequential schedule (PASS) is a topology matrix of rank s, that the graph is somehow
periodic and infinite admissible sequential schedule. It is defective, because no PASS can be found for it. Fig. 5
specified by a list (b that is the list of nodes in one period. illustrates such a graph and its topology matrix. Any schedule
For the example in Fig. 6, (b = { 1, 2, 3, 3) is a PASS, but for this graph will result either in deadlock or unbounded
4 = (2, 1, 3, 3) is not because node 2 cannot be run before buffer sizes, as the reader can easily verify. The rank of the
node 1. The list 4 = { 1, 2, 3) is not a PASS because the topology matrix indicates a sample rate inconsistency in the
infinite schedule resulting from repetitions of this list will graph. In Fig. 6, by contrast. a graph without this defect is
result in an infinite accumulation of data samples on the arcs shown. The topology matrix h a s rank s - 1 = 2. so we can
leading into node 3. find a vector q such that rq = 0. Furthermore, the following
Theorem I: For a connected SDFgraph with s nodes and theorem shows that we can find a positive integer vector q in
topology matrix r, rank (F) = s -1 is a necessary condition the nullspace of r. This vector tells us how many times we
for a PASS to exist. should invoke each node in one period of a PASS. Refemng
Proo$ We must prove that the existence of a PASS of again to Fig. 6 , the reader can easily verify that if we invoke
-
pcnodp implies rank (T') = s 1. Observe from (3) that we node 1 once, node 2 once, followed by node 3 twice, that the
can write
buffers will end up once again in their initial state. A5 before.
we prove some lemmas before getting to the theorem.
Lemma 4: Assume a connected SDF graph with topology
where matrix r. Let q be any vector such that rq = 0. Denote a
connected path through the graph by the set B = { b,, * * ., bL }
where each entry designates a node, and node b l is
n=O
connected to node bz. node bz to node b3, up to b L .Then all
Since the PASS is periodic, we can write q,, i E B are zero, or all are strictly positive, or all are strictly
negative. Furthermore, if any q, is rational then all q, are
b(np)= b(0) nrq. + rational.
I LEE AND MESSERSCHMI'TT STATIC SCHEDULING OF SYNCHRONOUS DATA FLOW PROGRAMS 29
n-0
I
-.". -
30 IEEE TRANSACTlONS ON COMPUTERS. VOL C-36. NO I. JANUARY 1987
0
integer vector q s.t. rq = 0. Such a vector exists because of
Theorem 2 . Then it is never necessary to run any predecessor
i P more than qs times in order to run CY x times, for any x <
9a.
Proof: The node CY will not consume any data produced
by the yth run of 8 for any y > 44. From the definition of I’
and 9 we know that aq, = bqa where a and b are the amount
of data consumed and produced on the link from ,d to a .
Therefore, running 6 only 48 times generates enough data on
the link to run cr 4atimes. More runs will not help. Q.E.D.
Theorem 4: Given a SDF graph with topology matrix and r
a positive integer vector q s.t. I’q = 0, a PASS of period
p = 1 ‘q exists if and only if a PASS of period Np exists for
Fig. 7. Two SDF graphs with consistent sample rates but no admissible any integer N.
schedule.
Proof:
Parr I: It is trivial to prove that the existence of a
Theorem 3 tells us that if we are given a positive integer
vector 9 in the nullspace of the topology matrix, that class S
PASS of period p implies the existence of a PASS of period i
algorithms will find a PASS with its period equal to the sum of
Np because the first PASS can be composed N times to
produce the second PASS.
the elements in the vector, if such a PASS exists. 1t is possible,
Part 2: We now prove that the existence of a PASS 4
even if rank (r) = s - 1 for no PASS to exist. Two such
I
I
period p and we are done. It it is not, then there must be some I
vectors in the nullspace of the topology matrix. How do we
node fl that is executed more than qs times before all nodes
select one to use in the class S algorithm? We now set out to
have been executed 9 times. But by Lemma 8, these “more
prove that given any positive integer vector in the nullspace of
than q” executions of 6 cannot be necessary for the later “less
the topology matrix, if a class S algorithm fails to find a PASS
than or equal to q” executions of other nodes. Therefore. the
then no PASS o f any period exists.
“less than or equal to 4” executions can be moved up in the
Lemma 6: Connecting one more node to a graph increases
list b, so that they precede all “more than 4” executions of 13.
the rank of the topology matrix by at least one.
yielding a new PASS 4 ’ of period Np. If this process is
The proof of this lemma follows the same kinds of
repeated until all “less than q” executions precede all “more
arguments as the proof of Lemma 2. Rows are added to the
than q” executions, then the first p elements of the resulting
topology matrix to describe the added connections to the new schedule will constitute a schedule of period p. Q.E.D.
node, and these rows must be linearly independent of rows CoroNary: Given any positive integer vector 4 E q(I”). the
already in the topology matrix. null space of a PASS of period p = 1 ‘9 exists if and only if
r,
Lemma 7: For any connected SDF graph with s nodes and a PASS exists of period r = 1 for any other positive integer
topology matrix a connected subgraph L with m nodes has a
r, vector u f q(r).
topology matrix rr for which
Proof: For any PASS at all to exist, it is necessary that
rank(r)= s - 1 rank(rL)= M - I rank(r) = s - 1, by Theorem 1. So the nullspace of r has
dimension one, and we can find a scalar c such that
;.e.. all subgraphs have the right rank.
Proo$- By contraposition. We prove that 9 = cu.
c
rank(rL)+rn- 1 rank(r)+s- 1. Furthermore. if both of these vectors are integer vectors, then
c is rational and we can write
From the corollary to Lemma 3 , if rank(FL)# rn - 1 then
n
runk(rL) = m . Then runk(r) 2 m +
(s - m) = s, by c=-
repeated applications of Lemma 6 , so rank(r) = s. Q.E.D. d
The next lemma shows that given a nullspace vector q , in
order to run any node the number of times specified by this where n and dare both integers. Therefore,
vector, it is not necessary to run any other node more than the dq =nu.
number of times specified by the vector.
Lemma 8: Consider the subgraph of a SDF graph formed By Theorem 4, a PASS of period p = 1 ‘q exists if and only if
by any node CY and all its immediate predecessors (nodes that a PASS of period dp = 1 T(d9)exists. By Theorem 4 again, a
feed it data, which may include CY itself). Construct a topology PASS of period dp exists if and only if a PASS of period r =
matrix r for this subgraph. If the original graph has a PASS, 1Tu exists. Q.E.D.
then by Theorem 1 and Lemma 7 , rank (I’) = m -
1 where Discussion: The four theorems and their corollaries have
m is the number of nodes in the subgraph. Find any positive great practical importance. We have specified a very broad
class of algorithms, designated class S algorithms, which, communication costs are not the overriding concern. Further-
given a positive integer vector q in the nullspace of the more, we assume homogeneity; all processors are the same, so
,apology matrix, find a PASS with period q u a l to the sum of they process a node in a SDF graph in the same amount of
the elements in q, Theorem 3 guarantees that these algorithms time. It is not necessary that the processors be synchronous,
will find a PASS if one exists. Theorems 1 and 2 guarantee although the implementation will be simpler if they are.
that such a vector q exists if a PASS exists. The corollary to A periodic admissible parallel schedule (PAPS) is a set of
Theorem 4 tells us that it does not matter what positive integer lists { $ i ; i = 1, - . e , M } where M is the number of
vector we use from the nullspace of the topology matrix, so we processors. and $, specifies a periodic schedule for processor
an simplify our system by using the smallest such vector, thus i . If r$ is the corresponding PASS with the smallest possible
obtaining a PASS with minimum period. period P,,then it follows that the total number Pp of block
G1vi.n these theorems, we now give a simple sequential invocations in the PAPS should be some integer multiple J of
xheduling algorithm that is of class S, and therefore will find P,,We could, of course, choose J = 1, but as we will show
a PASS if one exists. below, schedules that run faster might result if a larger J is
used. If the “best” integer J is known, then construction of a
1 ) Solve for the smallest positive integer vector q E q(I‘).
good PAPS is not too hard.
2) Form an arbitrarily ordered list L of all nodes in the
For a sequential schedule, precedences are enforced by the
system. schedule. For a multiprocessor schedule, the situation is not so
3, For each (Y E L,schedule (Y if it is runnable, trying each
simple. W e will assume that some method enforces the
node once. integrity of the parallel schedules. That is, if a schedule on a
4 ) If each node (Y has been scheduled qa times. STOP.
given processor dictates that a node should be invoked, but
51 If no node in L can be scheduled, indicate a deadlock (an
there is no input data for that node, then the processor halts
error in the graph).
until these input data are available. The task of the scheduler is
6) Else. go to 3 and repeat.
thus to construct a PAPS that minimizes the runtime for one
Thcnrern 3 tells us that this algorithms will not deadlock if a period of the PAPS divided by J, and avoids deadlocks. The
PAS4 exists. Two SDF graphs which cause deadlock and have mechanism to enforce the integrity of the communication
no P.4SS are shown in Fig. 7 . between blocks on different processors could use semaphores
Sirice the runtime is the same for any PASS (the one in shared memory or simple “instruction-count” synchroniza-
machine available is always busy), no algorithm will produce a tion, where no-ops are executed as necessary to maintain
betrcr runtime than this one. However. class S algorithms exist synchronicity among processors, depending on the multipro-
whih ~ construct schedules minimizing the memory required to cessor architecture.
buiicr data between nodes. Using dynamic programming or The first step is to construct an acyclic precedence graph for
integer programming, such algorithms are easily constructed. J periods of the PASS 4. A precise (class S) algorithm will be
A large grain data flow programming methodology offers given for this procedure below, but we start by illustrating it
conc rete advantages for single processor implementations. with the example in Fig. 8. The SDF graph in Fig. 8 is neither
Thc ability to interconnect modular blocks of code in a natural an acyclic nor a precedence graph. Examination of the number
wa! could considerably ease the task of programming high- of inputs consumed and outputs produced for each block
performance signal processors, even if the blocks of code reveals that block 1 should be invoked twice as often as the
thcniselves are programmed in Assembly language. The gain other two blocks. Further, given the delays on two of the arcs,
is >ornewhat analogous to that experienced in VLSI design we note that there are several possible minimum period
through the use of standard cells. For synchronous systems, PASS’s,e.g.,r#JI = {1,3, 1 , 2 } , & = (3, 1, 1, 2 } , o r d l =
thc penalty in runtime overhead is minimal. But a single { 1, 1, 3, 21, each with period P, = 4. A schedule that is not a
prl wessor implementation cannot take advantage of the PASS is Cbl = ( 2 , 1, 3, 1) because node 2 is not immediately
concurrency in a LGDF description. The remainder of this runnable. Fig. 9(a) shows the precedences involved in all three
paper is dedicated to explaining how thexoncurrency in the schedules. Fig. 9(b) shows the precedences involved in two
de\cription can be used to improve the throughput of a repetitions of these schedules (J = 2).
multiprocessor implementation. If we have two processors available, a PAPS for J = 1 (Fig.
9(a)) is
B. Constructing a PAPS $1 = (31
Fig. 8. An example.
6@
>1
J- 1
- 5 3 1
(a)
(b)
Fig. 1 1 . The two acyclic precedence graphs of Fig. 9 with the levels
indicated.
run o f a depends on the first d runs of where of a system with data dependent control flow contains a node
-irm- b,, that passes its input sample to its first output i f the sample is
-
? - . . ., * - .. - . a
grammer convenience without squandering computation re- vol. 60. Scpt. 1981.
sources. Programs are described as block diagrams where R.N. Kershaw, et d.,"A programmable digital signal processor with
/
32b floating point arithmetic." ISSCC 85 Dig. Tech. Papers, Feb. 13,
connections between blocks indicate the flow of data samples, 1985.
and the function of each block can be specified using a M. Chase, "A pipelined data flow lrchiteaure for signal processing: ,
conventional programming language. Blocks are executed The NEC uPD7261," in VLS! signa/ PrOcesing. New York: IEEE
Press, 1984.
whenever input data samples are available. Such a description P. M. Kogge. The Architectun of Pipelined Computers. New
is called large grain data flow (LGDF). The advantages of York: McGraw-Hill, 1981.
such a description are numerous. First, it is a natural way to D.9. Paul, J. A. Feldman, and V. J. Sfemno. "A design study for an
w i l y programmable, high-speed processor with a general-purpose
describe signal processing systems where the blocks are architecture," Lincoln Lab, Massachusetts Inst. Tcchnol.. Cambridge,
second order recursive digital filters, FFT butterfly operators, MA, Tech. Note 1980-50, 1980.
adaptive filters, and so on. Second, such a description exhibits W.B. Ackennan, "Data flow languages." Computer, vol. 15. Feh.
1982.
much of the available concurrency in a signal processing J . 9. Dennis. "Data flow supercomputers." computer. vol. 13. Nov.
algorithm, making multiple processor implementations easier 1980.
to achieve. Third, program blocks are modular, and may be 1. Watson and J . Gurd, "A practical data flow computer," Computer.
vol. 15, Feb. 1982.
reused in new system designs. Program blocks are viewed as A. L. Davis, "The architecture and system method of DDMI: A
black boxes with input and output data streams, so reusing a recursively structured data driven machine," in Proc. Fifrh A n n .
program block simply means reconnecting it in a new system. Symp. Comput. Architect.. Apr. 1978. pp. 210-215.
J. Rumbaugh, "A data flow multiprocessor." IEEE Trans. Cornput..
Fourth, multiple sample rates are easily described under the vol. C-26, p. 138. Feb. 1977.
programming paradigm. R. G. Babb, "Parallel processing with large grain data flow tech-
We describe highefficiency techniques for converting a niques,'. Computer. vol. 17, July, 1984.
D. G . Messerschmitt, "A tool for structured functional simulation,"
large grain data flow description of a signal processing system IEEE J. Select. Areas Commun.. vol. SAC-2, Jan. 1984.
into a set of ordinary sequential programs that run on parallel L. Snyder, "Parallel programming and the poker programming
machines (or, as a special case, a single machine). This environment.'' Computer, vol. 17, July 1984.
Kelly, Lochbaum, and Vyssotsky. "A block diagram compiler." Bell !
conversion is accomplished by a large grain compiler so Syst. Tech. J.. vol. 40,May 1%1.
1
called because it does not translate a high-level language into a B. Gold and C. Rader. Digital Processing of Signals. New York:
low-level language, but rather assembles pieces of code McGraw-Hill, 1969.
B. Karafin, "The new block diagram compiler for simulation of
(written in any language) for sequential or parallel execution. sampleddata systems,'' in AFfPS Conf. Proc.. vol. 27, 1%5, pp.
Most DSP systems are synchronous, meaning that the sample 55-61.
rate of any given data path, relative to other data paths, is M. Dertmzous. M. Kaliske. and K. Polzen. "On-line simulation of
block-diagram systems." fEEE Trans. Comput.. vol. C-18, Apr.
known at compile time. Large grain data flow graphs with 1%9.
such sample rate information are called synchronous data flow G . Korn, "High-speed block-diagram languages for microprocessors
and minicomputers in instrumentation, control. and simulation."
graphs. Given sample rate information, techniques are given Cornput. Elec. Eng.. vol. 4. pp. 143-159. 1977.
(and proven valid) for constructing sequential or parallel W. Henke, "MITSYN-An interactive dialogue language for time
schedules that will execute deterministically, without the signal processing." Res. Lab. Electronics Massachusetts Inst. Tech-
runtime overhead generally associated with data flow. For the nol., Cambridge, MA. RLE-TM-I, Feb. 1975.
T. Crystal and L. Kulsmd, "Circus," Inst. Defense Anal.. Princeton.
multiprocessor case, the problem of constructing a schedule NJ, CRD working paper, Dec. 1974.
that executes with maximum throughput is shown to be TOPSIM Iff-Simulation Package for Communication Sysmns.
equivalent to a standard operations research problem with well user's manual; Rep. Elec. Ens., Politencia di Torino. Italy.
G . Kopec, "The representation of discrete-time signals and systems in
studied heuristic solutions that closely approximate the opti- program," B . D . dissertation, Massachusetts Inst. Technol.. Cam-
mum. Given these techniques, the benefits of large grain data bridge, MA. May 1980.
flow programming can be extended to those signal processing C. S. Jhon, G . E . Sobelman. and D. E. Krekelberg. "Silicon
compilation bascd on a data-flow paradigm." IEEE Circ. Devices,
applications where performance demands are so severe that voi. 1. May 1985.
little inefficiency for the sake of programher convenience can R. M. Karp and R. E. Miller, "Properties of a model for parallel
be tolerated. computations: llcterminacy. termination. queueing." HAM J.. vol.
14, pp. 1390-1411, NOV.1966.
ACKNOWLEDGMENT R. Reiter. "A study of a model for parallel computations." Ph.D.
dissenetion. Univ. Michigan. Ann Arbor. 1967.
The authors gratefully acknowledge helpful suggestions J. L. Peterson. "Petri nets," Cornput. Sum.. vol. 9. Sept. 1977.
from tbc anonymous reviewers, R. Rathbone, and R. Righter. -,Perri Net Theory ond the Modcling of Systems. Englewood
Cliffs, NJ: Prcntice-Hall. 1981.
T. Agerwala, "Putting Pcm nets to work.'' Computer. p. 85, Dec..
REFERENCES 1979.
111 Signal Processing Peripherol, data sheet for the S28211, AMI, Inc. R. M. Karp and R. E. Miller. "Parallel program schemata." J.
[21 T M s f Z O f O User's Guide, Texas Instruments, Inc., Dallas, TX, 1983. Comput. S-vst. Sci.. vol. 3. no. 2. pp. 147-195. 1969.
[31 T.Tsuda, et ul., "A high-prfonnnnce LSI digital signal processor for F. Commoner and A. W. Holt, "Marked directed graphs." J.
communication," in Proc. ZEEE Zni. ConJ Commun.. June 19, Comput. Sysr. Sci.. vol. 5 , pp. 511-523. 1971.
! 1983. R. Reiter. "Scheduling parallel computations." J. Ass. Cornput.
[41 Digital Signa! Processor, data sheet for the uPD7720 signal processor Mach., vol. 14. pp. 590-599. 1968.
interface (SPl), NEC Electronics U.S.A. Inc. R. E. Blahut, Fmt Algorithms for Digital Signal Processing.
151 S. Magar, D. Essig, E. Caudel, S. Marshall, and R. Peters, "An Reading, MA: Addison-Wesley. 1985.
NMOS digital signal processor with multiprocessing capability," T.L. Adam. K. M. Chandy, and 1 . R. Dickson. "A comparison of list
ISSCC 85 Dig. Tech. Popers, Feb. 13, 1985. schedules for parallel processing systems." Commun. ASS.Cornput.
I61 R. C . Chapman, Ed.,"Digital signal processor," Bell Syst. Tech. J . , Mach.. vol. 17. pp. 685-690. Dec.. 1974.
LEE AND MESSERSCHMITT: STATIC SCHEDULING OF SYNCHRONOUS DATA FLOW PROGRAMS 35
[39] T . C. Hu, “Parallel sequencing and assembly line problems,” Operaf. DSVM C . McsKrschnltt (S’65-M’68-SM’78-
Rs., VOI.9, pp. 841-848. 1%1. F‘83) received the B.S.degree from the University
I401 W. H. Kohler, “A preliminary evaluation of the critical path method of Colorado. Boulder, In 1967. and the M.S and
for scheduling tasks on multiprocessor systems,” IEEE Trans. Ph.D.degrees from the University of Michigan,
COmpUI.. VOI. C-25, pp. 1235-1238. DcC. 1975. Ann Arbor, in 1968 and 1971. respectively.
He is a Professor of Electrical Engineenng and
Computer Sciences at the University of California.
Edward Asbford L e (S’80-M’M) received the Berkeley. From 1968 to 1977 he was a Member of
B.S. degree from Yole University. New Haven, the Technical Staff pnd later Supervisor at Bell
CT, in 1979, the S.M. degree from the Massachu- Laboratories. Holmdel, NJ, where he did systems
i
setts Institute of Technology. Cambridge, in 1981, engineering, development, and research on digital
and the Ph.D. degrce from the University of transmission and digital signal processing (panicularly relating to speech
California, Berkeley. in 1986. processing). His cumnt research interests include analog and digital signal
From 1980 to 1982 he was with Bell Laborato- processing, adaptive filtering. digital communications (on the subscriber loop
ries. as a member of the Technical Staff of the Data and fiber optics), architecture and software approaches to programmable
Communications Laboratory, where he did cxplor- digital signal processing, communication network design and protocols. and
atory work in voiceband data modem techniques computer-aideddesign of communications and signal processing systems He
and simultaneous voice and data transmission. Since has published over 70 papers and has 10 patcnu issued or pending in these
July 1986 he has been an Assistant Professor in the Department of Electrical fields. Since 1977 he has also served as a consultant to a number of
Engineering and Computer Science. University of California, Berkeley. His companies He has organized and participated in a number of short courses
research interests include architectures and software techniques for program- and seminars devoted to continuing engineering education.
mable digital signal processors, parallel processing. real-time software, Dr. Messcrschmin is a member of Eta Kappa Nu. Tau Beta Pi. and Sigma
computer-aidedengineering for signal processing and communications,digital Xi, and has several best papcr awards. He is currently a Senior Editor of IEEE
communications, and quantization. He has taught a shon course at the Communications Magazine, and is past Editor for Transmission Systems of
University of California, Santa Barbara, on telecommunications applications the IEEE TRANSACTIONSON COMMUNICATIONS and past member of the
of progammable digital signal processors, has consulted in industry, and holds Board of Governors of the IEEE Communications Society.
one patent.
Dr. Lee was the recipient of IBM and GE Fellowships and the Samuel
Silver Memorial Scholarship Award. He is a member of Tau Beta Pi.