Title:
ENERGY-MINIMAL DATAFLOW ARCHITECTURE WITH PROGRAMMABLE ON-CHIP NETWORK
Document Type and Number:
WIPO Patent Application WO/2024/049791
Kind Code:
A1
Abstract:
Disclosed herein is a co-designed compiler and CGRA architecture that achieves both high programmability and extreme energy efficiency. The architecture includes a rich set of control-flow operators that support arbitrary control flow and memory access on the CGRA fabric. The architecture is able to realize both energy and area savings over prior art implementations by offloading most control operations into a programmable on-chip network where they can re-use existing network switches.
Inventors:
LUCIA BRANDON (US)
BECKMANN NATHAN (US)
GOBIESKI GRAHAM (US)
BECKMANN NATHAN (US)
GOBIESKI GRAHAM (US)
Application Number:
PCT/US2023/031348
Publication Date:
March 07, 2024
Filing Date:
August 29, 2023
Export Citation:
Assignee:
UNIV CARNEGIE MELLON (US)
International Classes:
G06F9/30; G06F5/00; G06F8/41; G06F9/50
Foreign References:
US10515049B1 | 2019-12-24 | |||
US20220164189A1 | 2022-05-26 | |||
US20210373867A1 | 2021-12-02 |
Attorney, Agent or Firm:
CARLETON , Dennis M. (US)
Download PDF:
Claims:
Attorney Docket: 8350.2023‐002WO [0071] By contrast, implementing control flow in a PE requires full data‐width multiplexors and, if an entire PE is dedicated to control, an output channel to hold the results. Nevertheless, the disclosed embodiment are sometimes forced to allocate a PE for control flow. Specifically, if a control‐flow operator takes a software‐supplied value or a constant other than ‐1, 0, or 1, it requires ^^core support. [0072] The CFM provides a small amount of buffering for decider values. This is because loop deciders often have high fanout, which means that the next iteration of a loop is likely blocked by one or more downstream consumers. To remove this limitation, a small amount of downstream buffering for 1b decider signals is provided, improving performance with minimal impact on area. The CFM uses run‐length encoding to buffer up to eight decider values with just 3b of additional state, yielding up to 3:8 performance (on dmm) at an area of cost of <1%. 22 Attorney Docket: 8350.2023‐002WO Claims 1. A system comprising: a processor; and compiler software that, when executed by the processor performs the functions of: compiling a program written in a high‐level language to an intermediate representation; enforcing memory ordering on the intermediate representation to create an ordered graph; translating the intermediate representation to a dataflow graph; inserting control‐flow operators into the dataflow graph; and mapping operators in the dataflow graph to a coarse‐grained reconfigurable array (CGRA). 2. The system of claim 2 wherein the dataflow graph is optimized before the mapping step. 3. The system of claim 1 wherein the ordered graph encodes dependencies between memory operations. 23 Attorney Docket: 8350.2023‐002WO 4. The system of claim 3 wherein the ordered graph is iteratively pruned to eliminate redundant ordering arcs already enforced by data and control dependencies. 5. The system of claim 3 wherein path‐sensitive transductive reduction is applied to the ordered graph to convert a potentially cyclic ordering graph to an acyclic graph off strongly‐connected components. 6. The system of claim 2 wherein the control‐flow operators comprise: a steer operator; a carry operator; an invariant operator; a merge operator; an order operator; and a stream operator. 7. The system of claim 6 wherein the memory ordering is enforced by inserting order operators in the ordered graph. 8. The system of claim 6 wherein the steer, carry, invariant and merge operators are inserted into the dataflow graph. Attorney Docket: 8350.2023‐002WO 9. The system of claim 6 wherein the dataflow graph is optimized using the stream operator. 10. The system, of claim 1 wherein the CGRA comprises: a fabric of heterogeneous processing elements connected via a bufferless, 2D‐torus on‐chip network; a RISC‐V scalar core; and SRAM main memory. 11. The system of claim 11 wherein the compiler uses the dataflow graph and a topographical description of the CGRA to generate scalar code for execution by the scalar core and a bitstream to configure the fabric. 12. The system of claim 11 where the processing elements perform arithmetic and memory operations. 13. The system of claim 12 wherein the on‐chip network implements the control‐ flow operators. 14. The system of claim 10 wherein each processing element comprises: a functional unit; and a ^^core; Attorney Docket: 8350.2023‐002WO wherein the ^^core interfaces with the on‐chip network, buffers output values and interfaces with top‐level fabric control for configuration of the processing element. 15. The system of claim 10 wherein the fabric comprises a 6x6 array of processing elements. 16. The system of claim of claim 10 wherein et processing elements are selected from a group consisting of memory processing elements, arithmetic processing elements, multiplier processing elements, control‐flow processing elements and stream processing elements. 17. The system of claim 10 wherein processing elements buffer values in an output channel to reduce buffering values in the on‐chip network. 18. The system of claim 10 wherein the on‐chip network comprises a plurality of routers coupled to the processing elements. 19. The system of claim 18 wherein the routers include one or more control‐flow modules at one of more output ports to implement the control‐flow operators. |
Description:
U.S. PATENT APPLICATION for Energy‐Minimal Dataflow Architecture With Programmable On‐Chip Network Applicant Carnegie Mellon University ID2023‐002 Inventor Brandon Lucia Nathan Beckman Graham Gobieski Prepared By: KDWFIRM.COM Dennis M. Carleton, Principal KDW Firm PLLC 2601 Weston Pkwy. Suite 103 Cary, NC 27513 919‐396‐5643 Attorney Docket: 8350.2023‐002WO Energy‐Minimal Dataflow Architecture With Programmable On‐Chip Network Related Applications [0001] This application claims the benefit of U.S. Provisional Patent Application No. 63/403,422, filed on September 2, 2022, entitled “An Energy‐Minimal Dataflow Architecture With Programmable On‐Chip Netwo
rk”, the contents of which is incorporated herein in its entirety. Government Interest [0002] This invention was made with United States government
support under contract CCF1815882 awarded by the National Science F
oundation and contract W911NF‐18‐2‐0218, awarded by the U.S. A
rmy. The U.S. government has certain rights in this invention. Background [0003] Recent advances in machine learning, sensor devices a
nd embedded systems open the door to a wide range of sensing applicatio
ns, such as civil‐ infrastructure or wilderness monitoring, public safety
and security, medical devices, and chip‐scale satellites. To achieve long
(e.g., 5+ year) deployment lifetimes, these applications rely on on‐device proc
essing to limit off‐device communication. Computing at the extreme edge calls fo
r ultra‐low‐power (<1 mW), energy‐minimal and programmable processing. 1 Attorney Docket: 8350.2023‐002WO [0004] Application‐specific integrated circuits (ASICs) are
one possible solution to address the need for extreme energy efficiency. Howev
er, ASICs come with several major disadvantages. Computations in smart sen
sing applications are diverse, spanning deep learning, signal processing, co
mpression, encoding, decryption, planning, control, and symbolic reasoning.
Only a programmable solution can support all of these, as it is infeasi
ble to build an ASIC for every conceivable task. Moreover, the rapid pace of change
in these applications (e.g., due to new machine learning algorithms) puts
specialized hardware at risk of premature obsolescence, especially in a multi
‐year deployment. Finally, by targeting all computations, programmable d
esigns can achieve much greater scale than specialized designs — perha
ps trillions of devices. Scale reduces device cost, makes advanced manufacturin
g nodes economically viable, and mitigates carbon footprint. [0005] Unfortunately, traditional programmable cores are very
inefficient, typically spending only 5% to 10% of their energy on useful
work. The architect’s challenge is thus to reconcile generality and efficie
ncy. Coarse‐grained reconfigurable arrays (CGRAs) are both programmable an
d efficient. CGRAs can achieve energy efficiency competitive with ASICs
while remaining programmable by software. As shown in FIG. 1, a CGR
A is an array of processing elements (PEs) connected by an on‐chip n
etwork (NoC). CGRAs are programmed by mapping a computation’s dataflow
onto the array (i.e., by assigning operations to PEs and configuring the N
oC to route values 2 Attorney Docket: 8350.2023‐002WO between dependent operations). The efficiency of CGRAs
derives from avoiding overheads intrinsic to von Neumann architectu
res, specifically instruction fetch/control and data buffering in a cen
tralized register file. [0006] In the context of ultra‐low‐power sensing applicat
ions, the CGRA‐generation framework disclosed in U.S. Patent App. 17/572,925 wa
s designed to minimize energy, in contrast to prior, performance‐f
ocused CGRAs. These CGRAs reduce energy by up to 5 times versus ultra
low‐power von Neumann cores, and they come within 3 time the energy effic
iency of ASICs. [0007] However, according to Amdahl’s Law, to achieve sign
ificant end‐to‐end benefits, CGRAs must benefit the vast majority of pr
ogram execution. CGRAs must support a wide variety of program patterns at
minimal programmer effort, and they must provide a complete compiler an
d hardware stack that makes it easy to convert arbitrary application code
to an efficient CGRA configuration. Unfortunately, prior art CGRAs struggle
to support common programming idioms efficiently, leaving significant ene
rgy savings on the table. [0008] On the hardware side, many prior CGRAs support only
simple, regular control flow (e.g., inner loops with streaming memory accesse
s and no data‐ dependent control). To support complex control flow a
nd maximize performance, other CGRAs employ expensive hardware mec
hanisms (e.g., associative tags to distinguish loop iterations, large
buffers to avoid deadlock, and dynamic NoC routing). In either case, energy i
s wasted: from extra 3 Attorney Docket: 8350.2023‐002WO instructions needed to implement control flow unsuppor
ted by the CGRA fabric or from inefficiency in the CGRA microarchitec
ture itself. [0009] On the compiler side, mapping large computations onto
a CGRA fabric is a perennial challenge. Heuristic compilation methods ofte
n fail to find a valid mapping and optimization‐based methods lead to prohi
bitively long compilation times. Moreover, computations with irregula
r control flow are significantly more challenging to compile due to thei
r large number of control operations, which significantly increase the size of
the dataflow graph. To avoid these issues, hand‐coded vector assembly could
be used, restricting programs to primitives that map well onto a CGRA. V
ector assembly sidesteps irregular control but makes programming cumb
ersome. Summary [0010] Described herein is a novel design of a co‐designe
d CGRA compiler and architecture that supports arbitrary control flow and
memory access patterns without expensive hardware mechanisms. Unlike prior lo
w‐power CGRAs, the disclosed embodiments can execute arbitrary code, limi
ted only by fabric size and routing. This saves energy by offloading more co
de onto the CGRA, where it executes with an order‐of‐magnitude less
energy than a von Neumann core. Deeply nested loops are supported with
data‐dependent control flow and aliasing memory accesses, as commonl
y found in, for example, sparse linear algebra. 4 Attorney Docket: 8350.2023‐002WO [0011] The benefits of the disclosed embodiments are realize
d by several novel aspects. First, the instruction set architecture suppo
rts complex control while minimizing energy. A steering control paradigm is ado
pted in which values are only routed to where they are actually needed.
To support arbitrary nested control without tags, new control‐flow primit
ives are, such as the carry gate, which selects between tokens from inner
and outer loops. The disclosed embodiments also optimize the common case b
y introducing operators for common programming idioms, such as its
stream generator that generates an affine sequence for, for example,
streaming memory accesses. [0012] The disclosed embodiments offload control flow into t
he on‐chip network. New control flow primitives are implemented without w
asting energy or PEs by leveraging existing NoC switches. The insight is
that a NoC switch already contains essentially all of the logic needed for ste
ering control flow, and, with a few trivial additions, it can implement a wide ra
nge of control primitives. Mapping control‐flow into the NoC frees PEs for ar
ithmetic and memory operations, so deeply nested loops with complex contr
ol flow can be supported on a small CGRA fabric. [0013] The disclosed compiler complies C programs to an eff
icient CGRA configuration. Functions written in a high‐level lan
guage are compiled and novel analyses are employed to safely parallelize ope
rations. With steering control flow and no program counter, conventional tra
nsitive reduction 5 Attorney Docket: 8350.2023‐002WO analysis fails to enforce all memory orderings, so p
ath‐sensitive transitive reduction to infer orderings correctly is introduced.
Arbitrary control flow without associative tags is implemented by enforcing
strict ordering among values, leveraging its new control operators. Programs
are mapped onto the CGRA by formulating place‐and‐route as a Boolean
satisfiability (SAT) instance or integer linear program. The SAT formulati
on finds configurations quickly (<3 min), while the integer linear program
ming (ILP) formulation yields configurations that use 4.3% less energy. Brief Description of the Drawings [0014] By way of example, a specific exemplary embodiment o
f the disclosed system and method will now be described, with reference to
the accompanying drawings, in which: [0015] FIG. 1 is a schematic representation of a middle ea
r, showing ideal positioning of a cochlear implant, and, in particular
, the electrode array portion of the implant. [0016] FIG.2 is a diagram illustrating the semantics of the
control‐flow operators of the disclosed architecture. [0017] FIG. 3 schematically illustrates the front‐end and
middle‐end components of the compiler, which takes highs‐level source code a
nd transforms it to an optimized dataflow graph. [0018] FIG. 4 illustrates the enforcement of memory ordering
by the middle‐end. 6 Attorney Docket: 8350.2023‐002WO [0019] FIG. 5 illustrated the ULP CGRA fabric of the discl
osed embodiment. [0020] FIG. 6 illustrates the microarchitecture of a process
ing element. [0021] FIG. 7 illustrates the router microarchitecture. [0022] FIG. 8 are pseudocode listings showing implementation
of the steer and carry operators. Detailed Description [0023] Disclosed herein is a compiler and microarchitecture
for ultralow‐power, energy‐minimal CGRAs. At its core is an instruction
set architecture (ISA) that supports arbitrary control flow without expensive asso
ciative tag matching. The novel compiler disclosed herein transforms program
s written in high‐ level programming languages (e.g., C) into dataflow g
raphs using the ISA. It also enforces memory ordering and optimizes programs
by fusing loop induction variables into single stream operators. The
CGRA fabric efficiently executes compiled programs. It minimizes switching act
ivity by assigning a single operation per PE. Additionally, it reuses hard
ware in the NoC to implement control‐flow operations without wasting pro
cessing element (PE) resources. As a result, energy efficiency / performan
ce is improved by over prior art energy‐minimal CGRAs and a ultra‐low po
wer scalar core. [0024] As shown in FIG. 1, the invention comprises a co‐
designed compiler and CGRA architecture that executes programs written in a
high‐level language with minimal energy and high performance. New control
‐flow primitives to 7 Attorney Docket: 8350.2023‐002WO support common programming idioms, like deeply nested
loops and irregular memory accesses are introduced, while minimizing the
energy overhead. Control flow is implemented in the NoC to increase
utilization and ease compilation. [0025] Instruction Set Architecture (ISA) [0026] A rich set of new operators are provided to support
complex programs. The ISA has six categories of operators: arithmetic, mult
iplier, memory, control flow, synchronization, and streams. (Multiplication is
split from other arithmetic because, to save area, only some PEs can
perform multiplication.) The new operators are illustrated in FIG. 2. The co
ntrol flow operators include Steer, Carry and Invariant. The synchronizatio
n operators include Merge and Order. A Stream operator is also provided
to generate a sequence of data values. [0027] For control flow operators, whenever a value is re
ad, it is implied that the operator waits until a valid token arrives for that
value over the NoC. Tokens are buffered at the inputs if they are not consumed
or discarded. [0028] The Steer ( ^^ ି^ ) operator comes in two flavors, True and Fa
lse, and takes two inputs: a decider, ^^ and a data input, ^^. If ^^ matches the flavor, then the gate passes ^^ through; otherwise, ^^ is discarded. Steers are necessary to implement conditional execution, as they gate the inp
uts to disabled branches. 8 Attorney Docket: 8350.2023‐002WO [0029] The Carry operator represents a loop‐carried depende
ncy and takes a decider, ^^, and two data values, ^^ and ^^. Carry has the internal state machine, as shown in FIG. 2. In the Initial state,
it waits for ^^, and then passes it through and transitions to the Block state
. While in Block, if ^^ is True, the operator passes through ^^. It transitions back to Initial when ^^ is False and begins waiting for the next ^^ value (if not already buffered at the input). Carry operators keep tokens ordered in loops,
eliminating the need to tag tokens. All backedges are routed through a carry
operator. By not consuming ^^ while in Block, carry operators prevent outer loo
ps from spawning a new inner‐loop instance before the previ
ous one has finished. (Iterations from one inner‐loop may be pipelined if
they are independent, but entire instances of the inner loop will be serialize
d.) [0030] The Invariant operator is a variation of carry. It
represents a loop invariant and can be implemented as a carry with a self‐edg
e back to ^^. Invariants are used to generate a new loop‐invariant token for ea
ch loop iteration. [0031] The Merge operator enforces cross‐iteration ordering
by making sure that tokens from different loop iterations appear in the
same order, regardless of the control path taken within by each loop iteration
. The operator takes three inputs: a decider, ^^, and two data inputs, ^^ and ^^. Merge is essentially a multiplexor that passes through either ^^ or ^^, depending on ^^. However, it should be noted that only the value passed through
is consumed. 9 Attorney Docket: 8350.2023‐002WO [0032] The Order operator is used to enforce memory orderin
g by guaranteeing that multiple preceding operations have executed. It takes
two inputs, ^^ and ^^, and fires as soon as both arrive, passing ^^ through. [0033] The Stream operator generates a sequence of data val
ues, which are produced by evaluating an affine function across a r
ange of inputs. These operators are used in loops governed by affine induc
tion variables. A stream takes three inputs: ^^ ^^ ^^ ^^ ^^, ^^ ^^ ^^ ^^, and ^^ ^^ ^^ ^^ ^^. It initially sets its internal ^^ ^^ ^^ to ^ ^ ^^ ^^ ^^ ^^, and then begins iterating a specified arithmetic
operator ^^ as ^^ ^^ ^^’ ൌ ^ ^^ ^^ ^^ ^^, ^^ ^^ ^^ ^^). A stream operator produces two output tokens pe
r iteration: ^^ ^^ ^^ and a control signal ^^ ^^ ^^ ^^, which is False until ^^ ^^ ^^ reaches ^^ ^^ ^^ ^^ ^^, whereupon it is True and the stream stops iterating.
^^ ^^ ^^ ^^ is used by downstream control logic, for example, to control a
carry operator for outer loops. [0034] Compiler [0035] The compiler disclosed herein compiles, optimizes, and
maps high‐level code to the CGRA fabric. Its compiler has a front‐end,
middle‐end, and back‐end. FIG. 3 shows the front‐end and middle‐end compone
nts of the compiler. The front‐end compiles the high‐level language to an
intermediate representation (IR). In one embodiment, the compiler uses clang to
compile C code to LLVM’s IR. The middle‐end manipulates the LLVM IR
to insert control‐flow operators and enforce memory ordering. It then transl
ates the IR to a dataflow graph (DFG) representation and optimizes the
DFG by transforming 10 Attorney Docket: 8350.2023‐002WO and fusing subgraphs. In some embodiments, the operat
or count can be reduced by up to 27%. The back‐end takes the DFG
as input and maps operators onto the CGRA, producing a configuration bi
tstream in minutes. [0036] Memory‐Ordering Analysis. Sequential code is mapped
onto a CGRA fabric in which many operations, including memory, may execute
in parallel. For correctness, some memory operations must execute in a
particular order. The compiler’s middle‐end computes required orderin
gs between memory operations present in the IR and adds control‐flow
operations to enforce those orderings. [0037] The first step to enforcing memory ordering is to c
onstruct an ordering graph (OG) that encodes dependences between memory operation
s. The compiler uses alias analysis to identify memory operations tha
t may or must access the same memory location (i.e., alias), adding an arc be
tween the operations in the OG accordingly. No assumptions are made regarding
the alias analysis and self‐dependences need not be considered because
repeated instances of the same memory operation are always ordered on its
CGRA fabric. FIG. 4 shows a basic, unoptimized OG in the top left for
an example function. [0038] The OG as computed can be greatly simplified (i.e.,
pruned). Prior work has simplified the OG with improved alias analysis and b
y leveraging new control flow primitives. These efforts are orthogonal to the
present methodology, which simplifies the OG by eliminating redundant orde
ring arcs that are already enforced by data and control dependences. Dat
a dependences are 11 Attorney Docket: 8350.2023‐002WO discovered by walking LLVM’s definition‐use (def‐
use) chain from source to destination and removes ordering arcs for dependent o
perations. For instance, in the control flow graph (CFG) from the
example in FIG. 3, S2 is data‐dependent on L1, so there need not be an ord
ering arc in the OG. This is reflected in the arc from L1 to S2 that is pruned
in the OG in FIG. 4. Similarly, control dependences order some memory operations if t
he execution of the destination is control‐dependent on the source. The
CFG is analyzed to identify control dependences between memory operations
and those orderings are removed from the OG. In the CFG from
the example in FIG. 3, the arc from L0 to S1 in FIG. 4 is pruned using
this analysis. [0039] Two dependent memory operations are transitively order
ed if there is a path (of ordering arcs) in the OG from source to destina
tion. The redundant arcs that are transitively ordered by other control‐ and
data‐dependence orderings are discovered and eliminated. This reduces
the number of operations required to enforce ordering by approximate
ly 18% versus unoptimized ordering. [0040] To simplify its OG, transitive reduction (TR) is use
d, which simplifies ordering relation graphs for parallel execution of loops. TR
is applied to the OG, which converts a (potentially cyclic) ordering graph into a
n acyclic graph of strongly connected components (the SCCDAG). Traditional TR elim
inates arcs between SCCs, removes all arcs within each SCC, and
adds arcs to each SCC to form a simple cycle through all vertices. 12 Attorney Docket: 8350.2023‐002WO [0041] The algorithm is modified in two ways to make it w
ork for the OG. First, arcs in the inserted cycle must be compatible with progra
m order instead of being arbitrary. Second, the inserted arcs must respect pro
per loop nesting, avoiding arcs directly from the inner to outer loop.
To handle these arcs, synthetic loop entry and exit nodes are added to ea
ch loop (shown as ^^ ^^ ^^ and ^^ ^^ ^^ ^^ nodes at the bottom of FIG. 4). Any arc inserte
d that links an inner loop node to an outer loop node instead uses the i
nner loop’s exit as its destination. Symmetrically, an arc inserted that links
an outer loop node to an inner loop node has the inner loop’s entry as
its destination. With these two changes, the SCCDAG is usable for TR. [0042] However, applying existing TR analysis to the OG fai
ls to preserve the required ordering operations. The problem is that a
source and destination may be ordered along one (transitive) path and order
ing along another (direct) path may be removed as redundant. Execution
along the transitive path enforces ordering, but along the direct path do
es not, which is incorrect. FIG. 4 shows a scenario wherein path‐sensitivity is
critical. The path ^ ^ ^^ ^^3 ^ ^^2 ^ → ^^ ^^ ^^1 ^ ^^0 ^ should not be eliminated in TR because the alternative path, ^^ ^^ ^^3^ ^^2^ → ^^ ^^ ^^2^ ^^1^ → ^^ ^^ ^^1^ ^^0^ does not capture the direct control‐flow path from ^^2 to ^^0 via the backedge of the loop. This problem arises due to the steering control and lack
of a program counter to order memory operations. 13 Attorney Docket: 8350.2023‐002WO [0043] To correctly apply TR to remove redundant ordering a
rcs, path‐sensitive TR is introduced, which confirms that a transitive ordering
path subsumes all possible control‐flow paths before removing any orde
ring arc from the OG. With this constraint in place, transitive reduction c
an be safely used. [0044] Memory operators produce a control token on completio
n and can optionally consume a control token to enforce memory ordering.
The middle‐end encodes ordering arcs as defs and uses of data valu
es in the IR (as seen in the IR transform of loads and stores in FIG. 3) before
lowering them as dependences in the DFG. For a memory operator that
must receive multiple control signals, the middle‐end inserts order operat
ions to consolidate those signals. [0045] Control‐Flow Operator Insertion. The compiler lowers
its IR to use the disclosed control paradigm by inserting control‐flow
operators into the DFG. [0046] Steer. The compiler uses the control dependence graph
(CDG) to insert steers. For each consumer of a value, the compiler
walks the CDG from the producer to the consumer and inserts a steer operato
r at each node along the CDG traversal if it has not already been insert
ed by a different traversal. The steer’s control input is the decider of the b
asic block that the steer depends on, and its data input is the value or the
output of an earlier inserted steer. [0047] Carry and Invariant. For loops, the compiler inserts
a carry operator for loop‐ carried dependences and an invariant operator for loo
p‐invariant values into 14 Attorney Docket: 8350.2023‐002WO the loop header. A carry’s data input comes from
the loop backedge that produces the value. An invariant’s data input comes
from values defined outside the loop. These operators should produce a t
oken only if the next iteration of the loop is certain to execute. To ens
ure this behavior, the compiler sets their control signal to the decider of
the block at the loop exit. [0048] Merge. If two iterations of a loop may take differe
nt control flow paths that converge at a single join point in the loop body,
either may produce a token to the join point first. But for correctness, the o
ne from the earlier iteration must produce the first token. The compiler inserts a
merge operator at a join point in the CFG to ensure that tokens flow to the
join point in iteration order. The control signal ^^ for the merge operator is the decider of nearest
common dominator of the join point’s predecessor ba
sic blocks. Because the earlier iteration sends its control signal first and
tokens are not reordered, the merge operator effectively blocks the later itera
tion until the earlier iteration resolves. [0049] Stream Fusion. Target‐specific operator fusion is pe
rformed on the DFG to reduce required operations and routes by combining va
lue stream generators with loop control logic and address computation logic
. Streams are supported and applied for the common case of a loop with an
affine loop governing induction variable (LGIV). A stream makes loop logic
efficient by fusing the LGIV update and the loop exit condition into a sing
le operator. In the DFG, loop iteration logic is represented by the exit cond
ition, an update operator, 15 Attorney Docket: 8350.2023‐002WO the carry for the LGIV’s value, and the steer tha
t gates the LGIV in a loop iteration. The middle‐end fuses these operators into
a single stream operator and sets the stream’s initial, step, and bound val
ues. FIG. 3 shows stream compilation, where the operators for loop iteration l
ogic (outlined in blue in the DFG) are fused into a stream operator. Induction
‐variable analysis is applied to find affine LGIVs. Address computations ar
e also identified and mapped to an affine stream if possible, and the str
eam is fused into the memory operator. [0050] Mapping DFGs to Hardware. The backend takes a DFG a
nd a CGRA topology description and generates scalar code to invoke a bi
tstream to configure the fabric. This involves finding a mapping of DFG nodes
and edges to PEs, control flow modules and links. Mapping can be diffi
cult, and there is much prior work on heuristic methods that trade mapping q
uality for compilation speed. The disclosed embodiments have two advantages
versus the prior art. [0051] First, the disclosed embodiments do not time‐multipl
ex operations, so operations need only be scheduled in space, not time
. Prior art compilers unroll loops to reason about operation timing and id
entify the initiation interval, increasing program size. Second, the disclos
ed embodiments target energy efficiency, not performance. Rather than optimi
zing for initiation interval, it need only focus on finding a valid sol
ution, since leakage is insignificant. 16 Attorney Docket: 8350.2023‐002WO [0052] Two complementary mappers are provided. One is base
d on Boolean satisfiability (SAT) and another is based on integer
linear programming (ILP) that minimizes the average routing distance. The SAT
based mapper runs quickly, whereas the ILP‐based mapper yields an ene
rgy savings v. SAT. [0053] The constraints of the ILP and SAT formulations are
similar. The formulations ensure that every DFG vertex is mapped to a hardwar
e node, that every edge is mapped to a continuous route of hardware links,
and that the inputs and outputs of a vertex match the incoming and outgoing
links of a hardware node. They further disallow the mapping of multiple
DFG vertices to a single hardware node, the sharing of hardware links by mult
iple edges with different source vertices, and the mapping of a DFG
edge through a control flow module when a DFG vertex is mapped to that mo
dule. Together these are the necessary constraints to produce not only a
valid mapping, but also a good mapping (SAT is close to ILP in terms of ener
gy). [0054] Microarchitecture [0055] As shown in FIG. 5, the ULP CGRA fabric of the di
sclosed embodiments is an energy‐minimal coarse‐grained reconfigurable array.
The 6x6 fabric contains heterogeneous PEs connected via a bufferless, 2D‐tor
us NoC. A complete system in accordance with the disclosed embodiments c
ontains a CGRA fabric, a RISC‐V scalar core, and a 256KB (832KB
banks) SRAM main memory. [0056] Tagless Dataflow Scheduling. Asynchronous dataflow fi
ring is implemented via ordered dataflow. Adding ordering operators where
control may diverge 17 Attorney Docket: 8350.2023‐002WO ensures that tokens always match on arrival at a PE
, obviating the need for tags. Tagless, asynchronous firing has a low hardware
cost (one bit per input plus control logic), and it allows variable operation
latency (e.g., due to bank conflicts) to be tolerated while eliminating the need
for the compiler to reason about operation timing. [0057] Processing Elements. The PEs perform all arithmetic a
nd memory operations in the fabric. FIG. 6 shows the microarchitecture of
a PE. The PE includes a functional unit (FU) and the ^^core. The ^^core interfaces with the NoC, buffers output values, and interfaces with top‐level fabric
control for PE configuration. [0058] Functional Units. The ^^core exposes a generic interface using a latency‐
insensitive ready/valid protocol to make it easy to
add new operators. Inputs arrive on ^^ ^^_ ^^ ^^ ^^ ^^ when ^^ ^^_ ^^ ^^ ^^ ^^ ^^ is high and are consumed when ^^ ^^_ ^^ ^^ ^^ ^^ ^^ is high. The FU reserves space in the output channe
l by raising ^^ ^^_ ^^ ^^ ^^ ^^ ^^ (e.g., for pipelined, multi‐cycle operations), and output a
rrives on ^^ ^^_ ^^ ^^ ^^ ^^ when ^^ ^^_ ^^ ^^ ^^ ^^ ^^ is high. ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ supplies back pressure from downstream PEs. The remaining signals deal with top‐level configurat
ion and control. [0059] Communication. The ^^core decouples NoC communication from FU computation. The ^^core tracks which inputs are valid, raises backpres
sure on input ports when its FU is not ready, buffers inter
mediate results in output channels, and sends results over the NoC. Decoupling
simplifies the FU. 18 Attorney Docket: 8350.2023‐002WO [0060] Configuration. The ^^core handles PE and FU configuration, storing configuration state in a two‐entry configuration cac
he that enables single‐ cycle reconfiguration. Additionally, the ^^core enables the fabric to overlap reconfiguration of some PEs while others finish compu
tation on an old configuration. [0061] PE Types. The disclosed embodiments include a heter
ogeneous set of PEs. [0062] Memory PEs issue loads and stores to memory and hav
e a “row buffer” that coalesces non‐aliasing subword loads. Arithmetic PEs
implement basic ALU operations (e.g., compare, bitwise logic, add, subtrac
t, shift, etc.); Multiplier PEs implement multiply, multiply + shift, multiply +
fixed‐point clip, and multiply + accumulate; Control‐flow PEs implement th
e steer, invariant, carry, merge, and order operators, but most of these are a
ctually implemented in the NoC (see below); Stream PEs implement common aff
ine iterators. [0063] Bufferless NoC. PEs are connected via a statically
configured, multi‐hop, bufferless on‐chip network with routers. Instead of
buffering values in the NoC, PEs buffer values in their output channel. NoC
buffers are a primary energy sink in prior art CGRAs and are completely e
liminated here. Similarly, the NoC is statically routed to eliminate routing lo
ok‐up tables and flow‐ control mechanisms. [0064] Control Flow in the NoC. Control‐flow operators are
simple to implement (often a single multiplexer), but there are many of
them. Mapping each to a PE wastes energy and area and can make mapping to
the CGRA infeasible. 19 Attorney Docket: 8350.2023‐002WO Much of the logic required to implement control flow
is already plentiful in the NoC. Each NoC switch is a crossbar that can be
re‐purposed to mux values for control. Thus, to implement each control‐flow o
perator, a switch’s routing and ready/valid signals are manipulated to provide th
e desired functionality. The router microarchitecture of the disclosed embodime
nts is shown in FIG. 7. The router shares routing configuration and its d
ata and valid crossbars with the baseline NoC. A control‐flow module (CFM)
is added at a configurable number of output ports (in this case, t
wo output ports). The CFM determines when to send data to the output port
and manipulates inputs to the data switch to select which data is
sent. [0065] The CFM takes eight inputs and produces five outputs
that control router configuration and dataflow through the network. The i
nputs are: ^^ ^^ ^^: configuration of the CFM (i.e., opcode); ^^ ௩^^^ௗ , ^^ ௩^^^ௗ , ^^_ ^^ ^^ ^^ ^^ ^^: whether inputs are valid; ^^: value of the decider; ^^_ ^^ ^^ ^^ and ^^_ ^^ ^^ ^^: input ports for ^^ and ^^; and ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^: backpressure signal from the output port. The outputs are ^^ ^^^ௗ௬ , ^^ ^^^ௗ௬ , ^^_ ^^ ^^ ^^ ^^ ^^: upstream backpressure signals that allow the CFM to block upstream producers until all
signals required are valid; ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^: the valid signal for the CF’s output; and ^^ ^^ ^^: which port ( ^^_ ^^ ^^ ^^ or ^^_ ^^ ^^ ^^) to route to the output port on the data switch
. [0066] The CFM can be configured for routing or for the c
ontrol operators previously discussed. For example, ^^ ^^ ^^ ൌ ^^ is simple: just set ^^ ^^ ^^ ൌ ^^_ ^^ ^^ ^^, ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ ൌ ^^_ ^^ ^^ ^^ ^^ ^^, ^^_ ^^ ^^ ^^ ^^ ^^ ൌ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^. 20 Attorney Docket: 8350.2023‐002WO [0067] Other operators are more involved, but each requires
only a small state machine. FIG. 8 is pseudocode for the steer and car
ry operators previously described. [0068] As shown in FIG. 9(a), a steer forwards ^^ if ^^ is true, otherwise it discards ^^. To implement steer, the CFM waits for ^^ and ^^ to be valid. If ^^ is true, then ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ is raised, and the ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ signal propagates upstream to ^^ and ^^ and the CFM waits for ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ (i.e., for the value to be consumed). If ^^ is false, then ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ is kept low, and ^^_ ^^ ^^ ^^ ^^ ^^ and ^^_ ^^ ^^ ^^ ^^ ^^ are raised to discard these tokens. [0069] As shown in FIG. 9(b), carry is more complicated. C
arry begins in Initial state waiting for a valid ^^ token. It forwards the token and transitions to
Blocked state, where it forwards ^^ until it sees a false ^^ token. [0070] Control flow in the NoC adds small hardware overhead
s. Implementing control flow in the NoC is far more energy‐ and
area‐ efficient than in a PE, saving an estimated 40% energy and 22% area versus
CGRA with all CF operations mapped to PEs. The CFM deals only with n
arrow control signals and the 1b decider value ^^. It does not need to touch full data signals at
all; these are left to the pre‐existing data switch. Im
portantly, this means that the CFM adds no data buffers. Instead, the CFM simp
ly raises the ∗ _ ^^ ^^ ^^ ^^ ^^ signals to park values in the upstream output channe
ls until they are no longer needed. 21