Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ROUTING NETWORK FOR SUPERCONDUCTING DEVICES USING RACE LOGIC
Document Type and Number:
WIPO Patent Application WO/2023/239341
Kind Code:
A2
Abstract:
An apparatus is provided. The apparatus includes a routing network to route packets between a set of sending devices and a set of receiving devices. One or more of the routing network, the set of sending devices, and the set of receiving devices are part of a superconducting device. The set of sending devices and the set of receiving devices use a race logic architecture. The apparatus also includes a scheduling module to interconnect different subsets of the set of sending devices to different subsets of the set of receiving devices based on a set of connection schedules. The packets are routed between the different subsets of the set of sending devices and the different subsets of the set of receiving devices based on the set of connection schedules.

Inventors:
MICHELOGIANNAKIS GEORGIOS (US)
LYLES DARREN (US)
VASUDEVAN DILIP (US)
GONZALEZ-GUERRERO PATRICIA (US)
BAUTISTA MERIAM (US)
Application Number:
PCT/US2022/028538
Publication Date:
December 14, 2023
Filing Date:
May 10, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV CALIFORNIA (US)
Attorney, Agent or Firm:
WHETZEL, John (US)
Download PDF:
Claims:
CLAIMS What is claimed is: 1. An apparatus, comprising: a routing network to route packets between a set of sending devices and a set of receiving devices, wherein: one or more of the routing network, the set of sending devices, and the set of receiving devices are part of a superconducting device; the set of sending devices and the set of receiving devices use a race logic architecture; and a scheduling module to interconnect different subsets of the set of sending devices to different subsets of the set of receiving devices based on a set of connection schedules, wherein the packets are routed between the different subsets of the set of sending devices and the different subsets of the set of receiving devices based on the set of connection schedules. 2. The apparatus of claim 1, wherein: each connection schedule of the set of connection schedules is divided into a set of windows; and each window is divided into a set of time slots. 3. The apparatus of claim 2, wherein a values of a packet are based on a window when the packet was received.

4. The apparatus of claim 2, wherein each of the different subsets of the set of sending devices and the different subsets of the set of receiving devices are associated with a respective window of the set of windows. 5. The apparatus of claim 1, wherein the set of connection schedules rotate between the set of sending devices and the set of receiving devices in a round robin schedule. 6. The apparatus of claim 1, wherein: the scheduling module comprises a set of scheduling devices; and each scheduling device of the set of scheduling devices is associated with a respective subset of the routing network. 7. The apparatus of claim 6, wherein each scheduling device of the set of scheduling devices comprises a set of merging circuits and a set of toggle flip flop (TFF) circuits. 8. The apparatus of claim 1, wherein each routing device of the routing network comprises a nondestructive read out (NDRO) cell. 9. The apparatus of claim 8, wherein each NDRO cell is to generated a respective output based on one connection schedule of the set of connection schedules. 10. The apparatus of claim 1, further comprising: one or more shift buffers coupled to one or more of the set of sending devices and the set of receiving devices, wherein the one or more shift buffers are configured to store the packets for a period of time. 11. A system, comprising: a set of sending devices to transmit packets; a set of receiving devices to receive the packets; a routing network to route the packets between the set of sending devices and the set of receiving devices, wherein: one or more of the routing network, the set of sending devices, and the set of receiving devices are part of a superconducting device; the set of sending devices and the set of receiving devices use a race logic architecture; and a scheduling module to interconnect different subsets of the set of sending devices to different subsets of the set of receiving devices based on a set of connection schedules, wherein the packets are routed between the different subsets of the set of sending devices and the different subsets of the set of receiving devices based on the set of connection schedules. 12. The system of claim 11, wherein: each connection schedule of the set of connection schedules is divided into a set of windows; and each window is divided into a set of time slots.

13. The system of claim 12, wherein a values of a packet are based on a window when the packet was received. 14. The system of claim 12, wherein each of the different subsets of the set of sending devices and the different subsets of the set of receiving devices are associated with a respective window of the set of windows. 15. The system of claim 11, wherein the set of connection schedules rotate between the set of sending devices and the set of receiving devices in a round robin schedule. 16. The system of claim 11, wherein: the scheduling module comprises a set of scheduling devices; and each scheduling device of the set of scheduling devices is associated with a respective subset of the routing network. 17. The system of claim 16, wherein each scheduling device of the set of scheduling devices comprises a set of merging circuits and a set of toggle flip flop (TFF) circuits. 18. The system of claim 11, wherein each routing device of the routing network comprises a nondestructive read out (NDRO) cell.

19. The system of claim 18, wherein each NDRO cell is to generated a respective output based on one connection schedule of the set of connection schedules. 20. A method, comprising: obtaining a set of schedules for a routing network to route packets between a set of sending devices and set of receiving devices, wherein: one or more of the routing network, the set of sending devices, and the set of receiving devices are part of a superconducting device; the set of sending devices and the set of receiving devices use a race logic architecture; and coupling a subset of the set of sending devices to a subset of the set of receiving devices based on the set of schedules.

Description:
ROUTING NETWORK FOR SUPERCONDUCTING DEVICES USING RACE LOGIC CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims the benefit of U.S. Provisional Application No. 63/187,805, filed on May 12, 2021, the entire content of which is hereby incorporated by reference herein. STATEMENT OF GOVERNMENT RIGHTS [0002] This invention was made with government support under Contract No. DE-AC02- 05CH11231 awarded by the U.S. Department of Energy. The government has certain rights in the invention. TECHNICAL FIELD [0003] Aspects of the present disclosure relate to routing data, and in particular, to routing data in superconducting systems, devices, chips, or circuits. BACKGROUND [0004] Superconductivity is defined as the property of certain metals to have zero resistance below a critical temperature that is usually a few degrees Kelvin. superconducting devices, chips, systems, etc., may be computing devices, processing devices, etc., that operate at those temperatures. Josephson junctions (JJ) are often used in superconducting devices. JJs have two edges and allow current to pass through with no resistance until a critical current is reached. Reaching that critical current causes the JJ to switch to a resistive state and emit a magnetic quantum flux transfer that is observable as a voltage pulse. JJs are capable of switching with delays as low as a few picoseconds and with energies several orders of magnitude lower than traditional CMOS circuits.

BRIEF DESCRIPTION OF THE DRAWINGS [0005] The described embodiments and the advantages thereof may best be understood by reference to the following description taken in conjunction with the accompanying drawings. These drawings in no way limit any changes in form and detail that may be made to the described embodiments by one skilled in the art without departing from the spirit and scope of the described embodiments. [0006] FIG.1 is a block diagram illustrating an example system, in accordance with some embodiments of the disclosure. [0007] FIG.2 is a block diagram illustrating an example windows for a routing network, in accordance with some embodiments of the disclosure. [0008] FIG.3 is a block diagram that illustrates an example routing network, in accordance with some embodiments of the disclosure. [0009] FIG.4 is a block diagram illustrating an example routing device, in accordance with some embodiments of the disclosure. [0010] FIG.5 is a block diagram illustrating an example scheduling device, in accordance with some embodiments of the disclosure. [0011] FIG.6 is a block diagram illustrating an example scheduling device, in accordance with some embodiments of the disclosure. [0012] FIG.7 is a block diagram illustrating an example balancer, in accordance with some embodiments of the disclosure. [0013] FIG.8 is a block diagram illustrating an example converter, in accordance with some embodiments of the disclosure [0014] FIG.9 is a block diagram illustrating an example converter, in accordance with some embodiments of the disclosure [0015] FIG.10 is a block diagram illustrating an example converter, in accordance with some embodiments of the disclosure. [0016] FIG.11A is a diagram illustrating an example shift register, in accordance with some embodiments of the disclosure. [0017] FIG.11B is a diagram illustrating an example shift buffer, in accordance with some embodiments of the disclosure. [0018] FIG.12 is a flow diagram of a method of routing packets/pulses, in accordance with some embodiments of the disclosure.

DETAILED DESCRIPTION [0019] As discussed above, Josephson junctions (JJ) are often used in superconducting devices. JJs make superconducting devices particularly attractive for energy-efficient computation even after accounting for cooling energy. Superconducting circuits have been shown to operate at clock frequencies of several tens of GHz. However, the density of JJs in an area is less than the density of traditional CMOS circuits. [0020] A single flux quantum (SFQ) superconducting device may encode a logical 1 with the presence of a pulse that typically has a duration of a few picoseconds and an amplitude of few ms. A logical 0 is encoded by the absence of a pulse. This is in contrast with CMOS where 0s and 1s are encoded by voltage levels that typically remain constant until the next rising clock edge. Recently, race logic (RL) was adapted to rapid single flux quantum (RSFQ) superconducting circuits. Since in RSFQ voltage pulses continuously propagate and thus do not maintain a constant voltage level at each input like in CMOS, race logic in RSFQ adapts its logic primitives (i.e., gates) to be stateful in order to remember what pulses arrived and their timing relative to other inputs of the same gate. RL encodes information in the time domain by dividing time in what we refer to as “time slots.” The time of arrival of a pulse encodes its value. Although superconducting devices may use race logic (e.g., a temporal based encoding) traditional routing networks still used a binary based encoding and thus are hard to use with superconducting devices [0021] Aspects of the disclosure address the above-noted and other deficiencies by providing a routing network where both the data and control paths operate in a temporal domain. The routing network may use a predefined schedule or timetable to couple different senders to different receivers. The schedules or timetables allow the routing network to be more easily compatibly with senders and receivers that use race logic, due to the temporal nature of the schedules/timetables. This allows the superconducting device to operate more quickly and/or efficiently as packets, pulses, data, etc., can be remain in a race logic encoding and do not need to be converted to a binary encoding. [0022] FIG.1 is a block diagram illustrating an example system 100, in accordance with some embodiments of the disclosure. The system 100 includes routing devices 110A through 110Z, sending devices 120A through 120Z, and receiving devices 130A through 130Z. Each sending device 120A through 120Z may be referred to as a sender, a transmitter, etc. Each receiving device 130A through 130Z may be referred to as a receiver. In one embodiment, the system 100 may be a single flux quantum (SFQ) chip or circuit. For example, the system 100 may be a rapid SFQ (RSFQ) chip, circuit, device, system, etc. The system 100 (e.g., the sending devices 120A through 120Z, the receiving devices 130A through 130Z, etc.) may use a race logic architecture or convention. [0023] In one embodiment, the routing devices 110A through 110Z may receive data from the sending devices 120A through 120Z (as illustrated by the leftward arrows). The routing devices 110A through 110Z may forward the data to the receiving devices 130A through 130Z (as illustrated by the rightward arrows). In some embodiments, each routing device 110A through 110Z may be coupled to different subsets of the sending devices 120A through 120Z and different subsets of the receiving devices 130A through 130Z. In other embodiments, each routing device 110A through 110Z may be connected to each sending device 120A through 120Z and to each receiving device 130A through 130Z. Any number or variation of routing devices, sending devices, and receiving devices may be used in different embodiments. [0024] In one embodiment, the routing devices 110A through 110Z may be a network- on-chip (NoC). For example, the routing devices 110A through 110Z may be a superconducting rotary NoC (SRNoC). The routing devices 110A through 110Z may also be referred to as a routing network, a connection network, etc. The sending devices 120A through 120Z and receiving devices 130A through 130Z may also be referred to as endpoints of the routing network. [0025] In one embodiment, both the data path (e.g., the interconnections between the sending devices 120A through 120Z, and the receiving devices 130A through 130Z) and the control path (e.g., the connection schedules, how the routing devices 110A through 110Z interconnect the sending devices 120A through 120Z and receiving devices 130A through 130Z) may operate in a temporal domain using race logic. This allows the routing devices 110A through 110C to be compatible with architectures that already implement or use race logic. In addition, because RSFQ devices may have lower device density (e.g., can fit fewer devices, circuits, in an area), the routing devices 110A through 110C may reduce the number of devices used in the RSFQ device. [0026] Each routing devices 110A through 110Z may use a connection schedule to interconnect a sending device with a routing device. For example, each routing device 110A through 110Z may have its own connection schedule indicating when a particular sending device should be connected to a particular receiving device. The connection schedule may be a predefined or fixed schedule. Using a predefined or fixed schedule may allow for simpler control logic because the connections between sending devices and receiving devices are not established and terminated ad hoc. [0027] In one embodiment, the routing devices 110A through 110Z, sending devices 120A through 120Z, and receiving devices 130A through 130Z may be coupled to a common clock (e.g., a clock device, a timing circuit, etc.). The common block may serve as a timing reference such that all components are aware of which connections are active at any one time and connections are established or torn down (e.g., disconnected) in synchrony with other routing devices. [0028] FIG.2 is a block diagram illustrating an example windows 210A through 210Z for a routing network, in accordance with some embodiments of the disclosure. Each of windows 210A through 210Z includes four time slots 215. Although each windows 210A through 210Z is described as having four time slots 215 in FIG.2, a window may include any number of time slots in other embodiments. For example, each window may include 4, 8, 32, or any appropriate number of time slots. Furthermore, the number of windows for the routing network vary in different embodiments. For example, there may be 2, 4, 8, 100, or any appropriate number of windows. The windows may also be referred to as connection windows. [0029] As discussed above, the endpoints (e.g., sending device and receiving devices) of a routing network (e.g., routing devices 110A through 110Z illustrated in FIG.2, a NoC, a SRNoC, etc.) may use a race logic architecture or race logic convention. A race logic convention may operate in a manner such that the time of arrival of a pulse, signal, etc., determines the value it encodes. For example, each time slot 215 has an associated value of 0, 1, 2, or 3. If a pulse is received at the first time slot 215 (in a window 210), the value of the pulse is 0. If a pulse is received at the second time slot 215 (in a window 210), the value of the pulse is 1. If a pulse is received at the third time slot 215 (in a window 210), the value of the pulse is 2. If a pulse is received at the fourth time slot 215 (in a window 210), the value of the pulse is 4. Each pulse that travels through routing network may be an independent packet (e.g., data, a data packet, etc.). [0030] In one embodiment, multiple pulses may be received during a window 210A through 210Z. For example, during window 210A, a first pulse may be received at the first time slot 215, a second pulse may be received at the second time slot 215, a third pulse may be received at the third time slot 215, and a fourth pulse may be received at the fourth time slot 215. Thus, four separate values, 0, 1, 2, and 3 were received during window 210A. In another example, during window 210C, a first pulse may be received at the first time slot 215, a second pulse may be received at the third time slot 215, and a third pulse may be received at the fourth time slot 215. Thus, three separate values, 0, 2, and 3 were received during window 210C. [0031] A particular sending device (e.g., an input to the routing network) may be coupled to (e.g., may be continuously connected to) a particular receiving device (e.g., an output for the routing network) for the duration of the window. For example, the same sending device may be coupled to the same receiving device via a routing device for the duration of a window 210A. The number of timeslots 215 in each window 210A through 210Z may be the same in some embodiments. Thus, each routing device (of the routing network) may couple a sending device to a receiving device for equal amounts of time. [0032] In one embodiment, the number of time slots in a window may determine the number of equivalent binary bits that may be represented, encoded, etc., by a pulse, packet, etc. For example, if there are 64 time slots per window, each pulse/packet is equivalent to value that has log264 bits (e.g., each pulse represents a 6-bit value). By allowing each time slot 215 to carry a pulse/packet, the total number of bits of information/data that may be represented/encoded by the pulses may be determined as follows: ((log2X) * X), where X is the number of slots in a window. Although the number of time slots may be illustrated as being a power of 2, the number of time slots may be any appropriate number in other embodiments. [0033] In the routing network, the pulses/packets may not need to carry destination, virtual channel, head/tail bits (e.g., headers or footers), or any other overhead information traditionally found in traditional routing networks (e.g., traditional NoCs). Because the routing devices will connect/couple the appropriate senders and receivers (based on a schedule, a connection schedule, etc.), such overhead information (such as the destination) may not be need to route a pulse/packet from the appropriate sending device to the appropriate receiving device. In some embodiments, error correction information/data may be used if the system or architecture uses error correction. In order to increase the payload capacity (e.g., the number of bits that can be represented or encoded) of a single pulse, the number of time slots in a window 210A through 210Z can be increased. [0034] To maintain the same notion of time throughout the network (e.g., to help keep the sending devices, receiving devices, and routing devices on substantially the same timing), time slots 215 may be long enough to account for the propagation delay through the routing network. Otherwise, a packet/pulse entering the routing network may arrive during a different time slot at the receiving endpoint, thus changing its value. Therefore, in the routing network, the duration of a time slot may be based on or may be determined by the propagation delay through the routing network. The propagation delay may also include the amount of time it may take to configure, setup, reset, etc., the routing devices of the routing network. For example, the amount of time for a routing device to disconnect from a first sending device and a first receiving device, and connect to a second sending device and a second receiving device may be factored into the propagation delay. [0035] FIG.3 is a block diagram that illustrates an example routing network 300, in accordance with some embodiments of the disclosure. The routing network 300 may be referred to as a NoC, a SRNoC, etc. The routing network 300 includes routing device 310A through 310I, scheduling module 350, splitters S, and mergers M. The routing devices 310A through 310I may also be referred to as crosspoints, crossbar crosspoints, etc. The routing network 300 may be a 3x3 routing network. For example, there routing network 300 includes a 3x3 arrangement of routing devices. [0036] Each splitter S has an input and two (or more) outputs. Each splitter S may produce a pulse/packet at both outputs (or all of its outputs) when an input pulse/packet is received on its input. Each merger M has two (or more) inputs and one output. Each merger M may produce an output pulse when receiving a pulse at either input. [0037] Scheduling module 350 may select, activate, etc., routing devices 310A through 310I based on one or more schedules. Scheduling module 350 includes scheduling deices 340A through 340C. Scheduling device 340A is coupled to routing devices 310A through 310C, scheduling device 340B is coupled to routing devices 310D through 310F, and scheduling device 340C is coupled to routing devices 310G through 310I. Each of the scheduling devices 340A through 340C has three outputs (e.g., a left, middle, and right output). The scheduling devices 340A through 340C may use one of the three outputs to select a routing device, as discussed in more detail below. The scheduling devices 340A through 340C may also be referred to as counting networks. [0038] The routing network 300 is coupled to three different inputs, input 1, input 2, and input 3. Each of the inputs may be coupled to one or more sending devices (e.g., a transmitter, a sender, etc.) that may transmit pulses, packets, data, etc., via the routing network 300. The routing network 300 is also coupled to three different outputs, output 1, output 2, and output 3. Each of the outputs may be coupled to one or more receiving devices (e.g., a receiver) that may receive pulses, packets, data, etc., via the routing network 300. [0039] The routing network 300 may route packets, pulses, data, etc., received from input 1 to one of output 1, output 2, or output 3, based on one or more connection schedules. For example, scheduling device 340A may select one of routing devices 310A through 310C to forward one or more pulses/packets based on a first connection schedule (e.g., may activate or may cause a routing device to forward data). Scheduling device 340B may select one of routing devices 310D through 310F to forward one or more pulses/packets based on a second connection schedule. Scheduling device 340C may select one of routing devices 310G through 310I to forward one or more pulses/packets based on a third connection schedule. [0040] In one embodiment, the connection schedules may be round robin schedules. A round robin schedule may be a schedule that selects each routing device for an equal amount of time. For example, scheduling device 340A may select routing device 310A during a first window (using its first output), may select routing device 310B during a second window (using its second output), and may select routing device 310C during a third window (using its third output). In another example, scheduling device 340B may select routing device 310F during a first window (using its first output), may select routing device 310D during a second window (using its second output), and may select routing device 310E during a third window (using its third output). Thus, each scheduling device 340A through 340C may use its first output, second output, third output and then rotate back to using its first output. Based on the connection schedule, one routing device at each row and one routing device at each column receives a select signal (e.g., is activated) from the scheduling devices 340A through 340C during a window. [0041] The scheduling devices 340A through 340C may receive a clock or clock signal from a clock/timing device. The clock signal may have a period that is equal to the amount of time for a window. This may allow the scheduling devices 340A through 340C to select a different routing device each window. [0042] In one embodiment, the routing devices 310A through 310I may be reset at the end of each window (e.g., before the start of the next window). A reset signal may be generated by a reset module 360 to reset each of the routing devices 310A through 310I at the end of each window. The reset module 360 may be couple to each of the routing devices 310A through 310I to provide the reset signal to the routing devices 310A through 310I. The reset module 360 may be driven by the clock and may remain synchronized with the operation of the scheduling devices 340A through 340C. [0043] FIG.4 is a block diagram illustrating an example routing device 310, in accordance with some embodiments of the disclosure. As illustrated in FIG.4, a routing device 310 may be and/or may include a nondestructive read out (NDRO) cell 400. The NDRO cell 400 may include three inputs (S, Read, and R) and two outputs (Q and QR). As discussed above, a routing network may include multiple NDRO cells 400. [0044] In one embodiment, the NDRO cell 400 may maintain internal state to remember pulses it observed from the reset (R) and select (S) inputs, and their relative timing. If the NDRO cell 400 observes a reset (R) pulse more recently than select (S), any pulses arriving to the data input (Read) are not routed to the data output (Q) because the NDRO is in a cleared state. However, if the select (S) input observes a pulse more recently than reset (R), the NDRO is in a “connected state” and thus any pulses arriving to the data input (the Read input) are routed to the data output (Q). The output QR produces a pulse when Q does not and vice versa. The output QR may not be used by the routing network. In one embodiment, the NDRO cell 400 operates correctly even with overlapping reset and select inputs. If there are overlapping reset and select inputs, the select input may take precedence. [0045] FIG.5 is a block diagram illustrating an example scheduling device 340, in accordance with some embodiments of the disclosure. The scheduling device 340 includes toggle flip flops (TFFs) and mergers M. As discussed above, the scheduling device 340 may be coupled to a set of routing devices. The scheduling device 340 may select one of the routing devices by providing an output to a select input of the routing device. The scheduling device 340 may select the different routing devices based on a schedule, such as a round robin schedule. The selected routing device may route a pulse, packet, data, etc., from a sending device to a receiving device. The scheduling device 340 may also be referred to as a counting network. In one embodiment, the scheduling device 340 may be referred to as a 1x4 or 1-to-4 scheduling device (e.g., a device that has one input and four outputs). [0046] As discussed above, the scheduling device 340 includes TFFs. The TFFs may also be referred to as balancers. A TFF has one input and two outputs (which may be referred to as Q and QR). When a pulse, signal, data, etc., arrives at the input, the TFF may transmit that pulse to the output Q (e.g., the first output). The TFF will transmit the next pulse that is received to output QR (e.g., the second output). The TFF will rotate between transmitting pulses (received via the input) using the output Q and the output QR. Each merger M has two inputs and one output. Each merger M may produce an output pulse when receiving a pulse at either input. [0047] In one embodiment, the scheduling device 340 may implement the schedule (e.g., connection schedule, time table, connection timetable, etc.) for selecting different routing devices. As discussed above, the scheduling device 340 may receive a clock signal or clock pulse from a clock/timing device. All incoming pulses arrive through the input (IN) and get distributed evenly in a round-robin fashion to the N outputs (e.g., OUT0, OUT1, OUT2, and OUT3). For example, the first input pulse to arrive is routed to OUT0, the second input pulse to OUT1, and so on and so forth. After sending a pulse to OUT3, the scheduling device 340 may start back at OUT0 for the next pulse. [0048] Although four outputs are shown, the scheduling device 340 may include any appropriate number of outputs in other embodiments. For example, the scheduling device 340 may include 8, 16, 30, or some other appropriate number of outputs. Additional TFFs and mergers M may be added to the scheduling device 340 to accommodate the additional output. [0049] FIG.6 is a block diagram illustrating an example scheduling device 340, in accordance with some embodiments of the disclosure. The scheduling device 340 includes toggle flip flops (TFFs) and mergers M. As discussed above, the scheduling device 340 may be coupled to a set of routing devices. The scheduling device 340 may select one of the routing devices by providing an output to a select input of the routing device. The scheduling device 340 may select the different routing devices based on a schedule, such as a round robin schedule. The selected routing device may route a pulse, packet, data, etc., from a sending device to a receiving device. The scheduling device 340 may also be referred to as a counting network. In one embodiment, the scheduling device 340 may be referred to as a 4x4 or 4-to-4 scheduling device (e.g., a device that has four inputs and four outputs). [0050] As discussed above, the scheduling device 340 includes TFFs. The TFFs may also be referred to as balancers. The scheduling device 340 also includes mergers M. [0051] In one embodiment, the scheduling device 340 may implement the schedule (e.g., connection schedule, time table, connection timetable, etc.) for selecting different routing devices. As discussed above, the scheduling device 340 may receive a clock signal or clock pulse from a clock/timing device. An incoming pulse arriving one of the N inputs (e.g., IN0, IN1, IN2, and IN3) may get distributed evenly in a round-robin fashion to the N outputs (e.g., OUT0, OUT1, OUT2, and OUT3). For example, the first input pulse to arrive (from one or inputs IN0 through IN3) is routed to OUT0, the second input pulse (from one or inputs IN0 through IN3) to OUT1, and so on and so forth. After sending a pulse to OUT3, the scheduling device 340 may start back at OUT0 for the next pulse. [0052] FIG.7 is a block diagram illustrating an example balancer 700, in accordance with some embodiments of the disclosure. The balancer 700 may be part of a scheduling device (e.g., scheduling device 340 illustrated in FIG.3). For example, rather than using a TFF in the scheduling device 340, the balancer 740 may be used. The balancer 700 includes inputs X0 and X1, and outputs Y0 and Y1. [0053] The balancer 700 includes splitters, S, mergers M, Josephson transmission lines (JTLs), and inhibit circuits. Each splitter S has an input and two (or more) outputs. Each splitter S may produce a pulse/packet at both outputs (or all of its outputs) when an input pulse/packet is received on its input. Each merger M has two (or more) inputs and one output. Each merger M may produce an output pulse when receiving a pulse at either input. A JTL may include one input and one output. A JTL may propagates a pulse from an input to an output after a fixed delay. An inhibit circuit may include two inputs and one output. An inhibit circuit may propagate a pulse from the first input unless a pulse arrived at second input more recently than the first input. [0054] As illustrated in FIG.7, the outputs of the two right most splitters S are routed back to the balancer 700 to create feedback loops. The feedback loops allow the balancer 700 to configure the routing path. For example, when a pulse exits at output Y0, it will be fed back to the inhibit block, causing the path to be inhibited or closed, allowing the next pulse to exit at output Y1. The balancer 700 also includes a reset input to reset the balancer 700 back to an initial state. [0055] FIG.8 is a block diagram illustrating an example converter 800, in accordance with some embodiments of the disclosure. As discussed above, one or more of the components in RSFQ superconducting circuit may use race logic, or some other temporal based encoding for data (e.g., an encoding where the number of pulses received during a period of time indicates a value). However, some of the components in the RSFQ superconducting circuit may not use race logic, or components coupled to the RSFQ superconducting circuit may not use race logic. For example, the sending devices may use race logic but the receiving devices may not use race logic. In another example, additional devices that are coupled to the receiving devices may not use race logic while the receiving devices use race logic. [0056] In one embodiment, the converter 800 may convert a binary input into a race logic format or encoding. The converter 800 may include an input reference clock Tclk which feeds into serially connected delay blocks (JTL Delay). The delay blocks may be implemented as a delay buffer or a two inverters in series. The output of each delay block is fed as input to the n:1 multiplexer. The select signal of the multiplexer is the digital input (e.g., a bit value) which converted to an output that confirms to race logic. [0057] FIG.9 is a block diagram illustrating an example converter 900, in accordance with some embodiments of the disclosure. As discussed above, one or more of the components in RSFQ superconducting circuit may use race logic, or some other temporal based encoding for data (e.g., an encoding where the number of pulses received during a period of time indicates a value). However, some of the components in the RSFQ superconducting circuit may not use race logic, or components coupled to the RSFQ superconducting circuit may not use race logic. [0058] In one embodiment, the converter 900 may receive a pulse, packet, data, etc., that uses a race logic format or encoding at the input Tin. The converter 900 may convert the pulse/packet into a binary format. The converter 900 may include an input reference clock Tclk which feeds into serially connected delay blocks. Each delay block uses a RFSQ D-flip flop (DFF). The output of each delay block is observed. The input reference clock Tclk of all the RFSQ DFFs are connected to stop signal and the input of the first RFSQ DFF is connected to the start signal which is the time-domain signal Tin. [0059] FIG.10 is a block diagram illustrating an example converter 1000, in accordance with some embodiments of the disclosure. As discussed above, one or more of the components in RSFQ superconducting circuit may use race logic, or some other temporal based encoding for data (e.g., an encoding where the number of pulses received during a period of time indicates a value). However, some of the components in the RSFQ superconducting circuit may not use race logic, or components coupled to the RSFQ superconducting circuit may not use race logic. [0060] In one embodiment, the converter 1000 may receive a pulse, packet, data, etc., that uses a race logic format or encoding at the input Tin. The converter 1000 may convert the pulse/packet into a binary format. The converter 1000 may include an input reference clock Tclk which feeds into an RSFQ counter. The input signal for this circuit is a “Tclk” clock signal which starts the counter and the reset signal is the time-domain signal Tin (e.g., a race logic packet or pulse received at a time slot). [0061] FIG.11A is a diagram illustrating an example shift register 1100, in accordance with some embodiments of the disclosure. As discussed above, a routing network may route packets, pulses, data, etc., between sending devices and receiving device. The routing network may not buffer the pulses, packets, etc., that are routed through the routing network. However, the sending devices and/or the receiving devices may need to buffer packets/pulses if they are not ready to transmit and/or receive packets/pulses. The shift register 1100 includes resistors, inductors, Josephson junctions (JJs), and a DC-to-SFQ circuit (DCSFQ). [0062] The shift register 1100 includes four stages. The first stage may be an input state where an input DC pulse is provided to the DCSFQ to be converted into SFQ pulses. The second stage is the shifting stage which includes three magnetically coupled interferometers. The resistor Rs1 through Rs2 may be bias resistors connected to the input pulses. There are three input ports for the 3 clock currents that determine the shifting. [0063] The third stage is a readout stage which includes a Josephson transmission line (JTL). The JTL is used for signal amplification at to deliver the output pulse. The last stage is the terminating state. The terminating stage includes a coupled, inductor, JJs and a resistor Rt. The three input clock pulses can be tuned in to the desired shifting interval. [0064] FIG.11B is a diagram illustrating an example shift buffer 1150, in accordance with some embodiments of the disclosure. The shift buffer includes, DC-to-SFQ (DCSFQ), a merger (M), a shift register (SR), a splitter (S), and two non-destructive read-out (NDRO) cells. The bottom NDRO cell maybe used for the feedback loop to control the propagation of the feedback pulses. At the shift register output is a splitter and another NDRO that may control when pulses in the shift register propagate to the output after they traverse the last stage of the shift register. The shift buffer 1150 may be used to store RL-encoded pulses. The design of the shift buffer 1150 may be useful for temporarily and inexpensively storing information (e.g., buffering or storing the data for a period of time) in encoding using race logic (or some other temporal based encoding). [0065] FIG.12 is a flow diagram of a method 1200 of routing packets/pulses, in accordance with some embodiments of the disclosure. Method 1200 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, a processor, a processing device, a central processing unit (CPU), a system-on-chip (SoC), etc.), software (e.g., instructions running/executing on a processing device), firmware (e.g., microcode), or a combination thereof. In some embodiments, at least a portion of method 1200 may be performed by one or more of a routing network, a routing device, and a scheduling device. [0066] With reference to FIG.12, method 1200 illustrates example functions used by various embodiments. Although specific function blocks ("blocks") are disclosed in method 1200, such blocks are examples. That is, embodiments are well suited to performing various other blocks or variations of the blocks recited in method 1200. It is appreciated that the blocks in method 1200 may be performed in an order different than presented, and that not all of the blocks in method 1200 may be performed. [0067] Method 1200 begins at block 1205, where the processing logic may obtain a set of schedules routing packets between a set of sending devices and set of receiving devices. One or more of the routing network, the set of sending devices, and the set of receiving devices are part of a superconducting device. The set of sending devices and the set of receiving devices use a race logic architecture, as discussed above. At block 1210, the processing logic may couple a subset of the set of sending devices to a subset of the set of receiving devices based on the set of schedules. For example, using a round robin schedule, each routing device of a routing network may cycle through a set of sending devices and cycle through a set of receiving devices. [0068] Appendix A includes various diagrams that describe systems, architectures, modules, components, technologies, etc., that may be used to route data between sending devices and receiving devices. Appendix A is hereby incorporated by reference in its entirety. [0069] Unless specifically stated otherwise, terms such as “receiving,” “transmitting,” “sending,” “connecting,” “interconnecting,” “routing,” “forwarding,” “scheduling,” “generating,” “obtaining,” “determining,” or the like, refer to actions and processes performed or implemented by computing devices that manipulates and transforms data represented as physical (electronic) quantities within the computing device's registers and memories into other data similarly represented as physical quantities within the computing device memories or registers or other such information storage, transmission or display devices. Also, the terms "first," "second," "third," "fourth," etc., as used herein are meant as labels to distinguish among different elements and may not necessarily have an ordinal meaning according to their numerical designation. [0070] Examples described herein also relate to an apparatus for performing the operations described herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computing device selectively programmed by a computer program stored in the computing device. Such a computer program may be stored in a computer-readable non-transitory storage medium. [0071] The methods and illustrative examples described herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used in accordance with the teachings described herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as set forth in the description above. [0072] The above description is intended to be illustrative, and not restrictive. Although the present disclosure has been described with references to specific illustrative examples, it will be recognized that the present disclosure is not limited to the examples described. The scope of the disclosure should be determined with reference to the following claims, along with the full scope of equivalents to which the claims are entitled. [0073] As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising”, “includes”, and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Therefore, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. [0074] It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved. [0075] Although the method operations were described in a specific order, it should be understood that other operations may be performed in between described operations, described operations may be adjusted so that they occur at slightly different times or the described operations may be distributed in a system which allows the occurrence of the processing operations at various intervals associated with the processing. [0076] Various units, circuits, or other components may be described or claimed as “configured to” or “configurable to” perform a task or tasks. In such contexts, the phrase “configured to” or “configurable to” is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs the task or tasks during operation. As such, the unit/circuit/component can be said to be configured to perform the task, or configurable to perform the task, even when the specified unit/circuit/component is not currently operational (e.g., is not on). The units/circuits/components used with the “configured to” or “configurable to” language include hardware--for example, circuits, memory storing program instructions executable to implement the operation, etc. Reciting that a unit/circuit/component is “configured to” perform one or more tasks, or is “configurable to” perform one or more tasks, is expressly intended not to invoke 35 U.S.C. 112, sixth paragraph, for that unit/circuit/component. Additionally, “configured to” or “configurable to” can include generic structure (e.g., generic circuitry) that is manipulated by software and/or firmware (e.g., an FPGA or a general-purpose processor executing software) to operate in manner that is capable of performing the task(s) at issue. “Configured to” may also include adapting a manufacturing process (e.g., a semiconductor fabrication facility) to fabricate devices (e.g., integrated circuits) that are adapted to implement or perform one or more tasks. “Configurable to” is expressly intended not to apply to blank media, an unprogrammed processor or unprogrammed generic computer, or an unprogrammed programmable logic device, programmable gate array, or other unprogrammed device, unless accompanied by programmed media that confers the ability to the unprogrammed device to be configured to perform the disclosed function(s). [0077] The foregoing description, for the purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the embodiments and its practical applications, to thereby enable others skilled in the art to best utilize the embodiments and various modifications as may be suited to the particular use contemplated. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.

A P P E N D IX A

SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC George Michelogiannakis, Darren Lyles, Patricia Gonzalez-Guerrero, Meriam Bautista, Dilip Vasudevan, Anastasiia Butko Lawrence Berkeley National Laboratory, Berkeley, California, USA Email: {mihelog, dlyles, lg4er, mgbautista, dilipv, abutko}@lbl.gov Abstract—Temporal encoding has been shown to be a natural However, the expected superconducting device density im- fit for singleflux quantum (SFQ) superconducting computing provements combined with the constantly-growing computa- since SFQ already encodes information with the presence or tional needs of key HPC applications make superconducting absence of voltage pulses. However, past work in SFQ has a prime candidate for a long-lasting impact to HPC, such as focused on binary-encoded networks on chip (NoCs). In this in application-specific accelerators in future HPC systems. paper, we propose superconducting rotary NoC (SRNoC), a NoC where both data and control paths operate in the temporal Reduced circuit area with RL combined with emerging domain following the race logic (RL) convention. Therefore, superconducting device technologies both drive us towards SFQ chips with temporal compute or memory can use SRNoC higher-scale future RSFQ chips. Just as in CMOS, this means that networks on chip (NoCs) will likely play an to avoid converting between the temporal and binary domains important role in the scalability and efficiency of future that would result from using a binary-encoded NoC. Using larger-scale superconducting circuits. What’s more, NoCs RL also enables SRNoC to be area-efficient, mitigating SFQ should adapt to the computational model that is most ef- technology’s low device density. SRNoC treats pulses as inde- ficient for RSFQ circuits, instead of shoehorning CMOS- pendent packets and delivers them to outputs without changing inspired NoCs to RSFQ. Using binary-encoded NoCs threat- their value, i.e. preserving the RL convention. SRNoC operates ens to drastically reduce performance efficiency improve- on afixed, rotating connection schedule between inputs and ments from computing temporally in RSFQ [3]. outputs. In each connection window, multiple pulses (packets) can be transmitted sequentially. SRNoC provides 13.1× higher In this paper, we propose superconducting rotary NoC throughput per port per Josephson junction (JJ) compared to (SRNoC), a NoC for RSFQ superconducting circuits where the best-performing of three demonstrated NoCs. both the data and control paths are temporally-encoded using RL [3]. In accordance with RL, each data pulse represents a 1. Introduction value that SRNoC treats as an independent packet. Packets (pulses) are transported from a network input to an output without changing their value, thus preserving the RL con- Based on the potential of superconducting circuits to vention. SRNoC routers operate on afixed time-division operate at several tens of GHz and with superior energy schedule where input–output connections are established efficiency than CMOS [1], [2], [3], superconducting com- in pre-determined, rotating schedules. SRNoC is composed puting aspires to preserve performance scaling for key of low-latency, area-efficient building blocks for data and HPC applications once traditional technology scaling ceases. control. To make scheduling practical, each packet traverses Temporal encoding has been shown to be a goodfit for exactly one router. However, SRNoC can have multiple superconducting computing because it matches how single routers in parallel to increase performance or scaling [5]. flux quantum (SFQ), currently the dominant superconduct- During an input–output connection, multiple packets can be ing logic family [2], encodes 0s and 1s using the absence sent in sequence as long they do not collide. or presence of picosecond-duration voltage pulses. Recently, race logic (RL) was adapted to rapid singleflux quantum SRNoC’s contributions include its low JJ count as well (RSFQ) [3]. This enables computational circuits to trade as designing both control and data paths exclusively in the operating frequency for reduced area and energy since they temporal domain. SRNoC is a naturalfit for RSFQ chips that encode information using pulses in the time domain, a compute in the temporal domain [3] as well as for future language that is more natural to SFQ. larger-scale RSFQ chips that require more area-efficient and Area efficiency is particularly important because current higher-throughput NoCs. SRNoC’s low JJ count is important superconducting device technology is three orders of magni- given superconducting technology’s low device density [1]. tude less dense than state-of-the-art CMOS [1]. As a result, SRNoC provides 13.1× higher throughput per port per JJ demonstrated superconducting chips cannot, at best,fit more under load-balanced traffic compared to the best-performing than few tens of thousands of Josephson junctions (JJs), of three NoCs that were previously demonstrated in RSFQ, the fundamental device of superconducting computing [4] each with a different network topology [6] [7] [8] 2. Background Table 1 summarizes the logic primitives that are relevant to the rest of our study. Thefirst three are classical SFQ 2.1. Superconducting Computing primitives [1], [2], [3] while the last three are RL in RSFQ primitives [3]. Superconductivity is defined as the property of certain metals to have zero resistance below a critical temperature 2.3. Related Work that is usually a few degrees Kelvin [1]. A major mile- stone in superconducting computing was the invention of Previous work demonstrated NoCs in RSFQ using binary a Josephson junction (JJ) [4]. JJs have two edges and allow (non-temporal) data representation. Several works demon- current to pass through with no resistance until a critical strated individual crossbar switches using different imple- current is reached. Reaching that critical current causes the mentations such as with multiplexers and of different radices JJ to switch to a resistive state and emit a magnetic quantum such as 2×2 [9], [10] and 4×4 [8], [11], as well as an flux transfer that is observable as a voltage pulse [2]. JJs individual switch scheduler for the control path [12]. Two are capable of switching with delays as low as a few small-scale NoCs were also demonstrated, namely a 3×3 picoseconds and with energies several orders of magnitude ring topology [6] and an 8×8 Banyan network consisting lower than CMOS [1]. This makes superconducting circuits of 2×2 switches with limited internal FIFO buffering [7]. particularly attractive for energy-efficient computation even Later in our evaluation we compare SRNoC to the state of after accounting for cooling energy. Superconducting cir- the art of this past work from a variety of topologies. cuits have been shown to operate at clock frequencies of In CMOS, temporal encoding in NoCs only appears several tens of GHz [1], [3], [8]. However, even though as part of the data path, while the control logic remains progress is ongoing in increasing JJ density, currently there binary [13]. Temporal encoding does appear infields other is a 1000× area density gap compared to CMOS [1]. than NoCs such as spiking neural networks and neuromor- Currently, singleflux quantum (SFQ) and its variants are phic computing. the dominant logic family type [2]. SFQ encodes a logical 1 To the best of our knowledge, SRNoC is thefirst to with the presence of a pulse that typically has a duration of a operate both its control and data paths in the temporal few picoseconds and an amplitude of few mVs. A logical 0 domain using RL. This allows SRNoC to have a lower JJ is encoded by the absence of a pulse. This is in contrast with count and be a goodfit for applications that produce and CMOS where 0s and 1s are encoded by voltage levels that consume temporal data. typically remain constant until the next rising clock edge. 3. Motivation 2.2. Race Logic One motivating example for SRNoC is transport- Recently, race logic (RL) was adapted to rapid single triggered architectures (TTAs) [14]. TTAs implement multi- flux quantum (RSFQ) superconducting circuits [3]. Since in ple MOVE sub-instructions. Moving data to pre-determined RSFQ voltage pulses continuously propagate and thus do locations such as specific registers implies the operation not maintain a constant voltage level at each input like in that will be performed on the data. The number of sub- CMOS, RL in RSFQ adapts its logic primitives (i.e., gates) instructions depends on the available data-transport buses to be stateful in order to remember what pulses arrived and and defines the level of parallelism. Essentially, TTAs de- their timing relative to other inputs of the same gate [3]. compose traditional ALUs into multiple units. The complex- RL encodes information in the time domain by dividing ity of the compute units depends on the target application. time in what we refer to as “RL time slots” in this paper. In general, TTAs tend to have a large number of relatively That is, the time of arrival of a pulse encodes its value. For simple units sometimes with redundant functionalities. TTAs example, if we define the minimum value that each pulse provide a number of advantages such as reduced registerfile can represent to be 0 and the maximum 15, we can divide usage, high clock frequencies, simple independent function time into epochs where each epoch has 16 RL time slots. If units, and high design scalability. a pulse arrives in time slot 4, it encodes the value 4. After In RSFQ, TTAs would be capable of high clock fre- the last RL time slot, the next epoch begins by going back quencies and thus impressive performance. However, their to thefirst RL time slot that in this example encodes value operation creates more data movement than traditional ar- 0. The number of time slots in an epoch do not necessarily chitectures. Therefore, the efficiency of the NoC may define have to be a power of two. many of the overall architecture performance characteristics With RL, fewer wires are required to encode infor- because the NoC directly addresses some current limitations mation. This helps make circuits smaller but at a cost of of TTAs such as the latency and bandwidth for transporting operating frequency. This is a worthwhile trade-off since results [14]. A more capable NoC may also enable more superconducting circuits can achieve higher operating fre- compute specialization. Thus, SRNoC can be an enabler for quencies than CMOS. Therefore, we consider RL in RSFQ TTAs in RSFQ. What’s more, to minimize storing pulses at to be a promising future technology and thus we focus on endpoints as well as reduce network latency, communication it for our network on chip (NoC) named superconducting schedules of operands can be made to match thefixed rotary NoC (SRNoC) thefirst using RL in RSFQ connection schedule of SRNoC or we can re-configure the TABLE 1. A BRIEF SUMMARY OF SUPERCONDUCTING LOGIC PRIMITIVES RELEVANT TO OUR STUDY. Name Inputs Outputs Summary connection schedule to match expected traffic. The latter is easier in architectures such as some temporal networks or TTAs where the compiler can dictate or at least predict k t the dataflow graph. In addition, co-designing SRNoC with TTAs would reduce required conversions between the binary and temporal domains by performing at least some compu- tation or memory storage in the temporal domain. We leave k the co-design of TTAs with SRNoC as future work. t Architectures similar to TrueNorth are also motivating examples because they compute on spikes (pulses) using neurons. Also, their preconfiguration (training) makes the Figure 1. Overview of SRNoC. The number of endpoints and routers do not traffic pattern more predictable. This is another example have to be equal. Each router has its own connection schedule. Endpoints connect with dedicated wires to routers. where adopting a binary-encoded NoC would introduce significant overhead to convert data representation between For the scales previously demonstrated in RSFQ NoCs the binary and temporal domains at NoC boundaries. that are the scales we use in our evaluation, this approach is efficient. To support more endpoints, we can either increase 4. SRNoC the inputs and outputs of a SRNoC router or increase the number of parallel routers. These options have different A fundamental design goal of SRNoC is that both the performance versus complexity tradeoffs and can be com- data and control paths operate fully in the temporal domain bined. The latter option also has an additional parameters using RL. This makes SRNoC a goodfit for architectures of whether routers connect to all endpoints or only a subset. that already compute in RL and helps reduce the JJ count Instantiating routers in parallel has been shown to be effi- of the NoC to alleviate RSFQ’s low device density that may cient even at datacenter network scales [5]. Alternatively, we otherwise limit network scale or bandwidth [1]. However, can cascade routers to design multi-hop networks, but router implementing the control path in RL makes computing on schedules should be coordinated and packet values must still packets at each hop challenging [3]. In addition, super- be preserved in RL. However, since scaling up SRNoC to conducting technology’s low device density [1], [2] com- many tens or hundreds of endpoints is impractical in near- bined with the temporal encoding of the data path make term RSFQ technology, we leave this exploration as future buffering (storing) packets costly. Therefore, SRNoC uses work. circuit-switchedflow control without in-network buffers In SRNoC, each router maintains its own connection where paths are pre-configured using a predictable rotating schedule that is pre-defined and independent of network schedule. Similar approaches have demonstrated high per- activity. Routers and endpoints have a common clock which formance at datacenter system scales [5]. Afixed schedule serves as a timing reference such that all components are results in simple control logic because circuits are not estab- aware of which circuit connections are active at any one lished and terminated ad hoc. The downside of this approach time, and circuits are torn down and established in syn- is that the bandwidth between any two source–destination chrony with other routers [5]. If we have multiple routers in pairs does not adjust to the traffic pattern. However, we will parallel, in the interest of latency we stagger their connection show that SRNoC provides higher bandwidth per JJ than schedules such that we avoid having two or more circuits prior art to make up for this drawback. between the same source to the same destination active at the same time. 4.1. Overview 4.2. Data Representation and Transfer Figure 1 shows an overview of SRNoC. The number Both the inputs and outputs of our network follow the of endpoints and routers does not have to be equal. At a RL representation of information in RSFQ where the time minimum, there is one router and all network endpoints of arrival of a pulse determines the value it encodes [3]. connect to it. Endpoints connect to routers using dedicated Using RL, each pulse represents a value that can equiv- wires. Because routers are in parallel, all packets traversing alently in “traditional” binary networks be represented as SRNoC take exactly one hop through only one router an array of bits Each pulse that travels through SRNoC Figure 2. An illustration of time for one input (input i) of one individual SRNoC router. Input i connects to each output in sequence for a duration specified by the connection window (circuit duration) that is composed of the number of RL time slots chosen at design time. is an independent packet. Each connection window, which and application. Therefore, if w ne try to send m packets per is the time that an input remains continuously connected connection window that has 2 RL time slots, we use the to the same output (i.e., duration of a circuit), contains a bins and balls model to calculate that the expected number of pre-defined number of RL time slots. That number is the packets that can be sent (i.e., packets with different values) same for all connection windows. Therefore, routers spend is approximately an equal amount of time connecting each input–output pair. The number of RL time slots determines the number of ) equivalent binary bits for a packet. For example, with 64 RL time slots per connection window, each packet is equivalent With to signify that we try to send one packet to log 2 64 = 6 bits. However, the number of RL time slots in each of the 2 n RL time slots, the expected number of does not have to be a power of 2. For simplicity, the notion packets we can send becomes of time is common across all network endpoints and routers. These concepts are illustrated in Figure 2. ) In SRNoC, packets (pulses) do not need to carry desti- nation, virtual channel, head/tail bits, or any other overhead For example, for n = 64 (6-bit packets) that we use information traditionally found in modern NoCs [15], except as default later in our evaluation, that expected number of for error control if so desired. In order to increase the packets that can be sent during a connection window is 40.1. payload capacity of a single pulse, we can increase the To maintain the same notion of time throughout the number of possible values that a single pulse represents by network, RL time slots have to be long enough to account for increasing the number of RL time slots. Doing so means that the propagation delay through the NoC. Otherwise, a packet connections last longer but change less frequently. While (pulse) entering the NoC may arrive during a different the duration of an RL time slot is determined by circuit RL time slot at the receiving endpoint, thus changing its parameters as we show later, the number of RL time slots value in RL. Therefore, in SRNoC the duration of an RL per connection window is a design choice. Both values have time slot is largely determined by the propagation delay bandwidth and latency implications. through a router, plus crosspoint hold and setup times, and An important feature of SRNoC is allowing multiple a design margin to account for timing variability. In this pulses (packets) from the same input to the same output discussion, setup and hold times refer to the time margin during a connection window. However, overlapping pulses before and after changing connections, respectively, where can cause erroneous data transfer. Therefore, SRNoC only packets cannot be sent, as shown in Figure 2. transmits at most one pulse per RL time slot. That is, two One option we leave as future work is to allow sender pulses that represent the same value cannot be transmitted and receiver endpoints to have a “shifted” notion of time in the same connection window. This allows, for example, to allow making RL time slots shorter than the propagation sending a pulse representing value ’2’ and value ’8’ in the delay. In this case, the receiver endpoint shifts all its RL same connection window, but not two pulses each repre- time slots by the propagation delay. However, this requires senting the value ’2’. Potentially, this allows a connection endpoints to effectively operate in two RL time domains, one window to carry a pulse in every RL time slot. Ideally, with for the receiver circuitry and one for the sender circuitry. 16 RL time slots SRNoC could transmit 16 pulses each representing a different value from 0 to 15 (4-bit values). 4.3. Crossbar Crosspoint This means that, at most, the equivalent of log 2 16×16 = 64 bits can be transferred during every connection window with Similar to most networks, another fundamental block 16 RL time slots. of SRNoC is our crossbar crosspoint. Our crosspoint relies For our analysis, we consider an average case by as- on a non destructive read out (NDRO) cell that we adapt suming that pulses represent any value in their allowed from [16] by adding an extra JJ in the derivation path range, chosen randomly with equal probability. With n-bit to protect the SFQ loop and modifying the output stage. packets, th n probability that any two packets have all bits The NDRO cell has three inputs (S, Read, and R) and two equal is ( 2 assuming uniform random probabilities for complementary outputs (Q and QR), illustrated in Figure 3a. each bit In practice this would depend on the architecture We use the NDRO similarly to a CMOS tri-state except that complementary outputs (Q and QR). When a pulse arrives to the input, it causes a pulse at output Q. The next input pulse causes a pulse at output QR. This proceeds in a round-robin fashion. Upon powering up the circuit, the TFF initiates at Figure 4. A 1×4 counting network distributes incoming pulses based on a state where the next output pulse will be generated at their relative time of arrival. Thefirst pulse is routed to output 1, the second output Q. Our implementation of a TFF has no reset input to output 2, etc. The process repeats in a round-robin fashion. because in SRNoC we never reset counting networks that would cause them to re-start their round-robin distribution. the NDRO has to maintain internal state to remember pulses If two pulses arrive too close to each other to the TFF’s it observed from the reset (R) and select (S) inputs and their input (in), there is a risk of the TFF treating them as one relative timing. If the NDRO observes a reset (R) pulse more pulse and therefore missing an input pulse. Preventing this in recently than select (S), any pulses arriving to the data input a counting network with more than one input would impose (Read) are not routed to the data output (Q) because the timing dependencies between inputs. However, SRNoC only NDRO is in a “cleared state”. However, if the select (S) input uses 1×N counting networks. Therefore, we just have to observes a pulse more recently than reset (R), the NDRO is ensure that the duration of a connection window that is also in a “connected” state and thus any pulses arriving to the the period of the clock that is the input to SRNoC’s counting data input (Read) are routed to the data output (Q). The networks is not short enough to cause this problem. complementary NDRO output (QR) produces a pulse when Figure 5 shows our implementation of a 1×4 counting a reset (R) pulse arrives, if the NDRO was previously in a network. Thefirst and second stage TFFs are routed such “connected” state. However, we do not use QR in SRNoC’s that input pulses will always be distributed in a round robin crossbar and thus leave it unconnected. fashion, starting at the top output. The internal structure of NDROs have setup times between a data pulse (Read) our counting network remains bitonic [17] with the excep- and a subsequent reset (R), as well hold times between a tion that all paths and balancers that would be connected to select (S) and a subsequent data pulse. These setup and one of the N − 1 non-present inputs are excluded. hold times match those shown in Figure 2. Our NDRO An advantageous property of the TFF is low delay implementation operates correctly even with overlapping variability between an input pulse causing a pulse at output reset and select pulses. In that case, select takes precedence. Q compared to causing a pulse at QR. We use this property to more efficiently hide the input–output delay of the count- 4.4. Counting Network ing network such that it does not prolong RL time slots. That is, the clock driving the counting networks is ahead Bitonic counting networks werefirst proposed for compared to the sender endpoint circuits by D, where D is CMOS [17] to distribute tokens from N inputs to N outputs the minimum delay through a counting network. Therefore, such as to satisfy a fairness property. We use counting by the time a sender network endpoint receives a clock to networks in SRNoC to implement the connection timetables signify the beginning of the connection window, it does not in every router. In particular, we design a counting network have to wait for the counting networks because thefirst in RSFQ such that tokens are pulses. We further customize crosspoints have already received a select. Note that this is the design such that instead of N inputs we only have one, a pessimistic assumption since part of delay D could be but still have N outputs (i.e., we design a 1×N counting hidden by the delay for the sender endpoint to generate and network). All incoming pulses arrive through that input and propagate a packet to a router input. get distributed evenly in a round-robin fashion to the N To account for delay variability, we add the maximum outputs. That is, thefirst input pulse to arrive is routed to minus minimum delay through a counting network (output thefirst output, the second pulse to the second output, etc. delay variability) to the duration of a RL time slot. This Figure 4 illustrates this. allows all crosspoints to be selected before receiving a data The basic logic block of a counting network is the pulse and still gives enough time to pulses to exit the router balancer. We implement our balancer using a toggleflip in the same RL time slot they were transmitted in. flop (TFF), shown in Figure 3b. When our balancer has Furthermore, if crosspoint hold times are short, we can two inputs, the TFF is preceded by a merger (M). Our TFF prolong RL time slots by adding hold times to D, to design is based on [18] The TFF has one input (in) and two allow senders to send pulses after hold times have elapsed Finally, depending on their inputs and outputs, packets have to traverse a different number of splitters before ar- riving to their crosspoint and a different number of mergers after their crosspoint in order to depart the router. Therefore, when we consider the propagation delay through the router to add to the duration of a RL time slot (Section 4.2), we consider the worst case input–output path that contains one crosspoint, n−1 splitters, and n−1 mergers where n is the number of inputs or outputs of the router (i.e., the router radix). In Figure 6, that would be two splitters and two mergers (n = 3). 4.6. Endpoint Circuits Because endpoint circuits act as intermediaries between the NoC and traffic producers or consumers such as compute and memory blocks, just like any NoC, endpoint circuit design depends on how information is encoded by traffic producers and consumers. If compute blocks process binary- encoded traffic, the sender endpoint can convert binary However, if hold times are long but still shorter than RL traffic to RL such as by using binary counters that produce time slots, we can disallow sending pulses during thefirst a pulse when counting down to zero. To avoid buffering RL time slot of every connection window. Likewise, for RL pulses at endpoints, this countdown can begin once a crosspoint setup times we can either prohibit transmitting connection to the desired destination is established. Alter- pulses during the last RL time slot of each connection win- natively, if compute blocks process RL traffic but not in dow, or prolong RL time slots by the difference between the synchronization with router connection schedules, the sender setup time and the propagation delay of a single merger (if endpoint can buffer (store) pulses using a shift register [19] the difference is positive). That is because pulses (packets) until a connection is established. However, the most efficient traverse at least one merger after a crosspoint to exit the solution would be to co-schedule compute and memory router and this can overlap with the crosspoint’s setup time. traffic with SRNoC such as to produce pulses when SRNoC Based on our circuits, we choose the latter strategy. establishes connections to desired destinations. On the receiver side, no extra processing or additional 4.5. Router Architecture delays are required by SRNoC, but such requirements may be added by the receiver. The connection window that each Figure 6 illustrates the router architecture of SRNoC. pulse arrives in tells the destination which was the sender. As previously discussed, there is no buffering or processing In addition, we can use a 1×N counting network at every of traffic inside the router. All crosspoints for the same receiver to separate all pulses that arrive during the same output (i.e., one column in Figure 6) have their select inputs connection window, if so desired. driven by the same counting network. The input for all counting networks is a clock with a period that equals a connection window. This is the only control input of 5. Evaluation SRNoC. This way, at the next connection window a clock pulse arrives to each counting network. This causes each To evaluate SRNoC, wefirst present a circuit-level eval- counting network to generate select pulses to a different uation of individual components, followed by an RTL-level crosspoint. This changes the connectivity of each input. We evaluation of a router, and then a network-level evaluation connect outputs of counting networks to crosspoints such for different traffic patterns and parameters. that at every connection window only one crosspoint at every row and one crosspoint at every column receives a select 5.1. Circuit-Level Evaluation (i.e., only crosspoints that belong to the same diagonal are selected at a time). This ensures that each input connects to First, we individually characterize the crosspoint exactly one output. (NDRO) and the 1×4 counting network from Section 4 to Before generating new select pulses, we have to reset demonstrate their functionality and extract parameters that crosspoints already in a “connected” state. Crosspoint reset we then use for larger-scale evaluations. is generated by the same clock and is routed to each cross- point by a series of splitters. The longest path in this series 5.1.1. Method. Our circuit-level evaluations are done us- of splitters is shorter than the counting network’s minimum ing WRSPICE and the open-source MIT-LL SFQ5ee 10 delay. This ensures that each crosspoint receives a reset kA/cm2 process. Of note, in RSFQ fan-in and fan-out before it can receive a select from the same clock pulse cannot simply be accomplished with wires connected to one

another. Instead, we must use mergers when fanning in and splitters when fanning out. Our WRSPICE decks include test-benches for which we design RSFQ pulse generators to take rectangular-like pulses as inputs and output RSFQ pulses which resemble a voltage spike. With RSFQ pulse generators, we drive inputs into the design under test and observe its input and output behavior using the waveform viewer provided by WRSPICE. We use a supply and bias voltage of 10 mVs. For each circuit, we report the number of JJs, static power, active (dynamic) power, and the delay from a pulse entering The minimum delay that we observe in our experiments the data input until the corresponding output pulse. Active for our 1×4 counting network is 37.08ps and the maximum power is obtained from simulations using the maximum is 41.46ps. Therefore, their difference that we refer to as input frequency achieved by each circuit. For our crosspoint variability is 4.38ps. This variability reflects the difference implemented with an NDRO where reset and select pulses of delays of the four input–output paths within a 1×4 can safely overlap and select takes precedence, setup time is counting network. As we discussed in Section 4.4, while the minimum time required between an input data pulse and we can hide the counting network’s minimum delay using a subsequent reset pulse for the NDRO to operate properly. clock skew, the variability has to be added to our RL time Hold time is the minimum time required between a select slot duration, which degrades network performance. pulse and a subsequent input data pulse. Based on these circuit parameters and our description in Section 4, our RL time slot duration is 92ps (rounded up). 5.1.2. Results. Figures 7, 8, and 9 illustrate WRSPICE sim- This is determined by the formula below, where D is the ulation waveforms that show the behavior of a TFF, a 1×4 propagation delay through a crosspoint, SE is the crosspoint counting network, and a crosspoint (NDRO), respectively. setup time, M is the propagation delay through a merger, The crosspoint simulation shows two cycles (times between SP is the propagation delay through a splitter, R is the a reset). For the crosspoint, after thefirst reset pulse, a select number of inputs or outputs (router radix), V is the delay pulse arrives. Therefore, any input data pulses after that get variability of the counting network, and M is a factor we propagated to the output. The second reset pulse clears the add to account for circuit timing variability (1.2 in our case). crosspoint’s state. After that, since another select pulse does If SE < M , we just replace SE −M with zero: not arrive, input data pulses do not get propagated. Table 2 shows maximum observed input–output delay, active and static power, and JJ count for each circuit. In M×(D+(SE−M)+SP×(R−1)+M×(R−1)+V ) (3) addition, our crosspoint (NDRO) has a setup and hold This equation highlights that efficient circuit implemen- time of 8ps each. Static power is much higher than dy- tations decrease RL time slot duration and thus increase namic power because our experiments are based on RSFQ throughput, for instance by decreasing D or SE in the case technology that uses on-chip bias resistors that constantly of crosspoints. consume power [1], [2]. In general, reducing input–output We demonstrate correct operation of our TFFs for input delay increases network throughput and reduces latency clock pulses with a 13ps period (or higher) which is multi- T to. That is because the input–output combination affects the number of splitters and mergers a packet traverses. At worst, packets traverse n − 1 mergers and n − 1 splitters, where n is the number of inputs or outputs of the router (router radix). At best, packets only traverse one splitter and one merger. Based on the delay values of Table 2, the difference F between worst and best case (delay variability) to traverse T a router is 16ps for n = 3. 5.3. Network-Level Evaluation 5.3.1. Method. For our network-level analysis, we compare against reported throughput and number of JJs of the best- T performing demonstrated NoCs in RSFQ from a variety of network topologies. In particular, we compare against a uni- directional 3×3 ring [6], a 4×440GHz crossbar [8], and a ple times shorter than the duration of a connection window 4×4 Banyan network [7]. Performance and number of JJs even for just two RL time slots. Therefore, TFFs do not for the 4×4 Banyan network are estimates based on aggre- impose a timing constraint. In addition, as we explain in gate reported numbers and the contributions of individual Section 4.4, since our crosspoint hold times are large enough 2×2 routers [7]. To match the scales of the aforementioned but still smaller than the propagation delay through the networks we compare against, we size SRNoC to 4×4 with crosspoint, we disallow sending pulses during thefirst RL only one router by default. To calculate maximum through- time slot of each connection window. For instance, with 64 put, we assume senders try to send a packet (pulse) in each RL time slots, only 63 can carry packets and consequently RL time slot except thefirst to avoid violating crosspoint each packet represents a value from 0 to 62. hold times. That is, in every connection window senders try to send n−1 packets where n is the number of RL time slots. 5.2. RTL-Level Evaluation Consequently, we apply Equation 2 to account for packets that cannot be sent in the same connection window because We then scale-up our evaluation and simulate an entire they map to the same RL time slot. SRNoC router using behavioral Verilog. Because we are comparing against binary-encoded NoCs, we calculate SRNoC throughput per port in Gbps 5.2.1. Method. We construct Verilog behavioral models based on how many pulses a sender can transmit per second of the circuits we previously characterized in WRSPICE. and the equivalent number of bits each pulse represents in We then perform unit testing of those circuit models and binary format. As we previously explained in Section 4.2, all compare against WRSPICE simulations in order to verify those bits can be payload. We calculate end-to-end latency correctness. After that, we use those Verilog circuit models in SRNoC as the average waiting time for a pulse before to build a 4×4 SRNoC router, following Figure 6. In lieu of it is delivered. That includes the time until a connection the analog SFQ spikes as seen in our WRSPICE simulations, to the pulse’s desired destination is established and then we use 5ps long rectangular pulses. For RTL simulation until the RL time slot that matches the pulse’s value. This we use the Icarus Verilog simulator and the GTKWave assumes that pulses (packets) are generated at the beginning waveform viewer, which are both open-source. of thefirst connection window, i.e., as thefirst connection is established. By default, we have 64 RL time slots where 5.2.2. Results. Figure 10 shows the simulation of a 4×4 thefirst one cannot carry a packet. The duration of each RL SRNoC router. In every connection window (clock cycle), time slot is 92ps as previously explained in Section 5.1.2. different non-conflicting input–output paths are established. Table 3 shows the input–output relationship with respect to 5.3.2. Results. Topology-level analysis: Table 4 presents a connection windows and thus explains how input pulses are throughput and JJ count topology-level analysis of SRNoC routed to outputs in Figure 10. compared to three demonstrated NoCs in RSFQ. In UR Packets (pulses) face different propagation delays based traffic, senders have an equal probability to send to any on which input they arrive at and what output they are routed destination including themselves Under UR traffic all NoCs in our comparisons load balance equally. This allows them to utilize their bisection bandwidths in full. To measure Bitcomp is in fact a worst case (WC) traffic pattern for performance over area efficiency, we divide the throughput SRNoC. In general, for N endpoints and R routers, the in Gbps per port each network supports for UR traffic by WC throughput of SRNoC is R where a value of 1 means the number of JJs. This is shown in the rightmost column full bandwidth utilization. The N WC throughput occurs when of Table 4 where we show the relative improvement of each each source sends at maximum rate exclusively to exactly network in throughput (Gbps) per port per JJ. Throughput one unique destination, similar to bitcomp. results are shown as factor improvements relative to the Throughput–latency tradeoff: The default SRNoC con- uni-directional ring, because that has the lowest throughput. figuration we have used thus far focuses on maximizing As shown, SRNoC improves throughput per port per JJ throughput. With it, SRNoC has a latency of 3.9ns. This by 13.1× compared to the crossbar, which is the second- high latency is because of thefixed schedule SRNoC’s router best performer. This throughput increase is large enough uses, which prevents packets from departing ad hoc com- such that SRNoC is likely to remain favorable even for pared to a packet-switched router. This is a known tradeoff applications that have many packets with the same value and with circuit-switched and time division multiplexed (TDM) thus can send fewer pulses per connection window than the networks. However, SRNoC provides a tradeoff between analysis of Equation 2. SRNoC’s improvements are largely throughput and latency by adjusting the number of RL time because of its simplicity and therefore its lower JJ count. slots per connection window, for a constant RL time slot Common traffic synthetic patterns: Here we evaluate duration. We can use this tradeoff to optimize SRNoC for throughput using UR, “UR without self” where a source latency or throughput depending on the application. sends with equal probability to a destination other than itself, Figure 12 shows how throughput per port per JJ and “neighbor” where a source sends to the two destinations latency increase as we increase the number of RL time slots that are its immediate neighbors (this imitates some stencil per connection window for UR traffic. As shown, latency algorithms), “bitcomp” where a source sends exclusively to increases practically linearly because every time we double a unique destination, and “Randperm (20)”. Randperm (20) the number of RL time slots, packets have to wait twice is the average performance out of 20 random permutation as long by average until their desired connection and RL traffic patterns where each source sends a constant but time slot. Assuming traffic sources have as many packets randomly-chosen fraction of its traffic to each destination to send as RL time slots per connection window (a 100% including itself. Randperm (20) mimics the dominant traffic injection rate), increasing the number of RL time slots of a variety of applications such as trained neural networks. increases throughput because it increases the equivalent bits Our traffic patterns provide good coverage of static (i.e., that each pulse represents in RL. However, more RL time not changing with time) traffic patterns for a 4×4 NoC. slots per connection window reduces the average number of All these patterns are admissible because they do not over- packets that can depart per connection window (Equation 2). whelm any destination. For this analysis, we are comparing Therefore, throughput does not increase linearly with the only against the 4×4 crossbar [8] because that is the best- number of RL slots, in contrast to latency. With only two performing competitor and achieves full throughput for all RL time slots, SRNoC has a latency of 134ps and 83% lower admissible traffic patterns. throughput per port per JJ compared to 64 RL time slots. As shown in Figure 11, SRNoC provides higher band- width per port per JJ compared to the 4×4 crossbar [8] 6. Discussion and Future Work for all traffic patterns. Even though SRNoC does not fully utilize its bandwidth for any traffic pattern other than UR, A dimension of SRNoC that we leave for future work its lower JJ count compared to the 4×4 crossbar [8] results is using prior art on calculating an optimal connection in higher throughput per port per JJ timetable for an expected or observed traffic pattern [20] RSFQ Digital-to-Time converter Fig 1 shows the digital to time converter using the RSFQ logic. The idea of digital to time conversion is derived from the conventional cmos based DTC [wiki] which uses the tapped delay lines and multiplexer to generated delayed equivalent of a digital binary input. The circuit consists of a input reference clock and which feeds the circuit composed of serially connected delay blocks (JTL Delay) which can be implemented as a delay buffer or a two inverters in series. The output of each delay block is fed as input to the n:1 multiplexer. The select signal of the multiplexer is the digital input (D[0:n]) which needs to be converted in to the time output. The invention here is the implementation of this D2T in RSFQ logic. RSFQ Time-to-Digital converter (T2D) For time to digital converter, two methods in RSFQ are implemented. First T2D design follows the conventional tapped delay line based implementation. As shown in Fig 2, the tapped delay lines are constructed using the RSFQ D-Flipflops and the output of the each delay block is observed. The clock signal (Tclk) of all the D-Flipflops are connected to stop signal and the input of the first flipflop is connected to the start signal which is the time-domain signal “Tin”. As shown in figure 3, the second T2D conversion logic in RSFQ is based on up-counter. The input signal for this circuit is a “Tclk” clock signal which starts the counter and the reset signal is the time-domain signal Tin. The invention here is the implementation of this Tapped delay line [wiki] based and counter based time to digital converter [wiki] in RSFQ logic. For the counter based T2D, the invention here is realization of counting steps as the time steps for the time domain signals which is implemented using RSFQ.

1 Introduction In our IPDPS paper, we describe one version of a counting network that is a goodfit for the version of the network that we present in that paper. In this additional document, we describe variations to counting networks that, even though they weren’t a goodfit for the network in our IPDPS paper, they are appropriate for variations of our network (NoC) and therefore should be covered by our invention disclosure. We describe two variations on how to implement our balancer, the building block of our counting network. We also describe counting networks with N outputs but either one or N inputs. That is, we describe 1×N and N×N counting networks. 2 TFF-based Counting Networks In the variation we describe in our paper, a balancer consists of a toggleflip-flop (TFF) (figure 1. A TFF has one input and two outputs. Incoming pulses at the input are distributed in a round-robin fashion among the outputs. It is this property of TFF that we exploit to build a balancer. Note that we did not invent a TFF, but the manner that we connect them together with mergers to construct a counting network is part of the invention. Our TFF implementation used in our experiments is based on [18] from our paper. A TFF implements a balancer with one input and two outputs. To imple- ment a balancer with two inputs as required by counting networks, we place a merger at the input of the TFF (figure 3. The role of the merger is to take two inputs and merge them into one output Using this balancer, we can build any counting network either 1×N (figure 2) or N×N (figure 4). We restrict N to be a power of two. In both cases, the counting network receives pulses at its inputs and distributes them in a round robin fashion among its outputs. The input that each pulse arrived from does not affect the output it will be routed to. Only its sequence relative to other pulses does. 1

3 Alternative Balancer Implementation Here we describe an alternative implementation of a balancer that has some favorable properties compared to a balancer based on TFFs. In particular, this new balancer can tolerate input pulses that are closer together in time and still function correctly, compared to the balancer based on TFFs. This property was not important for our current version of the NoC that we describe in our IPDPS paper, but it can be quite important for larger-scale NoC variations or other variations with a higher clock frequency. This new version of the balancer has the same functionality: it takes input pulses and outputs them in a round robin fashion. However, this balancer has an additional reset port, which adds robustness in case the user wants to reset the counting network to the initial state. The feedback loops connected to the 2 outputs allow the RSFQ pulse to configure the routing path. For example, when an RSFQ pulse exits Y0, it will be fed back to the inhibit block, causing the path to be ”inhibited” or closed, allowing the next RSFQ pulse to only exist out of Y1. the downside of this balancer compared to the TFF-based one is that it uses more Josephson junctions (JJs). The balancer is shown infigure 5. Figure 5: Balancer circuit using feedback to configure I/O path. The dummy merger with a ”No Connection” input is placed to ensure path balancing. The reset configu- ration sets up the inhibitor blocks to ”inhibit” an incoming RSFQ. The 3xJTL makes sure that a delayed version of the RSFQ arrives at the signal input of the inhibit to ”clear” the inhbitor block at the top. This ensure that thefirst RSFQ will arrive at Y0. Figure 6: 4×4 re-settable counting network with 6 blanacers. Just like the TFF 4×4, RSFQ pulses will be equally distributed in a round robin fashion starting with the top output. Resetting the counting network will restart the ”counting” at the top output. As a summary,figure 6 shows a simplified view of a 1×4 Counting Network. where each pair of dots connected by a vertical line represents the input/output ports of a single balancer. In this picture, which of the two versions of the balancer we have discussed is used fo the network doesn’t matter. However, one of the advantages of using the second variation of a balancer is that it comes with reset capabilities. In a sense, when you reset the balancers in the counting network, you ”restart” the counting process. 3 1×N counting networks are possible to implement with the second variation of the balancer with the reset. However, one minor inconvenience is that due to the balancer’s design, there will always be a dangling input port which may or may not be used, unintentionally creating a 2×N counting network. Essentially, in this case a 1×N counting network with the second resettable balancer is al- most the same as a N×N counting network, just with the unwanted outputs left unconnected. In contrast, a 1×N counting network with TFF-based balancers are are to remove some paths and balancers, resulting in a smaller counting network. Counting Network Source: James Aspnes, Maurice Herlihy, and Nir Shavit. 1994. Counting networks. J. ACM 41, 5 (Sept. 1994), 1020–1048. DOI:https://doi.org/10.1145/185675.185815 4 Design and Implementation of a Shuttle Flux based Shift Register (using MIT-LL SFQ5ee Technology)

Description:

I. Prior Arts

Information storage and data processing using magnetic vortices is based on the fact that magnetic flux can penetrate a superconducting circuit characterized in a form of a limited flux quanta. The information bit is represented by a flux quantum vortex, the position of the vortex can be shifted by the application of current or magnetic field.

Shown in fig.1, and as described in [1], the combination of current lx and the loop current will move the vortex to advance its phase Φ 4 to past and increase through 2 π . The input voltage pulse will decrease the flux at L3 and increase that of L4 allowing the transfer of vortex to the next loop. The pulse Lx which induce the: shift must last a time At enough to accomplish the shifting ( .

This circuit has potential advantage for high-speed application and low energy dissipation of the switching behavior. In any case the potential advantage in speed alone over more conventional shift- register techniques would seem to make the flux shuttle worthy of further investigation.[1]

Fig 1 . Lumped-circuit model of the discrete junction flux shuttle containing trapped vortex

An application of shuttle flux for shift registers application was demonstrated in [2] and shown in Figure 2 below. The mechanism of this circuit is consisting of three interferometers with three loops per bit. The input clock currents determine the shift direction. The flux quantum circulates the current I ring in a storage loop. The Josephson junction shifted the phase by ΔΦ = 2 π Fig 3. Circuit diagram of a flux shift register cell with write and readout gate [2]

Another implementation of flux shift register was described in [3], as shown in Figure 3. This circuit is comprised of two loops with a master and slave junction. The clock line is coupled magnetically with a master inductance and junction Jm is biased by a current I bm to determine the shift direction.

Fig 3. Equivalent circuit of the shift register [3]

IL Claim: Design description

Our design of a shuttle flux shift register extends prior arts [1 -3] and is described below':

Design implementation :

1 . Our shift register consists of these four stages: a. Input stage - the input de pulse goes to a DC-to-SFQ circuit to be converted to SFQ pulses. b. Shifting stage - consists of three magnetically coupled interferometers. Rs is the bias resistors connected to the input pulses. There are three input ports for the 3 clock currents that determine the shifting c. Readout stage - a JTL (Josephson transmission line) is used for signal amplification at the readout to connect to deliver the output pulse, d. Terminating stage - the terminating stage is consisting of a coupled inductor, JJ and Rt, a feedback couple resistance at the last junction.

2. Shifting mechanism - the three input clock pulses can be tuned in to the desired shifting interval. 3. The cell / shifting stage is adopted from ref. [2] and added a terminating stage instead of a terminating resistor in each of the coupled inductors.

Fig 4. Shift register circuit schematic

4. Scalability -- 'To demonstrate its scalability, in reference to [2], we adopted the master-slave fig.5 implementation to extend our design to multiple shifting stages as illustrated in fig.7 and fig.8 respectively. We implemented the interval per input pulse, in that we don’t need an extra biasing to shift the phase.

Figure 5. Master - Slave implementation

To implement the multiple stages, a set of interferometers connected in series to the last junction of the first shift register cell and the coupled clock inductors are connected in series to the clock inductors which is also magnetically coupled with mutual inductance k. The time delay interval between the stages is configurable, individually for each stage. The delays of the clock pulses in the three clock pulse inputs set the time interval between shifting (i.e., the delay of each stage), The maximum interval setup for each pulses are the same. The time interval in clock input 2 and 3 is the mid-time delay. Fig. 6 shows the simulation of a single stage shift register, the clock and shifting delay is illustrated in the graph below'. The maximum delay is set to 200ps per stage and 100ps mid shift delay per two junctions. dock input setup: Pulse_1 100 pulse(0 100m 1p 10p 10p 10p 200p ) Pulse_2 200 pulse(0 100m 100p 10p 10p 10p 200p) Pulse_3 300 pulse(0 100m 200p 10p 10p 10p 200p)

Figure 6. Simulation of a single stage shift register

Fig. 7. Two stage Shift register with master and slave

Fig. 8. Three stage Shift register

III. Application: Controlled buffering

The designed shift register is used to implement a controlled buffering for our network on chip design as illustrated in fig. 6. In our IPDPS paper we did not show any buffering because that is assumed to be at the network edges and that is outside the scope of the paper. That is, packets may have to be buffered at network boundaries to wait until a connection to their desired destination is made. Because packets for our NoC are a connection of pulses interpreted in race logic (i.e., they are in the temporal domain so their value depends on their time of arrival), many existing buffers in RSFQ are not suitable because they would lose a pulse’s delay relative to the beginning of an epoch. In addition, alternative designs to our shift register such as a series of D flip flops (DFFs) are possible, but would use a large number of josephson junctions (JJs). This is why our buffer is a useful addition to our NoC. It gives the capability to buffer (store) temporally-interpreted pulses with a low number of JJs.

The flexibility in controlling the input clock voltages for shifting direction is an advantage for our application as we implement temporal timing using race logic.

In the diagram below, “shift register” refers to the circuit we previously described. We added two non destructive read out elements (NDROs) for the purpose of making this a usable buffer in the context of a NoC. In particular, the NDRO at the right receives a select (set) when the shift buffer should drain and let the packet it contains out. In that case, the bottom NDRO receives a reset. Therefore, any pulses exit the shift buffer from left to right, while maintaining their race logic timing. In contrast, when we want the buffer to maintain the packet in the shift buffer, we reset the NDRO on the right and select the bottom NDRO. In that case, the shift buffer essentially is part of a loop that the packet keeps going around.

Fig. 9. Controlled buffer block diagram

IV. Summary of our modification from related literature a. In ref [1], we adopted the concept of magnetic flux quanta around the vortex b. In ref [2], we adopted the 3 interferometers to implement the shift register cell, we didn’t use the write and readout part. We implemented the write part using DC/SFQ block and JTL for the readouts. c. The use of JTL for readouts helps also in signal amplification. d. In ref [2] we didn’t use a terminator resistor in each input pulses at the coupled Inductor, instead we terminate it directly to the ground and added a terminating stage, which is a pair couple inductor and JJ with feedback resistor, e. In ref [3], we adopted the master-slave to implement a multiple stage shift register, this is to demonstrate the scalability of our circuit. f. From [3], they used several biases to implement the shifting, for our design the three controlled input voltages are used for shifting so we simplify our circuit by not using any additional biases. g. Additionally, we used NDROs (figure 8) to implement a buffer (storage) for packets as pulses in race logic for a NoC. NDRO cells appear in literature, but the way we connect them to form a NoC buffer is part of the claim.

References:

1. T, A. Fulton. R. C. Dynes and P. W. Anderson, "The flux shuttle — A Josephson junction shift register employing single flux quanta,” in Proceedings of the IEEE, vol. 61, no. 1, pp. 28-35, Jan. 1973. doi: 10.1109/PROC.1973.8966.

2. R. Lochschmied, R. Herwig, M. Neuhaus and W. Jutzi, "A low power 12 bit flux shuttle shift register with Nb technology," in IEEE Transactions on Applied Superconductivity, vol. 7, no. 2, pp. 2983-2986, June 1997, doi: 10.1109/77.621945.

3. R. Koch, T. Scherer, M. Winter and W. Jutzi, "A 4 bit YBa/sub 2/Cu/sub 3/O/sub 7-/spl delta// bicrystal Josephson junctions flux shuttle shift register," in IEEE Transactions on Applied Superconductivity, vol. 7, no. 2. pp. 3646-3649, June 1997, doi: 10.1109/77.622208.