The clock is the essence of a synchronous digital system. Physically, the clock is distributed from an external pad to all similarly clocked synchronizing elements through a distribution network that encompasses combinational logic and interconnects. It serves to unify the physical and temporal design abstractions by determining the precise instants in time at which the digital machine changes state. Because the clock is important, optimization of the clock signal can have a significant impact on the chip's cycle time, especially in high-performance designs. Non-optimal clock behavior is caused by two phenomena: the routing to the chip's synchronizing elements, and the asymmetric behavior of the clock distribution logic.
Previous work in clock optimization has been contributed by several authors. H-trees have been recognized for years as a technique to help reduce the skew in synchronous systems [FK82] [KGE82] [DFW84] [BWM86]. For regular structures such as systolic arrays the H-tree works well to reduce skew because the synchronizing elements are distributed in a uniform pattern throughout the chip. However, for general design styles, nonuniform distributions of clock pins are common and the H-tree becomes ineffective as a technique for clock routing. The large size of the clock net has led some researchers [DFW84] [Mij87] to perform buffer optimization within the clock distribution tree. [BWM86] have provided an analysis of the clock tree that considers the transmission line properties of the interconnects. [BBB +89] have presented an approach for ASIC clock distribution that integrates buffer optimization into place and route algorithms. However, in all previous work the routing of the clock net is performed using ordinary global routing tools based on minimum spanning or approximate minimum Steiner tree net models and with detailed routers that have little understanding of clock routing problems. This causes non-optimal clock behavior and as region size or the number of pins in the clock net increases, the detrimental behavior is exacerbated. In this paper, we focus on routing techniques for optimizing clock signals in VLSI circuits. We demonstrate the superiority of our algorithm over standard routing techniques for widely varying region size, clock pin distributions, numbers of clock pins and technology feature size.
In section two the preliminaries necessary for understanding the paper are presented. Following this, in section three the problem is defined. Section four illustrates the algorithm for clock routing and section five discusses theoretical results. Next, in section six the experimental results are presented, and in section seven possible avenues for future work and conclusions regarding the approach are discussed.
A communication network constructed from the proposed components is modeled as a set of nodes (components) connected by bidirectional communication links. Because of technological constraints, the total I/O bandwidth of each node is limited to some fixed value, and assumed to be equally divided among the attached links. Increasing the number of links per component leads to a reduction in the average number of hops between nodes, but at the cost of reduced link bandwidth. This "hop count/ link bandwidth" tradeoff is examined in great detail through M/M/1 queueing models and simulations using traffic loads generated by parallel application programs. These results indicate that a small number of links should be used. It is also found that a significant improvement in performance is obtained if a component is allowed to immediately begin forwarding a message when the selected output link becomes idle, regardless of whether or not the end of the message has arrived. Finally, mechanisms which efficiently transmit a single message to multiple destinations are seen to have a significant impact on performance in programs relying on global information.
The complexity of the circuitry required to implement a communication component is examined. Schemes for providing hardware support for communication functions--routing, buffer management, and flow control--are presented. Estimates of the number of buffers and the degree of multiplexing on each communication link are determined. The amount of circuitry to implement a communication component is computed, and it is seen that the proposed communication component could be complemented with technology available today. Design recommendations for the implementation of such a component are made.
The system's novel features include a fast channel decomposer, an obstacle-avoiding global router, and an obstacle-avoiding switching router. The router's channel decomposition algorithm relies on a corner-stitched data structure to efficiently produce a small number of large channels. The global router considers obstacles during path generation, trading-off net length and channel complexity to simplify the subsequent channel routing task. While able to cope with obstacles, Magic's switchbox router is still comparable to the best traditional (non-obstacle-avoiding) channel routers.
The router's obstacle-avoidance features rely on two underlying concepts: (1) a preferred direction for crossing an obstacle, and (2) hazards, or areas the routing should avoid. Crossing obstacles in the preferred direction minimizes the creation of blocked areas, which can not be crossed by other routing; this minimizes obstacles' impact on the automatic routing. Crossings in preferred directions are controlled by strategically-placed hazards adjacent to obstacles.
Measurements show that obstacle-avoiding routing is both useful and practical: hand-routing improves the electrical characteristics of the selected nets, while the hand-routing obstacles have only minor effects upon the routing quality of a design as a whole. The improvement in electrical characteristics is due to the decreased net length and increased attention to layer selection possible when nets are prewired by hand.
Fast Packet Switches for Asynchronous Transfer Mode will form the basis for implementing the Broadband Intergated Services Digital Network of the future. This thesis presents the preliminary design of a switch chip that implements a novel flow control algorithm guaranteeing fair allocation and full utilization of bandwidth even under congestion. The switch chip is intented to be connected via its bit parallel links to other identical switch chip and various interface chips, so that it can be used as a building block for constructing composite switches of arbitrary topology and size. Its architecture is scalable up to a dozen multi-Gbit/sec links, although the first prototype is designed for only 4 bidirectional links of 400 Mbits/sec per link in each direction. The organization of the on-chip buffer memory, along with the back-pressure flow control, and the weighted round robin cell scheduling mechanisms that the chip implements in hardware, provide the network manager a set of powerful tools for tuning traffic flows and guaranteeing service performance. Full bandwidth utilization is achieved by providing dedicated buffer space to some classes of Virtual Circuits, and communication latency is reduced by using ``virtual cut-through''.
The realization of this architecture under a 1\ \(*mm CMOS technology has been studied. We present the circuits that are crucial for the size and speed of the chip. These crucial circuits are primarily: the buffer memory, the input-output buffers, the content addressable memory used to select the next VC to be serviced, the circuit that schedules the accesses to the shared buffer, and the chip routing tables. We present the design and layout of the above circuits and the results of their simulation as a proof of the fact that the proposed architecture is realizable using the modern VLSI technology
Keywords: volume rendering volume visualization parallel processing
Volume visualization deals with representation, manipulation and rendering of 3D volume data. A typical volume data size is 512 x 512 x 512 voxels with 8-bits information at each voxel. The 3D volume data is projected along a viewing direction which produces 2D images. These images have to be generated in real time (30 frames/sec). Hence, a special purpose architecture is required to process such a large amount of data. In this paper, we propose an optimal architecture for the viewing processor for 3D volume visualization. This is based on the existing Cube Architecture. To facilitate conflict free access of the voxels values, a new memory organization scheme is proposed. This architecture uses O(nlog n) processing units to project an image of n^3 voxels in O(n^2/log n) time,which is optimal as the product of time and number of processors is equal to O(n). We have implemented a part of 16^3 system in VLSI chip which suggests, the proposed architecture achieves the real-time projection of 3D volume data for 512^3 system. The same architecture can be extended for FTB or BTF compositing computations and weighted sum computations in real time
Keywords: NTP discovery, resource discovery, RD, server location, anycast, expanding ring broadcast, multicast triangulation, application layer, network probing, proactive gathering, server location techniques, reactive gathering, in routing layer, routing table polling, route probing, hop count probing
Submitted for publication
Abstract: "In this paper we consider the problem of choosing among a collection of replicated servers, focusing on the question of how to make choices that segregate client/server traffic according to network topology. We explore the cost and effectiveness of a variety of approaches, ranging from those requiring routing layer support (e.g., anycast) to those that build location databases using application-level probe tools like traceroute. We uncover a number of tradeoffs between effectiveness, network cost, ease of deployment, and portability across different types of networks. We performed our experiments using a simulation parameterized by a topology collected from 7 survey sites across the United States, exploring a global collection of Network Time Protocol servers."
Abstract: A probabilistic model, to solve efficiently stochastic network reliability problems, for networks of a "special class" (i.e., tree-based networks),is provided. The networks, belonging to this special class,are based on interconnection patterns consisting of complete binary trees in which spare edges are added, according to different criteria. The probabilistic model, to solve stochastic network reliability problems, for the aforementioned networks, is based on causal networks. We show that the use of this probabilistic model allows us to evaluate efficiently the average performance degradation (i.e., the average number of processing elements, still functioning, in the presence of random faults) of dynamically reconfigurable, fault-tolerant VLSI systems, which are based on such tree-based architectures. Examples of application of the probabilistic model are also provided to compare several different analyzed VLSI architectures.
We develop a design process support model called the \fILight Weight Transaction\fR (LWT) model, which captures both the \fIstructured\fR and \fIexploratory\fR aspects of VLSI design. The former corresponds to the design procedures that are well-understood and thus can be specified in advance, whereas the latter denotes the creative part of a design process. We developed a script facility to support routine design activities and proposed a history-based rework mechanism to allow interactive process management.
We develop a design process support model called the \fILight Weight Transaction\fR (LWT) model, which captures both the \fIstructured\fR and \fIexploratory\fR aspects of VLSI design. The former corresponds to the design procedures that are well-understood and thus can be specified in advance, whereas the latter denotes the creative part of a design process. We developed a script facility to support routine design activities and proposed a history-based rework mechanism to allow interactive exploration of the design space. Unlike conventional database transaction models, the LWT model is based on a data visibility abstraction: users can operate on a piece of data only when it is visible to them. We have shown how this can support both design exploration and cooperative group work.
To demonstrate the feasibility of the LWT model, we built a prototype implementation on top of the \fISprite\fR operating system, the Tcl/Tk facility, and the Berkeley OCT CAD tool suite. This implementation features a transparent load balancing scheme to exploit the computation power of networked workstations, an atomicity-guarantee mechanism to preserve the high-level task abstraction. In addition, the rework mechanism depends on a single assignment update principle, which in turn could pose serious storage overheads. Our implementation alleviates this overhead by performing a history-based object reclamation in the background.
Based on the design operation history, we propose a novel design management paradigm: Rather than requiring users to supply design meta-data, the system maintains and analyzes the design history to deduce the metadata, in particular, object attributes and inter-object relationships, according to a suite of domain-specific knowledge and inference procedures. This paradigm can be viewed as a generalization of the approach used in syntax-directed editors. However, we believe this to be the first attempt to apply the idea in the context of design database management systems. Instead of using abstract syntax trees, we use a special representation of the design history called \fIaugmented derivation graph\fR as the basis for design metadata inference. This paradigm opens a new way of thinking about creating information that are interesting to the system, be that a user, an operating system, or a database system.
This report describes a paper design and evaluation, and thus serves two purposes: It describes one particular VLSI sorting circuit, and it also serves as a case study in VLSI design methodology. The algorithm is described, the overall chip organization and data flow are presented, and detailed circuits, layouts, and timing analyses are given.
As multiple processor systems become more widely available, applications involving multiple concurrent processes will increase in number and importance. This increased interdependency among processes poses interesting problems in the area of processor scheduling. How should the processes be scheduled to achieve some optimal level of performance? A scheduler based on an expert system may prove to be a viable alternative to those that have been proposed and (in some cases) implemented so far.
This report describes the implementation of a learning mechanism that attempts to handle the problem of processor scheduling in such a multiprocessor environment. In effect, the Intelligent Agent tries to "learn" its own set of heuristics for optimally scheduling a set of co-operating processes. By simulating a relatively simple multiprocessor system we examine the merits of such an approach.
Copyright 1994 by FORTH, Heraklio, Crete, Greece
The goals of this research are to provide answers to these questions, to design a Prolog system that automatically exploits parallelism in Prolog wiht low overhead memory manamement and task management schemes, and to demonstrate by means of detailed simulations that such a Prolog system can indeed achieve a significant speedup over the fastest sequential Prolog systems.
We achieve these goals by first identifying the large sources of overhead in parallel Prolog execution: side-effects caused by parallel tasks, choicepoints created by parallel tasks, task creation, task scheduling, task suspension and context switching.
We then identify a form of parallelism, called flow parallelism, that can be exploited with low overhead becuase paralel execution is restricted to goals that do not cause side-effects and do not create choicepoints. We develop a master-slave model of parallel execution that eliminates task suspension and context switching. The model uses program partitioning and task scheduling techniques that do not require task suspension and context switching to prevent deadlock. We identify architectural techniques to support the parallel execution model and develop the Flow Parallel Prolog Machine (FPPM) architecture and implementation.
Finally, we evaluate the performance of FPPM and investigate the design tradeoffs using measurments on a detailed, register-transfer level simulator. FPPM achieves an average speedup of about a factor of 2 (as much as a factor of 5 for some programs) over the current highest performance sequential Prolog implementation, the VLSI-BAM. The speedups over other parallel Prolog systems are much larger.
Previous generation machines, such as the Caltech UPEs  mainly addressed the arithmetic computation needs of instrument simulation. The insights gained with these systems computation needs of instrument simulation. The insights gained with these systems demonstrate the feasibility of the approach but also clearly indicate that single chip systems cannot offer sufficient performance to simulate ensembles of instruments. To overcome this problem the MIMIC system includes a special purpose network managed by an on-chip unit.
Our machine relies heavily on the use of memory, both within our custom processing chips and in commercial memory chips giving rise to the name Memory Intensive Music Integrate Circuit (MIMIC).
Trace-driven simulation shows that in-cache translation has better performance than many translation lookaside buffer designs. As cache memories grow larger, the advantage of in-cache translation will also increase. Other simulation results indicate that in-cache translation will also increase. Other simulation results indicate that in-cache tranlation performs well over a wide range of cache configurations.
To further understand in-cache translation, we implemented it as a central feature of SPUR, a multiprocessor workstation developed at the University of California at Berkeley. A five-processor prototype convincingly demonstrates the feasibility of this new mechanism. In addition, a set of event counters imbedded in the custom VLSI cache controller chip, provides a means to measure cache performance. We use these counters to evaluate the performance of in-cache translation for 23 workloads. These measurements validate some of the key simulation results.
The measurements and simulations also show that the performance of in-cache translation is sensitive to pagetable placement. We propse a variation of the algorithm, called inverted in-cache translation, which reduces this sensitivity. We also examine alternative ways to support reference and dirty bits in a virtual address cache. An analytic model and measurements from the prototype show that the SPUR mechanism has the best performance, but that emulating dirty bits with protection is not much worse and does not require additional hardware. Finally, we show that the miss bit approximation to reference bits, where the bit is only checked on cache misses, performs better than true reference bits, which require a partial cache flush.
In this research I was supported for three years by a General Electric doctoral fellowship. The RISC project was supported in part by ARPA Order No. 3803 and monitored by NESC #N00039-78-G-0013-0004.
In this dissertation, the nature of general-purpose computations is studied, showing the simplicity of the operations usually performed and the high frequency of operand accesses, many of which are made to the few local scalar variables of procedures. The architecture of the RISC I and II processors is presented. They feature simple instructions and a large multi-window register file, whose overlapping windows are used for holding the arguments and local scalar variables of the most recently activated procedures. In the framework of the RISC project, which has been a large team effort at UC Berkeley for more than three years, a RISC II nMOS single-chip processor was implemented, in collaboration with R. Sherburne. Its microarchitecture is described and evaluted, followed by a discussion of the debugging and testing methods used. Future VLSI technology will allow the integration of larger systems on a single chip. The effective utilization of the additional transistors is considered, and it is proposed that they should be used in implementing specially organized instruction fetch-and-sequence units and data caches.
The architectural study and evaluation of RISC II, as well as its design, layout, and testing after fabrication, have shown the viability and the advantages of the RISC approach. The RISC II single-chip processor looks different from other popular commercial processors: it has been less total transistors, it spends only 10% of the chip area for control rather than one half to two thirds, and it required about five times less design and lay-out effort to get chips that work correctly and at speed on first silicon. And, on top of all that, RISC II executes integer, high level language programs significantly faster than these other processors made in similar technologies.