SIKS of Unified Computer Science TR Index

Keywords: routing vlsi

Entries matching at least 1 keyword found: 214; displayed: 50


Matching 2 keywords:

U of California Berkeley, ERL erl-90-27(dir)
Jackson, Michael A.B.; Srinivasan, Arvind; Kuh, Ernest S. ``A Novel Approach to IC Performance Optimization by Clock Routing.'' UCB//ERL-90-27, April, 1990. 34 pages.
ABSTRACT
In today's highly competitive IC marketplace, company survival necessitates product differentiability. Product differentiability may be engendered in many ways, several of which include: increased performance (e.g. lower power, faster timing, etc...), lower cost, more features, or faster time to market. Thus, the motivation for design techniques that enhance chip timing performance are of fundamental importance to the IC community.

The clock is the essence of a synchronous digital system. Physically, the clock is distributed from an external pad to all similarly clocked synchronizing elements through a distribution network that encompasses combinational logic and interconnects. It serves to unify the physical and temporal design abstractions by determining the precise instants in time at which the digital machine changes state. Because the clock is important, optimization of the clock signal can have a significant impact on the chip's cycle time, especially in high-performance designs. Non-optimal clock behavior is caused by two phenomena: the routing to the chip's synchronizing elements, and the asymmetric behavior of the clock distribution logic.

Previous work in clock optimization has been contributed by several authors. H-trees have been recognized for years as a technique to help reduce the skew in synchronous systems [FK82] [KGE82] [DFW84] [BWM86]. For regular structures such as systolic arrays the H-tree works well to reduce skew because the synchronizing elements are distributed in a uniform pattern throughout the chip. However, for general design styles, nonuniform distributions of clock pins are common and the H-tree becomes ineffective as a technique for clock routing. The large size of the clock net has led some researchers [DFW84] [Mij87] to perform buffer optimization within the clock distribution tree. [BWM86] have provided an analysis of the clock tree that considers the transmission line properties of the interconnects. [BBB +89] have presented an approach for ASIC clock distribution that integrates buffer optimization into place and route algorithms. However, in all previous work the routing of the clock net is performed using ordinary global routing tools based on minimum spanning or approximate minimum Steiner tree net models and with detailed routers that have little understanding of clock routing problems. This causes non-optimal clock behavior and as region size or the number of pins in the clock net increases, the detrimental behavior is exacerbated. In this paper, we focus on routing techniques for optimizing clock signals in VLSI circuits. We demonstrate the superiority of our algorithm over standard routing techniques for widely varying region size, clock pin distributions, numbers of clock pins and technology feature size.

In section two the preliminaries necessary for understanding the paper are presented. Following this, in section three the problem is defined. Section four illustrates the algorithm for clock routing and section five discusses theoretical results. Next, in section six the experimental results are presented, and in section seven possible avenues for future work and conclusions regarding the approach are discussed.

ENTRY
Sep 14, 1994
RETRIEVAL
ocr (in all.ocr); tiff (in {001-034}.tif)

U of California Berkeley, CS csd-83-137(dir)
Fujimoto, Richard M. ``VLSI Communication Components for Multicomputer Networks.'' UCB//CSD-83-137, 234 pages.
ABSTRACT
Advances in microporecessor technology will soon make general purpose computing systems composed of thousands of VLSI processors economically feasible. A high-performance communication system to interconnect these processors is of crucial importance to exploit the parallelism inherent in application such as circuit simulation and signal processing. This thesis discusses issues in the design of universal VLSI communication components to be used as the building blocks for constructing robust, high-bandwidth, point-to-point networks. The component provide enough flexibility to serve a wide variety of multicomputer configurations and applications. They feature special purpose hardware to implement communication functions traditionally implemented with network software.

A communication network constructed from the proposed components is modeled as a set of nodes (components) connected by bidirectional communication links. Because of technological constraints, the total I/O bandwidth of each node is limited to some fixed value, and assumed to be equally divided among the attached links. Increasing the number of links per component leads to a reduction in the average number of hops between nodes, but at the cost of reduced link bandwidth. This "hop count/ link bandwidth" tradeoff is examined in great detail through M/M/1 queueing models and simulations using traffic loads generated by parallel application programs. These results indicate that a small number of links should be used. It is also found that a significant improvement in performance is obtained if a component is allowed to immediately begin forwarding a message when the selected output link becomes idle, regardless of whether or not the end of the message has arrived. Finally, mechanisms which efficiently transmit a single message to multiple destinations are seen to have a significant impact on performance in programs relying on global information.

The complexity of the circuitry required to implement a communication component is examined. Schemes for providing hardware support for communication functions--routing, buffer management, and flow control--are presented. Estimates of the number of buffers and the degree of multiplexing on each communication link are determined. The amount of circuitry to implement a communication component is computed, and it is seen that the proposed communication component could be complemented with technology available today. Design recommendations for the implementation of such a component are made.

ENTRY
July 9, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-234}.tif)

U of California Berkeley, CS csd-88-412(dir)
Srini, Vason P.; Tam, Jerric V.; Nguyen, Tam M.; Holmer, Bruce K.; Patt, Yale N.; Despain, Alvin M. ``Design and Implementation of A CMOS Chip for Prolog.'' UCB//CSD-88-412, 132 pages.
ABSTRACT
We have designed and fabricated a high performance VLSI chip for executing Prolog programs using a 1.4 micron CMOS technology with two layers of metal. This chip implements a tagged architecture with hardware support for five stacks.The 32-bit data path of the chip contains a fast ALU, 64 registers in four groups, five counters, and six non-master/slave registers. The control is microprogrammed and uses a 512 X 160 bit ROM with four pages for fast microbranching. The chip operates at a cycle time of 100 ns (worst case) and has a size of 10 mm X 9 mm. A semicustom design methodology employing Mentor and NCR tools has been used in design. The challenges involved in the design, verification, routing, and fabrication of the chip are described.
ENTRY
October 11, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-132}.tif)

U of California Berkeley, CS csd-86-291(dir)
Hamachi, Gordon Taro. ``An Obstacle-Avoiding Router for Custom VLSI.'' UCB//CSD-86-291, 130 pages.
ABSTRACT
Magic's automatic routing system combines the flexibility of hand-routing with the speed and quality of automatic channel routers, by allowing chip designers to prewire selected nets by hand. The router then works around this previously-placed layout, called obstacles, to automatically wire the remaining nets. This ability to partially hand-route an integrated circuit gives designers complete control over critical paths, power and ground routing, and other special netw. At the same time the router provides a fast way to make the remaining connections in the design.

The system's novel features include a fast channel decomposer, an obstacle-avoiding global router, and an obstacle-avoiding switching router. The router's channel decomposition algorithm relies on a corner-stitched data structure to efficiently produce a small number of large channels. The global router considers obstacles during path generation, trading-off net length and channel complexity to simplify the subsequent channel routing task. While able to cope with obstacles, Magic's switchbox router is still comparable to the best traditional (non-obstacle-avoiding) channel routers.

The router's obstacle-avoidance features rely on two underlying concepts: (1) a preferred direction for crossing an obstacle, and (2) hazards, or areas the routing should avoid. Crossing obstacles in the preferred direction minimizes the creation of blocked areas, which can not be crossed by other routing; this minimizes obstacles' impact on the automatic routing. Crossings in preferred directions are controlled by strategically-placed hazards adjacent to obstacles.

Measurements show that obstacle-avoiding routing is both useful and practical: hand-routing improves the electrical characteristics of the selected nets, while the hand-routing obstacles have only minor effects upon the routing quality of a design as a whole. The improvement in electrical characteristics is due to the decreased net length and increased attention to layer selection possible when nets are prewired by hand.

ENTRY
June 25, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-130}.tif)

U of California Berkeley, CS csd-92-689(dir)
Tzeng, Ping-San. ``Integrated Placement and Routing for VLSI Layout Synthesis and Optimization.'' UCB//CSD-92-689, 106 pages.
ABSTRACT
This dissertation investigates ways to integrate various VLSI layout algorithms via carefully designed integrated data structures. Such an integrated approach can achieve better overall results by iterating non-sequentially among the various algorithms in a demand-driven manner. The shared data strucure which is modified incrementally by all the different algorithms serves as an efficient communication medium between them This approach has resulted in several new prototype tools, including a new placement program that combines wire-length optimization with a new 2-D compaction algorithm, a new area-routing approach that employs hierarchical rip-up and reroute techniques in an integrated global and detailed routing environment, and also a system that integrates the area router with a placement adjustment algorithm. This integrated system can iterate automatically between area routing and placement adjustment phases to generate optimized results for macro-cell problems with over-the-cell routing.
ENTRY
August 6, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-106}.tif)

Inst for CS-FORTH 1991.TR25.Fast_packet_switches.ps.Z(254K)
Stefanos Sidiropoylos. ``Fast Packet Switches for Asynchronous Transfer Mode.'' Technical Report No. 25, Masters thesis, CSD-UCH. ICS-FORTH, Heraklion, Crete, Greece, August 1991.
Fast Packet Switches for Asynchronous Transfer Mode will form the basis for implementing the Broadband Intergated Services Digital Network of the future. This thesis presents the preliminary design of a switch chip that implements a novel flow control algorithm guaranteeing fair allocation and full utilization of bandwidth even under congestion. The switch chip is intented to be connected via its bit parallel links to other identical switch chip and various interface chips, so that it can be used as a building block for constructing composite switches of arbitrary topology and size. Its architecture is scalable up to a dozen multi-Gbit/sec links, although the first prototype is designed for only 4 bidirectional links of 400 Mbits/sec per link in each direction. The organization of the on-chip buffer memory, along with the back-pressure flow control, and the weighted round robin cell scheduling mechanisms that the chip implements in hardware, provide the network manager a set of powerful tools for tuning traffic flows and guaranteeing service performance. Full bandwidth utilization is achieved by providing dedicated buffer space to some classes of Virtual Circuits, and communication latency is reduced by using ``virtual cut-through''.

The realization of this architecture under a 1\ \(*mm CMOS technology has been studied. We present the circuits that are crucial for the size and speed of the chip. These crucial circuits are primarily: the buffer memory, the input-output buffers, the content addressable memory used to select the next VC to be serviced, the circuit that schedules the accesses to the shared buffer, and the chip routing tables. We present the design and layout of the above circuits and the results of their simulation as a proof of the fact that the proposed architecture is realizable using the modern VLSI technology

1991.TR25.Fast_packet_switches.ps.Z


Matching 1 keyword:

U of Warwick, CS 185(dir)
Gibbons, A.M. ``A Tutorial Introduction to Distributed Memory Models of Parallel Computation.'' Technical Report, UWARWICK//CS-RR-185, University of Warwick, Department of Computer Science. November 1992.
ABSTRACT
This is an introduction to distributed memory models of parallel computation. Communication networks that have been advocated for general purposes are described and the problem of parallel routing within these networks is addressed through the permutation routing paradigm. A number of basic techniques are briefly introduced for efficient implementation of P-RAM algorithms on distributed memory models of parallel computation. These include techniques which are dependent on network topology or algorithmic structure, such as graph embedding and the method of compress and iterate as well as techniques with less specific application such as the employment of hashing and the use of parallel slackness to hide network latency.
BIB-VERSION
CS-TR-v2.0
ENTRY
January 20, 1995
LANGUAGE
English
NOTES
This report has been published, and is not available online; a citation of the published version is available in BiBTeX format (citation.bib) by anonymous FTP from ftp://ftp.dcs.warwick.ac.uk/pub/reports/rr/185/

U of Alberta CS TR92-01(dir) TR92-01.abstract(symlink)
Zhiyong Liu and Xiaobo Li Routing Linear Complement Permutations on Hypercubes 1992 TR 92-01 hypercube, linear complement permutations, routing algorithms, conflict, complexity

U of Warwick, CS 164(dir)
Chown, Paul; Walton, D.W.; Nudd, Graham R. ``VLSI Design of a Pipelined CORDIC Processor.'' Technical Report, UWARWICK//CS-RR-164, University of Warwick, Department of Computer Science. October 1990.
ABSTRACT
In this report we discuss the VLSI realisation of a pipelined CORDIC arithmetic unit to perform stable matrix row operations for the solution of systems of linear equations. The algorithmic considerations of the CORDIC process are highlighted and a chip level architecture is derived from these to implement the algorithm in a pipelined manner. We then proceed to give details of the 2pm CMOS processor that has been designed to implement that architecture.
BIB-VERSION
CS-TR-v2.0
ENTRY
January 20, 1995
LANGUAGE
English

U of Central Florida CS 1993(dir) tr931006.ps.Z(140K)
Vineet Goel, Amar Mukherjee. ``An Optimal Volume Rendering Architecture.'' Symposium on Parallel Rendering 1993. San Jose, CA, October 1993.

Keywords: volume rendering volume visualization parallel processing

Volume visualization deals with representation, manipulation and rendering of 3D volume data. A typical volume data size is 512 x 512 x 512 voxels with 8-bits information at each voxel. The 3D volume data is projected along a viewing direction which produces 2D images. These images have to be generated in real time (30 frames/sec). Hence, a special purpose architecture is required to process such a large amount of data. In this paper, we propose an optimal architecture for the viewing processor for 3D volume visualization. This is based on the existing Cube Architecture. To facilitate conflict free access of the voxels values, a new memory organization scheme is proposed. This architecture uses O(nlog n) processing units to project an image of n^3 voxels in O(n^2/log n) time,which is optimal as the product of time and number of processors is equal to O(n). We have implemented a part of 16^3 system in VLSI chip which suggests, the proposed architecture achieves the real-time projection of 3D volume data for 512^3 system. The same architecture can be extended for FTB or BTF compositing computations and weighted sum computations in real time
tr931006.ps.Z

U of Colorado, Boulder CS ReplSvrLoc.ps.Z(109K) ReplSvrLoc.txt.Z(19K)
James D. Guyton, Michael F. Schwartz. ``Locating Nearby Copies of Replicated Internet Servers.'' Technical Report CU-CS-762-95, Department of Computer Science, University of Colorado, Boulder, Colorado,

Keywords: NTP discovery, resource discovery, RD, server location, anycast, expanding ring broadcast, multicast triangulation, application layer, network probing, proactive gathering, server location techniques, reactive gathering, in routing layer, routing table polling, route probing, hop count probing

Submitted for publication

Abstract: "In this paper we consider the problem of choosing among a collection of replicated servers, focusing on the question of how to make choices that segregate client/server traffic according to network topology. We explore the cost and effectiveness of a variety of approaches, ranging from those requiring routing layer support (e.g., anycast) to those that build location databases using application-level probe tools like traceroute. We uncover a number of tradeoffs between effectiveness, network cost, ease of deployment, and portability across different types of networks. We performed our experiments using a simulation parameterized by a topology collected from 7 survey sites across the United States, exploring a global collection of Network Time Protocol servers."

U of Bologna CS tree-reliability.ps.gz(symlink)
Report: UBLCS-93-13 Title: Reliability Analysis of Tree-Based Structures and its Application to Fault-Tolerant VLSI Systems Authors: Marco Roccetti Date: June 1993 Length: 46 pages Availability: anonymous FTP from ftp.cs.unibo.it Directory: /pub/TR/UBLCS File: tree-reliability.ps.gz

Abstract: A probabilistic model, to solve efficiently stochastic network reliability problems, for networks of a "special class" (i.e., tree-based networks),is provided. The networks, belonging to this special class,are based on interconnection patterns consisting of complete binary trees in which spare edges are added, according to different criteria. The probabilistic model, to solve stochastic network reliability problems, for the aforementioned networks, is based on causal networks. We show that the use of this probabilistic model allows us to evaluate efficiently the average performance degradation (i.e., the average number of processing elements, still functioning, in the presence of random faults) of dynamically reconfigurable, fault-tolerant VLSI systems, which are based on such tree-based architectures. Examples of application of the probabilistic model are also provided to compare several different analyzed VLSI architectures.

U of California Berkeley, CS csd-93-724(dir)
Chiueh, Tzi-cker F. ``Papyrus: A History-Based VLSI Design Process Management System.'' UCB//CSD-93-724, 144 pages.
ABSTRACT
With the advent of powerful computer-aided-design (CAD) tools and increasingly complicated VLSI systems, the notion of circuit design has evolved into managing complexity rather than manipulating electronic devices. Complexity arises from enormous amounts of design data as well as the complicated process of creating design data. An important observation is that modern circuit designers spend more time managing the created design data than actually running the CAD tools. The thesis of our work is: the key to further enhance VLSI designers' productivity is not better CAD tools but a more responsive infrastructure. The focus of this dissertation is thus on support mechanisms that allow composition of a set of potentially heterogeneous tools into a coherent design system, and facilitate the integration of design data and design process management.

We develop a design process support model called the \fILight Weight Transaction\fR (LWT) model, which captures both the \fIstructured\fR and \fIexploratory\fR aspects of VLSI design. The former corresponds to the design procedures that are well-understood and thus can be specified in advance, whereas the latter denotes the creative part of a design process. We developed a script facility to support routine design activities and proposed a history-based rework mechanism to allow interactive process management.

We develop a design process support model called the \fILight Weight Transaction\fR (LWT) model, which captures both the \fIstructured\fR and \fIexploratory\fR aspects of VLSI design. The former corresponds to the design procedures that are well-understood and thus can be specified in advance, whereas the latter denotes the creative part of a design process. We developed a script facility to support routine design activities and proposed a history-based rework mechanism to allow interactive exploration of the design space. Unlike conventional database transaction models, the LWT model is based on a data visibility abstraction: users can operate on a piece of data only when it is visible to them. We have shown how this can support both design exploration and cooperative group work.

To demonstrate the feasibility of the LWT model, we built a prototype implementation on top of the \fISprite\fR operating system, the Tcl/Tk facility, and the Berkeley OCT CAD tool suite. This implementation features a transparent load balancing scheme to exploit the computation power of networked workstations, an atomicity-guarantee mechanism to preserve the high-level task abstraction. In addition, the rework mechanism depends on a single assignment update principle, which in turn could pose serious storage overheads. Our implementation alleviates this overhead by performing a history-based object reclamation in the background.

Based on the design operation history, we propose a novel design management paradigm: Rather than requiring users to supply design meta-data, the system maintains and analyzes the design history to deduce the metadata, in particular, object attributes and inter-object relationships, according to a suite of domain-specific knowledge and inference procedures. This paradigm can be viewed as a generalization of the approach used in syntax-directed editors. However, we believe this to be the first attempt to apply the idea in the context of design database management systems. Instead of using abstract syntax trees, we use a special representation of the design history called \fIaugmented derivation graph\fR as the basis for design metadata inference. This paradigm opens a new way of thinking about creating information that are interesting to the system, be that a user, an operating system, or a database system.

ENTRY
February 25, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-144}.tif)

Princeton U CS 484.ps.Z(237K)
TR-484-95 General Parallel Computation without CPUs: VLSI Realization of a Particle Machine

U of Massachusetts Amherst, CoInS documents(dir)
A Dual Optimization Approach for Distributed Minimum Delay Routing With Kyoo Jeong Lee, Don Towsley and Myungwhan Choi UM-CS-1987-113* Distributed Optimization Algorithms for Quasi-Static Threshold Load Comments: This report is only available in hardcopy form. Kyoo Jeong Lee and Don Towsley 10/27/87, 10/27/87 Comments: This report is only available in hardcopy form. UM-CS-1988-016* Retrieving Documents by Plausible Inference W.B. Croft, T.J. Lucia and P.R. Cohen 04/30/88, 04/30/88 Comments: This report is only available in hardcopy form.

Neuroprose bernabe.art1chip.ps.Z(403K)
bernabe.art1chip.ps.Z bernabe@cnm.us.es 4 pages. A CMOS VLSI Analog Current-Mode High-Speed ART1 Chip.

Neuroprose bernabe.art1chip.ps.Z(403K)
bernabe.art1chip.ps.Z bernabe@cnm.us.es 4 pages. A CMOS VLSI Analog Current-Mode High-Speed ART1 Chip.

U of California Berkeley, CS csd-82-102(dir)
Thompson, Clark D.; Carey, Michael J.; Hansen, Paul Mark. ``RESST: A VLSI Implementation of a Record-Sorting Stack.'' UCB//CSD-82-102, 28 pages.
ABSTRACT
This report discusses a VLSI implementation of a record-sorting stack. Records are represented on the stack as (key, record-pointer) pairs, and the operations supported are PUSH, POP, and CLEAR. When records are POPed, they are returened in smallest-first order. The implementation allows the sorting of n records in O(n) time, and the design is cascadable so that the capacity of a single VLSI chip does not limit the amount of data which may be sorted.

This report describes a paper design and evaluation, and thus serves two purposes: It describes one particular VLSI sorting circuit, and it also serves as a case study in VLSI design methodology. The algorithm is described, the overall chip organization and data flow are presented, and detailed circuits, layouts, and timing analyses are given.

ENTRY
August 6, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-028}.tif)

U of California Berkeley, CS csd-89-522(dir)
Obermeier, Fred W. ``An Open Architecture for Improving VLSI Circuit Performance.'' UCB//CSD-89-522, 104 pages.
ABSTRACT
Electrical performance and area improvement are important parts of the overal integrated circuit design task. However, few design tools allow easy exploration of the design space (area, delay, and power) or offer designers different performance alternatives. Given designer specified constraints on area, delay, and power, EPOXY will size a circuit's transistors and will attempt small circuit changes to help meet the constraints. The system provides an open flexible framework for developing and evaluating the effects of different area and electrical models, optimization algorithms, and circuit modifications.
ENTRY
August 19, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-104}.tif)

Monash U, Robotics & Digital Tech 94-1.ps.Z(96K)
Title: Layout of CMOS Functional Cells. Design Methodology Author: Andrew P. Paplinski, Member, IEEE, Mikhail Shnaider Technical Report 94-1 Availability: ftp.rdt.monash.edu.au:pub/techreports/RDT/94-1.ps.Z Date: March 1, 1994 Number of pages: 15 Keywords: CMOS technology, VLSI systems Abstract: Design Methodology based on elementary results from the graph theory which is employed in a process of converting a schematic of a CMOS functional cell in a corresponding mask layout is presented. The exact topology of the layout is described by the layout matrix, which is amenable to algorithmic operations. It is demonstrated with typical CMOS cells that when Eulerian trails for the p-MOS and n-MOS circuits exist, the active areas of the cells can be laid without discontinuities.

U of California Berkeley, CS csd-88-487(dir)
Tonogai, Dale. ``AI in Operating Systems: An Expert Scheduler.'' UCB//CSD-88-487, 32 pages.
ABSTRACT
The allocation of resources in operating systems is currently based on ad hoc heuristics and algorithms. Some well-known heuristics have been developed in such areas as process scheduling, memory management and network routing. The possibility that better decisions could be made by an Intelligent Agent employing expert system techniques is investigated in this report. The ability to learn from experience and to deal with uncertainty are two characteristics of expert systems that would be essential aspects of such an Intelligent Agent.

As multiple processor systems become more widely available, applications involving multiple concurrent processes will increase in number and importance. This increased interdependency among processes poses interesting problems in the area of processor scheduling. How should the processes be scheduled to achieve some optimal level of performance? A scheduler based on an expert system may prove to be a viable alternative to those that have been proposed and (in some cases) implemented so far.

This report describes the implementation of a learning mechanism that attempts to handle the problem of processor scheduling in such a multiprocessor environment. In effect, the Intelligent Agent tries to "learn" its own set of heuristics for optimally scheduling a set of co-operating processes. By simulating a relatively simple multiprocessor system we examine the merits of such an approach.

ENTRY
July 26, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-032}.tif)

Simon Fraser U, SCS CMPT94-11.ps.Z(107K)
TechReport{U-SFraser-CMPT-TR:1994-11, number = "TR 94-11", author = "Jave O. Kane and Joseph G. Peters", title = "Line Broadcasting in Cycles", month = dec, year = "1994", org = "SFU-CMPT", school = SFU_CS_School, pages = "20", keywords = "broadcasting, embedding, circuit-switched routing", amos = "94A05, 68M10", abstract = "Broadcasting is the process of transmitting information from an originating node (processor) in a network to all other nodes in the network. A local broadcast scheme only allows a node to send information along single communication links to adjacent nodes, while a line broadcast scheme allows nodes to use paths of several communication links to call distant nodes. The minimum time possible for broadcasting in a network of n nodes when no node is involved in more than one communication at any given time is log n phases. Local broadcasting is not sufficient, in general, for broadcasting to be completed in minimum-time; line broadcasting is always sufficient. An optimal line broadcast is a minimum-time broadcast that uses the smallest possible total number of communication links. In this paper, we give a complete characterization of optimal line broadcasting in cycles, and we develop efficient methods for constructing optimal line broadcast schemes.", URL = "ftp://ftp.fas.sfu.ca/pub/cs/techreports/1994/CMPT94-11.ps.Z", }

U of California Berkeley, CS csd-89-529(dir)
Katz, Randy H.; Hennessy, John L. ``High Performance Microprocessor Architectures.'' UCB//CSD-89-529, 18 pages.
ABSTRACT
Single chip processor performance has improved dramatically since the inception of four-bit microprocessor in 1971. This is due in part to technological advances, (i.e., faster devices and greater device density), but also because of the adoption of architectural approaches well suited to the opportunitites and limitations of VLSI. The most appropriate are those that effectively reduce off-chip memory accesses and admit of a regular pipelined implementation. The overriding goal of pipelining is to achieve "single cycle execution," i.e., instructions appear to execute in a single processor cycle. Today's RISC processors are close to realizing this goal, and the next generation will reduce the cycles per instruction even further. In this paper, we will review the design issues and the proposed architectures for high performance VLSI processors.
ENTRY
September 14, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-018}.tif)

INRIA (Natl Inst Res Comp & Ctl Sci, France) RR-2443.ps.gz(98K)
RR-2443.ps.gz KOOLE (Ger) : n the pathwise optimal Bernoulli routing policy for homogeneous parallel servers

Inst for CS-FORTH 94.AVG_PROFILE.ps.Z(99K)
Copyright 1994 by FORTH, Heraklio, Crete, Greece
File tech-reports/1994/94.AVG_PROFILE.ps.Z
http://www.ics.forth.gr/proj/arch-vlsi/avg_profile.ps

U of California Berkeley, CS csd-91-647(dir)
Rothman, Jeffrey. ``VLSI Design of a Network Interface Processor.'' UCB//CSD-91-647, 45 pages.
ABSTRACT
An important piece of parallel computer systems is an interface between the processing node and the communications network. In many systems, this interface is largely handled by software. This document describes a general purpose interface processor, which performs requests from the network. This component can also be used with the Fluent multiprocessor system. This design was motivated by the need of a fast interface for Fluent, but can be used in many network applications.
ENTRY
March 24, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-045}.tif)

U of California Berkeley, CS csd-91-610(dir)
Holmer, Bruce K. ``A Detailed Description of the VLSI-PLM Instruction Set: A WAM Based Processor for Prolog.'' UCB//CSD-91-610, 38 pages.
ABSTRACT
This document describes the VLSI-PLM instruction set and includes small programs to test details of its implementation. The VLSI-PLM is a single chip implementation of the PLM, a WAM based instruction set for the execution of Prolog. The instruction set is described using C-like code based on the actual microcode of the VLSI-PLM. The test programs are a collection of simple Prolog programs which were used to debug the microcode. This report compliments the report of Fa gin and Dobry, The Berkeley PLM Instruction Set: An Interaction Set for Prolog.
ENTRY
July 28, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-038}.tif)

U of California Berkeley, CS csd-90-588(dir)
Singhal, Ashok. ``Exploiting Fine Grain Parallelism in Prolog.'' UCB//CSD-90-588, 198 pages.
ABSTRACT
The potential usefulness of Prolog has motivated attempts to achieve higher performance than the fastest sequential Prolog systems currently available. Several researchers have attempted to do this by exploiting parallelism. The fundamental tradeoff is exposing more parallelism versus reducing the overhead incurred by parallel execution. The high execution pverhead of the complex memory management and task management schemes used in current parallel Prolog systems leads to the following questions: (1) What forms of parallelism can be effectively exploited with memory management and task management schemes that have low overhead? (2) How much speedup can be obtained by exploiting these forms of parallelism? (3) What architectural support is required to exploit these forms of parallelism?

The goals of this research are to provide answers to these questions, to design a Prolog system that automatically exploits parallelism in Prolog wiht low overhead memory manamement and task management schemes, and to demonstrate by means of detailed simulations that such a Prolog system can indeed achieve a significant speedup over the fastest sequential Prolog systems.

We achieve these goals by first identifying the large sources of overhead in parallel Prolog execution: side-effects caused by parallel tasks, choicepoints created by parallel tasks, task creation, task scheduling, task suspension and context switching.

We then identify a form of parallelism, called flow parallelism, that can be exploited with low overhead becuase paralel execution is restricted to goals that do not cause side-effects and do not create choicepoints. We develop a master-slave model of parallel execution that eliminates task suspension and context switching. The model uses program partitioning and task scheduling techniques that do not require task suspension and context switching to prevent deadlock. We identify architectural techniques to support the parallel execution model and develop the Flow Parallel Prolog Machine (FPPM) architecture and implementation.

Finally, we evaluate the performance of FPPM and investigate the design tradeoffs using measurments on a detailed, register-transfer level simulator. FPPM achieves an average speedup of about a factor of 2 (as much as a factor of 5 for some programs) over the current highest performance sequential Prolog implementation, the VLSI-BAM. The speedups over other parallel Prolog systems are much larger.

ENTRY
July 7, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-198}.tif)

U of California Berkeley, CS csd-90-578(dir)
Wawrzynek, John; von Eicken, Thorsten. ``MIMIC, A Custom VLSI Parallel Processor for Musical Sound Synthesis.'' UCB//CSD-90-578, 10 pages.
ABSTRACT
In this paper, we present the architecture of a new generation musical sound synthesis machine. The machine is the first of its kind, combining state of the art ideas from multiprocessing and VLSI design in a system that produces sound in realtime by simulating the physical behavior of musical instruments. The machine will be compact, inexpensive, and easily scalable offering much higher performance pet unit board area than would be possible using commercially available processors.

Previous generation machines, such as the Caltech UPEs [1] mainly addressed the arithmetic computation needs of instrument simulation. The insights gained with these systems computation needs of instrument simulation. The insights gained with these systems demonstrate the feasibility of the approach but also clearly indicate that single chip systems cannot offer sufficient performance to simulate ensembles of instruments. To overcome this problem the MIMIC system includes a special purpose network managed by an on-chip unit.

Our machine relies heavily on the use of memory, both within our custom processing chips and in commercial memory chips giving rise to the name Memory Intensive Music Integrate Circuit (MIMIC).

ENTRY
September 28, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-010}.tif)

U of California Berkeley, CS csd-90-565(dir)
Wood, David A. ``Design and Evaluation of In-Cache Address Translation.'' UCB//CSD-90-565, February 1990. 117 pages.
ABSTRACT
In this dissertation we study in-cache address translation, a new approach to implementing virtual memory. In-cache translation combines the functions ofthe traditional translation lookaside buffer witha virtual address cache. Rather than dedicating hardware resources specifically to hold pagetable entries, in-cache translation lets the pagetable share the regular cache with instructions and data. By combining the two mechanisms we simplify the memory system design, and potentially reduce the cycle time. In addition, by eliminating the translation lookaside buffer, we simplify the translation consistency problem in multiprocessors: the operating system can use the regular data cache coherency protocol to maintain a consistent view of the pagetable.

Trace-driven simulation shows that in-cache translation has better performance than many translation lookaside buffer designs. As cache memories grow larger, the advantage of in-cache translation will also increase. Other simulation results indicate that in-cache translation will also increase. Other simulation results indicate that in-cache tranlation performs well over a wide range of cache configurations.

To further understand in-cache translation, we implemented it as a central feature of SPUR, a multiprocessor workstation developed at the University of California at Berkeley. A five-processor prototype convincingly demonstrates the feasibility of this new mechanism. In addition, a set of event counters imbedded in the custom VLSI cache controller chip, provides a means to measure cache performance. We use these counters to evaluate the performance of in-cache translation for 23 workloads. These measurements validate some of the key simulation results.

The measurements and simulations also show that the performance of in-cache translation is sensitive to pagetable placement. We propse a variation of the algorithm, called inverted in-cache translation, which reduces this sensitivity. We also examine alternative ways to support reference and dirty bits in a virtual address cache. An analytic model and measurements from the prototype show that the SPUR mechanism has the best performance, but that emulating dirty bits with protection is not much worse and does not require additional hardware. Finally, we show that the miss bit approximation to reference bits, where the bit is only checked on cache misses, performs better than true reference bits, which require a partial cache flush.

ENTRY
July 12, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-117}.tif)

U of California Berkeley, CS csd-89-538(dir)
Chiueh, Tzi-cher F.; Katz, Randy H.; King, Valerie D. ``Managing the VLSI Design Process.'' UCB//CSD-89-538, 18 pages.
ABSTRACT
Ways to represent and structure data within the design environment are reasonably well-understood and have led to a number of proposed and implemented design frameworks. Comparable support for the operational nature of design, i.e., the control led and disciplined sequencing of CAD tool invocations, are still in their infancy. In this paper, we describe a model for managing designers' work within the VLSI design environment. The model is based on a task specification language, for encapsulating CAD tool invocations and arranging the sequencing of such invocations to accomplish specific tasks, and an activity/history model, which maintains the history of task invocations and serves as a focus for sharing work results in a cooperative manner. The task specification lanugage and a prototype tool navigator have been implemented within the OCT CAD framework.
ENTRY
August 2, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-018}.tif)

U of California Berkeley, CS csd-89-533(dir)
King, Valerie D. ``Task Specification and Management in the VLSI Design Process.'' UCB//CSD-89-533, 52 pages.
ABSTRACT
Task specification and management is presented as one aspect of a complete model for managing the VLSI design process. I describe the appropriate role of tasks in the design process, their hierarchical specification, including a detailed descri ption of the grammar used, how they are invoked, and then present a prototype implementation integrated into the U. C. Berkeley CAD environment. The task management model presented includes a Template Manager for handling static operations on task specif ications, called templates, and a Tool Navigator for dynamic operations including running, previewing, suspending, and reasoning tasks.
ENTRY
August 2, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-052}.tif)

U of California Berkeley, CS csd-89-531(dir)
Smine, Georges E.; Srini, Vason P. ``Area Efficient Cells for LagerIV's DPP Library.'' UCB//CSD-89-531, 40 pages.
ABSTRACT
Semi-custom design of high-performance VLSI processors has been demonstrated by the Berkeley VLSI-PLM chip using Mentor Graphics IDEA station, Cell station tools and NCR tools. To support semi-custom design using Berkeley VLSI tools such as LagerIV, we have developed a set of cells. These cells are designed with the goal of designing a high-performance VLSI Parallel Prolog Processor. They can be used in other designs such as DSP chips. Some of these cells complement those in LagerIV's DPP cell library. Others provide an area efficient replacement for the DPP cells.
ENTRY
August 2, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-040}.tif)

U of California Berkeley, CS csd-89-524(dir)
Smine, Georges E.; Srini, Vason P. ``The Aquarius IIU Node: The Caches, the Address Translation Unit, and the VME bus Interface.'' UCB//CSD-89-524, August 1989. 82 pages.
ABSTRACT
This report describes the cache memory system of the Aquarius IIU node along with the address translation unit, and the VME interface. The Aquarius IIU node is designed for the parallel execution of Prolog. It is based onthe VLSI-PLM Chip that runs the Warren Abstract Machine Instruction Set [STN87] (an intermediate language for Prolog). We have connected many of these nodes using a shared bus to form a multi, which has its own shared memory and snooping caches and is used as a backend Prolog engine to the host (SUN3/160). On every node, there are two controllers for data and instruction cache that cooperate to support Berkeley's snooping cache-lock state protocol, which minimizes bus traffic associated with locking blocks. The nodes share memory using the VME bus; the page faults and memory management are handled by the host. The components of the Aquarius IIU node have been simulated at the gate level.
ENTRY
July 9, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-082}.tif)

U of California Berkeley, CS csd-89-513(dir)
Shing, Ip Kong. ``Performance, Resources, and Complexity: A Systematic Approach to Microarchitectual Design.'' UCB//CSD-89-513, 236 pages.
ABSTRACT
VLSI design in general-microprocessor design in particular-has been treated more like an art than a science in the past. The goal of this thesis is to explain the science of VLSI design to someone who wants to build a microprocessor. This can be accomplished by providing a quantitative way to evaluate, and a systematic approach to design, a microprocessor. Resources and complexity are two separate ways a microprocessor designer can pay for performance. The microprocessor designer must evaluate the performance, resources, and complexity tradeoffs quantitatively. In this thesis, the SPUR (which stands for Symbolic Processing Using RISC Machines) CPU micro- architecture is used as example to show how performance, resources, and complexity tradeoffs can be evaluated quantitatively. A systematic approach to microarchitectual design is then developed based on the SPUR CPU design experience. The SPUR CPU is implemented in 1.6um, double layer metal, CMOS technology. It consists of 115,000 transistors, runs at 100ns, and consumes 0.8W of power. Important features of the SPUR CPU are: internal instruction cache; a four-stage pipeline; support for LISP; a cache controller interface for multiprocessing and virtual memory support; and a parallel coprocessor interface for floating point arithmetic support. All these features make the SPUR CPU significantly different and more complex than previous generations of Bekreley RISC machines.
ENTRY
September 22, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-236}.tif)

U of California Berkeley, CS csd-89-505(dir)
Jeong, Deog-Kyoon. ``Clocking and Synchronization Circuits in Multiprocessor Systems.'' UCB//CSD-89-505, 148 pages.
ABSTRACT
Microprocessors based on RISC (Reduced Instruction Set Computer) concepts have demonstrated an ability to provide more computing power at a given level of integration than conventional microprocessors. The next step is multiprocessors is critic al in achieving efficient hardware utilization. This thesis focuses on the communication capability of VLSI circuits and presents new circuit techniques as a guide to build an interconnection network of VLSI microprocessors. Two of the most prominent problems in a synchronous system, which most of the current computer systems are based on, have been clock skew and synchronization failure. A new concept called self-timed systems solves such problems but has not been accepted i n microprocessor implementations yet because of its complex design procedure and increased overhead. With this in mind, this thesis concentrated on a system in which individual synchronous subsystems are connected asynchronously. Synchronous subsystems op erate with a better control over clock skew using a phase locked loop (PLL) technique. Communication among subsystems is done asynchronously with a controlled synchronization failure rate. One advantage is that conventional VLSI design methodologies which are more efficient can still be applied. Circuit techniques for PLL-based clock generation are described along with stability criteria. The main objective of the circuit is to realize a zero delay buffer. Experimental results show the feasability of such circuits in VLSI. Synchronizer circuit co nfigurations in both bipolar and MOS technology that best utilize each device, or overcome the technology limit using a bandwidth doubling technique are shown. Interface techniques including handshake mechanisms in such a system are also described. These techniques are applied in designing a memory management unit and cache controller (MMU/CC) for a multiprocessor workstation, SPUR. A SPUR workstation is an example of synchronous subsystems cluster with independent clock frequency. The interface and communication aspect of the overall system are revealed through the description of the MMU/CC. The VLSI chip is implemented in 1.6 um CMOS technology with 68,000 transistors.
ENTRY
May 5, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-148}.tif)

U of California Berkeley, CS csd-89-504(dir)
Lee, Daebum. ``Aspect of Full-Custom VLSI Microprocessor Design and Implementation.'' UCB//CSD-89-504, 126 pages.
ABSTRACT
There is a broad spectrum of design styles that have proven successful for the construction of VLSI circuits and systems. Semi-custom to full-custom design styles offer wide ranges of resulting performance, expected turn-around time, and requir ed design effort. Implementation alternatives, such as replacing dynamic memory for static memory to implement a denser on-chip memory, also exist at all levels of design hierarchy. To make the best use of scarce resources on a single chip microprocessor and to make the emerging CAD tools truly useful, alternatives in the implementation of a microprocessor must be carefully evaluated. The research reported in this thesis focuses on issues concerning these alternatives, especially in the areas of on-chip m emory design and automated control logic design. The methodologies and techniques used to maximize the performance of a full-custom VLSI microprocessor, called the SPUR CPU, is initially presented to provide an overview of microprocessor design strategies. The rest of the research presented is transpire d from new ideas and better alternatives which have become available since the SPUR CPU. These are based on lessons learned in the SPUR design and advanced computer-aided design tools such as multi-level logic synthesis system. A rigorous evaluation of th ese alternatives is attempted and results from the evaluation establish the effectiveness of the alternatives considered. To increase the area efficiency of the on-chip memory, two memory design techniques are proposed and evaluated. Selective invalidation instead of refreshing, implemented using low overhead dynamic CMOS circuits, can effectively eliminate the refreshing re quirement of dynamic memory. With this scheme, the size of an on-chip local memory can be substantially increased without increasing the scarce silicon area. Trace-driven simulations show the effectiveness of this scheme over a simple invalidation scheme. The demand for high bandwidth local memory expedited by parallel execution of programs through multiple functional units requires a fast, stable, yet compact multi-port memory cell. A single-ended access, static memory cell operated at reduced voltage le vels is proven to be useful for such applications. A part of this research is devoted to investigating various layout styles for microprocessor's control. Recently, various VLSI CAD tools have emerged to facilitate the hard-wired control design. Behavioral synthesis and multi-level logic optimization syst ems provide particularly efficient and high-performance hard-wired logic implementation, even with semi-custom layout styles, such as standard cell-based design. All new design methods aim for simplicity and regularity. The standard cell based design style, when combined with multi-level logic optimization, can provide a resulting design as good as full-custom version but in much shorter design time.
ENTRY
April 13, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-126}.tif)

U of California Berkeley, CS csd-89-500(dir)
Lee, Daebum. ``A VLSI Chip Set for a Mutiprocessor Workstation:.'' UCB//CSD-89-500, 62 pages.
ABSTRACT
This two-part paper describes two key components used in building a 40-70 MIPS multiprocessor workstation. In the first part, VLSI implementation of the central processing unit (CPU) chip, based on reduced instruction set computer (RISC) archit ecture and with support for LISP is described. The 1.3on2 CPU chip uses a direct-mapped 512-byte on-chip instruction cache, and 138 40-bit registers organized in 8 overlapping windows to achieve 10 MIPS per processor peak performance with a 10 MHz, four-p hase clock. The second part of the paper [1] describes the memory management unit and cache controller (MMU/CC) chip. System level design issues such as multiprocessor cache coherency and synchronization among chip sets are also considered to the second part. Both ch ips are implemented in a 1.6 um double-layer-metal CMOS technology, and are being used in a multiprocessor workstation (SPUR) successfully executing its own operating system called Sprite as well as many applications including LISP programs.
ENTRY
August 11, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-062}.tif)

U of California Berkeley, CS csd-89-496(dir)
Nguyen, Lau T.; Bushnell, Linda G.; Srini, Vason P. ``The VLSI-PLM Board: Design, Construction, and Testing.'' UCB//CSD-89-496, 114 pages.
ABSTRACT
We present the details of the design, simulation, construction and testing of the VLSI-PLM Board. The VLSI-PLM Board is a wire-wrapped processor board for the VLSI-PLM Chip [STN88], a high performance CMOS processor for executing computer programs written in the Prolog language. All work was performed at the University of California at Berkeley. The design and simulations were performed using Mentor Graphics Computer Aided Design (CAD) tools on Apollo workstations. By using these tools, we were able to draw the gate-level design schematics on the computer and simulate the functionality and timing of the design. After the gate-level design passed all simulation tests, a wire-wrapped board was tested using a custom-made tester panel. The wire-wrapped part occupies 18 cm by 22 cm in a board 40 cm by 36 cm, with a total of 95 Integrated Circuit (IC) chips including the VLSI-PLM Chip.
ENTRY
July 9, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-114}.tif)

U of California Berkeley, CS csd-89-492(dir)
Bushnell, Linda G.; Srini, Vason P. ``A Computer Aided Design Methodology For Printed Circuit Boards.'' UCB//CSD-89-492, 42 pages.
ABSTRACT
We present a design methodology for realizing VLSI printed circuit board (PCB) designs using Computer Aided Design (CAD) tools on powerful workstations. The workstation and CAD tools allow one to create a gate-level design using TTL parts and interconnecting wires, simulate the functionality and timing of the gate-level design using TTL parts and interconnecting wires, simulate the functionality and timing of the gate-level design, package the designs TTL parts into PCB parts, place and route the PCB, create manufacturing data, and simulate the design at the board-level. A design example, the VLSI-PLM PC Board being developed at the University of California at Berkeley, is given to illustrate this procedure. The VLSI-PLM PC Board is a processor board for the VLSI-PLM Chip [SRINI], which is a high performance CMOS processor for executing computer programs written in the Prolog language.
ENTRY
July 26, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-042}.tif)

U of California Berkeley, CS csd-88-476(dir)
Bush, William R.; Cheng, Gino; McGeer, Patrick C.; Despain, Alvin M. ``A Prototype Silicon Compiler in Prolog.'' UCB//CSD-88-476, December 1988. 58 pages.
ABSTRACT
The Advanced Silicon Compiler in Prolog (ASP) is a full-range hardware synthesis system based on Prolog. It produces VLSI masks from instruction set architecture specifications written in Prolog. The system is composed of several hierarchical components that span behavioral, circuit, and geometric synthesis. This report describes the prototype ASP system and its major components.
ENTRY
August 9, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-058}.tif)

U of California Berkeley, CS csd-88-469(dir)
Bose, Bidyut Kumar. ``VLSI Design Techniques for Floating-Point Computation.'' UCB//CSD-88-469, 186 pages.
ABSTRACT
This thesis presents design techniques for floating-point computation in VLSI. A basis for area-time design decisions for arithmetic and memory operations is formulated from a study of computationally intensive programs. Tradeoffs in the design and implementation of an efficient coprocessor interface are studied together with the implications of hardware support for the IEEE Floating-Point Standard. Algorithm area-time tradeoffs for basic arithmetic functions are analyzed in light of changing technology. Details of a single-chip floating-point unit designed into two micron CMOS for SPUR are described, including special design considerations for very wide datapaths. The pervasive effects of scaling technology on different levels of design are explored, from devices and circuits, through logic and micro-architecture, to algorithms and systems.
ENTRY
September 16, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-186}.tif)

U of California Berkeley, CS csd-88-466(dir)
Hansen, Paul Mark. ``Coprocessor Architectures for VLSI.'' UCB//CSD-88-466, 1988. 226 pages.
ABSTRACT
The hardware resources available on a single chip to implement VLSI CPU remain scarce, despite rapid technological advances. Reduced Instruction Set Computers (RISCs) reduce complexity and use the chip hardware resources to make the most frequently occurring operations fast. In this dissertation, the RISC philosophy is extended to specialized devices called coprocessors. Coprocessors increase system performance by reducing the number of instructions per program and the number of effective cycles per instruction. A method for evaluating coprocessor performance is developed, including a model that accounts for system, software, and hardware effects. Coprocessor implementations are characterized in terms of effectiveness and utilization by considering operation and overhead time for typical computations. Performance and interface characteristics of the SPUR floating-point coprocessor implementaion of the IEEE Standard are presented and compared to two popular commercial versions by Intel and Motorola. The SPUR FPU is a factor of three to 50 times faster than the commercial versions for comparable technology and clock rates. For each architecture, the influence on performance of each of the following is identified: the bus width between the floating-point unit and operand storage, the operand transfer protocols implemented in hardware, the concurrent execution model, the speed of the function units, the floating-point instruction semantics, and the data cache service time. Execution time spent in overhead is shown to increase to more than 90% for some architectures if equipped with faster floating-point units. This suggests that coprocessor interface architectures must change dramatically to keep pace with the rapid advance in CPU execution rates to be effective. The combinatorial optimization problem of finding the shortest path between two vertices in a directed graph is presented. Algorithms for scan-based relaxation techniques and Dijkstra's shortest-path algorithm are considered in detail. A path optimization coprocessor based on the SPUR model is proposed that achieves nearly three orders of magnitude improvement in performance over software implementations, and two to three orders of magnitude improvement in cost with performance comparable to dedicated hardware devices or specialized and multi-computer architectures. Finally, the SPUR coprocessor architecture is evaluated for three other applications: digital signal processing, vector floating-floating arithmetic, and support for Prolog language.
ENTRY
September 22, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-226}.tif)

U of California Berkeley, CS csd-84-202(dir)
Raghavan, Prabhakar; Thompson, Clark D. ``Randomized Routing in Gate-Arrays.'' UCB//CSD-84-202, September 1984. 20 pages.
ABSTRACT
A stochastic model for analyzing the performance of randomized algorithms for routing gate-arrays is presented; our model is based on an empirical observation known as Rent's Rule. Using the model, we analyze the space requirement of a wiring technique that only uses one-bend routes. We show how the tech nique can be extended to a case where several wiring layers are available, wiht near-optimal saving in area. As a by-product, we obtain a random procedure for converting a two-layer gate-array routing to a many-layer routing while reducing area efficiently. Within our model, we also show that the one-bend strategy is sub- optimal in terms of space requirement. However, we also show that the optimal strategy is not significantly superior to the random one-bend strategy.
ENTRY
September 21, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-020}.tif)

U of California Berkeley, CS csd-84-185(dir)
Ramamoorthy, C. V.; Nishiguchi, O.; Tsai, W.-T. ``Simulation of Hierarchical Routing Algorithms.'' UCB//CSD-84-185, 16 pages.
ABSTRACT
This document describes two routing simulators for computer communication for communication networks. One is to simulate the new Arpanet routing algorithm [MCQ 80] and its extension to hierarchical networks [TSA 82, RAM 83], and the other one for the original Arpanet routing algorithm [MCQ 74] and its extension to hierarchical networks [MCQ 74, KAM 76, HER 82]. The novel features of the simulators are that they are highly parametrized so that they could be used for a variety of purposes. Specifically, the input network could be arbitrary clustering structures, the reliability of links of networks could be arbitrarily chosen, the delay along each link could be arbitrarily predetermined according to the real network environments, and the patterns and flow rates of input data streams could be arbitrarily chosen. This report is intended to be used as a manual for using the simulators, thus many details of the algorithms will not be presented for the sake of brevity. The detailed algorithms could be found in [MCQ 7, MCQ 80, RAM 83].
ENTRY
September 29, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-016}.tif)

U of California Berkeley, CS csd-84-173(dir)
Sherburne, Robert Warren JR. ``Processor Design Tradeoffs in VLSI.'' UCB//CSD-84-173, 1984. 86 pages.
ABSTRACT
As the density of circuit integration is increased, management of complexity becomes a critical issue in chip design. Hundreds of man-years of design time are required for complex processors which are presently available on a few chips. This high cost of manpower and other resources is not acceptable. In order to address this problem, the Reduced Instruction Set Computer (RISC) architecture relies on a small set of simple instructions which execute in a regular manner. This allows a powerful processor to be implemented on a single chip at a cost of only a few man years. A critical factor be- hind the success of the RISC II microprocessor is the careful optimization on a single chip at a cost of only a few man years A critical factor behind the success of the RISC II microprocessor is the careful optimization which was performed during its design. Allocation of the limited chip area and power resources must be is the careful optimization which was performed during its design. Allocation of the limited chip area and power resources must be carefully performed to ensure that all processor instructions operate at the fastest possible speed. A fast implementation alone, however, is not sufficient; the designer must also overall performance for typical applications in order to ensure best results. Areas of processor design which are analyzed in this work include System Pipelining, Local Memory Tradeoffs, Datapath Timing, and ALU Design Tradeoffs. Pipelining improves performance by increasing the utilization of the datapath resources. This gain is diminished, however, by data and instruction dependencies which require extra cycles of delay during instruciont execution. Also, the larger registre file bitcells which are needed in order to support concurrency in the datapath incur greater delays and reduce system bandwidth from the expected value. Increased local memory (or register file) capacity significantly reduces data I/O traffic by keeping needed data frequently in registers on the chip. Too much local memory, though, can actually reduce system throughput by increasing the datapath cycle time. Various ALU organizations are available to the designer; here several approaches are investigated as to their suitability for VLSI. Carry delay as well as power, area, and reg- ularity issues are examined for ripple, carry-select, and parallel adder designs. First, a traditional, fixed-gate delay analysis of carry computation is performed over a range of adder sizes. Next, delays are measured for NMOS implementations utilizing dynamic logic and bootstrapping techniques. The results differ widely: the fixed-delay model shows the parallel design to be superior for adders of 16 bits and up, while the NMOS analysis showed it to be outperformed by the carry-select design through 128 bits. Such a result underscores the need to reevaluate design strategies which were traditionally chosen for TTL-based implementations. Single-chip VLSI implementations impose a whole new set of It is hoped that this work will bring out the significance of of evaluating the design tradeoffs over the whole spectrum ranging from the selection of a processor architecture down to the choice of the carry circuitry in the ALU.

In this research I was supported for three years by a General Electric doctoral fellowship. The RISC project was supported in part by ARPA Order No. 3803 and monitored by NESC #N00039-78-G-0013-0004.

ENTRY
June 21, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-086}.tif)

U of California Berkeley, CS csd-84-166(dir)
Mayo, Robert Nelson. ``Combining Graphics and Procedures in a VLSI Layout Tool: The Tpack System.'' UCB//CSD-84-166, 32 pages.
ABSTRACT
Tpack is a system for VLSI module generation that uses both graphical and procedural information. A graphical editor is used to specify tiles of mask information, then procedures are written to arrange the tiles into modules. This technique combines the visual power of graphical systems with the programming power of procedural systems. Since all of the mask information is contained in the tiles, the same procedures may be used for different design rules or technologies, merely by supplying a different set of tiles. This paper describes the procedural and graphical interfaces, and discusses two module generators that have been built with them.
ENTRY
June 22, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-032}.tif)

U of California Berkeley, CS csd-83-154(dir)
Ousterhout, John K.; Hamachi, Gordon T.; Mayo, Robert Nelson; Scott, Walter Stewart; Taylor, George S. ``A Collection of Papers on Magic.'' UCB//CSD-83-154, December 2, 1983. 104 pages.
ABSTRACT
Magic is a "smart" layout system for integrated circuits. It incorporates expertise about design rules and connectivity directly into the layout system in order to implement powerful new operations, including: a continuous design-rule checker that operates in background to maintain an up-to-date picture of violations; an operation called plowing that permits interactive stretching and compaction; and routing tools that can work under and around existing connections in the channels. Magic uses a new data structure called corner stitching to achieve an efficient implementation of these operations.
ENTRY
July 3, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-104}.tif)

U of California Berkeley, CS csd-83-141(dir)
Katevenis, Emmanuel-Manolis George. ``Reduced Instruction Set Computer Architectures for VLSI.'' UCB//CSD-83-141, October 1983. 222 pages.
ABSTRACT
Integrated circuits offer compact and low-cost implementation of digital systems, and provide performance gains through their high-bandwidth on-chip communication. When this technology is used to build a general-purpose von Neumann processor, it is desirable to integrate as much functionality as possible on a single chip, so as to minimize off-chip communication. Even in Very Large Scale Integrated (VLSI) circuits, however, the transistors available on the limited chip area constitute a scarce resource when used for the implementation of a complete processor or even computer, and thus, they have to be used effectively. This dissertation shows that the recent trend in computer architecture towards instruction sets of increasing complexity leads to inefficient use of those scarce resources. We investigate the alternative of Reduced Instruction Set Computer (RISC) architectures which allow effective use of on-chip transistors in functional units that provide fast access to frequently used operands and instructions.

In this dissertation, the nature of general-purpose computations is studied, showing the simplicity of the operations usually performed and the high frequency of operand accesses, many of which are made to the few local scalar variables of procedures. The architecture of the RISC I and II processors is presented. They feature simple instructions and a large multi-window register file, whose overlapping windows are used for holding the arguments and local scalar variables of the most recently activated procedures. In the framework of the RISC project, which has been a large team effort at UC Berkeley for more than three years, a RISC II nMOS single-chip processor was implemented, in collaboration with R. Sherburne. Its microarchitecture is described and evaluted, followed by a discussion of the debugging and testing methods used. Future VLSI technology will allow the integration of larger systems on a single chip. The effective utilization of the additional transistors is considered, and it is proposed that they should be used in implementing specially organized instruction fetch-and-sequence units and data caches.

The architectural study and evaluation of RISC II, as well as its design, layout, and testing after fabrication, have shown the viability and the advantages of the RISC approach. The RISC II single-chip processor looks different from other popular commercial processors: it has been less total transistors, it spends only 10% of the chip area for control rather than one half to two thirds, and it required about five times less design and lay-out effort to get chips that work correctly and at speed on first silicon. And, on top of all that, RISC II executes integer, high level language programs significantly faster than these other processors made in similar technologies.

ENTRY
July 6, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-222}.tif)

U of California Berkeley, CS csd-83-138(dir)
Thomspon, Clark D.; Raghavan, Prabhakar. ``On Estimating the Performance of VLSI Circuits.'' UCB//CSD-83-138, 18 pages.
ABSTRACT
Metrics of energy, area, and time are defined for a graph-theoretic model of VLSI computation. A number of "technological constant factors" are introduced in order to take into account the effects of using different technologies for implementing logic circuits. Different constant factors are seen to be appropriate for different logic families. We examine seven such families: NMOS, CMOS, CMOS-SOS, IsquaredL, GaAs HEMT, Jj-CIL, and Jj-CS.
ENTRY
July 7, 1993
RETRIEVAL
ocr (in all.ocr); tiff (in {001-018}.tif)


CPU use for this search: 2.93 user, 0.38 system, 20 wall