Reading list for Giorgos Sapountzis

Tomasulo

[1]

R. M. Tomasulo. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development, 11(1):25-33, 1967.
BibTeX entry
Comments: Seminal paper. However, I think that the presentation in the Hennessy-Patterson textbook is thorough, and it is not very useful to present the paper itself.

Paper importance: 6/10

Instruction Fetch

[2]

T. M. Conte, K. N. Menezes, P. M. Mills, and B. A. Patel. Optimization of Instruction Fetch Mechanisms for High Issue Rates. In 22nd ISCA, June 1995.
BibTeX entry, PS
Comments: Presents optimizations on the cache organization and the branch target buffer (BTB) in order to increase the utilization of bus from the I-Fetch to the I-Decode stage. I do not think that the presented ideas are breakthroughs. It is worth noting that the paper considers an issue window greater than the typical basic block size, and shows how the bottlenecks change in processor design.

Paper importance: 7/10

Complexity-Effective Superscalars

[3]	Subbarao Palacharla, Norman P. Jouppi, and J. E. Smith. Complexity-Effective Superscalar Processors. In 24^th ISCA, June 1997. BibTeX entry, PS Comments: Very good paper, both the techical part and the presentation. Defines complexity which takes into account both transistor and wire delay. Quantifies complexity of a typical superscalar processor, and proposes a microarchitecture with reduced complexity. A workshop (WCED) that deals with problems defined in this paper is now held in conjuction with ISCA. Paper importance: 10/10
[4]	Dana S. Henry, Bradley C. Kuszmaul, and Vinod Viswanath. The Ultrascalar Processor: An Asymptotically Scalable Superscalar Microarchitecture. In ARVLSI '99, pages 256-273, March 1999. BibTeX entry, Compressed PS Comments: The ideas presented in this paper are of little practical value right now because the paper computes asympotic bounds on delay. However, the paper contains a surpising result: all you need to build a superscalar processor is a Cyclic Segmented Parallel Prefix (CSPP) circuit. I think that the view taken in this paper is interesting, and the CSPP circuit is worth studying because it is the building block of many fast circuits. Paper importance: 8/10
[5]	Dana S. Henry, Bradley C. Kuszmaul, Gabriel H. Loh, Rahul Sami, and Vinod Viswanath. Circuits for Wide-Window Superscalar Processors. In ISCA '00, pages 236-247, June 2000. BibTeX entry, Compressed PS

Binary Translation

[6]	Cindy Zheng and Carol Thompson. PA-RISC to IA-64: Transparent Execution, No Recompilation. IEEE Computer, 33(3):47-52, March 2000. BibTeX entry, PDF Comments: The objective of the system described in the paper is to execute binaries of OS/processor A on a OS/processor B with the highest possible speed. In this case, the OS/processor A is the legacy architecture and the OS/processor B is the new native architecture. The ideas presented in this paper can also be applied to binary translation as a generic run-time optimization technique. It would be useful if the paper contained more quantitative results. Paper importance: 10/10
[7]	Michael Gschwind, Erik R. Altman, Sumedh Sathaye, Paul Ledak, and David Appenzeller. Dynamic and Transparent Binary Translation. IEEE Computer, 33(3):54-59, March 2000. BibTeX entry, PDF Comments: Presents binary translation as a runtime optimization technique that can be used to reduce the Power-Delay-Product (PDP) of the processor. The native architecture is optimized for high frequency operation and for being a good target for the binary translator. The architecture seen by the applications is the combination of the native architecture and the binary translator. In conclusion, a very good paper. It would be useful if the paper contained more quantitative results. Paper importance: 10/10
[8]	Cristina Cifuentes and Mike Van Emmerik. UQBT: Adaptable Binary Translation at Low Cost. IEEE Computer, 33(3):60-66, March 2000. BibTeX entry, PDF Comments: Describes a generic framework that translates (at compile-time) binaries from one OS/processor to another. The paper presents an important application of binary translation, but I think that the efficiency of the approach is limited: the approach produces good results when translating for the same OS and different processors (e.g. Solaris/Sparc to Solaris/IA-64). Paper importance: 7/10
[9]	K. Ebcioglu and E.R. Altman. DAISY: Dynamic Compilation for 100% Architectural Compatibility. In 24^th ISCA, pages 26-37, June 1997. BibTeX entry, Available here

Software Pipelining / Rotating RF

[10]	M. Lam. Software Pipelining : An Effective Scheduling Technique for VLIW Machines. In Proc. ACM SIGPLAN PLDI, pages 318-328, June 1988. BibTeX entry, PDF Comments: Formulation and compiler-based-only solution of the problem of software pipeling. Although that's not explicit, the paper also contains a classification of the loops with regard to their amenability to software pipeling (or when software pipeling succeeds). I think this paper is essential for complete understanding of software pipeling. Paper importance: 10/10
[11]	James C. Dehnert, Peter Y. T. Hsu, and Joseph P. Bratt. Overlapped Loop Support in the Cydra 5. In Proc. ASPLOS-III, pages 26-38, April 1989. BibTeX entry, PDF Comments: Seminal paper. Explains how simple hardware support can allow the compiler to software pipeline loops without overhead. Paper importance: 10/10
[12]	P. Tirumalai, M. Lee, and M. S. Schlansker. Parallelization Of Loops With Exits On Pipelined Architectures. In Proceedings of Supercomputing'90, pages 200-212, November 1990. BibTeX entry, PDF Comments: Presents an extention of the ideas in [11] to loops with data dependant conditions. Identifies the need for and proposes some form of speculative execution: the processor must have the ability to execute instructions of an iteration before it knows whether the condition of that iteration is true or not. Paper importance: 7/10
[13]	B.Rau, D.Yen, W.Yen, and R.Towle. The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions, and Trade-offs. IEEE Computer, 22(1):12-35, January 1989. BibTeX entry

IA-64

[14]	Jerry Huck, Dale Morris, Jonathan Ross, Allan Knies, Hans Mulder, and Rumi Zahir. Introducing the IA-64 Architecture. IEEE Micro, 20(5), September 2000. BibTeX entry, PDF Comments: This a reference that shows how the ideas from [11] were implemented by the IA-64 architects. Needless to say that the paper is very interesting on its own. Paper importance: 10/10
[15]	Harsh Sharangpani and Ken Arora. Itanium Processor Microarchitecture. IEEE Micro, 20(5), September 2000. BibTeX entry, PDF
[16]	Jay Bharadwaj, William Y. Chen, Weihaw Chuang, Gerolf Hoflehner, Kishore Menezes, Kalyan Muthukumar, and Jim Pierce. The Intel IA-64 Compiler Code Generator. IEEE Micro, 20(5), September 2000. BibTeX entry, PDF
[17]	Rakesh Krishnaiyer, Dattatraya Kulkarni, Daniel Lavery, Wei Li, Chu cheow Lim, John Ng, and David Sehr. An Advanced Optimizer for IA-64 Architecture. IEEE Micro, 20(6), November 2000. BibTeX entry, PDF
[18]	Stefan Rusu and Gadi Singer. The First IA-64 Microprocessor. IEEE Journal of Solid-state Circuits, 35(11):1539-1544, November 2000. BibTeX entry

This file has been generated by bibtex2html 1.46