(updated on January 1, 1997)
My research experiments before 1996.
With the emergence of world wide computer networks, the demand of accessing information all over the world has increased tremendously. Under such a demand, many global information systems have been developed, such as FTP, Gopher, Wais, Archie, Webster, and WWW. Among them, WWW (World Wide Web) is the most attractive because it allows different protocols to be used and provides hypermedia capacity. Besides, WWW can integrate a general protocol for many users and offer multimedia functionality. Interested readers can also read my article which introduces WWW. In this area, I have worked on a WWW authoring system.
In this part, I am especially interested in MPEG decoding/encoding techniques. Our Lab has successfully designed a very fast MPEG-1 decoder, which can compete with Xing's product. We also find the work full of interesting research topics. Our future work is to research on the techniques of encoding and to extend the current work to MPEG-2.
In my Ph.D. work, I have successfully proposed a new approach, called multilist scheduling, which can greatly reduce the effort of developing dynamic load balancing routines for parallel/distributed programs. For this approach, I did not only examine it from the aspect of theory, but I also pratically implemented it. The success of this work will have a significant impact on parallel programming and will change the way people write parallel/distributed programs.
The first version of this system (with 20,000 lines of C++ code) is operational at CMU. At this point, although the system has not been formally publicized, this work has attracted attention of many reseachers, which are requesting/using the code of this system and committing the collaborative work. For example, the Nectar group and the EDRC center at CMU are using this software package for solid modeling problems. In the Northeast parallel architecture center at Syracuse, one group (led by Dr. Nikos Chrisochoides) will use this package to develop many important simulation problems and plan to have collaborative work with me, in CSIE, NCTU. Researchers at Harvard Univ., as well as some other institutes, are planning to port my system on their high-speed network systems and parallel systems. With the very encouraging response, I plan to continue to work on the topic and any related topics with any interested colleagues and students. I can also expect fruitful research results and many good Ph.D. thesis topics in this research direction.
I will extend the work in the following directions.
The following sections will summarize the research that I have done at CMU over the past 7 years. Most of my research was involved in the Warp and Nectar projects at CMU. Warp is a 100-FLOPS systolic array machine. Nectar is a high-bandwidth and low-latency computer network for connecting various hosts, e.g., Cray, iWarp, and workstations. The Nectar network is made of fiber-optic links, large crossbar switches, and dedicated network coprocessors, called CABs. The current Nectar prototype uses 100 Mbits/s links and 16x16 crossbar switches.
Parallel programming requires task scheduling to optimize performance; this primarily involves balancing the load over the processors. In many cases, it is critical to perform task scheduling at runtime. For example, (1) in many parallel applications the task load cannot be accurately predicted a priori; (2) in a network-based multicomputer the computational power of each processor may not remain constant. In order to support dynamic task scheduling, the programmer usually needs to design and implement a complex set of scheduling routines, e.g., routines for maintaining task lists and handling interprocessor communication for load balancing. Unfortunately, it is very difficult and time-consuming to write and debug all of these scheduling routines.
This thesis proposes a new approach which can greatly reduce the effort of developing efficient dynamic task scheduling routines. In our new approach, we decompose task scheduling into two parts --- the specification of scheduling policies and the implementation of supportive scheduling operations --- and then hide the latter from the programmer. We call this approach multilist scheduling, because it is based on a uniform scheduling model involving the use of multiple scheduling lists.
This thesis analyzes three main features of the new multilist scheduling model: ease of use, generality, and efficiency.
Traditionally, it has been difficult to efficiently support both parallel D&C and BFS in a uniform framework. We believe that our system is the first system that do so.
Multilist scheduling is the first model which can hide the details of dynamic task scheduling routines while simultaneously supporting general task scheduling. I expect that this model will have a significant impact on parallel programming, especially in the domain of multicomputers. In fact, researchers in the Engineering Design Research Center (EDRC) at CMU have, already, started using my system to develop parallel programs for their solid modeling systems. I expect my system will be widely used in the near future.
I proposed a new scheduling algorithm for parallel D&C. This scheduling algorithm minimizes the communication cost while balancing the load well. I also proved that, among all the scheduling algorithms that can split the load nearly evenly, my scheduling algorithm is optimal with respect to the communication cost. This algorithm is a very good example for showcasing the system proposed in my thesis.
Several computer architectures provide register windows to support fast procedure calls. Large numbers of registers, however, increase the size of the on-chip process state, and thus the process switch time. This overhead is negligible for Unix-style processes, but it can be significant for light-weight processes or threads. In this work, I quantified the effect of register windows on the performance of threads and interrupt handlers. The context is the nectar software on CAB, which is built around the SPARC CPU. I also looked at a number of possible strategies in which register windows are used to support fast thread switching. For some alternative strategies, I had to use modified Gnu C-Compilers to produce the code and had to trace the code of C-threads in Mach OS. All the experiments were performed based on the CAB functional simulator.
Noodles, a solid modeling program developed in EDRC, uses a sophisticated data structure with complicated links (pointers) to represent a solid model. In order to parallelize this complex system, I implemented remote procedure calls in a way similar to the Sun RPC package and the Matchmaker package in Mach OS. I also implemented routines to balance the load at runtime. Basically, the load balancing routine can be viewed as a client/server application: the load balancer server and each processor client. Finally, I successfully demonstrated a speedup of 3.8 on 4 processors in a Nectar Demo meeting. Note that there were only 4 processors available on Nectar at that time. Since I spent about two months on implementing load balancing routines, this gave me a strong motivation to provide a parallel programming language/system to hide the details of load balancing routines in my thesis.
In this project, I studied how the interprocessor communication systems of the 4.3 BSD UNIX OS, the Amoeba distributed OS and the V kernel could be implemented on Nectar. UNIX is a standard and widely used OS, while Amoeba and the V kernel are operating systems with fast communication protocols. Note that the interprocessor communication in Amoeba is based on the client/server model. In this study, I learned how to ``off-load'' protocols such as TCP/IP to the network coprocessor CAB and the implications of doing so.
I was a member of the team building the CAB hardware. First, I implemented the functional simulator of CAB on Nectar. Since a CAB includes a Sparc CPU, I incorporated the Sparc functional simulator (obtained from Sun Microsystems) into my functional simulator. In addition, I wrote about 70 diagnostic programs to detect the correctness of the CAB hardware by checking the consistency between the CAB functional simulator and the CAB logical simulator. This experience greatly enhanced my knowledge of Nectar hardware (CAB) and hardware/software interface, thus enabling me to study threads and interprocessor communication on Nectar (see preceding paragraphs).
Apply is a specialized programming language for low-level vision operations. I implemented the compiler to generate code on Warp by using the lex/yacc packages and the Lisp language. I also successfully developed an efficient algorithm, called cyclic scroll buffering, for Apply. The algorithm is very efficient, but hard to be written without the help of compiler. Interestingly, researchers in the Robotics Institute, at CMU, found that the performance of the compiled code are about 2-3 times faster than those of their hand-written code. As far as I know, this technique has been widely used in low-level vision programmers. I was also motivated to work on the parallel/distributed computing paradigm, from this point on.
Warp was a participant in the DARPA Architecture Workshop Benchmark Study, which compared performance of various architectures for image processing. I implemented the graph matching problem, one of the benchmarks. Since the problem was too large and too complicated for Warp, it was almost impossible to implement it on Warp, especially without debugging tools. However, I noticed that my sequential program on a Sun3 workstation already outperformed other participants because I devised a new pruning technique for the graph matching problem. This work provided us with one good reason to build a network-based multicomputer, i.e., the current Nectar system (which I helped to build later).
I proposed a generic method to obtain the area-time tradeoffs for a class of problems, e.g., matrix multiplication, sorting, and convolution.
I proposed a new transformation technique which the designers can use to simplify their design. I also designed some efficient systolic arrays for integer multiplication, and matrix multiplication.