This thesis will show that despite their expensive memory systems, traditional vector processors are not able to fully use the memory bandwidth they are provided with and that they are not tolerant to very large memory latencies. We will show that both the lack of tolerance to very large memory latencies and the discrepancy between actual and peak performance of vector machines originate in the conservative in-order instruction dispatch model currently in use. This in-order model results in a large underutilization of the high memory bandwidth provided by the memory system. The in-order model is also responsible for not requesting data early enough in time to mask large memory latencies.
Three solutions to the latency and performance problems will be presented. Using dynamic scheduling techniques already exploited in superscalar machines, we will show that performance of traditional vector architectures can be greatly improved while, at the same time, providing the necessary latency tolerance to compensate for slow memory systems. The first two techniques, {\em decoupling} and {\em out-of-order execution}, look for independent operations of a singe program stream to be executed in parallel. The third technique, {\em multithreading}, will improve global throughput by interleaving independent instructions from different programs. The three techniques presented preserve full binary compatibility.
Alone or combined, these three techniques can yield better performance and allow the use of cheaper main memory systems which can make vector architectures regain some of their lost prominence.
This thesis will also look into the problem of spill code found in architectures with a limited number of registers. We will present three different techniques that dynamically eliminate spill loads and stores, thereby reducing total memory traffic and improving execution time.