The purpose of this paper is to show that decoupling techniques can be applied to a vector processor, resulting in a large increase in performance of vectorizable programs. Using a trace driven approach, we simulate a selection of the Perfect Club and Specfp92 programs and compare their execution time on a conventional single port vector architecture and on a decoupled vector architecture. Decoupling provides a performance advantage of more than a factor of 1.4 for realistic memory latencies, and even with an ideal memory system with zero latency, there is still a speedup of as much as 1.31. An important part of this paper is devoted to study the tradeoffs involved in choosing an suitable size for the different queues of the architecture, so that the hardware cost of the queues is reduced while still retaining most of the performance advantages of decoupling.