The best way to understand the requirements is to examine typical DSP algorithms and identify how their compositional requirements have influenced the architectures of DSP processor. Let us consider one of the most common processing tasks the finite impulse response filter.
For each tap of the filter a data sample is multiplied by a filter coefficient with result added to a running sum for all of the taps .Hence the main component of the FIR filter is dot product: multiply and add .These options are not unique to the FIR filter algorithm; in fact multiplication is one of the most common operation performed in signal processing -convolution, IIR filtering and Fourier transform also involve heavy use of multiply -accumulate operation. Originally, microprocessors implemented multiplication by a series of shift and add operation, each of which consumes one or more clock cycle .First a DSP processor requires a hardware which can multiply in one single cycle. Most of the DSP algorithm require a multiply and accumulate unit (MAC).
In comparison to other type of computing tasks, DSP application typically have very high computational requirements since they often must execute DSP algorithms in real time on lengthy segments ,therefore parallel operation of several independent execution units is a must -for example in addition to MAC unit an ALU and shifter is also required .
Executing a MAC in every clock cycle requires more than just single cycle MAC unit. It also requires the ability to fetch the MAC instruction, a data sample, and a filter coefficient from a memory in a single cycle. Hence good DSP performance requires high memory band width-higher than that of general microprocessors, which had one single bus connection to memory and could only make one access per cycle. The most common approach was to use two or more separate banks of memory, each of which was accessed by its own bus and could be written or read in a single cycle. This means programs are stored in a memory and data in another .With this arrangement, the processor could fetch and a data operand in parallel in every cycle .since many DSP algorithms consume two data operands per instruction a further optimization commonly used is to include small bank of RAM near the processor core that is used as an instruction cache. When a small group of instruction is executed repeatedly, the cache is loaded with those instructions, freeing the instruction bus to be used for data fetches instead of instruction fetches -thus enabling the processor to execute a MAC in a single cycleHigh memory bandwidth requirements are often further supported by dedicated hard ware for calculating memory address. These memory calculating units operate in parallel with DSP processors main execution units, enabling it to access data in new location in the memory without pausing to calculate the new address.
Memory accesses in DSP algorithm tend to exhibit very predictable pattern: for example For sample in FIR filter , the filter coefficient are accessed sequentially from start to finish , then accessed start over from beginning of the coefficient vector when processing the next input sample .This is in the contrast of other computing tasks ,such as data base processing where accesses to memory are less predictable .DSP processor address generation units take advantage of this predictability of supporting specialize addressing modes that enable the processor to efficiently access data in the patterns commonly found in DSP algorithms .The most common of these modes is register indirect addressing with post increment , which is used to automatically increment the address pointer for the algorithms where repetitive computations are performed on a series of data stored sequentially in the memory .Without this feature , the programmer would need to spend instruction explicitly incrementing the address pointer .