The best way to understand the requirements is to examine typical DSP algorithms and identify how their compositional requirements have influenced the architectures of DSP processor. Let us consider one of the most common processing tasks the finite impulse response filter.
For each tap of the filter a data sample is multiplied by a filter coefficient with result added to a running sum for all of the taps .Hence the main component of the FIR filter is dot product: multiply and add .These options are not unique to the FIR filter algorithm; in fact multiplication is one of the most common operation performed in signal processing -convolution, IIR filtering and Fourier transform also involve heavy use of multiply -accumulate operation. Originally, microprocessors implemented multiplication by a series of shift and add operation, each of which consumes one or more clock cycle .First a DSP processor requires a hardware which can multiply in one single cycle. Most of the DSP algorithm require a multiply and accumulate unit (MAC).
In comparison to other type of computing tasks, DSP application typically have very high computational requirements since they often must execute DSP algorithms in real time on lengthy segments ,therefore parallel operation of several independent execution units is a must -for example in addition to MAC unit an ALU and shifter is also required .
Executing a MAC in every clock cycle requires more than just single cycle MAC unit. It also requires the ability to fetch the MAC instruction, a data sample, and a filter coefficient from a memory in a single cycle. Hence good DSP performance requires high memory band width-higher than that of general microprocessors, which had one single bus connection to memory and could only make one access per cycle. The most common approach was to use two or more separate banks of memory, each of which was accessed by its own bus and could be written or read in a single cycle. This means programs are stored in a memory and data in another .With this arrangement, the processor could fetch and a data operand in parallel in every cycle .since many DSP algorithms consume two data operands per instruction a further optimization commonly used is to include small bank of RAM near the processor core that is used as an instruction cache. When a small group of instruction is executed repeatedly, the cache is loaded with those instructions, freeing the instruction bus to be used for data fetches instead of instruction fetches -thus enabling the processor to execute a MAC in a single cycle.