**Post: #1**

[attachment=380]

Presented by:B.V.PhaniSekhar

Survey of Matrix Multiplication Algorithms

Abstract

Matrix multiplication is one of the most fundamental operations in linear algebra and serves as the main building block in many different algorithms, including the solution of systems of linear equations, matrix inversion, evaluation of the matrix determinant, in signal processing, and the transitive closure of a graph. In several cases the asymptotic complexities of these algorithms depend directly on the complexity of matrix multiplica¬tion which motivates the study of possibilities to speed up matrix multiplication. Also, the inclusion of matrix multiplication in many benchmarks points at its role as a deter¬mining factor for the performance of high-speed computations.

A matrix multiplication algorithm is usually build as follows. First an algorithm for a small matrix problem is developed, then a tensor product construction produces from it an algorithm for multiplying large matrices. Several times over the years, the ground rules for constructing the basic algorithm have been relaxed, and with more care in the tensor product construction, it has been shown how to use these more lenient basic constructions to still give efficient algorithms for multiplying large matrices.

Matrix multiplication algorithms not only depend upon the algorithm but also depends upon the machine architecture. The way in which 2 numbers multiply in a machine do affect the algorithms. But one will find that addition is always faster than multiplication. These algorithms use this fact in order to improve the time complexity, still most of these algorithms are not used, as the constant factor involved in the complexity is so large that it makes them unusable. Floating point numbers take more time, as number of cycles used to multiply floating numbers is large as compared to integers. The rest of the paper deals with different type of matrix multiplication algorithms.

Introduction

Humans have talent that are hard to program. Driving a car, recognizing continuous speech, and playing chess have proved far more challenging to computers than they have to people. To this class we can probably add algorithm optimization. How is it that a mathematician can look at a simple algorithm for complex product like

U = A * C - B * D

V = A*D + B*C

And discover that the number of * operations can be reduced to 3? One method, which seems far from obvious is

U = C * {A - B) + B * (C - D) V = D*(A + B)+B*(C-D) where the expression B * (C - D) is computed once and reused.

(a) Similarly we have A * A - B * B which can be computed using 1 multiplication

A* A - B * B = (A + B) * (A - B)

(b) Cross Product of vectors a\i + a2j + ask and b\i + b2j + hk finds:

a2 * 63 - a3 * 62

a~i * bi — ai * 63 ai * 62 — a>2 * h

as the three goal expressions. While working on a symmetric scene generation program, one might wonder, is there a way to compute a cross product in less than six multiplications, at the cost of extra adds. Yes, indeed one finds a way to cross product two vectors by using five multiplications.

si = a\ — a,2

■S-2 = b2 — 63 «3 = «1 - «3

S4 = 61 - s2

mi = a3 * 61 t\ = m3 — m2

m2 = ai * 63 m3 = si * s2 m4 = b2* s3 m5 = a2 * s4

ci = m4 - ti

c2 = mi - m2 c3 = ti - m5

© A more complicated but interesting example is strassen's algorithm[3] for multiplying two 2 * 2 matrices, which involves 7 multiplications and 18 additions, a saving of one multiplication at the expense of 14 additions over the ordinary matrix multiplication. This result is more than a mathematical curiosity, since it yields an asymptotically faster algorithm for general matrix multiplication. The algorithm was later improved by winograd, who reduced the number of additions from 18 to 15, thereby reducing the constant in the running time of matrix multi¬plication. Although these results are easily verifiable, the lack of an apparent structure and theory behind such methods leaves doubts about the optimality of such methods.

2 Background

Much of the effort invested in speeding up practical matrix multiplication implementation has concentrated on well known standard algorithm, with improvements seen when the required inner products are computed in various ways that are better suited to machine architecture. Much less efforts has been given towards the investigation of alternative algorithms whose asymptotic complexity is less than the 0(n3) operations required by the conventional algo¬rithm to multiply n * n matrices. One such algorithm is Strassen's algorithm introduced in 1969, which has the complexity of 0(nl9^).

Since Strassen, in his now famous paper from 1969, showed that the 0(n3) asymptotic complexity of matrix multiplication can be reduced to 0(n2'807) the exponent has been fur¬ther decreased and the currently known lowest order is by Coppersmith and Winograd 1990, 0(n2'376). These results have been achieved by first studying how matrix multiplication depends on bilinear, later also trilinear combinations of factors. But while the former lead to smaller reductions in the exponent, the latter suffer from a constant factor so large that it renders them unusable for all practical sizes of matrices. Also the trilinear algorithms apply exclusively to certain subsets of the natural numbers and are not immediately applicable for floating point calculations.

Many researches have studied the conventional 0(n3) time algorithm in an attempt to reduce its unaffordable cubic cost. Theoretically speaking, the lower bound for multiplying two N * N matrices is 0(n2) since there are 2n2 inputs that has to be examined and n2 outputs to be computed. A general and optimal algorithm that achieves this lower bound is not yet in existence; however a special case that has been reported recently.

3 Bilinear Algorithms

3.1 Introduction to Bilinear algorithms

Given a pair of m x n and n x p matrices X= [x^-] and Y= [yjk], compute X * Y in the following order: First evaluate the linear forms in the x-variables and in the y-variables,

Lq — 53 fijqxij

Lq> — 53 fjkqVjk

h3

then the products Pq = /,(/ for q = O, 1, ... M - 1, and finally the entries of X * Y, as the linear Combinations

M-l

3

q=0

where fijq, fj^g, fkiq are constants such that the above equations are the identities in the in-determinates Xjj, for i=0, 1, ... m-l ; j=0, 1, ... n-1; k=0, 1, ... p-1. M, the total number of all multiplications of Lq by Lq>, is called the rank of the algorithm, and the multiplications of Lq by Lqt; are called the bilinear steps of the algorithm or bilinear multiplications.

Three main bilinear algorithms are relevant for this study; the first is the block formulation of ordinary matrix multiplication, the second is Strassen's algorithm, and last Winograd's variant of it. The usual number of scalar operations (i.e., the total number of additions and multiplications) required to perform N * N matrix multiplication is:

M(n) = 2n3 - n3

(i.e., n3 multiplications and n3 — n2 additions). However Strassen discovered how to multiply two matrices in

S(n) = 7nlg7 - 6n2

scalar operations.

3.2 Ordinary matrix multiplication

Asymptotic complexity: 2n3 operations.

Each recursion step (blocked version): 8 multiplications, 4 additions.

&21 &22

X

V ^21 ^22 J

1' Cn CiL/ C21 C22 )

Cn = CLiibn + au X 62i

C12 = ^11^12 + au X b22 c2i = a2ihi + a22 x 621 c22 = a2\ bu + a22 x 622

3.3 Strassen's matrix multiplication

Strassen's formulation makes use of the fact that while the cost for additions and multiplica¬tions might be of the same order in the scalar case, multiplications are considerably more costly for increasing sizes of matrices. By utilizing a mapping of bilinear combinations that trades one multiplication for fourteen additions/subtractions in each recursion step the asymptotic complexity is reduced.

Asymptotic complexity: 4.695n2'807 operations.

Each recursion step: 7 multiplications, 18 additions/subtractions.

(an «i2A

&21 CL22

X

V &21 ^22 )

K C21 C22 )

Pi = (ail + CL22) X (fen + hi)

P2 = (an + ^22) X ^11

P3 = an X (fei2 - fe22)

p4 = a22 X (-fen 4- 621)

P5 = (an + 012) x fe22

PG = (-an + a2i) x (fen + fei2)

P7 = (au - a22) x (fe2i + fe22)

Cll =Pl+P4-P5+P7

Cl2 - Pi ~ P\

C21 = P3 + P5

Cll = Pi + P3 ~ Pi + P6