Gurvan Huiban

UNIVERSIDADE FEDERAL DE MINAS GERAIS
DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃO
PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO

Projeto e Análise de Algoritmos
Quarto Trabalho Prático

Belo Horizonte
13 de setembro 2002

This work will be available on the Internet: http://www.dcc.ufmg.br/ghuiban/

Presentation of the problem

Matrices are convenient data representations, used in various types of program. One of the key operation related with their use is multiplication. However, this operation can be computationally heavy, and can require a significant time, especially if matrices involved are large¹.

Due to its nature and the structure of a matrix, this operation can be easily divided in independent parts, in order save time by parallelizing it.

This topic of parallelizing matrices multiplication will be addressed in this work. For reasons of simplicity, we will only deal with square matrices, filled at random², but we believe that our results can be extended to any kind of matrices ( $n \times m$ ) that can not be considered as sparse³. This last point is important, since it exists specific strategies for this context. Our aim is only to deal with ``basic'' matrix multiplication.

Theoretical aspects

Remind that a matrix of size $n \times n$ contains $n^{2}$ elements. The product of two matrices and of such size is a matrix of size $n \times n$ . In other words, to compute the product $A \times B$ , we have to compute the value of $n^{2}$ elements. Using standard formulas, this requires $n^{3}$ multiplications. However, Strassen discovered a base algorithm to compute the product of $2\times 2$ matrices with only seven multiplications[4]. This technique can be recursively extended to large matrices via a tensor product construction. In this case, large matrices are broken down recursively by partitioning into quarters, sixteenths, etc. This leads to a number of multiplications inferior to $O(n^{\log_{2}7})$ . The better upper bound known today is approximately $O(n^{2.376})$ [2]. However the best known lower bound are linear in $n^{2}$ [3]. The complexity of the problem is still an open question.

Our study and implementation does not use such ``advanced strategies'', and we will only consider the following problem: ``Matrices multiplication without the use of advanced strategies''.

Lower bound for implementation with PRAM-CREW model

In a PRAM⁴ model, an infinite of processors are available. All of them have access to a global memory, and the communication time is considered as null. However, before using a processor, it has to be awaken.

We propose an algorithm for the PRAM-CREW⁵ model that runs in linear-time $\Theta(n)$ , using $n^{2}$ processors $(p_{0}, \ldots, p_{n^{2}-1})$ . Consequently, this algorithm has a cost $O(n^{3})$ . This algorithm is the following:

$\begin{algorithm} % latex2html id marker 58\caption{PRAM algorithm for matrix... ...\ENSURE $R = A \times B$, $n \times n$\ matrix \end{algorithmic}\end{algorithm}$

Is algorithm is a CREW algorithm, since all processors read concurrently values of and , but there is no concurrent writing (each processor writes values in a specific memory place).

This algorithms uses $n^{2}$ processors. The time-complexity is the following: spawning $n^{2}$ processors requires $O(\lceil log(n^{2})\rceil)$ pulses; time-complexity of line 4 is $\Theta(1)$ , and time-complexity of lines 3 to 5 is $\Theta(n)$ . A single processor runs only instructions 3 to 5. Consequently, the time complexity of the part executed by a processor is $\Theta(n)$ . As all processors runs concurrently, the time-complexity of the whole algorithm is $O(\lceil log(n^{2})\rceil) + \Theta(n) = \Theta(n)$ .

The cost of a PRAM algorithm is the product of its time-complexity by the number of processors used. This leads us to a cost of $n^{2}\Theta(n) = \Theta(n^{3})$ , which is equal to the time-complexity of the sequential algorithm for this problem. Hence, it is an optimal implementation.

Lower bound for implementation with SIMD model

In the SIMD⁶ architecture, all processors execute simultaneously the same instruction on its own data item (on the contrary of the MIMD⁷ where each processor has its own program acting on its own data). There are various possible architectures for SIMD model. In [1], there are lower bound for this problem in three SIMD architecture: 2D mesh model, Hypercube model and Shuffle-exchange model.

2D mesh SIMD model

On the 2D mesh SIMD model, it is possible to compute the multiplication of two $n \times n$ matrices in time $\Theta(n)$ , using $n^{2}$ processors. The only fact that there are $n^{3}$ multiplications to perform and $n^{2}$ processors gives us a time complexity $\Omega(n)$ . However, since it is possible to distribute all data and perform required computation in , it is possible to achieve a time complexity of $\Theta(n)$ . An algorithm of such complexity is also proposed.

Hypercube model

Given an hypercube of $n^{3}$ processors, it is possible to compute the result of two matrices of $n \times n$ elements is time $\Theta(\log(n))$ . An algorithm is also given, and it has three phases. The first one is a broadcast phase, the second one performs the multiplications, and during the third one, results are gathered.

Shuffle-exchange model

A shuffle exchange network has a topology containing nodes, each of which is labeled by a unique -bit integer. If two nodes have labels $[i_{l}\ldots i_{0}]$ and $[j_{l}\ldots j_{0}]$ , then and are connected if $i_{k} = j_{k}$ for $1 \leqslant k <(l-1)$ and $i_{0} \neq j_{0}$ , or if is a left or right cyclic shift of .

Given a shuffle-exchange SIMD model, of $n^{3}$ processors, it is possible to compute the result of two matrices of $n \times n$ elements is time $\Theta(\log(n))$ . An algorithm reaching such a time-complexity can be obtained by simulating the algorithm for the hypercube model.

Complexity and comparison of different possible implementations for multi-computers architecture

Mainly, there are two approaches for parallelizing this problem in multi-computers environment. The first one uses block-matrix, and the second one is row-column oriented.

Block-oriented algorithm

This approach uses the fact that a matrix can be considered as a set of blocks. Figure 1 shows an example of such a decomposition.

**Figure 1:** Decomposition of a matrix in blocks
$\includegraphics[height=3cm]{blocks.eps}$

Block matrices multiplication is performed analogously to scalar matrix multiplication; each occurrence of a scalar product is replaced by an occurrence of a block multiplication. For instance, let and be two matrices divided as follow:

$\begin{displaymath} A = \left( \begin{array}{ccc} A_{1,1} & A_{1,2}\\ A_{2,1... ... B_{1,1} & B_{1,2}\\ B_{2,1} & B_{2,2} \end{array} \right) \end{displaymath}$

Note that blocks are not necessarily square. If the sizes of the blocks are compatible, the result of $R = A \times B$ can be expressed:

$\begin{displaymath} R = \left( \begin{array}{cc} R_{1,1} = A_{1,1} \times B_{1... ... \times B_{1,2} +A_{2,2} \times B_{2,2}\\ \end{array}\right) \end{displaymath}$

An implementation using this principle can be done: every computer multiplies two sub-matrices during each iterations. For instance, suppose that, with four computers, we want to evaluate $A \times B$ , where:

$\begin{displaymath} A = \left( \begin{array}{rrrr} 1 & 0 & 2 & 3\\ 4 & -1 & ... ... -2\\ 3 & -1 & 0 & -2\\ 1 & 0 & 4 & 5 \end{array} \right) \end{displaymath}$

We define the following blocks:

$\begin{eqnarray*} A_{1,1} = \left( \begin{array}{rr} 1 & 0\\ 4 & -1 \end{arra... ...= \left( \begin{array}{rr} 0 & 2\\ 4 & 5 \end{array}\right) \end{eqnarray*}$

We have to compute $R_{1,1}$ , $R_{1,2}$ , $R_{2,1}$ , $R_{2,2}$ , defined above. As four computers are available, we can divide tasks as follow: each one deals with a $R_{i,j}$ . Each $R_{i,j}$ is the sum of two block-matrices product. First, we will compute $C_{i,j} = A_{i,1}\times B_{1,j}$ ; that is the first member of the sum:

$\begin{eqnarray*} & & \left( \begin{array}{c\vert c} C_{1,1} = A_{1,1} \times ... ...r} -10 & 12\\ 2 & -1 \end{array} \right) \end{array}\right) \end{eqnarray*}$

Then, each computer will compute the second part of the sum; that is $D_{i,j} = A_{i,2} \times B_{2,j}$ :

$\begin{eqnarray*} & & \left( \begin{array}{c\vert c} D_{1,1} = A_{1,2} \times... ...y}{rr} 8 & 2\\ 0 & 0 \end{array} \right) \end{array}\right) \end{eqnarray*}$

Each computer sums $R_{i,j} = C_{i,j} + D_{i,j}$ , and sends the result:

$\begin{displaymath} R = \left( \begin{array}{c\vert c} R_{1,1} = \left( \be... ...} -2 & 14\\ 2 & -1 \end{array} \right) \end{array}\right) \end{displaymath}$

The result overall result is:

$\begin{displaymath} R = \left( \begin{array}{rrrr} 8 & -1 & 14 & 16\\ 9 & 7 ... ... 7 & 14 & -2 & 14\\ -9 & -9 & 2 & -1\\ \end{array}\right) \end{displaymath}$

It is possible to generalize this approach: dividing matrices into blocks and sending those blocks to each computer. Each one calculates the resulting block, and sends the result.

Row-column oriented algorithm

A second possible approach is to use a row-column decomposition. Figure 2 illustrates the way two matrices are multiplied.

**Figure 2:** Multiplication of matrices
$\includegraphics[height=3cm]{rows.eps}$

As it can be seen, to evaluate an element of the result, a computer need a row and a column. Hence, one strategy to parallelize the problem could be send a set of rows of and a set of columns of to each computer. Each one will be able to calculate a set of elements of the solution matrix, and will send its results. Our algorithm (detailed in section 3.2) is based on this approach. An example of execution with the same set of data as the one above can be found section 3.2.

Note that the row-column approach is a special case of block matrix decomposition (each row of and each column of represents a block).

Comparison of those approaches

Actually, those two approaches are very similar, since there is only one way to compute effectively each element of the result. The only thing that change is the computation effectively done by each computer, and eventually the way data are transmitted from a computer to an other one if the network has a special topology. For instance, according to [1], the block matrix approach can be efficiently implemented if the network of computers is a mesh, and the row-column approach is efficient in the case of a ring. In our case, the network does not have particular topology⁸, and we believe that performances would be very similar in a case or another.

Implementation

We will describe our implementations realized for matrices multiplication.

Sequential program for mono-processor computer

To evaluate the performance of our parallelized algorithm and its implementation, we need a reference, which will be a sequential algorithm for mono-processor computer, based on the ``basic'' matrices multiplication algorithm.

Description of the program

A program performing sequential matrices multiplication using the classical algorithm has been written in C.

It generates two matrices and of a given size, fills them with integers, randomly chosen between zero and the value of parameter MAX_MATRIX, and prints them on standard output. Then, it computes the multiplication of by and stores it in . Finally, it prints .

The algorithm used for matrices multiplication is the ``classical'' algorithm. It can be written as follow:

$\begin{algorithm} % latex2html id marker 356\caption{\lq\lq basic'' algorithm for ... ...E $R = A \times B$, matrix of size $n \times n$ \end{algorithmic}\end{algorithm}$

The time complexity of such an algorithm is naturally $O(n^{3})$ (three loop of iterations).

Note that for very large matrices (for and higher for architecture Linux for instance), memory allocation fails⁹.

Organization

The program is divided in three files: file sequential.c contains the main program, file functions-sequential.c contains used functions, and file headers-sequential.h contains their headers.

More precisely, used functions are:

main: main function.
gen_matrix: randomly generates a matrix of size $n \times n$ . Input: (pointer on matrix in memory) and . There is no output.
prn_matrix: prints a matrix of size $n \times n$ on standard output. Input: a pointer on the matrix to be printed and . There is no output.
solve: performs matrices multiplication. Input: pointers on matrices , and , and . There is no output. It uses algorithm 2.
rnd_int: returns an integer between zero and . Input: .
initrandom: initializes the random generator. There is neither input nor output.

The structure of the program is showed on figure 3.

**Figure 3:** Structure of the program
$\includegraphics[width=16cm]{structure_sequential.eps}$

Used data structure are mainly arrays of integers.

Input and output

The only input of the program is the command line used to run it: It takes as a parameter.

During its execution, the program writes on the screen , , and a message saying that no problem was encountered during computation. An example of execution is shown below:

#> ./sequential 4
A generated
    9    40    47    40
    5    49     1     2
   46    20    20    47
   45    36    47    13

B generated
   27    29    42    14
   34     5    24    28
   45     0    30    19
   32    47    40    42

Computation of R = A x B
 4998  2341  4348  3819
 1910   484  1496  1545
 4326  3643  4892  3558
 4970  2096  4684  3077

No error was encountered

Utilization

The program is called sequential. It takes one and only one argument: the value of . If the number of argument is not correct, an error message will be written. For instance, to generate and multiply two matrices of size $77 \times 77$ , the program has to be executed as follow:

#> ./sequential 77

Complexity

As a matrix contains $n \times n$ elements, the generation of a matrix and the output of a matrix is $O(n^{2})$ . As previously noticed, time complexity of matrices multiplication is $O(n^{3})$ . Consequently, the time complexity of whole the program is $\Theta(n^{3})$ .

Tests

As the program is very simple, it is easy to check if the result is correct or not; and all tests realized gave the expected result.

Parallel program for multi-computers

The second program implemented for computing matrices multiplication uses different computers, linked by a TCP-IP network. The architecture chosen is master - slave: a master program generates the matrices, sends informations to slaves and receives results of computation; and a slave program, executed at the same time by one or many computers receives instructions from the master, does effectively computations and sends results. Figure 4 represents the communication scheme between computers: the master centralizes all communications.

**Figure 4:** Communication scheme in master-slave architecture
$\includegraphics[width=5cm]{communication.eps}$

For this work, our main idea was to make a program as flexible as possible. By flexible, we mean:

able to work in heterogeneous environment.
without any requirement on the number of computers used¹⁰ or on the hardware available.

Those constraints raised various problems such that dealing with computers of different architecture or velocity. This leads us to the concept of load balancing: the computation is divided between computers depending on their efficiency.

There are various way to divide tasks between slaves. This repartition can be fixed or dynamic, equal or not, and so on. We call granularity the complexity of a task assigned to a slave between two instructions from the master program. For a given problem, a large granularity reduces communications between slaves and master. Naturally, the larger is the granularity, the more difficult it is to divide computations in a ``fair'' way (a computation time more or less equivalent for each slave). An unfair repartition implies that while some computers are working others are waiting. For the problem of matrices multiplication, we believe that it is a loss of efficiency¹¹. Figure 5 illustrates the influence of the granularity on the computation time: with high granularity, there are very few communications between slave and master, but we may ``waste'' a lot of computation resources. With smaller granularity, the number of communications increases, both the load balancing and the overall execution time are better. With a very small granularity, all time gained by a good load balancing is lost due to the very high number of communications.

**Figure 5:** Influence of granularity on the execution time
$\includegraphics[height=20cm]{granularity.eps}$

There are various way to fairly divide tasks. For instance, one could statically divide tasks depending on the efficiency of each computer (which requires a good knowledge of available resources). Since our objective was to have a flexible program, this repartition had to be dynamic and the granularity has to be relatively low. Between two interactions with the master, each slave computes values (remind that there are $n^{2}$ elements to calculate). In other words, the master will divide a set of computations between computers. As generally, $p \ll n$ , we believe that the granularity is small enough to provide a good load balancing.

Our approach is row-column oriented. Each slave will receives rows of and columns of , to compute elements of . We have chosen to send to each slave the entire matrix . With such data, for each row of transmitted to a slave, it will be able to compute values.

Algorithms applied by the master and slaves are the followings:

$\begin{algorithm} % latex2html id marker 421\caption{Algorithm applied by the... ...E $R = A \times B$, matrix of size $n \times n$ \end{algorithmic}\end{algorithm}$

$\begin{algorithm} % latex2html id marker 431\caption{Algorithm applied by sla... ...DFOR \STATE sends row $i$\ of $R$. \ENDWHILE \end{algorithmic}\end{algorithm}$

An example of execution of our algorithm using 3 slaves, with matrix and used in the example section 2.3.1:

$\begin{displaymath} A = \left( \begin{array}{rrrr} 1 & 0 & 2 & 3\\ 4 & -1 & ... ... -2\\ 3 & -1 & 0 & -2\\ 1 & 0 & 4 & 5 \end{array} \right) \end{displaymath}$

Figures from 6 to 11 show the progress of a possible execution of the algorithm

Figure 6: Master sends

to all slaves
Figure 7: Master sends different rows for each slave

$\includegraphics[height=5cm]{exemple_1.eps}$

$\includegraphics[height=5cm]{exemple_2.eps}$

At this point of the computation, all slaves keep in memory matrix and have received a row of (row 0 for slave 1, row 1 for slave 2 and row 2 for slave 3.

Figure 8: Each slave performs its local computation
Figure 9: Slave 2 has finished its computation; it sends its result and receives another row to compute

$\includegraphics[height=5cm]{exemple_3.eps}$

$\includegraphics[height=5cm]{exemple_4.eps}$

Each slave will do the following computation:

$\begin{eqnarray*} \text{Slave 1} & : & \left( \begin{array}{cccc} 8 & -1 & 1... ...-2\\ 3 & -1 & 0 & -2\\ 1 & 0 & 4 & 5 \end{array} \right)\\ \end{eqnarray*}$

Suppose that slave 2 finishes first its computation. It will send its result and receive another computation to do.

Figure 10: Slaves 1 and 3 have finished their computation: they send their results and receive a message marking the end of computation
Figure 11: Slave 3 has finished its computation: it sends its results and receive a message marking the end of computation

$\includegraphics[height=5cm]{exemple_5.eps}$

$\includegraphics[height=5cm]{exemple_6.eps}$

Slave 2 will do the following computation:

$\begin{displaymath} \text{Slave 2} \left( \begin{array}{cccc} -9 & -9 & 2 & -1... ...& -2\\ 3 & -1 & 0 & -2\\ 1 & 0 & 4 & 5 \end{array}\right) \end{displaymath}$

In the meanwhile, the others slaves have finished their computation. They send their results, but there is no more computation to do. Consequently, they receive a message marking the end of the execution.

Slave 2 sends its results and receives a message marking the end of the execution. Master has received all results from all slaves and build the resulting matrix:

$\begin{displaymath} R = \left( \begin{array}{rrrr} 8 & -1 & 14 & 16\\ 9 & 7 ... ... 7 & 14 & -2 & 14\\ -9 & -9 & 2 & -1\\ \end{array}\right) \end{displaymath}$

As it can be seen, such an algorithm meets our objectives. To save time, it could be useful to desynchronizes the part of the master program that deals with each slave. To do so, the master program creates threads¹² associated with each slave. Consequently, when a slave finishes its computation, it sends its results and receives another work to do without having to wait for another slave to finish its current computation. This concurrent execution may generate problems of concurrent access to the same data (for instance: a thread wants to increment a variable and at the same time, another thread wants to read it). To solve this problem, we used mutex on critical parts of the code. A mutex is a flag preventing two threads from executing a part of code at the same time. A mutex has been defined to protect the incrementation of the row to compute; and to protect the update of matrix .

Slaves and master communicate through sockets. A socket is a kind of network ``pipe'' between two computers, and data can be sent and received through it. To create a socket, a program must know the IP address of the other computer and define a port to communicate with A. If the communication port is not occupied and the request is accepted, a socket will be created, and both computers will be able to send and receive data through it.

Those programs (master and slave) have been implemented in C. Data structures used are mainly arrays of integers.

Organization

Master program

The master program randomly generates and prints matrices and , creates threads, sends data to slaves, receives results and finally prints . The structure of the program is very similar to the sequential algorithm. Function solve is different, and a function has been added to read IP addresses of slaves. More precisely, the structure of the program is the shown on figure 12.

**Figure 12:** Structure of the master program
$\includegraphics[width=16cm]{structure_master.eps}$

Functions already described in section 3.1.2 will not be mentioned again.

solve: calls readip, creates threads and call (for each thread) function threadsolve.
readip: reads the file containing IP address of slaves.
threadsolve: function executed by each threads. Manage computation realized by the associated slave.

The program is divided in three files: master.c contains the main program, headers_master.h contains the definition of the prototypes and functions_master.c contains used functions.

Slave program

The slave program is much simpler than the master program. It only creates a socket, waits for data from the master program (matrix and rows of ), computes a row of and sends it. The program is divided in the three usual files slave.c, headers_slave.h and functions_slave.c.

A unique function compute_result is called by the main program and computes the row resulting of the multiplication of previously received row of and (e.g. lines 4 to 9 of algorithm 4). The main program creates the socket, implements the algorithm 4 (replacing lines 4 to 9 by a call of function compute_result).

Input and output

We will not consider communication between master and slave, since it already has been described.

To work, the main program requires three elements: the size of the problem (e.g. the value of ), the number of computers to be used for the computation, and the IP addresses of the slaves. The two first parameters are given by command line; and IP addresses are read in a file called ip_address.txt. In file headers_master.h there is parameter NUM_HOSTS that defines the maximum number of possible slaves. The file ip_address.txt must contains NUM_HOSTS addresses IP (even if they are not used).

The main program writes on standard output matrices and , and the result of the computation.

The slave programs only prints on the screen the size of the problem and the number of rows it processed.

Utilization

Since libraries are different from an architecture to another, make linux builds the program for Linux environment, and make sun builds it for sun architecture.

To use the program, it is necessary to:

determines slaves, their address IP, and checks that the file ip_address.txt contains the right IP addresses.
run program slave (or slave-sun, depending on the architecture) on each slave.
run program master (or master-sun), followed by the value of and the number of computers to use. The master will try connect to the first addresses IP of the file ip_address.txt.

Note that a physical computer can be run at the same time the master process and a slave process (but it can run only one slave process).

An example of computation can give:

on master computer:

lamborghini#> ./master 5 3


 BEGINNING OF COMPUTATION

A generated
    2    10    17    41    25
    9    44    45    14     3
   42    42    31    31    32
   33    24    27    24    15
   10    33    45     3    11

B generated
   15    23    34     5     2
   12     7    12    29    49
   37    39    44    33     4
   48    25    46    30     6
   28    13    31     6    37

 3447  2129  3597  2241  1733
 3084  2659  3551  3244  2549
 4665  3660  5714  3573  3636
 3354  2775  4167  2562  2049
 2663  2434  3195  2648  2242

lamborghini#>

on slave 1:

lamborghini#> ./slave


BEGINNING OF COMPUTATION

Size of the problem: 5
Number of computation processed: 2
lamborghini#>

on slave 2:

audi#>  ./slave


BEGINNING OF COMPUTATION

Size of the problem: 5
Number of computation processed: 2
audi#>

on slave 3:

porsche#> ./slave-sun


BEGINNING OF COMPUTATION

Size of the problem: 5
Number of computation processed: 1
porsche#>

Tests

The function computing a row is quite simple, and it is not really difficult to check it. Moreover, during development phase, after computing the solution using parallel implementation, the master program computed the solution with the sequential algorithm and checked if results were different. For instance, this allows us to find an error in the use of mutex.

Problems encountered

One of the objective was to make a program working in an heterogeneous environment. But different architectures may use different encodings for representing integers. For instance, the encoding of an integer differs on a PC under Linux and on a Sun Station. Luckily, there is a standard encoding for sending integer on a network, and there exist functions for converting an integer to this standard network encoding, or the contrary. Those functions are htonl¹³ and ntohl¹⁴. The use of those functions allowed us to ``forget'' the heterogeneity of used computers.

Another problem encountered has been some ``surprising'' performance results: on various tests realized with Sun Stations, the execution time (both for sequential and parallel algorithm) was lower for than for (with the same computer on the same conditions). Deleting the -O2 compilation flag (e.g. reducing the level of optimization performed by compiler) allowed us to find more ``coherent'' results. However, We do not know if this was the origin of the problem.

But the main problem encountered has been to find out a granularity allowing us to get low execution times and good load balancing. Actually, the algorithm presented above is the second algorithm we implemented. The first one had a granularity much lower, and time dedicated to communication between slaves and master was far too high to have good performances.

Complexity

To study the complexity of this set of program, let us define as the size of the problem, as the number of used computers

First, we will consider the message complexity, that is the amount of communications realized during a computation.

Communication complexity

In this paragraph, we will only determine the number of integers sent on the network. We will neglect the encapsulation of data into TCP packets.

First, the master process sends to each slave the size of the problem, and matrix . That is, it sends $n^{2}+1$ integers to each slave. Since there are slaves, the message complexity of this parts is $p(n^{2}+1)$ . To compute a result, the master sends to a slave an integer, and row of . The slave performs a computation and returns a row. That is, to get a row of the final matrix, integers are sent through the network. There are different rows; and the number of integers sent on the network during computation is . Consequently, the overall number of integers sent on the network is $p(n^{2}+1)+n(2n+1) = \Theta((p+2)n^{2}) = \Theta(pn^{2})$ .

Time complexity

Let us suppose that the time complexity of an operation on matrices is $c_{1}$ , and the time-complexity of sending an integer is $c_{2}$ .

If we consider that computation takes place sequentially (and not concurrently as it is effectively the case), the overall execution time is constituted by the communication time and the computation time. The time complexity of the computation of a row (line 4 to 9 of algorithm 4 is naturally $c_{1}n^{2}$ . There are rows to compute. So, the computation time is $c_{1}n^{3}$ . The communication cost is $c_{2}p(n^{2}+1)+n(2n+1)$ . Consequently, the computation time is $O(c_{1}n^{3} + c_{2}pn^{2})$ .

If we consider concurrent execution, overall execution time is equal to execution time of $s_{k}$ , where $s_{k}$ is the slave performing the highest number of computations. Let us suppose that $s_{k}$ computes rows. To do so, $s_{k}$ will first receive $n^{2}+1$ elements (size of the problem and matrix ). Then, it will receive elements, performs the computation of a row, and sends the result ( elements). Those operations will be repeated times. Hence, the communication time is $c_{2}\left(n^{2}+1 + k(2n+1)\right) \simeq c_{2}\left(n^{2} + 2kn\right)$ , and the computation time is $c_{1}kn^{2}$ . In other words, the time complexity of the process is $\Theta\left(c_{1}kn^{2} + c_{2}(n^{2} + 2kn)\right) = \Theta\left(c_{1}kn^{2} + c_{2}n^{2}\right)$ . If all computers are identical, $k = \frac{n}{p}$ . It holds that the time complexity of a computation is $O\left(c_{1}\frac{n^{3}}{p} + c_{2}n^{2}\right)$ .

Practical results

To evaluate performances of our implementation, and profits of parallelization, tests have been realized.

Homogeneous computers

First, experiments have been performed with a set of identical computers. Those computers are Sun Station Ultra-5, equipped with a 360MHz Ultra-SPARC-III processor and 128MB of memory. Those computers are connected to the network through 10Mbps interfaces. It can be considered that $c_{1} \ll c_{2}$ (where $c_{1}$ and $c_{2}$ are constants introduced in section 3.2.6): sending an integer from a computer to another is much longer than locally performing an operation.

A test consist in computing the product between matrices of size $n \times n$ and with computers. Tests have been realized for different values of and . Note that matrices used for computation never are the same, as they are randomly generated ``on the fly''. However, it is assumed that the computation time of any matrix of size $n \times n$ filled with integers is the same. Moreover, the output on the screen has been disabled. Figure 13 shows computation times obtained¹⁵ for solving the problem with both sequential and parallel algorithms. Note that one of the computers ran simultaneously a master and a slave process.

**Figure 13:** Execution time in function of , for various values of
$\includegraphics[width=6cm]{ultra_time_size.eps}$

Figures 14 and 15 show the speedup and the efficiency of our implementation. Speedup and efficiency can be expressed as follow:

Speedup = $\frac{\text{execution time for sequential algorithm}}{\text{execution time for parallel algorithm}}$ . It represents profit (in term of time) of the use of parallel algorithm. If Speedup is inferior or equal to one, there is no profit in using the parallel algorithm.
Efficiency = $\frac{\text{speedup}}{\text{number of computers used}}$ . This shows the efficiency of the parallelization: having a speedup of 2 using 32 computers shows a poor profit for parallelization, since much more resources were used. The ideal is to have an efficiency almost equal to 1.

Figure 14: Speedup in function of

, for various values of

Figure 15: Efficiency in function of

, for various values of

$\includegraphics[width=6cm]{ultra_speedup.eps}$

$\includegraphics[width=6cm]{ultra_efficiency.eps}$

On one (and only one) computer, the parallelized version of the algorithm always requires more time than the sequential algorithm; this is normal, since the parallel programs is much complexer than the sequential one.

For low values of (in our experiments, ), the sequential computation is always faster than parallelized ones. For higher values of , it is the contrary: the parallelized version is always faster than the sequential one. This is normal, since the transmission time for an integer is much greater than the computation time. As it can be seen in section 3.2.6, the part of transmission in the overall execution time is ``only'' proportional with $n^{2}$ , and the part of computation is $n^{3}$ . The higher will be, the lower will be the part of communication. If communication part was null, the speedup should be close to .

Figures 14 and 15 confirm this last affirmation: speedup is (almost) always increasing. Efficiency is also increasing with the value of . It is interesting to notice that efficiency decreases with , for $p \geqslant 3$ . This can be explained by mutex used to avoid concurrent writing. If two threads want to modify the same variable at the same time, one thread has to wait for the other thread to finish its operation. Naturally, with a higher number of used computers, this has more chances to happen.

sl.1 sl.2 sl.3 sl.4 sl.5 sl.6 sl.7

num. rows computed 101 107 102 102 94 95 99

	sl.1	sl.2	sl.3	sl.4	sl.5	sl.6	sl.7
num. rows computed	101	107	102	102	94	95	99

Table above gives the number of rows each slave processed for and . It can be seen that values are homogeneous.

Heterogeneous environment

Some tests have also been realized with an heterogeneous network¹⁶. The four first slaves are PC's under Linux, and the four last slaves are Sun station of different generations. The network between all those computers is a 100Mbps network, and at least six of the height slaves are equipped with a 100Mbps network adapter.

**Figure 16:** Execution time in function of
$\includegraphics[width=6cm]{lapo_time.eps}$

Figure 16 shows the execution time for multiplying different size of matrices with four, six and height computers. It can be seen that there are very few differences in the execution time with 4, 6 and 8 computers. Table below gives the number of rows of computed by each slaves (for ).

sl.1 sl.2 sl.3 sl.4 sl.5 sl.6 sl.7 sl.8

num. rows computed 276 376 642 461 127 59 34 25

	sl.1	sl.2	sl.3	sl.4	sl.5	sl.6	sl.7	sl.8
num. rows computed	276	376	642	461	127	59	34	25

The four last slaves compute very few rows of . Consequently, similarity of the execution time can be explained this way: the computation power of the last four slaves is so small (in comparison with the four first slaves) that it does not really have influence on the overall execution time.

Figure 17: Speedup in function of

, in relation with execution time of sequential algorithm on Maserati
Figure 18: Efficiency in function of

$\includegraphics[width=6cm]{lapo_speedup.eps}$

$\includegraphics[width=6cm]{lapo_efficiency.eps}$

Speedup and efficiency are related to the execution time of the sequential algorithm measured on ``Maserati'', with is one of the computers used to make those experiments¹⁷. As it can be seen on figures 17 and 18, speedup and efficiency are quite good: speedup is between two and four (even when only four slaves are considered); and efficiency, when four slaves are considered, is between and . Moreover, both speedup and efficiency are increasing with . Naturally, efficiency decreases with the number of computers used, since, the computation time is almost the same.

It is worth noting that execution time on this system is much lower than execution on homogeneous system. This can be explained by the power of computers, but also by the velocity of the network (communication time between master and slaves is lower).

Conclusion

We have proposed and implemented an algorithm for computing the product of two square matrices of size $n \times n$ . Our implementation meets our objectives: it is flexible and works on heterogeneous systems. Moreover, it seems to be efficient, since a reasonably good speedup is found for ``high'' values of .

On homogeneous system, load is well balanced between all slaves, and the more computer are used, the lower is the execution time.

It has been seen that in heterogeneous environment, the influence of the number of computers on the execution time is not so simple: increasing the number of slaves by adding relatively slow computers does not really have an influence. This leads us to emphasize that, for designing an important application (for instance an industrial application), it is necessary to have accurate informations on the hardware platform used. If the network is quite low, it is better to increase the granularity of the algorithm (e.g. increase the independence of each computers). On the contrary, is the network is very efficient, but computers heterogeneous, it may be useful to decrease the granularity of the repartition to get a better load balancing. If platforms are heterogeneous, does it worths using all computers? Or on the contrary, the profit will be insignificant?

Bibliography

1: Parallel computing: theory and practice.
McGraw-Hill, 1994.
2: D. Coppersmith and S. Winograd.
Matrix multiplication via arithmetic progressions.
Journal of Symbolic Computation, pages 251-280, 1990.
3: R. Raz.
On the complexity of matrix product.
In Proceedings of the 34th ACM Symposium on Theory of Computing, pages 144-151. ACM Press, 2002.
4: V. Strassen.
Gaussian elimination is not optimal.
Numerische Mathematik, pages 354-356, 1969.

Numeric values

Graphs in section 4 have been realized with numeric values that are in tables below.

Execution time on Sun Ultra-5

seq. 1 comp. 2 comp. 3 comp. 4 comp. 5 comp. 6 comp. 7 comp. 8 comp.

100 0.1 10.0 5.1 3.4 2.6 2.2 1.8 1.6 1.5

200 1.3 20.1 10.1 6.9 5.4 4.5 3.9 3.6 3.4

300 4.6 33.1 16.7 11.6 9.0 7.7 6.7 6.5 6.2

400 13.1 52.2 11.7 7.4 6.1 5.8 5.7 5.8 6.2

500 27.0 75.3 21.2 13.3 10.7 9.9 9.7 9.5 10.0

600 52.9 108.5 36.7 23.6 18.6 16.6 15.8 15.8 15.6

700 79.3 140.7 52.5 34.2 27.2 24.0 22.6 21.9 22.0

800 164.7 232.8 98.0 64.4 50.0 43.0 39.2 37.1 36.5

900 182.0 262.3 109.3 71.7 56.4 48.8 45.3 43.2 42.4

1000 255.8 342.1 149.0 99.1 77.2 66.4 59.8 57.3 55.5

Speedup on Sun Ultra-5

1 comp. 2 comp. 3 comp. 4 comp. 5 comp. 6 comp. 7 comp. 8 comp.

100 0.01 0.01 0.02 0.03 0.04 0.05 0.06 0.06

200 0.06 0.12 0.18 0.24 0.28 0.33 0.36 0.38

300 0.13 0.27 0.39 0.51 0.59 0.68 0.70 0.74

400 0.25 1.11 1.77 2.14 2.25 2.29 2.25 2.11

500 0.35 1.27 2.03 2.52 2.72 2.78 2.84 2.7

600 0.48 1.44 2.24 2.84 3.18 3.34 3.34 3.39

700 0.56 1.51 2.31 2.91 3.30 3.50 3.62 3.60

800 0.70 1.68 2.55 3.29 3.83 4.20 4.43 4.51

900 0.69 1.66 2.53 3.22 3.72 4.01 4.21 4.29

1000 0.74 1.71 2.58 3.31 3.85 4.27 4.46 4.60

Efficiency on Sun Ultra-5

1 comp. 2 comp. 3 comp. 4 comp. 5 comp. 6 comp. 7 comp. 8 comp.

100 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01

200 0.06 0.06 0.06 0.06 0.06 0.06 0.05 0.05

300 0.13 0.14 0.13 0.13 0.12 0.12 0.10 0.09

400 0.25 0.56 0.59 0.54 0.45 0.38 0.32 0.26

500 0.35 0.64 0.68 0.63 0.54 0.46 0.41 0.34

600 0.48 0.72 0.75 0.71 0.64 0.56 0.48 0.42

700 0.56 0.76 0.77 0.73 0.66 0.58 0.52 0.45

800 0.70 0.84 0.85 0.83 0.77 0.70 0.63 0.56

900 0.69 0.83 0.84 0.81 0.74 0.67 0.60 0.54

1000 0.74 0.86 0.86 0.83 0.77 0.71 0.64 0.58

Listings of the programs

Program sequential

File `sequential.c`

/* ------------------------------------------------------------ */
/* - PAA - TP 4                                               - */
/* - Sequential matrix multiplication                         - */
/* ------------------------------------------------------------ */
/* - sequential.c                                             - */
/* - Main program                                             - */
/* ------------------------------------------------------------ */
/* - Gurvan Huiban                                            - */
/* - Last modifications: 08/09/2002                           - */
/* ------------------------------------------------------------ */

#include "headers-sequential.h"
#include <stdio.h>
#include <stdlib.h>

// Main program 
int main (int argc, char * argv[])
{
    /* ----------
     * VARIABLES 
     * ---------- */

    int n; // size of the problem.

    int * A; // First matrix
    int * B; // Second matrix
    int * R; // Result of A*B

    /* --------------------
     * CODE OF THE PROGRAM
     * -------------------- */

    // Initializing the random generator
    initrandom();
    
    // Checking validity of the command line
    if (argc != 2)
    {
        printf("USAGE: sequential size_of_matrix\n\n");
        exit(-1);
    }
    
    // Getting the value of n
    sscanf(argv[1], "%i", &n);

    // Memory allocation
    A = (int *) malloc(n*n*sizeof(int));
    B = (int *) malloc(n*n*sizeof(int));    
    R = (int *) malloc(n*n*sizeof(int));

    if (A == NULL || B == NULL || R == NULL)
    {
        printf("ERROR: memory allocation failed\n");
        exit(-1);
    }

    // Generation of matrices
    gen_matrix(A, n);
    printf("A generated\n");
    prn_matrix(A, n);
  
    gen_matrix(B, n);
    printf("B generated\n");
    prn_matrix(B, n);

    // Solving the problem
    printf("Computation of R = A x B\n");
    solve(R, A, B, n);
  
    // Writing the result on the screen
    prn_matrix(R, n);

    // Freeing memory
    free(A);
    free(B);
    free(R);

    printf("No error were encountered\n");
    return(0);
}

File `headers-sequential.h`

/* ------------------------------------------------------------ */
/* - PAA - TP 4                                               - */
/* - Sequential matrix multiplication                         - */
/* ------------------------------------------------------------ */
/* - headers-sequential.h                                     - */
/* - Prototypes                                               - */
/* ------------------------------------------------------------ */
/* - Gurvan Huiban                                            - */
/* - Last modifications: 08/09/2002                           - */
/* ------------------------------------------------------------ */

#ifndef __MAIN_H__
#define __MAIN_H__

/* -------
 * DEFINE
 * ------- */

// Maximum value for matrices values
#define INT_MAX 50

/* -----------
 * PROTOTYPES
 * ----------- */

/* ------------------------------------------------------------ */
/* - function gen_matrix                                      - */
/* ------------------------------------------------------------ */
/* - Randomly generates a matrix of size n*n                  - */
/* - To do so, call function rnd_int                          - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on the matrix to be generated         - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void gen_matrix(int * A, int n);

/* ------------------------------------------------------------ */
/* - function prn_matrix                                      - */
/* ------------------------------------------------------------ */
/* - Prints matrix A on the standard output                   - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on the matrix to be printed           - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void prn_matrix(int * A, int n);

/* ------------------------------------------------------------ */
/* - function solve                                           - */
/* ------------------------------------------------------------ */
/* - Computes R = A*B (A, B and R are n*n matrices)           - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) R: pointer on the matrix containing result       - */
/* - (int *) A: pointer on the matrix A                       - */
/* - (int *) A: pointer on the matrix B                       - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void solve(int * R, int * A, int * B, int n);

/* ------------------------------------------------------------ */
/* - function rnd_int                                         - */
/* ------------------------------------------------------------ */
/* - Returns an integer between 0 and n-1                     - */
/* - (note: the value returned may occasionaly be n, due to   - */
/* - rounding error)                                          - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int) i:   maximum value                                 - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - (int) : value between 0 and n                            - */
/* ------------------------------------------------------------ */
int rnd_int (int i);

/* ------------------------------------------------------------ */
/* - function initrandom                                      - */
/* ------------------------------------------------------------ */
/* - Initialises the random generator                         - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - No input                                                 - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - No output                                                - */
/* ------------------------------------------------------------ */
void initrandom(void);

#endif

File `functions-sequential.c`

/* ------------------------------------------------------------ */
/* - PAA - TP 4                                               - */
/* - Sequential matrix multiplication                         - */
/* ------------------------------------------------------------ */
/* - functions-sequential.c                                   - */
/* - Functions                                                - */
/* ------------------------------------------------------------ */
/* - Gurvan Huiban                                            - */
/* - Last modifications: 08/09/2002                           - */
/* ------------------------------------------------------------ */

#include "headers-sequential.h"
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

/* ------------------------------------------------------------ */
/* - function gen_matrix                                      - */
/* ------------------------------------------------------------ */
/* - Randomly generates a matrix of size n*n                  - */
/* - To do so, call function rnd_int                          - */ 
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on the matrix to be generated         - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void gen_matrix(int * A, int n)
{
    /* ----------
     * VARIABLES
     * ---------- */
    int i,
        j;   // counters
    
    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    for (i=0; i<n; i++)
    {
        for (j=0; j<n; j++)
        {
            A[i*n+j] = rnd_int(INT_MAX);
        }
    }
}

/* ------------------------------------------------------------ */
/* - function prn_matrix                                      - */
/* ------------------------------------------------------------ */
/* - Prints matrix A on the standard output                   - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on the matrix to be printed           - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void prn_matrix(int * A, int n)
{
    /* ----------
     * VARIABLES
     * ---------- */
    int i,
        j;   // counters
    
    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    for (i=0; i<n; i++)
    {
        for (j=0; j<n; j++)
        {
            printf("%5i ", A[i*n+j]);
        }
        printf("\n");
    }
    printf("\n");
}

/* ------------------------------------------------------------ */
/* - function solve                                           - */
/* ------------------------------------------------------------ */
/* - Computes R = A*B (A, B and R are n*n matrices)           - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) R: pointer on the matrix containing result       - */
/* - (int *) A: pointer on the matrix A                       - */
/* - (int *) A: pointer on the matrix B                       - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void solve(int * R, int * A, int * B, int n)
{
    /* ----------
     * VARIABLES
     * ---------- */
    int sum; // temporary sum
    int i, // counters
        j,
        k;
    
    /* --------------------
     * CODE OF THE PROGRAM
     * -------------------- */

    for (i=0; i<n; i++)
    {
        for (j=0; j<n; j++)
        {
            sum = 0;
            for (k=0; k<n; k++)
            {
                sum = sum + (A[i*n+k] * B[k*n+j]);
            }
            R[i*n+j] = sum;
        }
    }
}

/* ------------------------------------------------------------ */
/* - function rnd_int                                         - */
/* ------------------------------------------------------------ */
/* - Returns an integer between 0 and n-1                     - */
/* - (note: the value returned may be n occasionaly, due to   - */
/* - rounding error)                                          - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int) i:   maximum value                                 - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - (int) : value between 0 and n                            - */
/* ------------------------------------------------------------ */
int rnd_int (int i)
{
    /* ----------
     * VARIABLES
     * ---------- */
    float num;
    float denum = RAND_MAX+1.0;

    int res;

    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    num = (float) rand();
    res = (int)(i*(num/denum));
    return(res);
}

/* ------------------------------------------------------------ */
/* - function initrandom                                      - */
/* ------------------------------------------------------------ */
/* - Initialises the random generator                         - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - No input                                                 - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - No output                                                - */
/* ------------------------------------------------------------ */
void initrandom(void)
{
    /* ----------
     * VARIABLES
     * ---------- */
    time_t ltime;

    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    time(&ltime);
    ltime%=10000;
    srand(ltime);
}

Program master

File `master.c`

/* ------------------------------------------------------------ */
/* - PAA - TP 4                                               - */
/* - Parallel matrix multiplication                           - */
/* ------------------------------------------------------------ */
/* - master.c                                                 - */
/* - Main program (master)                                    - */
/* ------------------------------------------------------------ */
/* - Gurvan Huiban                                            - */
/* - Last modifications: 27/08/2002                           - */
/* ------------------------------------------------------------ */
/* - inspired from Example of master using TCP protocol.      - */
/* - Reference: W. Richard Stevens. "UNIX Network Programming"- */
/*            Prentice Hall Software Series. 1990.            - */
/*            Chapter 6: Berkeley Sockets.                    - */
/* ------------------------------------------------------------ */


/* ---------
 * INCLUDES
 * --------- */
#include <stdio.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <strings.h>
#include <unistd.h>
#include <pthread.h>

#include "headers-master.h"



/* -------------
 * MAIN PROGRAM
 * ------------- */
int main (int argc, char * argv[])
{
    /* ----------
     * VARIABLES 
     * ---------- */

    int n; // size of the problem.
    int p; // number of computer to be used

    int * A; // First matrix
    int * B; // Second matrix
    int * R; // Result of A*B

    /* --------------------
     * CODE OF THE PROGRAM
     * -------------------- */

    // Initializing the random generator
    initrandom();

    printf("\n\n BEGINNING OF COMPUTATION\n\n");
    

    // Checking validity of the command line
    if (argc != 3)
    {
        printf("USAGE: sequential size_of_matrix number_of_computers\n\n");
        exit(-1);
    }
    
    // Getting the value of n
    sscanf(argv[1], "%i", &n);
    sscanf(argv[2], "%i", &p);
    

    // Memory allocation
    A = (int *) malloc(n*n*sizeof(int));
    B = (int *) malloc(n*n*sizeof(int));    
    R = (int *) malloc(n*n*sizeof(int));

    if (A == NULL || B == NULL || R == NULL)
    {
        printf("ERROR: memory allocation failed\n");
        exit(-1);
    }

    // Generation of matrices
    gen_matrix(A, n);
    printf("A generated\n");
    prn_matrix(A,n);
    
    gen_matrix(B, n);
    printf("B generated\n");
    prn_matrix(B,n);

    // launch the computation
    solve(R, A, B, n, p);

    // prints the result
    prn_matrix(R, n);

    // Frees memory
    free(A);
    free(B);
    free(R);

    return (0);
}

File `headers-master.h`

/* ------------------------------------------------------------ */
/* - PAA - TP 4                                               - */
/* - Parallel matrix multiplication                           - */
/* ------------------------------------------------------------ */
/* - headers-master.h                                         - */
/* - Prototypes of functions for the master                   - */
/* ------------------------------------------------------------ */
/* - Gurvan Huiban                                            - */
/* - Last modifications: 27/08/2002                           - */
/* ------------------------------------------------------------ */

#ifndef __HEADERS_MASTER_H__
#define __HEADERS_MASTER_H__

/* --------
 * DEFINES
 * -------- */
// size of an IP address
#define HOST_ADDR_SIZE 15

// number of computers available
#define NUM_HOSTS      8

// port to be used in TCP communication
#define SERV_TCP_PORT  2048

// Maximum value for matrices elements
#define MAX_MATRIX 50


/* ----------------
 * DATA STRUCTURES
 * ---------------- */

// Argument for a thread.
struct ThreadArgs_t{
    // Id of the thread
    int id;

    // pointer on the adress of slave
    char * hostaddr;

    // pointers on A, B, and R
    int * pA;
    int * pB;
    int * pR;

    // size of the problem (e.g. n)
    int sizePb;

    // pointer on the line number to compute
    int * computeLine;

    // mutex on computeLine
    pthread_mutex_t * computeLineMutex;
    
    // mutex on R
    pthread_mutex_t * RMutex;
} typedef strThreadArgs;


/* -----------
 * PROTOTYPES
 * ----------- */

/* ------------------------------------------------------------ */
/* - function readip                                          - */
/* ------------------------------------------------------------ */
/* - Reads IP addresses of slaves. Those addresses are in file- */
/* - ip_address.txt                                           - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (char *) addr: pointer on the data structure containing  - */
/* -                IP addresses                              - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void readip(char addr[NUM_HOSTS][HOST_ADDR_SIZE]);

/* ------------------------------------------------------------ */
/* - function gen_matrix                                      - */
/* ------------------------------------------------------------ */
/* - Randomly generates a matrix of size n*n                  - */
/* - To do so, call function rnd_int                          - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on the matrix to be generated         - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void gen_matrix(int * A, int n);


/* ------------------------------------------------------------ */
/* - function prn_matrix                                      - */
/* ------------------------------------------------------------ */
/* - Prints matrix A on the standard output                   - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on the matrix to be printed           - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void prn_matrix(int * A, int n);


/* ------------------------------------------------------------ */
/* - function solve                                           - */
/* ------------------------------------------------------------ */
/* - Computes R = A*B (A, B and R are n*n matrices)           - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) R: pointer on the matrix containing result       - */
/* - (int *) A: pointer on the matrix A                       - */
/* - (int *) B: pointer on the matrix B                       - */
/* - (int) n:   problem dimension                             - */
/* - (int) p:   number of computers to be used                - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void solve(int * R, int * A, int * B, int n, int p);

/* ------------------------------------------------------------ */
/* - function threadsolve                                     - */
/* ------------------------------------------------------------ */
/* - Function executed by each thread to solve the problem    - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (void *) args: argument for the function                 - */
/* -                (type strThreadArgs)                      - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void * threadsolve(void * args);

/* ------------------------------------------------------------ */
/* - function rnd_int                                         - */
/* ------------------------------------------------------------ */
/* - Returns an integer between 0 and n-1                     - */
/* - (note: the value returned may be n occasionaly, due to   - */
/* - rounding error)                                          - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int) i:   maximum value                                 - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - (int) : value between 0 and n                            - */
/* ------------------------------------------------------------ */
int rnd_int (int i);


/* ------------------------------------------------------------ */
/* - function initrandom                                      - */
/* ------------------------------------------------------------ */
/* - Initialises the random generator                         - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - No input                                                 - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - No output                                                - */
/* ------------------------------------------------------------ */
void initrandom(void);

#endif

File `functions-master.c`

/* ------------------------------------------------------------ */
/* - PAA - TP 4                                               - */
/* - Parallel matrix multiplication                           - */
/* ------------------------------------------------------------ */
/* - functions-master.c                                       - */
/* - Functions for the master                                 - */
/* ------------------------------------------------------------ */
/* - Gurvan Huiban                                            - */
/* - Last modifications: 27/08/2002                           - */
/* ------------------------------------------------------------ */

/* ---------
 * INCLUDES
 * --------- */
#include <arpa/inet.h>
#include <netinet/in.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>

#include "headers-master.h"

/* ------------------------------------------------------------ */
/* - function readip                                          - */
/* ------------------------------------------------------------ */
/* - Reads IP addresses of slaves. Those addresses are in file- */
/* - ip_address.txt                                           - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (char *) addr: pointer on the data structure containing  - */
/* -                IP addresses                              - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void readip(char addr[NUM_HOSTS][HOST_ADDR_SIZE])
{
    /* ----------
     * VARIABLES
     * ---------- */
    // pointer on the input file
    FILE * file;
    char buffer[HOST_ADDR_SIZE];
    

    int i;

    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    
    file = fopen("ip_address.txt", "r");
    if (file == NULL)
    {
        printf("ERROR: Problem encountered while opening file \
                ip_address.txt\n\n");
        exit(-1);
    }

    for (i=0; i<NUM_HOSTS; i++)
    {
        fgets(buffer, HOST_ADDR_SIZE, file);
        strcpy(&(addr[i][0]), buffer);
    }

    fclose(file);
}

    

    


/* ------------------------------------------------------------ */
/* - function gen_matrix                                      - */
/* ------------------------------------------------------------ */
/* - Randomly generates a matrix of size n*n                  - */
/* - To do so, call function rnd_int                          - */ 
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on the matrix to be generated         - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void gen_matrix(int * A, int n)
{
    /* ----------
     * VARIABLES
     * ---------- */
    int i,
        j;   // counters
    
    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    for (i=0; i<n; i++)
    {
        for (j=0; j<n; j++)
        {
            A[i*n+j] = rnd_int(MAX_MATRIX);
        }
    }
}

/* ------------------------------------------------------------ */
/* - function prn_matrix                                      - */
/* ------------------------------------------------------------ */
/* - Prints matrix A on the standard output                   - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on the matrix to be printed           - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void prn_matrix(int * A, int n)
{
    /* ----------
     * VARIABLES
     * ---------- */
    int i,
        j;   // counters
    
    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    for (i=0; i<n; i++)
    {
        for (j=0; j<n; j++)
        {
            printf("%5i ", A[i*n+j]);
        }
        printf("\n");
    }
    printf("\n");
}

/* ------------------------------------------------------------ */
/* - function solve                                           - */
/* ------------------------------------------------------------ */
/* - Creates threads in order to compute R = A*B              - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) R: pointer on the matrix containing result       - */
/* - (int *) A: pointer on the matrix A                       - */
/* - (int *) B: pointer on the matrix B                       - */
/* - (int) n:   problem dimension                             - */
/* - (int) p:   number of computers to be used                - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void solve(int * R, int * A, int * B, int n, int p)
{
    /* ----------
     * VARIABLES
     * ---------- */

    // Address of slaves
    char addr[NUM_HOSTS][HOST_ADDR_SIZE];

    // Array of threads
    pthread_t * proc;

    // Array of arguments for the threads
    strThreadArgs * threadArgs;
    
    // Number of the current thread
    int threadId;

    // size of a vector (n*sizeof(int))
    int vectorSize;
    
    // line number to compute
    int lineNumber = 0;

    // mutex to avoid concurrent writing operation
    pthread_mutex_t * lineMut;
    pthread_mutex_t * RMut;

    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */

    // set hosts address
    readip(addr);

    // Memory allocation for the list of threads
    proc = (pthread_t *) malloc(p*sizeof(pthread_t));

    // Memory allocation for the arguments of threads
    threadArgs = (strThreadArgs *) malloc(p*sizeof(strThreadArgs));

    // Memory allocation for mutex
    lineMut = (pthread_mutex_t *) malloc(sizeof(pthread_mutex_t));
    RMut = (pthread_mutex_t *) malloc(sizeof(pthread_mutex_t));
    
    vectorSize = n*sizeof(int);

    // Initialization of mutex
    pthread_mutex_init(lineMut, NULL);
    pthread_mutex_init(RMut, NULL);

    // Manages the threads
    for (threadId=0; threadId<p; threadId++)
    {
        // Fills arguments for each thread
        threadArgs[threadId].id = threadId;
        threadArgs[threadId].hostaddr = addr[threadId];
        threadArgs[threadId].pA = A;
        threadArgs[threadId].pB = B;
        threadArgs[threadId].pR = R;
        threadArgs[threadId].sizePb = n;
        threadArgs[threadId].computeLine = &lineNumber;
        threadArgs[threadId].computeLineMutex = lineMut;
        threadArgs[threadId].RMutex = RMut;

        // creation of the threads
        pthread_create(&proc[threadId], NULL, threadsolve,\
                       &(threadArgs[threadId]));
    }

    // When all thread have finished, the computation is over
    for (threadId=0; threadId<NUM_HOSTS; threadId++)
    {
        pthread_join(proc[threadId],NULL);
    }

    // free memory
    free(threadArgs);
    free(proc);
    free(lineMut);
    free(RMut);
}


/* ------------------------------------------------------------ */
/* - function threadsolve                                     - */
/* ------------------------------------------------------------ */
/* - Function executed by each thread to solve the problem    - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (void *) args: argument for the function                 - */
/* -                (type strThreadArgs)                      - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - None                                                     - */
/* ------------------------------------------------------------ */
void * threadsolve(void * voidargs)
{
    /* ----------
     * VARIABLES
     * ---------- */
    // Id of the thread
    int threadId;

    // pointer on arguments
    strThreadArgs * args;
    
    // IP address of the host
    char * addr;

    // Pointers on A, B and R
    int * A;
    int * B;
    int * R;

    // buffer for sent/received vectors
    int * buffer;

    // Size of the problem
    int n;

    // Line to compute
    int * line;

    // result for a computation
    int * res;

    // data structure for the socket
    struct sockaddr_in serv_addr;

    // Id of the socket
    int socketId;

    // vectorsize (n*sizeof(int))
    int vectorSize;

    // boolean indicating if the computation have to go on
    char computation = 1;

    // number of bytes received
    int received;

    // integer to make the conversion with network format
    int conv;

    // Mutex
    pthread_mutex_t * lineMut;
    pthread_mutex_t * RMut;

    int i;
    int j;

    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */

    // extracts arguments from data structure send while\
       creating the thread
    args = (strThreadArgs *) voidargs;
    threadId = args->id;
    addr = args->hostaddr;
    A = args->pA;
    B = args->pB;
    R = args->pR;
    n = args->sizePb;
    line = args->computeLine;
    lineMut = args->computeLineMutex;
    RMut = args->RMutex;

    // Memory allocation
    buffer = (int *) malloc(n*sizeof(n));
    res = (int *) malloc(n*sizeof(n));

    vectorSize = n*sizeof(int);
    
    // Creation of the socket
    serv_addr.sin_family = AF_INET;
    serv_addr.sin_addr.s_addr = inet_addr(addr);
    serv_addr.sin_port = htons(SERV_TCP_PORT);

    // Open a TCP socket (an Internet stream socket).
    if ( (socketId = socket(AF_INET, SOCK_STREAM, 0)) < 0)
    {
        fprintf(stderr,"master: can't open stream socket %i\n", threadId);
        exit(-1);
    }
        
    // Connect to the slave
    if (connect(socketId, (struct sockaddr *) &serv_addr,
                sizeof(serv_addr)) < 0)
    {
        fprintf(stderr,"master: can't connect to slave %i\n", threadId);
        exit(-1);
    }
        
    // Sends the size of the problem
    conv = htonl(n);
    
    if (send(socketId, &conv, sizeof(int), 0) != sizeof(int))
    {
        fprintf(stderr,"master: can't send size of the problem\n");
    }

    // Sends the matrix B row by row
    for (i=0; i<n; i++)
    {
        // Copies row i in buffer and convert integers.
        for(j=0; j<n; j++)
        {
            buffer[j] = htonl(B[i*n+j]);
        }
                        
        if (send(socketId, buffer, vectorSize, 0) != vectorSize)
        {
            fprintf(stderr,"master: can't send the row\n");    
        }
    }

    // A thread does not know how many computation it will deal
    // with.
    while (computation)
    {
        // The computation only ends when the line to process is n
        if ( (*line) < n)
        {
            // Locking row to compute
            pthread_mutex_lock(lineMut);

            //compute (i,j)
            i = (*line);

            // get and converts the line number to compute
            conv = htonl(*line);

            // Increments the element to compute
            (*line)++;

            // unlock the variables (we already have copied element in conv)
            pthread_mutex_unlock(lineMut);

            // sends the line number to compute
            if (send(socketId, &conv, sizeof(int), 0) != sizeof(int))
            {
                fprintf(stderr,"master: can't send the value of element\n");    
            }

            // copies and converts row i of A, before sending it
            for (j=0; j<n; j++)
            {
                buffer[j] = htonl(A[i*n+j]);
            }

            if (send(socketId, buffer, vectorSize, 0) != vectorSize)
            {
                fprintf(stderr,"master: can't send the row\n");    
            }

            // wait for the result of the computation
            received = 0;
            while (received < vectorSize)
            {
                received += recv(socketId, &(((char *)buffer)[received]), \
                                 vectorSize-received, 0);
            }


            // updates the matrix R. To do so, lock it
            pthread_mutex_lock(RMut);

            // copies values and converts them at the same time
            for (j=0; j<n; j++)
            {
                R[i*n+j] = ntohl(buffer[j]);
            }
            // unlock R
            pthread_mutex_unlock(RMut);
        }
        else
        {
            // the line number to process is n. Computation is finished
            computation = 0;

            // sends value for element (= n) (to warn slave)
            conv = htonl(*line);
            if (send(socketId, &conv, sizeof(int), 0) != sizeof(int))
            {
                fprintf(stderr,"master: can't send the value of element\n");    
            }
        }
    }

    // Close the socket
    close(socketId);

    //frees memory
    free(buffer);
    free(res);
    

    return NULL;
    
}

/* ------------------------------------------------------------ */
/* - function rnd_int                                         - */
/* ------------------------------------------------------------ */
/* - Returns an integer between 0 and n-1                     - */
/* - (note: the value returned may be n occasionaly, due to   - */
/* - rounding error)                                          - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int) i:   maximum value                                 - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - (int) : value between 0 and n                            - */
/* ------------------------------------------------------------ */
int rnd_int (int i)
{
    /* ----------
     * VARIABLES
     * ---------- */
    float num;
    float denum = RAND_MAX+1.0;

    int res;

    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    num = (float) rand();
    res = (int)(i*(num/denum));
    return(res);
}

/* ------------------------------------------------------------ */
/* - function initrandom                                      - */
/* ------------------------------------------------------------ */
/* - Initialises the random generator                         - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - No input                                                 - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - No output                                                - */
/* ------------------------------------------------------------ */
void initrandom(void)
{
    /* ----------
     * VARIABLES
     * ---------- */
    time_t ltime;

    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    time(&ltime);
    ltime%=10000;
    srand(ltime);
}

Program slave

File `slave.c`

/* ------------------------------------------------------------ */
/* - PAA - TP 4                                               - */
/* - Parallel matrix multiplication                           - */
/* ------------------------------------------------------------ */
/* - slave.c                                                  - */
/* - Main program (slave)                                     - */
/* ------------------------------------------------------------ */
/* - Gurvan Huiban                                            - */
/* - Last modifications: 08/09/2002                           - */
/* ------------------------------------------------------------ */
/* - inspired from Example of slave using TCP protocol.       - */
/* - Reference: W. Richard Stevens. "UNIX Network Programming"- */
/*            Prentice Hall Software Series. 1990.            - */
/*            Chapter 6: Berkeley Sockets.                    - */
/* ------------------------------------------------------------ */

/* ---------
 * INCLUDES
 * --------- */
#include <netinet/in.h>
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <unistd.h>

#include "headers-slave.h"


/* -------------
 * MAIN PROGRAM
 * ------------- */
int main(int argc, char *argv[])
{
    /* ----------
     * VARIABLES 
     * ---------- */
    // Id of the master's socket
    int socketServ;

    // Id of the slave's socket
    int socketId;

    // structure for creating the socket
    struct sockaddr_in serv_addr;

    // boolean for go on listening
    char listening=1;

    // size of the problem
    int n;

    // row of matrix A
    int * A;

    // matrix B;
    int * B;

    // result of the computation of A*B
    int * res;

    // size of a vector (n * sizeof(int))
    int vectorSize;

    // size effectively received through the socket
    int received;

    // number of rows computed by the slave
    int numComputation = 0;

    // number of the line currently processed by the slave
    int numline;

    // buffer for receiving vectors
    int * buffer;

    int i,j;
    int k;

    // integer used to make the conversion of types
    int conv;

    
    /* --------------------
     * CODE OF THE PROGRAM
     * -------------------- */

    printf("\n\nBEGINNING OF COMPUTATION\n\n");
    
    // Creation of the socket
    if ( (socketId = socket(AF_INET, SOCK_STREAM, 0)) < 0)
    {
        fprintf(stderr,"slave: can't open stream socket\n");
    }

    // Binding the socket
    bzero((char *) &serv_addr, sizeof(serv_addr));
    serv_addr.sin_family = AF_INET;
    serv_addr.sin_addr.s_addr = htons(INADDR_ANY);
    serv_addr.sin_port = htons(SERV_TCP_PORT);    

    if (bind(socketId, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0)
    {
        fprintf(stderr,"slave: can't bind local address\n");
    }

    // listens to the socket
    listen(socketId, QUEUE);

    // waits for a connection from the master process. 
    socketServ = accept(socketId, NULL, NULL);        
    if (socketServ < 0)
    {
        fprintf(stderr,"slave: accept error\n");
    }

    // receives the size of the problem
    if (recv(socketServ, &conv, sizeof(int), 0) != sizeof(int))
    {
        fprintf(stderr,"slave: can't receive size of problem\n");
        close(socketServ);
        exit(-11);
    }

    // Converts the results
    n = ntohl(conv);

    // Defines variables an memory allocation
    vectorSize = n*(sizeof(int));
    A = (int *) malloc(n*sizeof(int));
    B = (int *) malloc(n*n*sizeof(int));
    buffer = (int *) malloc(n*sizeof(int));
    res = (int *) malloc (n*sizeof(int));

    
    printf("Size of the problem: %i\n", n);

    // Receives B, row by row
    for (i=0; i<n; i++)
    {
        // As the entire row may not already be there, it may be
        // required to call many times recv. "received" counts the
        // number of bytes effectively received. The loop executes
        // command recv as long as the entire row has not been received.
        received = 0;
        while (received < vectorSize)
        {
            received += recv(socketServ, &(((char *)buffer)[received]),\
                             vectorSize-received, 0);
        }

        // Copy the row into B and converts it
        for (j=0; j<n; j++)
        {
            B[i*n+j] = ntohl(buffer[j]);
        }
    }

    // The slave does not know yet how many lines it will process.
    // Consequently, it will go on listening and computing until receiving
    // the indication to compute line n
    while (listening)
    {
        // receives the line number
        if (recv(socketServ, &conv, sizeof(int), 0) != sizeof(int))
        {
            fprintf(stderr,"slave: can't receive the value of numline\n");
        }
        // converts it
        numline = ntohl(conv);

        // Checks if the computation is finished (e.g. numline >= n)
        if (numline < n)
        {

            // Receives the row necessary to compute the resulting row
            received = 0;
            while (received < vectorSize)
            {
                received += recv(socketServ, &(((char *)buffer)[received]),\
                                 vectorSize-received, 0);
            }

            // copies this row and converts it
            for (k=0; k<n; k++)
            {
                A[k] = ntohl(buffer[k]);
            }

            // Launch the computation of the result
            // note that the result is already converted ("ready to be sent")
            computeResult(A, B, res, n);
            numComputation++;
            
            
            // sends the result to the server
            if ((send(socketServ, res, n*sizeof(int), 0) != n*sizeof(int)))
            {
                fprintf(stderr,"slave: can't send response\n");    
            }
        }
        else
        {
            // Computation is finished.
            listening = 0;
            close(socketServ);
            close(socketId);
            printf("Number of computation processed: %i\n", numComputation);
        }
    }

    // free memory
    free(A);
    free(B);
    free(buffer);
    free(res);
    
    return (0);
}

File `headers-slave.h`

/* ------------------------------------------------------------ */
/* - PAA - TP 4                                               - */
/* - Parallel matrix multiplication                           - */
/* ------------------------------------------------------------ */
/* - headers-slave.h                                          - */
/* - Prototypes of functions for the slave                    - */
/* ------------------------------------------------------------ */
/* - Gurvan Huiban                                            - */
/* - Last modifications: 08/09/2002                           - */
/* ------------------------------------------------------------ */

#ifndef __HEADERS_SLAVE_H__
#define __HEADERS_SLAVE_H__

/* --------
 * DEFINES
 * -------- */

// Port used for communication
#define SERV_TCP_PORT  2048

// Size of Queue for listening TCP port
#define QUEUE 100

/* -----------
 * PROTOTYPES
 * ----------- */

/* ------------------------------------------------------------ */
/* - function computeResult                                   - */
/* ------------------------------------------------------------ */
/* - Compute the product between row i of A and B             - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on row of A                           - */
/* - (int *) B: pointer on B                                  - */
/* - (int *) res: pointer on row of result                    - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - none                                                     - */
/* ------------------------------------------------------------ */
void computeResult(int * A, int * B, int * res, int n);

#endif

File `functions-slave.c`

/* ------------------------------------------------------------ */
/* - PAA - TP 4                                               - */
/* - Parallel matrix multiplication                           - */
/* ------------------------------------------------------------ */
/* - functions-slave.c                                        - */
/* - Functions for the slave                                  - */
/* ------------------------------------------------------------ */
/* - Gurvan Huiban                                            - */
/* - Last modifications: 08/09/2002                           - */
/* ------------------------------------------------------------ */

#include <netinet/in.h>
#include <stdlib.h>
#include <stdio.h>

#include "headers-slave.h"

/* ------------------------------------------------------------ */
/* - function computeResult                                   - */
/* ------------------------------------------------------------ */
/* - Compute the product between row i of A and B             - */
/* -                                                          - */
/* - INPUT:                                                   - */
/* - (int *) A: pointer on row of A                           - */
/* - (int *) B: pointer on B                                  - */
/* - (int *) res: pointer on row of result                    - */
/* - (int) n:   problem dimension                             - */
/* -                                                          - */
/* - OUTPUT:                                                  - */
/* - none                                                     - */
/* ------------------------------------------------------------ */
void computeResult(int * A, int * B, int * res, int n)
{
    /* ----------
     * VARIABLES
     * ---------- */
    int j, i;
    int r;
    
    /* ---------------------
     * CODE OF THE FUNCTION
     * --------------------- */
    // compute the n element of the result
    for (j=0; j<n; j++)
    {
        r = 0;
        for (i=0; i<n; i++)
        {
            r += A[i]*B[i*n+j];
        }
        // Converts the result before sending it on the network
        res[j] = htonl(r);
    }
}

About this document ...

This document was generated using the LaTeX2HTML translator Version 2002 (1.62)

The command line arguments were:
latex2html -split 0 tp4.tex

The translation was initiated by gurvan huiban on 2002-10-03

Footnotes

... large ¹: In specific applications, it is common to have many millions elements in a matrix
... random ²: with integer in our implementation
... sparse ³: containing an important number of 0
... PRAM ⁴: Parallel Random Access Machine
... PRAM-CREW ⁵: Concurrent Read Exclusive Write
... SIMD ⁶: Single Instruction Multiple Data
... MIMD ⁷: Multiple Instruction Multiple Data
... topology ⁸: Actually, the network graph is a complete graph, since each computer can directly communicate with any other computer
... fails ⁹: This is not a problem, since no tests have been realized with such large matrices
... used ¹⁰: However, this implementation has been designed for a number of computer $p \ll n$ . For a large number of computers - order of or more - our implementation may not be as efficient.
... efficiency ¹¹: Some problems may require an unfair repartition.
...threads ¹²: parts of the program that are executed concurrently
...htonl ¹³: Host To Network Long
...ntohl ¹⁴: Network To Host Long
... obtained ¹⁵: tables containing values measured can be found in appendix
... network ¹⁶: containing computers of various architecture and velocity
... experiments ¹⁷: It is a PC Pentium III 500, with 128MB of RAM, running Linux

gurvan huiban 2002-10-03

	seq. on Maserati	4 comp.	6 comp.	8 comp.
400	4.10	1.9	1.91	2.05
800	34.95	13.87	12.56	12.24
1000	72.46	23.41	21.52	19.95
1200	120.96	32.35	33.81	32.2
1600	390.06	134.23	131.02	120.46
2000	626.57	208.66	198.25	186.09

	4 comp.	6 comp.	8 comp.
400	2.16	2.15	2.00
800	2.52	2.78	2.85
1000	3.10	3.37	3.63
1200	3.73	3.57	3.75
1600	2.91	2.98	3.23
2000	3.00	3.16	3.37

	seq.	1 comp.	2 comp.	3 comp.	4 comp.	5 comp.	6 comp.	7 comp.	8 comp.
100	0.1	10.0	5.1	3.4	2.6	2.2	1.8	1.6	1.5
200	1.3	20.1	10.1	6.9	5.4	4.5	3.9	3.6	3.4
300	4.6	33.1	16.7	11.6	9.0	7.7	6.7	6.5	6.2
400	13.1	52.2	11.7	7.4	6.1	5.8	5.7	5.8	6.2
500	27.0	75.3	21.2	13.3	10.7	9.9	9.7	9.5	10.0
600	52.9	108.5	36.7	23.6	18.6	16.6	15.8	15.8	15.6
700	79.3	140.7	52.5	34.2	27.2	24.0	22.6	21.9	22.0
800	164.7	232.8	98.0	64.4	50.0	43.0	39.2	37.1	36.5
900	182.0	262.3	109.3	71.7	56.4	48.8	45.3	43.2	42.4
1000	255.8	342.1	149.0	99.1	77.2	66.4	59.8	57.3	55.5