Gurvan Huiban
This work will be available on the Internet: http://www.dcc.ufmg.br/ghuiban/
Matrices are convenient data representations, used in various types of program. One of the key operation related with their use is multiplication. However, this operation can be computationally heavy, and can require a significant time, especially if matrices involved are large1.
Due to its nature and the structure of a matrix, this operation can be easily divided in independent parts, in order save time by parallelizing it.
This topic of parallelizing matrices multiplication will be addressed in this work. For reasons of simplicity, we will only deal with square matrices, filled at random2, but we believe that our results can be extended to any kind of matrices () that can not be considered as sparse3. This last point is important, since it exists specific strategies for this context. Our aim is only to deal with ``basic'' matrix multiplication.
Remind that a matrix of size contains elements. The product of two matrices and of such size is a matrix of size . In other words, to compute the product , we have to compute the value of elements. Using standard formulas, this requires multiplications. However, Strassen discovered a base algorithm to compute the product of matrices with only seven multiplications[4]. This technique can be recursively extended to large matrices via a tensor product construction. In this case, large matrices are broken down recursively by partitioning into quarters, sixteenths, etc. This leads to a number of multiplications inferior to . The better upper bound known today is approximately [2]. However the best known lower bound are linear in [3]. The complexity of the problem is still an open question.
Our study and implementation does not use such ``advanced strategies'', and we will only consider the following problem: ``Matrices multiplication without the use of advanced strategies''.
In a PRAM4 model, an infinite of processors are available. All of them have access to a global memory, and the communication time is considered as null. However, before using a processor, it has to be awaken.
We propose an algorithm for the PRAM-CREW5 model that runs in linear-time , using processors . Consequently, this algorithm has a cost . This algorithm is the following:
Is algorithm is a CREW algorithm, since all processors read concurrently values of and , but there is no concurrent writing (each processor writes values in a specific memory place).
This algorithms uses processors. The time-complexity is the following: spawning processors requires pulses; time-complexity of line 4 is , and time-complexity of lines 3 to 5 is . A single processor runs only instructions 3 to 5. Consequently, the time complexity of the part executed by a processor is . As all processors runs concurrently, the time-complexity of the whole algorithm is .
The cost of a PRAM algorithm is the product of its time-complexity by the number of processors used. This leads us to a cost of , which is equal to the time-complexity of the sequential algorithm for this problem. Hence, it is an optimal implementation.
In the SIMD6 architecture, all processors execute simultaneously the same instruction on its own data item (on the contrary of the MIMD7 where each processor has its own program acting on its own data). There are various possible architectures for SIMD model. In [1], there are lower bound for this problem in three SIMD architecture: 2D mesh model, Hypercube model and Shuffle-exchange model.
On the 2D mesh SIMD model, it is possible to compute the multiplication of two matrices in time , using processors. The only fact that there are multiplications to perform and processors gives us a time complexity . However, since it is possible to distribute all data and perform required computation in , it is possible to achieve a time complexity of . An algorithm of such complexity is also proposed.
Given an hypercube of processors, it is possible to compute the result of two matrices of elements is time . An algorithm is also given, and it has three phases. The first one is a broadcast phase, the second one performs the multiplications, and during the third one, results are gathered.
A shuffle exchange network has a topology containing nodes, each of which is labeled by a unique -bit integer. If two nodes have labels and , then and are connected if for and , or if is a left or right cyclic shift of .
Given a shuffle-exchange SIMD model, of processors, it is possible to compute the result of two matrices of elements is time . An algorithm reaching such a time-complexity can be obtained by simulating the algorithm for the hypercube model.
Mainly, there are two approaches for parallelizing this problem in multi-computers environment. The first one uses block-matrix, and the second one is row-column oriented.
This approach uses the fact that a matrix can be considered as a set of blocks. Figure 1 shows an example of such a decomposition.
Block matrices multiplication is performed analogously to scalar matrix multiplication; each occurrence of a scalar product is replaced by an occurrence of a block multiplication. For instance, let and be two matrices divided as follow:
Note that blocks are not necessarily square. If the sizes of the blocks are compatible, the result of
can be expressed:
An implementation using this principle can be done: every computer multiplies two sub-matrices during each iterations. For instance, suppose that, with four computers, we want to evaluate , where:
We define the following blocks:
We have to compute , , , , defined above. As four computers are available, we can divide tasks as follow: each one deals with a . Each is the sum of two block-matrices product. First, we will compute ; that is the first member of the sum:
Then, each computer will compute the second part of the sum; that is :
Each computer sums , and sends the result:
The result overall result is:
It is possible to generalize this approach: dividing matrices into blocks and sending those blocks to each computer. Each one calculates the resulting block, and sends the result.
A second possible approach is to use a row-column decomposition. Figure 2 illustrates the way two matrices are multiplied.
As it can be seen, to evaluate an element of the result, a computer need a row and a column. Hence, one strategy to parallelize the problem could be send a set of rows of and a set of columns of to each computer. Each one will be able to calculate a set of elements of the solution matrix, and will send its results. Our algorithm (detailed in section 3.2) is based on this approach. An example of execution with the same set of data as the one above can be found section 3.2.
Note that the row-column approach is a special case of block matrix decomposition (each row of and each column of represents a block).
Actually, those two approaches are very similar, since there is only one way to compute effectively each element of the result. The only thing that change is the computation effectively done by each computer, and eventually the way data are transmitted from a computer to an other one if the network has a special topology. For instance, according to [1], the block matrix approach can be efficiently implemented if the network of computers is a mesh, and the row-column approach is efficient in the case of a ring. In our case, the network does not have particular topology8, and we believe that performances would be very similar in a case or another.
We will describe our implementations realized for matrices multiplication.
To evaluate the performance of our parallelized algorithm and its implementation, we need a reference, which will be a sequential algorithm for mono-processor computer, based on the ``basic'' matrices multiplication algorithm.
A program performing sequential matrices multiplication using the classical algorithm has been written in C.
It generates two matrices and of a given size, fills them with integers, randomly chosen between zero and the value of parameter MAX_MATRIX, and prints them on standard output. Then, it computes the multiplication of by and stores it in . Finally, it prints .
The algorithm used for matrices multiplication is the ``classical'' algorithm. It can be written as follow:
The time complexity of such an algorithm is naturally (three loop of iterations).
Note that for very large matrices (for and higher for architecture Linux for instance), memory allocation fails9.
The program is divided in three files: file sequential.c contains the main program, file functions-sequential.c contains used functions, and file headers-sequential.h contains their headers.
More precisely, used functions are:
The structure of the program is showed on figure 3.
Used data structure are mainly arrays of integers.
The only input of the program is the command line used to run it: It takes as a parameter.
During its execution, the program writes on the screen , , and a message saying that no problem was encountered during computation. An example of execution is shown below:
#> ./sequential 4 A generated 9 40 47 40 5 49 1 2 46 20 20 47 45 36 47 13 B generated 27 29 42 14 34 5 24 28 45 0 30 19 32 47 40 42 Computation of R = A x B 4998 2341 4348 3819 1910 484 1496 1545 4326 3643 4892 3558 4970 2096 4684 3077 No error was encountered
The program is called sequential. It takes one and only one argument: the value of . If the number of argument is not correct, an error message will be written. For instance, to generate and multiply two matrices of size , the program has to be executed as follow:
#> ./sequential 77
As a matrix contains elements, the generation of a matrix and the output of a matrix is . As previously noticed, time complexity of matrices multiplication is . Consequently, the time complexity of whole the program is .
As the program is very simple, it is easy to check if the result is correct or not; and all tests realized gave the expected result.
The second program implemented for computing matrices multiplication uses different computers, linked by a TCP-IP network. The architecture chosen is master - slave: a master program generates the matrices, sends informations to slaves and receives results of computation; and a slave program, executed at the same time by one or many computers receives instructions from the master, does effectively computations and sends results. Figure 4 represents the communication scheme between computers: the master centralizes all communications.
For this work, our main idea was to make a program as flexible as possible. By flexible, we mean:
Those constraints raised various problems such that dealing with computers of different architecture or velocity. This leads us to the concept of load balancing: the computation is divided between computers depending on their efficiency.
There are various way to divide tasks between slaves. This repartition can be fixed or dynamic, equal or not, and so on. We call granularity the complexity of a task assigned to a slave between two instructions from the master program. For a given problem, a large granularity reduces communications between slaves and master. Naturally, the larger is the granularity, the more difficult it is to divide computations in a ``fair'' way (a computation time more or less equivalent for each slave). An unfair repartition implies that while some computers are working others are waiting. For the problem of matrices multiplication, we believe that it is a loss of efficiency11. Figure 5 illustrates the influence of the granularity on the computation time: with high granularity, there are very few communications between slave and master, but we may ``waste'' a lot of computation resources. With smaller granularity, the number of communications increases, both the load balancing and the overall execution time are better. With a very small granularity, all time gained by a good load balancing is lost due to the very high number of communications.
There are various way to fairly divide tasks. For instance, one could statically divide tasks depending on the efficiency of each computer (which requires a good knowledge of available resources). Since our objective was to have a flexible program, this repartition had to be dynamic and the granularity has to be relatively low. Between two interactions with the master, each slave computes values (remind that there are elements to calculate). In other words, the master will divide a set of computations between computers. As generally, , we believe that the granularity is small enough to provide a good load balancing.
Our approach is row-column oriented. Each slave will receives rows of and columns of , to compute elements of . We have chosen to send to each slave the entire matrix . With such data, for each row of transmitted to a slave, it will be able to compute values.
Algorithms applied by the master and slaves are the followings:
An example of execution of our algorithm using 3 slaves, with matrix and used in the example section 2.3.1:
Figures from 6 to 11 show the progress of a possible execution of the algorithm
At this point of the computation, all slaves keep in memory matrix and have received a row of (row 0 for slave 1, row 1 for slave 2 and row 2 for slave 3.
Suppose that slave 2 finishes first its computation. It will send its result and receive another computation to do.
Slave 2 will do the following computation:
In the meanwhile, the others slaves have finished their computation. They send their results, but there is no more computation to do. Consequently, they receive a message marking the end of the execution.
Slave 2 sends its results and receives a message marking the end of the execution. Master has received all results from all slaves and build the resulting matrix:
As it can be seen, such an algorithm meets our objectives. To save time, it could be useful to desynchronizes the part of the master program that deals with each slave. To do so, the master program creates threads12 associated with each slave. Consequently, when a slave finishes its computation, it sends its results and receives another work to do without having to wait for another slave to finish its current computation. This concurrent execution may generate problems of concurrent access to the same data (for instance: a thread wants to increment a variable and at the same time, another thread wants to read it). To solve this problem, we used mutex on critical parts of the code. A mutex is a flag preventing two threads from executing a part of code at the same time. A mutex has been defined to protect the incrementation of the row to compute; and to protect the update of matrix .
Slaves and master communicate through sockets. A socket is a kind of network ``pipe'' between two computers, and data can be sent and received through it. To create a socket, a program must know the IP address of the other computer and define a port to communicate with A. If the communication port is not occupied and the request is accepted, a socket will be created, and both computers will be able to send and receive data through it.
Those programs (master and slave) have been implemented in C. Data structures used are mainly arrays of integers.
The master program randomly generates and prints matrices and , creates threads, sends data to slaves, receives results and finally prints . The structure of the program is very similar to the sequential algorithm. Function solve is different, and a function has been added to read IP addresses of slaves. More precisely, the structure of the program is the shown on figure 12.
Functions already described in section 3.1.2 will not be mentioned again.
The program is divided in three files: master.c contains the main program, headers_master.h contains the definition of the prototypes and functions_master.c contains used functions.
The slave program is much simpler than the master program. It only creates a socket, waits for data from the master program (matrix and rows of ), computes a row of and sends it. The program is divided in the three usual files slave.c, headers_slave.h and functions_slave.c.
A unique function compute_result is called by the main program and computes the row resulting of the multiplication of previously received row of and (e.g. lines 4 to 9 of algorithm 4). The main program creates the socket, implements the algorithm 4 (replacing lines 4 to 9 by a call of function compute_result).
We will not consider communication between master and slave, since it already has been described.
To work, the main program requires three elements: the size of the problem (e.g. the value of ), the number of computers to be used for the computation, and the IP addresses of the slaves. The two first parameters are given by command line; and IP addresses are read in a file called ip_address.txt. In file headers_master.h there is parameter NUM_HOSTS that defines the maximum number of possible slaves. The file ip_address.txt must contains NUM_HOSTS addresses IP (even if they are not used).
The main program writes on standard output matrices and , and the result of the computation.
The slave programs only prints on the screen the size of the problem and the number of rows it processed.
Since libraries are different from an architecture to another, make linux builds the program for Linux environment, and make sun builds it for sun architecture.
To use the program, it is necessary to:
Note that a physical computer can be run at the same time the master process and a slave process (but it can run only one slave process).
An example of computation can give:
lamborghini#> ./master 5 3 BEGINNING OF COMPUTATION A generated 2 10 17 41 25 9 44 45 14 3 42 42 31 31 32 33 24 27 24 15 10 33 45 3 11 B generated 15 23 34 5 2 12 7 12 29 49 37 39 44 33 4 48 25 46 30 6 28 13 31 6 37 3447 2129 3597 2241 1733 3084 2659 3551 3244 2549 4665 3660 5714 3573 3636 3354 2775 4167 2562 2049 2663 2434 3195 2648 2242 lamborghini#>
lamborghini#> ./slave BEGINNING OF COMPUTATION Size of the problem: 5 Number of computation processed: 2 lamborghini#>
audi#> ./slave BEGINNING OF COMPUTATION Size of the problem: 5 Number of computation processed: 2 audi#>
porsche#> ./slave-sun BEGINNING OF COMPUTATION Size of the problem: 5 Number of computation processed: 1 porsche#>
The function computing a row is quite simple, and it is not really difficult to check it. Moreover, during development phase, after computing the solution using parallel implementation, the master program computed the solution with the sequential algorithm and checked if results were different. For instance, this allows us to find an error in the use of mutex.
One of the objective was to make a program working in an heterogeneous environment. But different architectures may use different encodings for representing integers. For instance, the encoding of an integer differs on a PC under Linux and on a Sun Station. Luckily, there is a standard encoding for sending integer on a network, and there exist functions for converting an integer to this standard network encoding, or the contrary. Those functions are htonl13 and ntohl14. The use of those functions allowed us to ``forget'' the heterogeneity of used computers.
Another problem encountered has been some ``surprising'' performance results: on various tests realized with Sun Stations, the execution time (both for sequential and parallel algorithm) was lower for than for (with the same computer on the same conditions). Deleting the -O2 compilation flag (e.g. reducing the level of optimization performed by compiler) allowed us to find more ``coherent'' results. However, We do not know if this was the origin of the problem.
But the main problem encountered has been to find out a granularity allowing us to get low execution times and good load balancing. Actually, the algorithm presented above is the second algorithm we implemented. The first one had a granularity much lower, and time dedicated to communication between slaves and master was far too high to have good performances.
To study the complexity of this set of program, let us define as the size of the problem, as the number of used computers
First, we will consider the message complexity, that is the amount of communications realized during a computation.
In this paragraph, we will only determine the number of integers sent on the network. We will neglect the encapsulation of data into TCP packets.
First, the master process sends to each slave the size of the problem, and matrix . That is, it sends integers to each slave. Since there are slaves, the message complexity of this parts is . To compute a result, the master sends to a slave an integer, and row of . The slave performs a computation and returns a row. That is, to get a row of the final matrix, integers are sent through the network. There are different rows; and the number of integers sent on the network during computation is . Consequently, the overall number of integers sent on the network is .
Let us suppose that the time complexity of an operation on matrices is , and the time-complexity of sending an integer is .
If we consider that computation takes place sequentially (and not concurrently as it is effectively the case), the overall execution time is constituted by the communication time and the computation time. The time complexity of the computation of a row (line 4 to 9 of algorithm 4 is naturally . There are rows to compute. So, the computation time is . The communication cost is . Consequently, the computation time is .
If we consider concurrent execution, overall execution time is equal to execution time of , where is the slave performing the highest number of computations. Let us suppose that computes rows. To do so, will first receive elements (size of the problem and matrix ). Then, it will receive elements, performs the computation of a row, and sends the result ( elements). Those operations will be repeated times. Hence, the communication time is , and the computation time is . In other words, the time complexity of the process is . If all computers are identical, . It holds that the time complexity of a computation is .
To evaluate performances of our implementation, and profits of parallelization, tests have been realized.
First, experiments have been performed with a set of identical computers. Those computers are Sun Station Ultra-5, equipped with a 360MHz Ultra-SPARC-III processor and 128MB of memory. Those computers are connected to the network through 10Mbps interfaces. It can be considered that (where and are constants introduced in section 3.2.6): sending an integer from a computer to another is much longer than locally performing an operation.
A test consist in computing the product between matrices of size and with computers. Tests have been realized for different values of and . Note that matrices used for computation never are the same, as they are randomly generated ``on the fly''. However, it is assumed that the computation time of any matrix of size filled with integers is the same. Moreover, the output on the screen has been disabled. Figure 13 shows computation times obtained15 for solving the problem with both sequential and parallel algorithms. Note that one of the computers ran simultaneously a master and a slave process.
Figures 14 and 15 show the speedup and the efficiency of our implementation. Speedup and efficiency can be expressed as follow:
On one (and only one) computer, the parallelized version of the algorithm always requires more time than the sequential algorithm; this is normal, since the parallel programs is much complexer than the sequential one.
For low values of (in our experiments, ), the sequential computation is always faster than parallelized ones. For higher values of , it is the contrary: the parallelized version is always faster than the sequential one. This is normal, since the transmission time for an integer is much greater than the computation time. As it can be seen in section 3.2.6, the part of transmission in the overall execution time is ``only'' proportional with , and the part of computation is . The higher will be, the lower will be the part of communication. If communication part was null, the speedup should be close to .
Figures 14 and 15 confirm this last affirmation: speedup is (almost) always increasing. Efficiency is also increasing with the value of . It is interesting to notice that efficiency decreases with , for . This can be explained by mutex used to avoid concurrent writing. If two threads want to modify the same variable at the same time, one thread has to wait for the other thread to finish its operation. Naturally, with a higher number of used computers, this has more chances to happen.
sl.1 | sl.2 | sl.3 | sl.4 | sl.5 | sl.6 | sl.7 | |
num. rows computed | 101 | 107 | 102 | 102 | 94 | 95 | 99 |
---|
Table above gives the number of rows each slave processed for and . It can be seen that values are homogeneous.
Some tests have also been realized with an heterogeneous network16. The four first slaves are PC's under Linux, and the four last slaves are Sun station of different generations. The network between all those computers is a 100Mbps network, and at least six of the height slaves are equipped with a 100Mbps network adapter.
Figure 16 shows the execution time for multiplying different size of matrices with four, six and height computers. It can be seen that there are very few differences in the execution time with 4, 6 and 8 computers. Table below gives the number of rows of computed by each slaves (for ).
sl.1 | sl.2 | sl.3 | sl.4 | sl.5 | sl.6 | sl.7 | sl.8 | |
num. rows computed | 276 | 376 | 642 | 461 | 127 | 59 | 34 | 25 |
---|
The four last slaves compute very few rows of . Consequently, similarity of the execution time can be explained this way: the computation power of the last four slaves is so small (in comparison with the four first slaves) that it does not really have influence on the overall execution time.
Speedup and efficiency are related to the execution time of the sequential algorithm measured on ``Maserati'', with is one of the computers used to make those experiments17. As it can be seen on figures 17 and 18, speedup and efficiency are quite good: speedup is between two and four (even when only four slaves are considered); and efficiency, when four slaves are considered, is between and . Moreover, both speedup and efficiency are increasing with . Naturally, efficiency decreases with the number of computers used, since, the computation time is almost the same.
It is worth noting that execution time on this system is much lower than execution on homogeneous system. This can be explained by the power of computers, but also by the velocity of the network (communication time between master and slaves is lower).
We have proposed and implemented an algorithm for computing the product of two square matrices of size . Our implementation meets our objectives: it is flexible and works on heterogeneous systems. Moreover, it seems to be efficient, since a reasonably good speedup is found for ``high'' values of .
On homogeneous system, load is well balanced between all slaves, and the more computer are used, the lower is the execution time.
It has been seen that in heterogeneous environment, the influence of the number of computers on the execution time is not so simple: increasing the number of slaves by adding relatively slow computers does not really have an influence. This leads us to emphasize that, for designing an important application (for instance an industrial application), it is necessary to have accurate informations on the hardware platform used. If the network is quite low, it is better to increase the granularity of the algorithm (e.g. increase the independence of each computers). On the contrary, is the network is very efficient, but computers heterogeneous, it may be useful to decrease the granularity of the repartition to get a better load balancing. If platforms are heterogeneous, does it worths using all computers? Or on the contrary, the profit will be insignificant?
Graphs in section 4 have been realized with numeric values that are in tables below.
seq. | 1 comp. | 2 comp. | 3 comp. | 4 comp. | 5 comp. | 6 comp. | 7 comp. | 8 comp. | |
100 | 0.1 | 10.0 | 5.1 | 3.4 | 2.6 | 2.2 | 1.8 | 1.6 | 1.5 |
200 | 1.3 | 20.1 | 10.1 | 6.9 | 5.4 | 4.5 | 3.9 | 3.6 | 3.4 |
300 | 4.6 | 33.1 | 16.7 | 11.6 | 9.0 | 7.7 | 6.7 | 6.5 | 6.2 |
400 | 13.1 | 52.2 | 11.7 | 7.4 | 6.1 | 5.8 | 5.7 | 5.8 | 6.2 |
500 | 27.0 | 75.3 | 21.2 | 13.3 | 10.7 | 9.9 | 9.7 | 9.5 | 10.0 |
600 | 52.9 | 108.5 | 36.7 | 23.6 | 18.6 | 16.6 | 15.8 | 15.8 | 15.6 |
700 | 79.3 | 140.7 | 52.5 | 34.2 | 27.2 | 24.0 | 22.6 | 21.9 | 22.0 |
800 | 164.7 | 232.8 | 98.0 | 64.4 | 50.0 | 43.0 | 39.2 | 37.1 | 36.5 |
900 | 182.0 | 262.3 | 109.3 | 71.7 | 56.4 | 48.8 | 45.3 | 43.2 | 42.4 |
1000 | 255.8 | 342.1 | 149.0 | 99.1 | 77.2 | 66.4 | 59.8 | 57.3 | 55.5 |
1 comp. | 2 comp. | 3 comp. | 4 comp. | 5 comp. | 6 comp. | 7 comp. | 8 comp. | |
100 | 0.01 | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.06 |
200 | 0.06 | 0.12 | 0.18 | 0.24 | 0.28 | 0.33 | 0.36 | 0.38 |
300 | 0.13 | 0.27 | 0.39 | 0.51 | 0.59 | 0.68 | 0.70 | 0.74 |
400 | 0.25 | 1.11 | 1.77 | 2.14 | 2.25 | 2.29 | 2.25 | 2.11 |
500 | 0.35 | 1.27 | 2.03 | 2.52 | 2.72 | 2.78 | 2.84 | 2.7 |
600 | 0.48 | 1.44 | 2.24 | 2.84 | 3.18 | 3.34 | 3.34 | 3.39 |
700 | 0.56 | 1.51 | 2.31 | 2.91 | 3.30 | 3.50 | 3.62 | 3.60 |
800 | 0.70 | 1.68 | 2.55 | 3.29 | 3.83 | 4.20 | 4.43 | 4.51 |
900 | 0.69 | 1.66 | 2.53 | 3.22 | 3.72 | 4.01 | 4.21 | 4.29 |
1000 | 0.74 | 1.71 | 2.58 | 3.31 | 3.85 | 4.27 | 4.46 | 4.60 |
1 comp. | 2 comp. | 3 comp. | 4 comp. | 5 comp. | 6 comp. | 7 comp. | 8 comp. | |
100 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
200 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.05 | 0.05 |
300 | 0.13 | 0.14 | 0.13 | 0.13 | 0.12 | 0.12 | 0.10 | 0.09 |
400 | 0.25 | 0.56 | 0.59 | 0.54 | 0.45 | 0.38 | 0.32 | 0.26 |
500 | 0.35 | 0.64 | 0.68 | 0.63 | 0.54 | 0.46 | 0.41 | 0.34 |
600 | 0.48 | 0.72 | 0.75 | 0.71 | 0.64 | 0.56 | 0.48 | 0.42 |
700 | 0.56 | 0.76 | 0.77 | 0.73 | 0.66 | 0.58 | 0.52 | 0.45 |
800 | 0.70 | 0.84 | 0.85 | 0.83 | 0.77 | 0.70 | 0.63 | 0.56 |
900 | 0.69 | 0.83 | 0.84 | 0.81 | 0.74 | 0.67 | 0.60 | 0.54 |
1000 | 0.74 | 0.86 | 0.86 | 0.83 | 0.77 | 0.71 | 0.64 | 0.58 |
seq. on Maserati | 4 comp. | 6 comp. | 8 comp. | |
400 | 4.10 | 1.9 | 1.91 | 2.05 |
800 | 34.95 | 13.87 | 12.56 | 12.24 |
1000 | 72.46 | 23.41 | 21.52 | 19.95 |
1200 | 120.96 | 32.35 | 33.81 | 32.2 |
1600 | 390.06 | 134.23 | 131.02 | 120.46 |
2000 | 626.57 | 208.66 | 198.25 | 186.09 |
4 comp. | 6 comp. | 8 comp. | |
400 | 2.16 | 2.15 | 2.00 |
800 | 2.52 | 2.78 | 2.85 |
1000 | 3.10 | 3.37 | 3.63 |
1200 | 3.73 | 3.57 | 3.75 |
1600 | 2.91 | 2.98 | 3.23 |
2000 | 3.00 | 3.16 | 3.37 |
4 comp. | 6 comp. | 8 comp. | |
400 | 0.54 | 0.36 | 0.25 |
800 | 0.63 | 0.46 | 0.36 |
1000 | 0.78 | 0.56 | 0.45 |
1200 | 0.93 | 0.60 | 0.47 |
1600 | 0.73 | 0.50 | 0.40 |
2000 | 0.75 | 0.53 | 0.42 |
/* ------------------------------------------------------------ */ /* - PAA - TP 4 - */ /* - Sequential matrix multiplication - */ /* ------------------------------------------------------------ */ /* - sequential.c - */ /* - Main program - */ /* ------------------------------------------------------------ */ /* - Gurvan Huiban - */ /* - Last modifications: 08/09/2002 - */ /* ------------------------------------------------------------ */ #include "headers-sequential.h" #include <stdio.h> #include <stdlib.h> // Main program int main (int argc, char * argv[]) { /* ---------- * VARIABLES * ---------- */ int n; // size of the problem. int * A; // First matrix int * B; // Second matrix int * R; // Result of A*B /* -------------------- * CODE OF THE PROGRAM * -------------------- */ // Initializing the random generator initrandom(); // Checking validity of the command line if (argc != 2) { printf("USAGE: sequential size_of_matrix\n\n"); exit(-1); } // Getting the value of n sscanf(argv[1], "%i", &n); // Memory allocation A = (int *) malloc(n*n*sizeof(int)); B = (int *) malloc(n*n*sizeof(int)); R = (int *) malloc(n*n*sizeof(int)); if (A == NULL || B == NULL || R == NULL) { printf("ERROR: memory allocation failed\n"); exit(-1); } // Generation of matrices gen_matrix(A, n); printf("A generated\n"); prn_matrix(A, n); gen_matrix(B, n); printf("B generated\n"); prn_matrix(B, n); // Solving the problem printf("Computation of R = A x B\n"); solve(R, A, B, n); // Writing the result on the screen prn_matrix(R, n); // Freeing memory free(A); free(B); free(R); printf("No error were encountered\n"); return(0); }
/* ------------------------------------------------------------ */ /* - PAA - TP 4 - */ /* - Sequential matrix multiplication - */ /* ------------------------------------------------------------ */ /* - headers-sequential.h - */ /* - Prototypes - */ /* ------------------------------------------------------------ */ /* - Gurvan Huiban - */ /* - Last modifications: 08/09/2002 - */ /* ------------------------------------------------------------ */ #ifndef __MAIN_H__ #define __MAIN_H__ /* ------- * DEFINE * ------- */ // Maximum value for matrices values #define INT_MAX 50 /* ----------- * PROTOTYPES * ----------- */ /* ------------------------------------------------------------ */ /* - function gen_matrix - */ /* ------------------------------------------------------------ */ /* - Randomly generates a matrix of size n*n - */ /* - To do so, call function rnd_int - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on the matrix to be generated - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void gen_matrix(int * A, int n); /* ------------------------------------------------------------ */ /* - function prn_matrix - */ /* ------------------------------------------------------------ */ /* - Prints matrix A on the standard output - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on the matrix to be printed - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void prn_matrix(int * A, int n); /* ------------------------------------------------------------ */ /* - function solve - */ /* ------------------------------------------------------------ */ /* - Computes R = A*B (A, B and R are n*n matrices) - */ /* - - */ /* - INPUT: - */ /* - (int *) R: pointer on the matrix containing result - */ /* - (int *) A: pointer on the matrix A - */ /* - (int *) A: pointer on the matrix B - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void solve(int * R, int * A, int * B, int n); /* ------------------------------------------------------------ */ /* - function rnd_int - */ /* ------------------------------------------------------------ */ /* - Returns an integer between 0 and n-1 - */ /* - (note: the value returned may occasionaly be n, due to - */ /* - rounding error) - */ /* - - */ /* - INPUT: - */ /* - (int) i: maximum value - */ /* - - */ /* - OUTPUT: - */ /* - (int) : value between 0 and n - */ /* ------------------------------------------------------------ */ int rnd_int (int i); /* ------------------------------------------------------------ */ /* - function initrandom - */ /* ------------------------------------------------------------ */ /* - Initialises the random generator - */ /* - - */ /* - INPUT: - */ /* - No input - */ /* - - */ /* - OUTPUT: - */ /* - No output - */ /* ------------------------------------------------------------ */ void initrandom(void); #endif
/* ------------------------------------------------------------ */ /* - PAA - TP 4 - */ /* - Sequential matrix multiplication - */ /* ------------------------------------------------------------ */ /* - functions-sequential.c - */ /* - Functions - */ /* ------------------------------------------------------------ */ /* - Gurvan Huiban - */ /* - Last modifications: 08/09/2002 - */ /* ------------------------------------------------------------ */ #include "headers-sequential.h" #include <stdlib.h> #include <stdio.h> #include <time.h> /* ------------------------------------------------------------ */ /* - function gen_matrix - */ /* ------------------------------------------------------------ */ /* - Randomly generates a matrix of size n*n - */ /* - To do so, call function rnd_int - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on the matrix to be generated - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void gen_matrix(int * A, int n) { /* ---------- * VARIABLES * ---------- */ int i, j; // counters /* --------------------- * CODE OF THE FUNCTION * --------------------- */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { A[i*n+j] = rnd_int(INT_MAX); } } } /* ------------------------------------------------------------ */ /* - function prn_matrix - */ /* ------------------------------------------------------------ */ /* - Prints matrix A on the standard output - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on the matrix to be printed - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void prn_matrix(int * A, int n) { /* ---------- * VARIABLES * ---------- */ int i, j; // counters /* --------------------- * CODE OF THE FUNCTION * --------------------- */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { printf("%5i ", A[i*n+j]); } printf("\n"); } printf("\n"); } /* ------------------------------------------------------------ */ /* - function solve - */ /* ------------------------------------------------------------ */ /* - Computes R = A*B (A, B and R are n*n matrices) - */ /* - - */ /* - INPUT: - */ /* - (int *) R: pointer on the matrix containing result - */ /* - (int *) A: pointer on the matrix A - */ /* - (int *) A: pointer on the matrix B - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void solve(int * R, int * A, int * B, int n) { /* ---------- * VARIABLES * ---------- */ int sum; // temporary sum int i, // counters j, k; /* -------------------- * CODE OF THE PROGRAM * -------------------- */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0; for (k=0; k<n; k++) { sum = sum + (A[i*n+k] * B[k*n+j]); } R[i*n+j] = sum; } } } /* ------------------------------------------------------------ */ /* - function rnd_int - */ /* ------------------------------------------------------------ */ /* - Returns an integer between 0 and n-1 - */ /* - (note: the value returned may be n occasionaly, due to - */ /* - rounding error) - */ /* - - */ /* - INPUT: - */ /* - (int) i: maximum value - */ /* - - */ /* - OUTPUT: - */ /* - (int) : value between 0 and n - */ /* ------------------------------------------------------------ */ int rnd_int (int i) { /* ---------- * VARIABLES * ---------- */ float num; float denum = RAND_MAX+1.0; int res; /* --------------------- * CODE OF THE FUNCTION * --------------------- */ num = (float) rand(); res = (int)(i*(num/denum)); return(res); } /* ------------------------------------------------------------ */ /* - function initrandom - */ /* ------------------------------------------------------------ */ /* - Initialises the random generator - */ /* - - */ /* - INPUT: - */ /* - No input - */ /* - - */ /* - OUTPUT: - */ /* - No output - */ /* ------------------------------------------------------------ */ void initrandom(void) { /* ---------- * VARIABLES * ---------- */ time_t ltime; /* --------------------- * CODE OF THE FUNCTION * --------------------- */ time(<ime); ltime%=10000; srand(ltime); }
/* ------------------------------------------------------------ */ /* - PAA - TP 4 - */ /* - Parallel matrix multiplication - */ /* ------------------------------------------------------------ */ /* - master.c - */ /* - Main program (master) - */ /* ------------------------------------------------------------ */ /* - Gurvan Huiban - */ /* - Last modifications: 27/08/2002 - */ /* ------------------------------------------------------------ */ /* - inspired from Example of master using TCP protocol. - */ /* - Reference: W. Richard Stevens. "UNIX Network Programming"- */ /* Prentice Hall Software Series. 1990. - */ /* Chapter 6: Berkeley Sockets. - */ /* ------------------------------------------------------------ */ /* --------- * INCLUDES * --------- */ #include <stdio.h> #include <stdlib.h> #include <netinet/in.h> #include <arpa/inet.h> #include <sys/types.h> #include <sys/socket.h> #include <strings.h> #include <unistd.h> #include <pthread.h> #include "headers-master.h" /* ------------- * MAIN PROGRAM * ------------- */ int main (int argc, char * argv[]) { /* ---------- * VARIABLES * ---------- */ int n; // size of the problem. int p; // number of computer to be used int * A; // First matrix int * B; // Second matrix int * R; // Result of A*B /* -------------------- * CODE OF THE PROGRAM * -------------------- */ // Initializing the random generator initrandom(); printf("\n\n BEGINNING OF COMPUTATION\n\n"); // Checking validity of the command line if (argc != 3) { printf("USAGE: sequential size_of_matrix number_of_computers\n\n"); exit(-1); } // Getting the value of n sscanf(argv[1], "%i", &n); sscanf(argv[2], "%i", &p); // Memory allocation A = (int *) malloc(n*n*sizeof(int)); B = (int *) malloc(n*n*sizeof(int)); R = (int *) malloc(n*n*sizeof(int)); if (A == NULL || B == NULL || R == NULL) { printf("ERROR: memory allocation failed\n"); exit(-1); } // Generation of matrices gen_matrix(A, n); printf("A generated\n"); prn_matrix(A,n); gen_matrix(B, n); printf("B generated\n"); prn_matrix(B,n); // launch the computation solve(R, A, B, n, p); // prints the result prn_matrix(R, n); // Frees memory free(A); free(B); free(R); return (0); }
/* ------------------------------------------------------------ */ /* - PAA - TP 4 - */ /* - Parallel matrix multiplication - */ /* ------------------------------------------------------------ */ /* - headers-master.h - */ /* - Prototypes of functions for the master - */ /* ------------------------------------------------------------ */ /* - Gurvan Huiban - */ /* - Last modifications: 27/08/2002 - */ /* ------------------------------------------------------------ */ #ifndef __HEADERS_MASTER_H__ #define __HEADERS_MASTER_H__ /* -------- * DEFINES * -------- */ // size of an IP address #define HOST_ADDR_SIZE 15 // number of computers available #define NUM_HOSTS 8 // port to be used in TCP communication #define SERV_TCP_PORT 2048 // Maximum value for matrices elements #define MAX_MATRIX 50 /* ---------------- * DATA STRUCTURES * ---------------- */ // Argument for a thread. struct ThreadArgs_t{ // Id of the thread int id; // pointer on the adress of slave char * hostaddr; // pointers on A, B, and R int * pA; int * pB; int * pR; // size of the problem (e.g. n) int sizePb; // pointer on the line number to compute int * computeLine; // mutex on computeLine pthread_mutex_t * computeLineMutex; // mutex on R pthread_mutex_t * RMutex; } typedef strThreadArgs; /* ----------- * PROTOTYPES * ----------- */ /* ------------------------------------------------------------ */ /* - function readip - */ /* ------------------------------------------------------------ */ /* - Reads IP addresses of slaves. Those addresses are in file- */ /* - ip_address.txt - */ /* - - */ /* - INPUT: - */ /* - (char *) addr: pointer on the data structure containing - */ /* - IP addresses - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void readip(char addr[NUM_HOSTS][HOST_ADDR_SIZE]); /* ------------------------------------------------------------ */ /* - function gen_matrix - */ /* ------------------------------------------------------------ */ /* - Randomly generates a matrix of size n*n - */ /* - To do so, call function rnd_int - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on the matrix to be generated - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void gen_matrix(int * A, int n); /* ------------------------------------------------------------ */ /* - function prn_matrix - */ /* ------------------------------------------------------------ */ /* - Prints matrix A on the standard output - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on the matrix to be printed - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void prn_matrix(int * A, int n); /* ------------------------------------------------------------ */ /* - function solve - */ /* ------------------------------------------------------------ */ /* - Computes R = A*B (A, B and R are n*n matrices) - */ /* - - */ /* - INPUT: - */ /* - (int *) R: pointer on the matrix containing result - */ /* - (int *) A: pointer on the matrix A - */ /* - (int *) B: pointer on the matrix B - */ /* - (int) n: problem dimension - */ /* - (int) p: number of computers to be used - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void solve(int * R, int * A, int * B, int n, int p); /* ------------------------------------------------------------ */ /* - function threadsolve - */ /* ------------------------------------------------------------ */ /* - Function executed by each thread to solve the problem - */ /* - - */ /* - INPUT: - */ /* - (void *) args: argument for the function - */ /* - (type strThreadArgs) - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void * threadsolve(void * args); /* ------------------------------------------------------------ */ /* - function rnd_int - */ /* ------------------------------------------------------------ */ /* - Returns an integer between 0 and n-1 - */ /* - (note: the value returned may be n occasionaly, due to - */ /* - rounding error) - */ /* - - */ /* - INPUT: - */ /* - (int) i: maximum value - */ /* - - */ /* - OUTPUT: - */ /* - (int) : value between 0 and n - */ /* ------------------------------------------------------------ */ int rnd_int (int i); /* ------------------------------------------------------------ */ /* - function initrandom - */ /* ------------------------------------------------------------ */ /* - Initialises the random generator - */ /* - - */ /* - INPUT: - */ /* - No input - */ /* - - */ /* - OUTPUT: - */ /* - No output - */ /* ------------------------------------------------------------ */ void initrandom(void); #endif
/* ------------------------------------------------------------ */ /* - PAA - TP 4 - */ /* - Parallel matrix multiplication - */ /* ------------------------------------------------------------ */ /* - functions-master.c - */ /* - Functions for the master - */ /* ------------------------------------------------------------ */ /* - Gurvan Huiban - */ /* - Last modifications: 27/08/2002 - */ /* ------------------------------------------------------------ */ /* --------- * INCLUDES * --------- */ #include <arpa/inet.h> #include <netinet/in.h> #include <sys/types.h> #include <sys/socket.h> #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <unistd.h> #include "headers-master.h" /* ------------------------------------------------------------ */ /* - function readip - */ /* ------------------------------------------------------------ */ /* - Reads IP addresses of slaves. Those addresses are in file- */ /* - ip_address.txt - */ /* - - */ /* - INPUT: - */ /* - (char *) addr: pointer on the data structure containing - */ /* - IP addresses - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void readip(char addr[NUM_HOSTS][HOST_ADDR_SIZE]) { /* ---------- * VARIABLES * ---------- */ // pointer on the input file FILE * file; char buffer[HOST_ADDR_SIZE]; int i; /* --------------------- * CODE OF THE FUNCTION * --------------------- */ file = fopen("ip_address.txt", "r"); if (file == NULL) { printf("ERROR: Problem encountered while opening file \ ip_address.txt\n\n"); exit(-1); } for (i=0; i<NUM_HOSTS; i++) { fgets(buffer, HOST_ADDR_SIZE, file); strcpy(&(addr[i][0]), buffer); } fclose(file); } /* ------------------------------------------------------------ */ /* - function gen_matrix - */ /* ------------------------------------------------------------ */ /* - Randomly generates a matrix of size n*n - */ /* - To do so, call function rnd_int - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on the matrix to be generated - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void gen_matrix(int * A, int n) { /* ---------- * VARIABLES * ---------- */ int i, j; // counters /* --------------------- * CODE OF THE FUNCTION * --------------------- */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { A[i*n+j] = rnd_int(MAX_MATRIX); } } } /* ------------------------------------------------------------ */ /* - function prn_matrix - */ /* ------------------------------------------------------------ */ /* - Prints matrix A on the standard output - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on the matrix to be printed - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void prn_matrix(int * A, int n) { /* ---------- * VARIABLES * ---------- */ int i, j; // counters /* --------------------- * CODE OF THE FUNCTION * --------------------- */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { printf("%5i ", A[i*n+j]); } printf("\n"); } printf("\n"); } /* ------------------------------------------------------------ */ /* - function solve - */ /* ------------------------------------------------------------ */ /* - Creates threads in order to compute R = A*B - */ /* - - */ /* - INPUT: - */ /* - (int *) R: pointer on the matrix containing result - */ /* - (int *) A: pointer on the matrix A - */ /* - (int *) B: pointer on the matrix B - */ /* - (int) n: problem dimension - */ /* - (int) p: number of computers to be used - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void solve(int * R, int * A, int * B, int n, int p) { /* ---------- * VARIABLES * ---------- */ // Address of slaves char addr[NUM_HOSTS][HOST_ADDR_SIZE]; // Array of threads pthread_t * proc; // Array of arguments for the threads strThreadArgs * threadArgs; // Number of the current thread int threadId; // size of a vector (n*sizeof(int)) int vectorSize; // line number to compute int lineNumber = 0; // mutex to avoid concurrent writing operation pthread_mutex_t * lineMut; pthread_mutex_t * RMut; /* --------------------- * CODE OF THE FUNCTION * --------------------- */ // set hosts address readip(addr); // Memory allocation for the list of threads proc = (pthread_t *) malloc(p*sizeof(pthread_t)); // Memory allocation for the arguments of threads threadArgs = (strThreadArgs *) malloc(p*sizeof(strThreadArgs)); // Memory allocation for mutex lineMut = (pthread_mutex_t *) malloc(sizeof(pthread_mutex_t)); RMut = (pthread_mutex_t *) malloc(sizeof(pthread_mutex_t)); vectorSize = n*sizeof(int); // Initialization of mutex pthread_mutex_init(lineMut, NULL); pthread_mutex_init(RMut, NULL); // Manages the threads for (threadId=0; threadId<p; threadId++) { // Fills arguments for each thread threadArgs[threadId].id = threadId; threadArgs[threadId].hostaddr = addr[threadId]; threadArgs[threadId].pA = A; threadArgs[threadId].pB = B; threadArgs[threadId].pR = R; threadArgs[threadId].sizePb = n; threadArgs[threadId].computeLine = &lineNumber; threadArgs[threadId].computeLineMutex = lineMut; threadArgs[threadId].RMutex = RMut; // creation of the threads pthread_create(&proc[threadId], NULL, threadsolve,\ &(threadArgs[threadId])); } // When all thread have finished, the computation is over for (threadId=0; threadId<NUM_HOSTS; threadId++) { pthread_join(proc[threadId],NULL); } // free memory free(threadArgs); free(proc); free(lineMut); free(RMut); } /* ------------------------------------------------------------ */ /* - function threadsolve - */ /* ------------------------------------------------------------ */ /* - Function executed by each thread to solve the problem - */ /* - - */ /* - INPUT: - */ /* - (void *) args: argument for the function - */ /* - (type strThreadArgs) - */ /* - - */ /* - OUTPUT: - */ /* - None - */ /* ------------------------------------------------------------ */ void * threadsolve(void * voidargs) { /* ---------- * VARIABLES * ---------- */ // Id of the thread int threadId; // pointer on arguments strThreadArgs * args; // IP address of the host char * addr; // Pointers on A, B and R int * A; int * B; int * R; // buffer for sent/received vectors int * buffer; // Size of the problem int n; // Line to compute int * line; // result for a computation int * res; // data structure for the socket struct sockaddr_in serv_addr; // Id of the socket int socketId; // vectorsize (n*sizeof(int)) int vectorSize; // boolean indicating if the computation have to go on char computation = 1; // number of bytes received int received; // integer to make the conversion with network format int conv; // Mutex pthread_mutex_t * lineMut; pthread_mutex_t * RMut; int i; int j; /* --------------------- * CODE OF THE FUNCTION * --------------------- */ // extracts arguments from data structure send while\ creating the thread args = (strThreadArgs *) voidargs; threadId = args->id; addr = args->hostaddr; A = args->pA; B = args->pB; R = args->pR; n = args->sizePb; line = args->computeLine; lineMut = args->computeLineMutex; RMut = args->RMutex; // Memory allocation buffer = (int *) malloc(n*sizeof(n)); res = (int *) malloc(n*sizeof(n)); vectorSize = n*sizeof(int); // Creation of the socket serv_addr.sin_family = AF_INET; serv_addr.sin_addr.s_addr = inet_addr(addr); serv_addr.sin_port = htons(SERV_TCP_PORT); // Open a TCP socket (an Internet stream socket). if ( (socketId = socket(AF_INET, SOCK_STREAM, 0)) < 0) { fprintf(stderr,"master: can't open stream socket %i\n", threadId); exit(-1); } // Connect to the slave if (connect(socketId, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) { fprintf(stderr,"master: can't connect to slave %i\n", threadId); exit(-1); } // Sends the size of the problem conv = htonl(n); if (send(socketId, &conv, sizeof(int), 0) != sizeof(int)) { fprintf(stderr,"master: can't send size of the problem\n"); } // Sends the matrix B row by row for (i=0; i<n; i++) { // Copies row i in buffer and convert integers. for(j=0; j<n; j++) { buffer[j] = htonl(B[i*n+j]); } if (send(socketId, buffer, vectorSize, 0) != vectorSize) { fprintf(stderr,"master: can't send the row\n"); } } // A thread does not know how many computation it will deal // with. while (computation) { // The computation only ends when the line to process is n if ( (*line) < n) { // Locking row to compute pthread_mutex_lock(lineMut); //compute (i,j) i = (*line); // get and converts the line number to compute conv = htonl(*line); // Increments the element to compute (*line)++; // unlock the variables (we already have copied element in conv) pthread_mutex_unlock(lineMut); // sends the line number to compute if (send(socketId, &conv, sizeof(int), 0) != sizeof(int)) { fprintf(stderr,"master: can't send the value of element\n"); } // copies and converts row i of A, before sending it for (j=0; j<n; j++) { buffer[j] = htonl(A[i*n+j]); } if (send(socketId, buffer, vectorSize, 0) != vectorSize) { fprintf(stderr,"master: can't send the row\n"); } // wait for the result of the computation received = 0; while (received < vectorSize) { received += recv(socketId, &(((char *)buffer)[received]), \ vectorSize-received, 0); } // updates the matrix R. To do so, lock it pthread_mutex_lock(RMut); // copies values and converts them at the same time for (j=0; j<n; j++) { R[i*n+j] = ntohl(buffer[j]); } // unlock R pthread_mutex_unlock(RMut); } else { // the line number to process is n. Computation is finished computation = 0; // sends value for element (= n) (to warn slave) conv = htonl(*line); if (send(socketId, &conv, sizeof(int), 0) != sizeof(int)) { fprintf(stderr,"master: can't send the value of element\n"); } } } // Close the socket close(socketId); //frees memory free(buffer); free(res); return NULL; } /* ------------------------------------------------------------ */ /* - function rnd_int - */ /* ------------------------------------------------------------ */ /* - Returns an integer between 0 and n-1 - */ /* - (note: the value returned may be n occasionaly, due to - */ /* - rounding error) - */ /* - - */ /* - INPUT: - */ /* - (int) i: maximum value - */ /* - - */ /* - OUTPUT: - */ /* - (int) : value between 0 and n - */ /* ------------------------------------------------------------ */ int rnd_int (int i) { /* ---------- * VARIABLES * ---------- */ float num; float denum = RAND_MAX+1.0; int res; /* --------------------- * CODE OF THE FUNCTION * --------------------- */ num = (float) rand(); res = (int)(i*(num/denum)); return(res); } /* ------------------------------------------------------------ */ /* - function initrandom - */ /* ------------------------------------------------------------ */ /* - Initialises the random generator - */ /* - - */ /* - INPUT: - */ /* - No input - */ /* - - */ /* - OUTPUT: - */ /* - No output - */ /* ------------------------------------------------------------ */ void initrandom(void) { /* ---------- * VARIABLES * ---------- */ time_t ltime; /* --------------------- * CODE OF THE FUNCTION * --------------------- */ time(<ime); ltime%=10000; srand(ltime); }
/* ------------------------------------------------------------ */ /* - PAA - TP 4 - */ /* - Parallel matrix multiplication - */ /* ------------------------------------------------------------ */ /* - slave.c - */ /* - Main program (slave) - */ /* ------------------------------------------------------------ */ /* - Gurvan Huiban - */ /* - Last modifications: 08/09/2002 - */ /* ------------------------------------------------------------ */ /* - inspired from Example of slave using TCP protocol. - */ /* - Reference: W. Richard Stevens. "UNIX Network Programming"- */ /* Prentice Hall Software Series. 1990. - */ /* Chapter 6: Berkeley Sockets. - */ /* ------------------------------------------------------------ */ /* --------- * INCLUDES * --------- */ #include <netinet/in.h> #include <arpa/inet.h> #include <sys/types.h> #include <sys/socket.h> #include <stdio.h> #include <stdlib.h> #include <strings.h> #include <unistd.h> #include "headers-slave.h" /* ------------- * MAIN PROGRAM * ------------- */ int main(int argc, char *argv[]) { /* ---------- * VARIABLES * ---------- */ // Id of the master's socket int socketServ; // Id of the slave's socket int socketId; // structure for creating the socket struct sockaddr_in serv_addr; // boolean for go on listening char listening=1; // size of the problem int n; // row of matrix A int * A; // matrix B; int * B; // result of the computation of A*B int * res; // size of a vector (n * sizeof(int)) int vectorSize; // size effectively received through the socket int received; // number of rows computed by the slave int numComputation = 0; // number of the line currently processed by the slave int numline; // buffer for receiving vectors int * buffer; int i,j; int k; // integer used to make the conversion of types int conv; /* -------------------- * CODE OF THE PROGRAM * -------------------- */ printf("\n\nBEGINNING OF COMPUTATION\n\n"); // Creation of the socket if ( (socketId = socket(AF_INET, SOCK_STREAM, 0)) < 0) { fprintf(stderr,"slave: can't open stream socket\n"); } // Binding the socket bzero((char *) &serv_addr, sizeof(serv_addr)); serv_addr.sin_family = AF_INET; serv_addr.sin_addr.s_addr = htons(INADDR_ANY); serv_addr.sin_port = htons(SERV_TCP_PORT); if (bind(socketId, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) { fprintf(stderr,"slave: can't bind local address\n"); } // listens to the socket listen(socketId, QUEUE); // waits for a connection from the master process. socketServ = accept(socketId, NULL, NULL); if (socketServ < 0) { fprintf(stderr,"slave: accept error\n"); } // receives the size of the problem if (recv(socketServ, &conv, sizeof(int), 0) != sizeof(int)) { fprintf(stderr,"slave: can't receive size of problem\n"); close(socketServ); exit(-11); } // Converts the results n = ntohl(conv); // Defines variables an memory allocation vectorSize = n*(sizeof(int)); A = (int *) malloc(n*sizeof(int)); B = (int *) malloc(n*n*sizeof(int)); buffer = (int *) malloc(n*sizeof(int)); res = (int *) malloc (n*sizeof(int)); printf("Size of the problem: %i\n", n); // Receives B, row by row for (i=0; i<n; i++) { // As the entire row may not already be there, it may be // required to call many times recv. "received" counts the // number of bytes effectively received. The loop executes // command recv as long as the entire row has not been received. received = 0; while (received < vectorSize) { received += recv(socketServ, &(((char *)buffer)[received]),\ vectorSize-received, 0); } // Copy the row into B and converts it for (j=0; j<n; j++) { B[i*n+j] = ntohl(buffer[j]); } } // The slave does not know yet how many lines it will process. // Consequently, it will go on listening and computing until receiving // the indication to compute line n while (listening) { // receives the line number if (recv(socketServ, &conv, sizeof(int), 0) != sizeof(int)) { fprintf(stderr,"slave: can't receive the value of numline\n"); } // converts it numline = ntohl(conv); // Checks if the computation is finished (e.g. numline >= n) if (numline < n) { // Receives the row necessary to compute the resulting row received = 0; while (received < vectorSize) { received += recv(socketServ, &(((char *)buffer)[received]),\ vectorSize-received, 0); } // copies this row and converts it for (k=0; k<n; k++) { A[k] = ntohl(buffer[k]); } // Launch the computation of the result // note that the result is already converted ("ready to be sent") computeResult(A, B, res, n); numComputation++; // sends the result to the server if ((send(socketServ, res, n*sizeof(int), 0) != n*sizeof(int))) { fprintf(stderr,"slave: can't send response\n"); } } else { // Computation is finished. listening = 0; close(socketServ); close(socketId); printf("Number of computation processed: %i\n", numComputation); } } // free memory free(A); free(B); free(buffer); free(res); return (0); }
/* ------------------------------------------------------------ */ /* - PAA - TP 4 - */ /* - Parallel matrix multiplication - */ /* ------------------------------------------------------------ */ /* - headers-slave.h - */ /* - Prototypes of functions for the slave - */ /* ------------------------------------------------------------ */ /* - Gurvan Huiban - */ /* - Last modifications: 08/09/2002 - */ /* ------------------------------------------------------------ */ #ifndef __HEADERS_SLAVE_H__ #define __HEADERS_SLAVE_H__ /* -------- * DEFINES * -------- */ // Port used for communication #define SERV_TCP_PORT 2048 // Size of Queue for listening TCP port #define QUEUE 100 /* ----------- * PROTOTYPES * ----------- */ /* ------------------------------------------------------------ */ /* - function computeResult - */ /* ------------------------------------------------------------ */ /* - Compute the product between row i of A and B - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on row of A - */ /* - (int *) B: pointer on B - */ /* - (int *) res: pointer on row of result - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - none - */ /* ------------------------------------------------------------ */ void computeResult(int * A, int * B, int * res, int n); #endif
/* ------------------------------------------------------------ */ /* - PAA - TP 4 - */ /* - Parallel matrix multiplication - */ /* ------------------------------------------------------------ */ /* - functions-slave.c - */ /* - Functions for the slave - */ /* ------------------------------------------------------------ */ /* - Gurvan Huiban - */ /* - Last modifications: 08/09/2002 - */ /* ------------------------------------------------------------ */ #include <netinet/in.h> #include <stdlib.h> #include <stdio.h> #include "headers-slave.h" /* ------------------------------------------------------------ */ /* - function computeResult - */ /* ------------------------------------------------------------ */ /* - Compute the product between row i of A and B - */ /* - - */ /* - INPUT: - */ /* - (int *) A: pointer on row of A - */ /* - (int *) B: pointer on B - */ /* - (int *) res: pointer on row of result - */ /* - (int) n: problem dimension - */ /* - - */ /* - OUTPUT: - */ /* - none - */ /* ------------------------------------------------------------ */ void computeResult(int * A, int * B, int * res, int n) { /* ---------- * VARIABLES * ---------- */ int j, i; int r; /* --------------------- * CODE OF THE FUNCTION * --------------------- */ // compute the n element of the result for (j=0; j<n; j++) { r = 0; for (i=0; i<n; i++) { r += A[i]*B[i*n+j]; } // Converts the result before sending it on the network res[j] = htonl(r); } }
This document was generated using the LaTeX2HTML translator Version 2002 (1.62)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 tp4.tex
The translation was initiated by gurvan huiban on 2002-10-03