README.md

   1 #Task solution - matrix multiplication in fortran#
   2 This repository contains the realization of the [first task](http://home.agh.edu.pl/~macwozni/fort/zadanie1.pdf) of the fortran programming language course at AGH in Cracov.
   3
   4 ##Directory structure##
   5
   6     CMakeLists.txt
   7     README.md
   8     make_results.sh
   9     template.gnuplot
  10     src/
  11         blockmath.F90
  12         testing.F90
  13         naivemath.F90
  14         bettermath.F90
  15         main.F90
  16         dotmath.F90
  17         bettermath2.F90
  18     res/
  19         dot_16
  20         dot_8
  21         mat_8
  22         bett2_4
  23         bett_4
  24         bett2_8
  25         block_4
  26         naiv_8
  27         mat_16
  28         bett_8
  29         naiv_4
  30         block_8
  31         naiv_16
  32         mat_4
  33         dot_4
  34         bett_16
  35         block_16
  36         bett2_16
  37         wykres4.pdf
  38         wykres8.pdf
  39         wykres16.pdf
  40         wykres4.svg
  41         wykres8.svg
  42         wykres16.svg
  43
  44 ###CmakeLists.txt###
  45 This file contains CMake definitions for building fortran sources. It does not, however, define the actions of running the program, generating results, nor plotting.
  46
  47 ###make_results.sh###
  48 This script runs built program under different options and writes measured multiplication times in `res/{naiv,bett,dot,mat,bett2,block}_{4,8,16}` while also running gnuplot to generate their corresponding plots into `res/wykres{4,8,16}.pdf`.
  49
  50 ###template.gnuplot###
  51 It's the gnuplot script for generating plots from text files in `res`. In this script `[kind]` has to be replaced with actual number (for example with the help of `sed`, see `make_results.sh` for how it's done).
  52
  53 ###src/###
  54 This directory contains the fortran sources of the project, namely modules sources `*math.F90` and main program source `main.F90`.
  55
  56 ###res/###
  57 This directory contains text results of matrix multiplication time measuring (files without extension) and their corresponding plots (pdf files as required for the task and respective svg versions for embedding here).
  58
  59 ##Building##
  60 As a prerequisite, CMake, make and a fortran compiler, either gfortran or ifort, are required (didn't want to support proprietary compiler :c but had to).
  61
  62 `make_results.sh` assumes an out-of-source build in `build/` has been performed.
  63
  64     $ mkdir build
  65     $ cd build
  66     $ cmake ../
  67     $ make mull
  68
  69 The CMake configuration provided allows one to specify one of twosets of compiler flags.  The default is RELEASE and it can be changed to DEBUG using
  70
  71     $ cmake .. -DCMAKE_BUILD_TYPE=DEBUG
  72
  73 ##Running##
  74 The executable is `mull`.
  75
  76      $ ./mull
  77      Usage: mull KIND IMPLEMENTATION
  78      where KIND is one of: 4, 8, 16
  79       IMPLEMENTATION is one of: naiv, bett, dot, mat, bett2, block
  80
  81 When ran with proper arguments, it prints matrix sizes and multiplication times in columns.
  82
  83 Text files and plots in `res/` can be regenerated by issuing
  84
  85     $ ./make_results.sh
  86
  87 from the main project directory (may take up to several tens of minutes).
  88
  89 ##Implementation##
  90 5 versions of matrix multiplication are implemented in separate files under `src/`. They're all called and their running times measured in `src/main.F90` together with fortran's built-in `matmul()` intrinsic function.
  91
  92 ####naive####
  93 Implemented in `src/naivemath.F90` the order of loops is i, j, k.
  94
  95 ####better####
  96 Implemented in `src/bettermath.F90` the order of loops is j, k, i. It's expected to be faster than naive, because the innermost loop accesses array elements that are closed to each other in memory.
  97
  98 ####dot####
  99 Implemented in `src/dotmath.F90` the order of loops is i, j, k. The innermost loop is replaced with dot_product() instrinsic function call, that is supposed to aid it's optimization.
 100
 101 ####better2####
 102 Implemented in `src/bettermath2.F90` the order of loops is j, k, i. The innermost loop is replaced with an operation on an entire column.
 103
 104 ####block####
 105 Implemented in `src/blockmath.F90`. The multiplication of matrices is achieved by many multiplications of submatrices (blocks).  The order of loops for one block is j, k, i with the innermost loop replaced with an operation on entire blocks' columns.
 106
 107 ##Results##
 108
 109 ####KIND=4####
 110 ![plot kind=4](res/wykres4.svg)
 111
 112 ####KIND=8####
 113 ![plot kind=8](res/wykres8.svg)
 114
 115 ####KIND=16####
 116 ![plot kind=16](res/wykres16.svg)
 117
 118 ##Conclusions##
 119 As expected, modification of algorithm to reference memory more locally improves the execution speed (difference between naive and better). This is achieved through better processor cache exploitation.
 120
 121 Usage of intrinsic `dot_product()` as well as fortran's array operations didn't really help the compiler further optimize code (almost no diference between better and better2, between naive and dot). However, the same change in a bigger fashion, mainly - application of `matmul()`, gave significant performance gains.
 122
 123 Block array multiplication is supposed to increase temporal locality of operations, thus allowing efficient use use L1 cache. Whether this method is really beneficial depends on factors like processor model. The performance gain, if it there is any, is greater for bigger matices. In this experiment this algorithm was almost aqually effective to better and better2 implementations.
 124
 125 The differences between algorithms' speeds are biggest for single precision numbers and almost unnoticable in case of kind 16 precision. One possible explaination of this is that operations on high precision numbers consume more processor cycles, hence
 126 1. memory accesses have smaller impart on the overall running time of the code
 127 2. they can be performed by processor in parallel with arithmetic operations and have a chance of completing before the FPU can start another operation