doc/documentation.tex

   1 \documentclass[a4paper]{memoir}
   2 \usepackage[T1]{fontenc}
   3 \usepackage{charter}
   4 \usepackage{microtype}
   5
   6 \usepackage{amsmath}
   7 \usepackage{amssymb}
   8
   9 \newtheorem{theorem}{Theorem}
  10 \newtheorem{lemma}{Lemma}
  11 \newtheorem{example}{Example}
  12 \newtheorem{remark}{Remark}
  13
  14 \title{eTorrent documentation}
  15 \author{Jesper Louis Andersen \\ jesper.louis.andersen@gmail.com}
  16 \date{\today}
  17 \begin{document}
  18 \maketitle{}
  19 \tableofcontents{}
  20
  21 \section{Introduction}
  22 Why write a complete analysis and documentation on the software? This
  23 is normally not the way Open Source software is written. Rather than
  24 sit down and think it seems most people are happy with writing code
  25 and let the code be documentation of what to do. But the problem with
  26 that approach is that I hop on-and-off on this project. It makes it
  27 impossible to do ad-hoc development.
  28
  29 When you grow up, it becomes clear that writing extensive
  30 documentation for a piece of software gives a much better
  31 result. Documentation is key to correct and elegant
  32 software. Documentation is key to extensive analysis before
  33 code. Thus, we must write documentation.
  34
  35 \chapter{Requirements}
  36 \label{chap:requirements}
  37
  38 The plan is to build a Bittorrent client. What will set this client
  39 off from other clients is the fault-tolerance of the client. In
  40 general, it should not be possible to take the client down, even in
  41 the case of errors in the client, on the disk, system crashes etc.
  42
  43 \paragraph{Fault tolerance} The client should be fault-tolerant. In
  44 general, if any part of the client has an error, it must not result in
  45 the client being brought down. The client should also be able to
  46 recover fast from such an error.
  47
  48 The client should avoid disk-corruption at all costs. It is accepted
  49 if the client assumes the underlying operating system does not corrupt
  50 the disk.
  51
  52 \paragraph{Unattended operation} The client must run unattended. The
  53 interface to the client is a directory hierachy: Torrent files in the
  54 hierachy are downloaded and when a file is removed, it is
  55 stopped. There are no requirement for an interactive interface.
  56
  57 \paragraph{Performance} The client should run at an adequate speed. It
  58 should be able to run up to the usual limits of Disk I/O. The client
  59 should not pursue speed aggressively, use rtorrent for that. The
  60 client must be able to serve a large number of simultaneous
  61 torrents. We aim for thousands of torrents served at the same time, if
  62 the server is large enough. There are 3 iterations: 1 torrent, 100
  63 torrents and 1000 torrents which must be achieved in order in
  64 releases of the software.
  65
  66 \chapter{Analysis}
  67 \section{Pieces}
  68 Pieces are the central elements we exchange between peers. A Torrent
  69 file consists of several pieces. Each are identified by a natural
  70 number indexed at $1$. This identification serves as the primary key
  71 in our implementation. Several informations are linked to the primary
  72 key. First, each piece has a size. This is important, since the last
  73 piece rarely have the size as the other pieces. Second, pieces have
  74 binary data associated with them of the given size. Third, this data
  75 has a checksum. Finally, the piece is mapped into a list of triples,
  76 $(path, offset, length)$ which designates to where the piece should
  77 go.
  78 \begin{example}
  79   A simple piece could have a list like the following:
  80 \begin{verbatim}
  81 [{foo, 10, 20},
  82  {bar, 30, 50}]
  83 \end{verbatim}
  84 It should be interpreted as if we should write 20 bytes to
  85 \texttt{foo} at offset 10 and then write 50 bytes to \texttt{bar} at
  86 offset 30.
  87 \end{example}
  88 \begin{remark}
  89   Invariant: The sum of the sizes in the list of where the piece is
  90   stored should be equal to the size of the piece.
  91 \end{remark}
  92
  93 \paragraph{On piece size}
  94 It seems correct to keep track of piece size all over the
  95 system. First, if we run several torrents, they may have different
  96 piece sizes. Second, it will greatly reduce the need to take special
  97 care of the last piece.
  98
  99 \paragraph{On piece checksum}
 100 We should always checksum a piece which has been read. First, it alleviates
 101 disk-corruption. A corrupted piece can then never be transmitted over
 102 the network. Second, it is cheap to check a piece in memory. Third, it
 103 serves as a great assertion invariant in the system: All written
 104 pieces should preserve their checksum when read.
 105
 106 When writing a piece, it should be checked as well. There is no
 107 thought in writing something which became accidentally corrupted. As
 108 we mostly retrieve the binary data associated with a piece from a
 109 peer, we really have no control over its correctness, so checked it
 110 must be.
 111
 112 \section{Filesystem interaction}
 113 \subsection{Piece serving}
 114 When we wish to serve a piece from disk, we must carry out a number of
 115 operations: We must locate the piece on disk. We must load it into
 116 memory and we must break it up so it can be sent to the peer who
 117 requested it.
 118
 119 Locating a piece is piece number. If we have the piece
 120 number, we deduce the files which comprises the piece in question and
 121 the (offsets, lengths) we have to read inside them. In other
 122 words, let $pids$ be piece identifications. Further, let $path$ be a
 123 file system path (UNIX notation). Finally, let $offset, length \in
 124 \mathbb{N}$. We have the function:
 125 \begin{equation*}
 126   \mathtt{locate\_piece} \colon (pid) \to (path, offset, length)\; list
 127 \end{equation*}
 128
 129 Then, when the piece is located, we must load it. Assume the existence
 130 of a function that can read pieces
 131 \begin{equation*}
 132   \mathtt{read\_piece} \colon (path, offset, length) \; list \to
 133   binary
 134 \end{equation*}
 135 where $binary$ is binary data. When data has been read, we check
 136 its cryptographic checksum. If the check doesn't match at this point,
 137 we have an error situation which must be handled appropriately.
 138
 139 Then the checksummed piece is sent to the process responsible for peer
 140 communication. Since peers can choose their block size as they see
 141 fit, the cut operation must not be handled centrally, but at the peer
 142 communication process.
 143
 144 \subsection{Piece retrieval}
 145
 146 When we get a piece from a peer, we begin by making a checksum
 147 check. If this check fails, we answer the peer communication process
 148 with an error and note it gave us a bad piece. This can be used by the
 149 piece communication process to mark its peer ``dirty'' and eventually
 150 for disconnecting and blacklisting.
 151
 152 If the piece is ok, we look up the checksum in the map of
 153 checksums. It must match the identification of the piece. If not, it
 154 is an error as well. If both the checksum and identification matches,
 155 we will store the piece.
 156
 157 There are several storage methods available to our disposal:
 158
 159 \paragraph{Method 1}
 160 Create all files. Use the system call \texttt{fseek(3)} to
 161 fast-forward to the point in the file we want to write and write down
 162 the piece at its correct slot.
 163
 164 The advantage of this approach is simplicity. It is easy to
 165 implement. It may introduce sparse files however. We may also pre-fill
 166 all files with an amount of zeros to avoid the sparse file
 167 production. However, this will be a problem because it takes time and
 168 it introduces files on-disk essentially without
 169 information. Pre-filling ensures that the file can always be written
 170 irregardless of free-space however.
 171
 172 We note that the Azureus client seems to be using an approach like
 173 this.
 174
 175 \paragraph{Method 2}
 176 Write the file contigously. Call the on-disk piece locations for
 177 slots. Then we first write to slot 1, then slot 2, then slot 3 and so
 178 forth. Pieces are written as they come in, so they may not be written
 179 in the correct slots in the first place.
 180
 181 This can be alleviated by using a sorting algorithm on the
 182 pieces. There are several applicable sorting algorithms. A simple
 183 solution would be exchanging selection:
 184
 185 Assume the pieces $1$ through $n$ are sorted correctly. We write
 186 pieces contigously to slot $n+1, n+2, \dotsc$. When piece $n+1$ is
 187 retrieved, we exchange the piece in slot $n+1$ with this new piece. To
 188 do this safely, we use a free slot on-disk as a temporary variable and
 189 ensure we copy the piece out of slot $n+1$ first. Thus, a crash will
 190 not result in the loss of data. Note that we then have pieces $1$
 191 through $n+1$ placed correctly. We then run again for slot $n+2$ which
 192 we may have already retrieved. The question is how many exchanges this
 193 makes as disk I/O is pretty heavy and a major limiting factor in
 194 BitTorrent clients.
 195
 196 For a slot there are a maximum of 3 writes: One for the contiguous
 197 write, one when the piece that fits gets written and one for making
 198 place for the fitting piece. Thus, the algorithm is $\mathcal{O}(n)$
 199 with a constant factor of around 3.
 200
 201 The original bittorrent client by Bram Cohen uses a variant of this
 202 approach.
 203
 204 \paragraph{Method 3}
 205 Use \texttt{mmap(2)}. A file is mapped into memory at a given
 206 location. Writes are done to memory and the operating system is
 207 responsible for letting the write go to the disk at the correct
 208 location by the virtual memory subsystem. This is extremely easy. It
 209 is fast as well, but there are a couple of limitations.
 210
 211 In a 32-bit architecture, we don't have enough memory to keep
 212 several gigabytes of data mapped in. Hence, we will either need to use
 213 a pure 64-bit operating system or we will need to devise an algorithm
 214 for mapping parts of files in and out. We need to do this anyway,
 215 since we can't expect to map several file descriptors at the same
 216 time.
 217
 218 Rtorrent is using this approach.
 219
 220 \paragraph{Method 4}
 221 Use internal storage. We can choose to represent the data internally
 222 in a on-disk persistent format. Then, when we have the whole file, we
 223 can write it out. Each piece will get written exactly 2 times, so it
 224 may seem to be better than method number 2. On the other hand, there
 225 are problems with the method: We can't look at data until everything
 226 is downloaded.
 227
 228 \paragraph{Discussion}
 229 My intuition tells me, that method 1 with pre-fill is the easiest to
 230 implement. Thus, we choose to implement that solution first. We can
 231 change to another method later, when the client basics are there and
 232 works.
 233
 234 \subsection{What to do at startup?}
 235 When the system starts, we have no idea of what we have piece-wise of
 236 a torrent. Hence, we must halt all communication with others until we
 237 know what pieces we have and what pieces we miss. We will check one
 238 torrent at a time, which will require some control.
 239
 240 For each torrent, we will begin loading in pieces. Either pieces fail,
 241 or pieces will be checked. If method 2 is chosen for piece storage, we
 242 need to identify read pieces. There must be some error-handling in the
 243 loading code, so we gracefully handle mis-loads.
 244
 245 If a file is missing on disk, we will create it and pre-fill it with
 246 zeros. Hence, we have the following invariant: ``File system processes
 247 can assume there is access to the needed files''.
 248
 249 \subsection{Handling checksum read errors}
 250 What happens when a checksum read reports an error? There are 2 causes
 251 for this: Disk corruption and a system crash/reset. The most probable
 252 is that the system was reset. Thus, we mark the piece as bad and
 253 ignore it as if it did not exist. Done correctly, it seems we can then
 254 continue running.
 255
 256 Disk corruption is much more fatal. We will assume data is not
 257 corrupted on the disk. Modern file systems like ZFS (see \cite{zfs}),
 258 will carry out checks of all read blocks and thus it is near
 259 impossible to have disk corruption in such a scheme.
 260
 261 \section{Peer processes}
 262
 263 General rule: we try to carry out bookkeeping as close to the peer as
 264 possible. Ie, we update mnesia tables whenever a message arrives or
 265 when a message gets sent in a early/late manner. Upon arrival, the
 266 first thing we do is to update database tables locally. Upon message
 267 sending, the last thing we do is to update. Sender/Reciever processes
 268 are responsible for updating and tracking the information.
 269
 270 \chapter{Programming planning}
 271 \section{Filesystem}
 272 A central problem to the eTorrent project is the File system. The
 273 filesystem processes must be split because the death of one of them
 274 must not take all torrents down. It would rather bad architecture.
 275
 276 \subsection{Processses}
 277 \subsubsection{File process}
 278 For each file which is managed, there is a process which is termed the
 279 ``file process''. This process is responsible for managing the file
 280 reads and writes. It has a very simple interface by which it accepts
 281 read and write operations on the file given by byte offset and number
 282 of bytes to read/write. It also contains a timeout for when no-one has
 283 requested any data on the file for some time in which case it closes
 284 down gracefully.
 285
 286 \subsubsection{General idea}
 287 For each torrent, there is a managing proces. This process is
 288 responsible for managing the torrents access to the disk. The
 289 management process is created when a given torrent has been processed
 290 for checksumming and is handed its status upon spawn-time.
 291
 292 When spawned, we get a mapping between piece identifications and the
 293 files we need to read from/write to in order to get the piece loaded
 294 or saved. We use this mapping for lookup in the code.
 295
 296 There are 2 main functions that the management process accepts:
 297 \texttt{read\_piece} and \texttt{write\_piece}. Upon getting a read or
 298 write request the process will look if there is a file process serving
 299 already. If not, it will spawn one and ask it to read/write the data
 300 in question. The process is linked to the file processes, so if any of
 301 these dies, we know it and can act accordingly by cleaning up our map
 302 of files and $pid$s. Since a file process exists when it has done
 303 nothing for some time, it is expected that we will use this feature
 304 quite much.
 305
 306 \subsubsection{File descriptor replacement}
 307 We want a simple algorithm for replacing file descriptors. A very way
 308 which is possible in erlang is to let each file be managed by a
 309 process. This process has a timeout on its main retrieval which will
 310 make it close down if no operations have been served for some time. A
 311 main process will keep track of all file processes and it will also
 312 have an LRU structure for the files. Thus file-descriptor processes can be
 313 purged if some new files has to be opened, but they auto-purge if
 314 no-one uses them.
 315
 316 Ergo, whenever a file process is spawned, the LRU process is informed
 317 about it. It can then ask for a close of a given process if it runs
 318 out of file descriptors.
 319 \begin{remark}
 320   This is a long term optimization. It should not be implemented in
 321   the first release.
 322 \end{remark}
 323
 324 \end{document}
 325
 326 %%% Local Variables:
 327 %%% mode: latex
 328 %%% TeX-master: t
 329 %%% End: