Merge branch 'jlouis/snubbing'
[etorrent.git] / doc / documentation.tex
blobcfa897ae47d233c5a7d381174f08488a42a3720a
1 \documentclass[a4paper]{memoir}
2 \usepackage[T1]{fontenc}
3 \usepackage{charter}
4 \usepackage{microtype}
6 \usepackage{amsmath}
7 \usepackage{amssymb}
9 \newtheorem{theorem}{Theorem}
10 \newtheorem{lemma}{Lemma}
11 \newtheorem{example}{Example}
12 \newtheorem{remark}{Remark}
14 \title{eTorrent documentation}
15 \author{Jesper Louis Andersen \\ jesper.louis.andersen@gmail.com}
16 \date{\today}
17 \begin{document}
18 \maketitle{}
19 \tableofcontents{}
21 \section{Introduction}
22 Why write a complete analysis and documentation on the software? This
23 is normally not the way Open Source software is written. Rather than
24 sit down and think it seems most people are happy with writing code
25 and let the code be documentation of what to do. But the problem with
26 that approach is that I hop on-and-off on this project. It makes it
27 impossible to do ad-hoc development.
29 When you grow up, it becomes clear that writing extensive
30 documentation for a piece of software gives a much better
31 result. Documentation is key to correct and elegant
32 software. Documentation is key to extensive analysis before
33 code. Thus, we must write documentation.
35 \chapter{Requirements}
36 \label{chap:requirements}
38 The plan is to build a Bittorrent client. What will set this client
39 off from other clients is the fault-tolerance of the client. In
40 general, it should not be possible to take the client down, even in
41 the case of errors in the client, on the disk, system crashes etc.
43 \paragraph{Fault tolerance} The client should be fault-tolerant. In
44 general, if any part of the client has an error, it must not result in
45 the client being brought down. The client should also be able to
46 recover fast from such an error.
48 The client should avoid disk-corruption at all costs. It is accepted
49 if the client assumes the underlying operating system does not corrupt
50 the disk.
52 \paragraph{Unattended operation} The client must run unattended. The
53 interface to the client is a directory hierachy: Torrent files in the
54 hierachy are downloaded and when a file is removed, it is
55 stopped. There are no requirement for an interactive interface.
57 \paragraph{Performance} The client should run at an adequate speed. It
58 should be able to run up to the usual limits of Disk I/O. The client
59 should not pursue speed aggressively, use rtorrent for that. The
60 client must be able to serve a large number of simultaneous
61 torrents. We aim for thousands of torrents served at the same time, if
62 the server is large enough. There are 3 iterations: 1 torrent, 100
63 torrents and 1000 torrents which must be achieved in order in
64 releases of the software.
66 \chapter{Analysis}
67 \section{Pieces}
68 Pieces are the central elements we exchange between peers. A Torrent
69 file consists of several pieces. Each are identified by a natural
70 number indexed at $1$. This identification serves as the primary key
71 in our implementation. Several informations are linked to the primary
72 key. First, each piece has a size. This is important, since the last
73 piece rarely have the size as the other pieces. Second, pieces have
74 binary data associated with them of the given size. Third, this data
75 has a checksum. Finally, the piece is mapped into a list of triples,
76 $(path, offset, length)$ which designates to where the piece should
77 go.
78 \begin{example}
79 A simple piece could have a list like the following:
80 \begin{verbatim}
81 [{foo, 10, 20},
82 {bar, 30, 50}]
83 \end{verbatim}
84 It should be interpreted as if we should write 20 bytes to
85 \texttt{foo} at offset 10 and then write 50 bytes to \texttt{bar} at
86 offset 30.
87 \end{example}
88 \begin{remark}
89 Invariant: The sum of the sizes in the list of where the piece is
90 stored should be equal to the size of the piece.
91 \end{remark}
93 \paragraph{On piece size}
94 It seems correct to keep track of piece size all over the
95 system. First, if we run several torrents, they may have different
96 piece sizes. Second, it will greatly reduce the need to take special
97 care of the last piece.
99 \paragraph{On piece checksum}
100 We should always checksum a piece which has been read. First, it alleviates
101 disk-corruption. A corrupted piece can then never be transmitted over
102 the network. Second, it is cheap to check a piece in memory. Third, it
103 serves as a great assertion invariant in the system: All written
104 pieces should preserve their checksum when read.
106 When writing a piece, it should be checked as well. There is no
107 thought in writing something which became accidentally corrupted. As
108 we mostly retrieve the binary data associated with a piece from a
109 peer, we really have no control over its correctness, so checked it
110 must be.
112 \section{Filesystem interaction}
113 \subsection{Piece serving}
114 When we wish to serve a piece from disk, we must carry out a number of
115 operations: We must locate the piece on disk. We must load it into
116 memory and we must break it up so it can be sent to the peer who
117 requested it.
119 Locating a piece is piece number. If we have the piece
120 number, we deduce the files which comprises the piece in question and
121 the (offsets, lengths) we have to read inside them. In other
122 words, let $pids$ be piece identifications. Further, let $path$ be a
123 file system path (UNIX notation). Finally, let $offset, length \in
124 \mathbb{N}$. We have the function:
125 \begin{equation*}
126 \mathtt{locate\_piece} \colon (pid) \to (path, offset, length)\; list
127 \end{equation*}
129 Then, when the piece is located, we must load it. Assume the existence
130 of a function that can read pieces
131 \begin{equation*}
132 \mathtt{read\_piece} \colon (path, offset, length) \; list \to
133 binary
134 \end{equation*}
135 where $binary$ is binary data. When data has been read, we check
136 its cryptographic checksum. If the check doesn't match at this point,
137 we have an error situation which must be handled appropriately.
139 Then the checksummed piece is sent to the process responsible for peer
140 communication. Since peers can choose their block size as they see
141 fit, the cut operation must not be handled centrally, but at the peer
142 communication process.
144 \subsection{Piece retrieval}
146 When we get a piece from a peer, we begin by making a checksum
147 check. If this check fails, we answer the peer communication process
148 with an error and note it gave us a bad piece. This can be used by the
149 piece communication process to mark its peer ``dirty'' and eventually
150 for disconnecting and blacklisting.
152 If the piece is ok, we look up the checksum in the map of
153 checksums. It must match the identification of the piece. If not, it
154 is an error as well. If both the checksum and identification matches,
155 we will store the piece.
157 There are several storage methods available to our disposal:
159 \paragraph{Method 1}
160 Create all files. Use the system call \texttt{fseek(3)} to
161 fast-forward to the point in the file we want to write and write down
162 the piece at its correct slot.
164 The advantage of this approach is simplicity. It is easy to
165 implement. It may introduce sparse files however. We may also pre-fill
166 all files with an amount of zeros to avoid the sparse file
167 production. However, this will be a problem because it takes time and
168 it introduces files on-disk essentially without
169 information. Pre-filling ensures that the file can always be written
170 irregardless of free-space however.
172 We note that the Azureus client seems to be using an approach like
173 this.
175 \paragraph{Method 2}
176 Write the file contigously. Call the on-disk piece locations for
177 slots. Then we first write to slot 1, then slot 2, then slot 3 and so
178 forth. Pieces are written as they come in, so they may not be written
179 in the correct slots in the first place.
181 This can be alleviated by using a sorting algorithm on the
182 pieces. There are several applicable sorting algorithms. A simple
183 solution would be exchanging selection:
185 Assume the pieces $1$ through $n$ are sorted correctly. We write
186 pieces contigously to slot $n+1, n+2, \dotsc$. When piece $n+1$ is
187 retrieved, we exchange the piece in slot $n+1$ with this new piece. To
188 do this safely, we use a free slot on-disk as a temporary variable and
189 ensure we copy the piece out of slot $n+1$ first. Thus, a crash will
190 not result in the loss of data. Note that we then have pieces $1$
191 through $n+1$ placed correctly. We then run again for slot $n+2$ which
192 we may have already retrieved. The question is how many exchanges this
193 makes as disk I/O is pretty heavy and a major limiting factor in
194 BitTorrent clients.
196 For a slot there are a maximum of 3 writes: One for the contiguous
197 write, one when the piece that fits gets written and one for making
198 place for the fitting piece. Thus, the algorithm is $\mathcal{O}(n)$
199 with a constant factor of around 3.
201 The original bittorrent client by Bram Cohen uses a variant of this
202 approach.
204 \paragraph{Method 3}
205 Use \texttt{mmap(2)}. A file is mapped into memory at a given
206 location. Writes are done to memory and the operating system is
207 responsible for letting the write go to the disk at the correct
208 location by the virtual memory subsystem. This is extremely easy. It
209 is fast as well, but there are a couple of limitations.
211 In a 32-bit architecture, we don't have enough memory to keep
212 several gigabytes of data mapped in. Hence, we will either need to use
213 a pure 64-bit operating system or we will need to devise an algorithm
214 for mapping parts of files in and out. We need to do this anyway,
215 since we can't expect to map several file descriptors at the same
216 time.
218 Rtorrent is using this approach.
220 \paragraph{Method 4}
221 Use internal storage. We can choose to represent the data internally
222 in a on-disk persistent format. Then, when we have the whole file, we
223 can write it out. Each piece will get written exactly 2 times, so it
224 may seem to be better than method number 2. On the other hand, there
225 are problems with the method: We can't look at data until everything
226 is downloaded.
228 \paragraph{Discussion}
229 My intuition tells me, that method 1 with pre-fill is the easiest to
230 implement. Thus, we choose to implement that solution first. We can
231 change to another method later, when the client basics are there and
232 works.
234 \subsection{What to do at startup?}
235 When the system starts, we have no idea of what we have piece-wise of
236 a torrent. Hence, we must halt all communication with others until we
237 know what pieces we have and what pieces we miss. We will check one
238 torrent at a time, which will require some control.
240 For each torrent, we will begin loading in pieces. Either pieces fail,
241 or pieces will be checked. If method 2 is chosen for piece storage, we
242 need to identify read pieces. There must be some error-handling in the
243 loading code, so we gracefully handle mis-loads.
245 If a file is missing on disk, we will create it and pre-fill it with
246 zeros. Hence, we have the following invariant: ``File system processes
247 can assume there is access to the needed files''.
249 \subsection{Handling checksum read errors}
250 What happens when a checksum read reports an error? There are 2 causes
251 for this: Disk corruption and a system crash/reset. The most probable
252 is that the system was reset. Thus, we mark the piece as bad and
253 ignore it as if it did not exist. Done correctly, it seems we can then
254 continue running.
256 Disk corruption is much more fatal. We will assume data is not
257 corrupted on the disk. Modern file systems like ZFS (see \cite{zfs}),
258 will carry out checks of all read blocks and thus it is near
259 impossible to have disk corruption in such a scheme.
261 \section{Peer processes}
263 General rule: we try to carry out bookkeeping as close to the peer as
264 possible. Ie, we update mnesia tables whenever a message arrives or
265 when a message gets sent in a early/late manner. Upon arrival, the
266 first thing we do is to update database tables locally. Upon message
267 sending, the last thing we do is to update. Sender/Reciever processes
268 are responsible for updating and tracking the information.
270 \chapter{Programming planning}
271 \section{Filesystem}
272 A central problem to the eTorrent project is the File system. The
273 filesystem processes must be split because the death of one of them
274 must not take all torrents down. It would rather bad architecture.
276 \subsection{Processses}
277 \subsubsection{File process}
278 For each file which is managed, there is a process which is termed the
279 ``file process''. This process is responsible for managing the file
280 reads and writes. It has a very simple interface by which it accepts
281 read and write operations on the file given by byte offset and number
282 of bytes to read/write. It also contains a timeout for when no-one has
283 requested any data on the file for some time in which case it closes
284 down gracefully.
286 \subsubsection{General idea}
287 For each torrent, there is a managing proces. This process is
288 responsible for managing the torrents access to the disk. The
289 management process is created when a given torrent has been processed
290 for checksumming and is handed its status upon spawn-time.
292 When spawned, we get a mapping between piece identifications and the
293 files we need to read from/write to in order to get the piece loaded
294 or saved. We use this mapping for lookup in the code.
296 There are 2 main functions that the management process accepts:
297 \texttt{read\_piece} and \texttt{write\_piece}. Upon getting a read or
298 write request the process will look if there is a file process serving
299 already. If not, it will spawn one and ask it to read/write the data
300 in question. The process is linked to the file processes, so if any of
301 these dies, we know it and can act accordingly by cleaning up our map
302 of files and $pid$s. Since a file process exists when it has done
303 nothing for some time, it is expected that we will use this feature
304 quite much.
306 \subsubsection{File descriptor replacement}
307 We want a simple algorithm for replacing file descriptors. A very way
308 which is possible in erlang is to let each file be managed by a
309 process. This process has a timeout on its main retrieval which will
310 make it close down if no operations have been served for some time. A
311 main process will keep track of all file processes and it will also
312 have an LRU structure for the files. Thus file-descriptor processes can be
313 purged if some new files has to be opened, but they auto-purge if
314 no-one uses them.
316 Ergo, whenever a file process is spawned, the LRU process is informed
317 about it. It can then ask for a close of a given process if it runs
318 out of file descriptors.
319 \begin{remark}
320 This is a long term optimization. It should not be implemented in
321 the first release.
322 \end{remark}
324 \end{document}
326 %%% Local Variables:
327 %%% mode: latex
328 %%% TeX-master: t
329 %%% End: