Fix bug #7067 - Linux asynchronous IO (aio) can cause smbd to fail to respond to...
[Samba.git] / docs-xml / Samba3-HOWTO / TOSHARG-HighAvailability.xml
blob1ce81d404e2a9b4391a2475b0e55e5d3215a9ad3
1 <?xml version="1.0" encoding="iso-8859-1"?>
2 <!DOCTYPE chapter PUBLIC "-//Samba-Team//DTD DocBook V4.2-Based Variant V1.0//EN" "http://www.samba.org/samba/DTD/samba-doc">
3 <chapter id="SambaHA">
4 <chapterinfo>
5         &author.jht;
6         &author.jeremy;
7 </chapterinfo>
9 <title>High Availability</title>
11 <sect1>
12 <title>Features and Benefits</title>
14 <para>
15 <indexterm><primary>availability</primary></indexterm>
16 <indexterm><primary>intolerance</primary></indexterm>
17 <indexterm><primary>vital task</primary></indexterm>
18 Network administrators are often concerned about the availability of file and print
19 services. Network users are inclined toward intolerance of the services they depend
20 on to perform vital task responsibilities.
21 </para>
23 <para>
24 A sign in a computer room served to remind staff of their responsibilities. It read:
25 </para>
27 <blockquote>
28 <para>
29 <indexterm><primary>fail</primary></indexterm>
30 <indexterm><primary>managed by humans</primary></indexterm>
31 <indexterm><primary>economically wise</primary></indexterm>
32 <indexterm><primary>anticipate failure</primary></indexterm>
33 All humans fail, in both great and small ways we fail continually. Machines fail too.
34 Computers are machines that are managed by humans, the fallout from failure
35 can be spectacular. Your responsibility is to deal with failure, to anticipate it
36 and to eliminate it as far as is humanly and economically wise to achieve.
37 Are your actions part of the problem or part of the solution?
38 </para>
39 </blockquote>
41 <para>
42 If we are to deal with failure in a planned and productive manner, then first we must
43 understand the problem. That is the purpose of this chapter.
44 </para>
46 <para>
47 <indexterm><primary>high availability</primary></indexterm>
48 <indexterm><primary>CIFS/SMB</primary></indexterm>
49 <indexterm><primary>state of knowledge</primary></indexterm>
50 Parenthetically, in the following discussion there are seeds of information on how to
51 provision a network infrastructure against failure. Our purpose here is not to provide
52 a lengthy dissertation on the subject of high availability. Additionally, we have made
53 a conscious decision to not provide detailed working examples of high availability
54 solutions; instead we present an overview of the issues in the hope that someone will
55 rise to the challenge of providing a detailed document that is focused purely on
56 presentation of the current state of knowledge and practice in high availability as it
57 applies to the deployment of Samba and other CIFS/SMB technologies.
58 </para>
60 </sect1>
62 <sect1>
63 <title>Technical Discussion</title>
65 <para>
66 <indexterm><primary>SambaXP conference</primary></indexterm>
67 <indexterm><primary>Germany</primary></indexterm>
68 <indexterm><primary>inspired structure</primary></indexterm>
69 The following summary was part of a presentation by Jeremy Allison at the SambaXP 2003
70 conference that was held at Goettingen, Germany, in April 2003. Material has been added
71 from other sources, but it was Jeremy who inspired the structure that follows.
72 </para>
74         <sect2>
75         <title>The Ultimate Goal</title>
77         <para>
78 <indexterm><primary>clustering technologies</primary></indexterm>
79 <indexterm><primary>affordable power</primary></indexterm>
80 <indexterm><primary>unstoppable services</primary></indexterm>
81         All clustering technologies aim to achieve one or more of the following:
82         </para>
84         <itemizedlist>
85                 <listitem><para>Obtain the maximum affordable computational power.</para></listitem>
86                 <listitem><para>Obtain faster program execution.</para></listitem>
87                 <listitem><para>Deliver unstoppable services.</para></listitem>
88                 <listitem><para>Avert points of failure.</para></listitem>
89                 <listitem><para>Exact most effective utilization of resources.</para></listitem>
90         </itemizedlist>
92         <para>
93         A clustered file server ideally has the following properties:
94 <indexterm><primary>clustered file server</primary></indexterm>
95 <indexterm><primary>connect transparently</primary></indexterm>
96 <indexterm><primary>transparently reconnected</primary></indexterm>
97 <indexterm><primary>distributed file system</primary></indexterm>
98         </para>
100         <itemizedlist>
101                 <listitem><para>All clients can connect transparently to any server.</para></listitem>
102                 <listitem><para>A server can fail and clients are transparently reconnected to another server.</para></listitem>
103                 <listitem><para>All servers serve out the same set of files.</para></listitem>
104                 <listitem><para>All file changes are immediately seen on all servers.</para>
105                         <itemizedlist><listitem><para>Requires a distributed file system.</para></listitem></itemizedlist></listitem>
106                 <listitem><para>Infinite ability to scale by adding more servers or disks.</para></listitem>
107         </itemizedlist>
109         </sect2>
111         <sect2>
112         <title>Why Is This So Hard?</title>
114         <para>
115         In short, the problem is one of <emphasis>state</emphasis>.
116         </para>
118         <itemizedlist>
119                 <listitem>
120                         <para>
121 <indexterm><primary>state information</primary></indexterm>
122                         All TCP/IP connections are dependent on state information.
123                         </para>
124                         <para>
125 <indexterm><primary>TCP failover</primary></indexterm>
126                         The TCP connection involves a packet sequence number. This
127                         sequence number would need to be dynamically updated on all
128                         machines in the cluster to effect seamless TCP failover.
129                         </para>
130                 </listitem>
131                 <listitem>
132                         <para>
133 <indexterm><primary>CIFS/SMB</primary></indexterm>
134 <indexterm><primary>TCP</primary></indexterm>
135                         CIFS/SMB (the Windows networking protocols) uses TCP connections.
136                         </para>
137                         <para>
138                         This means that from a basic design perspective, failover is not
139                         seriously considered.
140                         <itemizedlist>
141                                 <listitem><para>
142                                 All current SMB clusters are failover solutions
143                                 &smbmdash; they rely on the clients to reconnect. They provide server
144                                 failover, but clients can lose information due to a server failure.
145 <indexterm><primary>server failure</primary></indexterm>
146                                 </para></listitem>
147                         </itemizedlist>
148                         </para>
149                 </listitem>
150                 <listitem>
151                         <para>
152                         Servers keep state information about client connections.
153                         <itemizedlist>
154 <indexterm><primary>state</primary></indexterm>
155                                 <listitem><para>CIFS/SMB involves a lot of state.</para></listitem>
156                                 <listitem><para>Every file open must be compared with other open files
157                                                 to check share modes.</para></listitem>
158                         </itemizedlist>
159                         </para>
160                 </listitem>
161         </itemizedlist>
163                 <sect3>
164                 <title>The Front-End Challenge</title>
166                 <para>
167 <indexterm><primary>cluster servers</primary></indexterm>
168 <indexterm><primary>single server</primary></indexterm>
169 <indexterm><primary>TCP data streams</primary></indexterm>
170 <indexterm><primary>front-end virtual server</primary></indexterm>
171 <indexterm><primary>virtual server</primary></indexterm>
172 <indexterm><primary>de-multiplex</primary></indexterm>
173 <indexterm><primary>SMB</primary></indexterm>
174                 To make it possible for a cluster of file servers to appear as a single server that has one
175                 name and one IP address, the incoming TCP data streams from clients must be processed by the
176                 front-end virtual server. This server must de-multiplex the incoming packets at the SMB protocol
177                 layer level and then feed the SMB packet to different servers in the cluster.
178                 </para>
180                 <para>
181 <indexterm><primary>IPC$ connections</primary></indexterm>
182 <indexterm><primary>RPC calls</primary></indexterm>
183                 One could split all IPC$ connections and RPC calls to one server to handle printing and user
184                 lookup requirements. RPC printing handles are shared between different IPC4 sessions &smbmdash; it is
185                 hard to split this across clustered servers!
186                 </para>
188                 <para>
189                 Conceptually speaking, all other servers would then provide only file services. This is a simpler
190                 problem to concentrate on.
191                 </para>
193                 </sect3>
195                 <sect3>
196                 <title>Demultiplexing SMB Requests</title>
198                 <para>
199 <indexterm><primary>SMB requests</primary></indexterm>
200 <indexterm><primary>SMB state information</primary></indexterm>
201 <indexterm><primary>front-end virtual server</primary></indexterm>
202 <indexterm><primary>complicated problem</primary></indexterm>
203                 De-multiplexing of SMB requests requires knowledge of SMB state information,
204                 all of which must be held by the front-end <emphasis>virtual</emphasis> server.
205                 This is a perplexing and complicated problem to solve.
206                 </para>
208                 <para>
209 <indexterm><primary>vuid</primary></indexterm>
210 <indexterm><primary>tid</primary></indexterm>
211 <indexterm><primary>fid</primary></indexterm>
212                 Windows XP and later have changed semantics so state information (vuid, tid, fid)
213                 must match for a successful operation. This makes things simpler than before and is a
214                 positive step forward.
215                 </para>
217                 <para>
218 <indexterm><primary>SMB requests</primary></indexterm>
219 <indexterm><primary>Terminal Server</primary></indexterm>
220                 SMB requests are sent by vuid to their associated server. No code exists today to
221                 effect this solution. This problem is conceptually similar to the problem of
222                 correctly handling requests from multiple requests from Windows 2000
223                 Terminal Server in Samba.
224                 </para>
226                 <para>
227 <indexterm><primary>de-multiplexing</primary></indexterm>
228                 One possibility is to start by exposing the server pool to clients directly.
229                 This could eliminate the de-multiplexing step.
230                 </para>
232                 </sect3>
234                 <sect3>
235                 <title>The Distributed File System Challenge</title>
237                 <para>
238 <indexterm><primary>Distributed File Systems</primary></indexterm>
239                 There exists many distributed file systems for UNIX and Linux.
240                 </para>
242                 <para>
243 <indexterm><primary>backend</primary></indexterm>
244 <indexterm><primary>SMB semantics</primary></indexterm>
245 <indexterm><primary>share modes</primary></indexterm>
246 <indexterm><primary>locking</primary></indexterm>
247 <indexterm><primary>oplock</primary></indexterm>
248 <indexterm><primary>distributed file systems</primary></indexterm>
249                 Many could be adopted to backend our cluster, so long as awareness of SMB
250                 semantics is kept in mind (share modes, locking, and oplock issues in particular).
251                 Common free distributed file systems include:
252 <indexterm><primary>NFS</primary></indexterm>
253 <indexterm><primary>AFS</primary></indexterm>
254 <indexterm><primary>OpenGFS</primary></indexterm>
255 <indexterm><primary>Lustre</primary></indexterm>
256                 </para>
258                 <itemizedlist>
259                         <listitem><para>NFS</para></listitem>
260                         <listitem><para>AFS</para></listitem>
261                         <listitem><para>OpenGFS</para></listitem>
262                         <listitem><para>Lustre</para></listitem>
263                 </itemizedlist>
265                 <para>
266 <indexterm><primary>server pool</primary></indexterm>
267                 The server pool (cluster) can use any distributed file system backend if all SMB
268                 semantics are performed within this pool.
269                 </para>
271                 </sect3>
273                 <sect3>
274                 <title>Restrictive Constraints on Distributed File Systems</title>
276                 <para>
277 <indexterm><primary>SMB services</primary></indexterm>
278 <indexterm><primary>oplock handling</primary></indexterm>
279 <indexterm><primary>server pool</primary></indexterm>
280 <indexterm><primary>backend file system pool</primary></indexterm>
281                 Where a clustered server provides purely SMB services, oplock handling
282                 may be done within the server pool without imposing a need for this to
283                 be passed to the backend file system pool.
284                 </para>
286                 <para>
287 <indexterm><primary>NFS</primary></indexterm>
288 <indexterm><primary>interoperability</primary></indexterm>
289                 On the other hand, where the server pool also provides NFS or other file services,
290                 it will be essential that the implementation be oplock-aware so it can
291                 interoperate with SMB services. This is a significant challenge today. A failure
292                 to provide this interoperability will result in a significant loss of performance that will be
293                 sorely noted by users of Microsoft Windows clients.
294                 </para>
296                 <para>
297                 Last, all state information must be shared across the server pool.
298                 </para>
300                 </sect3>
302                 <sect3>
303                 <title>Server Pool Communications</title>
305                 <para>
306 <indexterm><primary>POSIX semantics</primary></indexterm>
307 <indexterm><primary>SMB</primary></indexterm>
308 <indexterm><primary>POSIX locks</primary></indexterm>
309 <indexterm><primary>SMB locks</primary></indexterm>
310                 Most backend file systems support POSIX file semantics. This makes it difficult
311                 to push SMB semantics back into the file system. POSIX locks have different properties
312                 and semantics from SMB locks.
313                 </para>
315                 <para>
316 <indexterm><primary>smbd</primary></indexterm>
317 <indexterm><primary>tdb</primary></indexterm>
318 <indexterm><primary>Clustered smbds</primary></indexterm>
319                 All <command>smbd</command> processes in the server pool must of necessity communicate
320                 very quickly. For this, the current <parameter>tdb</parameter> file structure that Samba
321                 uses is not suitable for use across a network. Clustered <command>smbd</command>s must use something else.
322                 </para>
324                 </sect3>
326                 <sect3>
327                 <title>Server Pool Communications Demands</title>
329                 <para>
330                 High-speed interserver communications in the server pool is a design prerequisite
331                 for a fully functional system. Possibilities for this include:
332                 </para>
334                 <itemizedlist>
335 <indexterm><primary>Myrinet</primary></indexterm>
336 <indexterm><primary>scalable coherent interface</primary><see>SCI</see></indexterm>
337                         <listitem><para>
338                         Proprietary shared memory bus (example: Myrinet or SCI [scalable coherent interface]).
339                         These are high-cost items.
340                         </para></listitem>
341                 
342                         <listitem><para>
343                         Gigabit Ethernet (now quite affordable).
344                         </para></listitem>
345                 
346                         <listitem><para>
347                         Raw Ethernet framing (to bypass TCP and UDP overheads).
348                         </para></listitem>
349                 </itemizedlist>
351                 <para>
352                 We have yet to identify metrics for  performance demands to enable this to happen
353                 effectively.
354                 </para>
356                 </sect3>
358                 <sect3>
359                 <title>Required Modifications to Samba</title>
361                 <para>
362                 Samba needs to be significantly modified to work with a high-speed server interconnect
363                 system to permit transparent failover clustering.
364                 </para>
366                 <para>
367                 Particular functions inside Samba that will be affected include:
368                 </para>
370                 <itemizedlist>
371                         <listitem><para>
372                         The locking database, oplock notifications,
373                         and the share mode database.
374                         </para></listitem>
376                         <listitem><para>
377 <indexterm><primary>failure semantics</primary></indexterm>
378 <indexterm><primary>oplock messages</primary></indexterm>
379                         Failure semantics need to be defined. Samba behaves the same way as Windows.
380                         When oplock messages fail, a file open request is allowed, but this is 
381                         potentially dangerous in a clustered environment. So how should interserver
382                         pool failure semantics function, and how should such functionality be implemented?
383                         </para></listitem>
385                         <listitem><para>
386                         Should this be implemented using a point-to-point lock manager, or can this
387                         be done using multicast techniques?
388                         </para></listitem>
390                 </itemizedlist>
392                 </sect3>
393         </sect2>
395         <sect2>
396         <title>A Simple Solution</title>
398         <para>
399 <indexterm><primary>failover servers</primary></indexterm>
400 <indexterm><primary>exported file system</primary></indexterm>
401 <indexterm><primary>distributed locking protocol</primary></indexterm>
402         Allowing failover servers to handle different functions within the exported file system
403         removes the problem of requiring a distributed locking protocol.
404         </para>
406         <para>
407 <indexterm><primary>high-speed server interconnect</primary></indexterm>
408 <indexterm><primary>complex file name space</primary></indexterm>
409         If only one server is active in a pair, the need for high-speed server interconnect is avoided.
410         This allows the use of existing high-availability solutions, instead of inventing a new one.
411         This simpler solution comes at a price &smbmdash; the cost of which is the need to manage a more
412         complex file name space. Since there is now not a single file system, administrators
413         must remember where all services are located &smbmdash; a complexity not easily dealt with.
414         </para>
416         <para>
417 <indexterm><primary>virtual server</primary></indexterm>
418         The <emphasis>virtual server</emphasis> is still needed to redirect requests to backend
419         servers. Backend file space integrity is the responsibility of the administrator.
420         </para>
422         </sect2>
424         <sect2>
425         <title>High-Availability Server Products</title>
427         <para>
428 <indexterm><primary>resource failover</primary></indexterm>
429 <indexterm><primary>high-availability services</primary></indexterm>
430 <indexterm><primary>dedicated heartbeat</primary></indexterm>
431 <indexterm><primary>LAN</primary></indexterm>
432 <indexterm><primary>failover process</primary></indexterm>
433         Failover servers must communicate in order to handle resource failover. This is essential
434         for high-availability services. The use of a dedicated heartbeat is a common technique to
435         introduce some intelligence into the failover process. This is often done over a dedicated
436         link (LAN or serial).
437         </para>
439         <para>
440 <indexterm><primary>SCSI</primary></indexterm>
441 <indexterm><primary>Red Hat Cluster Manager</primary></indexterm>
442 <indexterm><primary>Microsoft Wolfpack</primary></indexterm>
443 <indexterm><primary>Fiber Channel</primary></indexterm>
444 <indexterm><primary>failover communication</primary></indexterm>
445         Many failover solutions (like Red Hat Cluster Manager and Microsoft Wolfpack)
446         can use a shared SCSI of Fiber Channel disk storage array for failover communication.
447         Information regarding Red Hat high availability solutions for Samba may be obtained from
448         <ulink url="http://www.redhat.com/docs/manuals/enterprise/RHEL-AS-2.1-Manual/cluster-manager/s1-service-samba.html">www.redhat.com</ulink>.
449         </para>
451         <para>
452 <indexterm><primary>Linux High Availability project</primary></indexterm>
453         The Linux High Availability project is a resource worthy of consultation if your desire is
454         to build a highly available Samba file server solution. Please consult the home page at
455         <ulink url="http://www.linux-ha.org/">www.linux-ha.org/</ulink>.
456         </para>
458         <para>
459 <indexterm><primary>backend failures</primary></indexterm>
460 <indexterm><primary>continuity of service</primary></indexterm>
461         Front-end server complexity remains a challenge for high availability because it must deal
462         gracefully with backend failures, while at the same time providing continuity of service
463         to all network clients.
464         </para>
465         
466         </sect2>
468         <sect2>
469         <title>MS-DFS: The Poor Man's Cluster</title>
471         <para>
472 <indexterm><primary>MS-DFS</primary></indexterm>
473 <indexterm><primary>DFS</primary><see>MS-DFS, Distributed File Systems</see></indexterm>
474         MS-DFS links can be used to redirect clients to disparate backend servers. This pushes
475         complexity back to the network client, something already included by Microsoft.
476         MS-DFS creates the illusion of a simple, continuous file system name space that works even
477         at the file level.
478         </para>
480         <para>
481         Above all, at the cost of complexity of management, a distributed system (pseudo-cluster) can
482         be created using existing Samba functionality.
483         </para>
485         </sect2>
487         <sect2>
488         <title>Conclusions</title>
490         <itemizedlist>
491                 <listitem><para>Transparent SMB clustering is hard to do!</para></listitem>
492                 <listitem><para>Client failover is the best we can do today.</para></listitem>
493                 <listitem><para>Much more work is needed before a practical and manageable high-availability transparent cluster solution will be possible.</para></listitem>
494                 <listitem><para>MS-DFS can be used to create the illusion of a single transparent cluster.</para></listitem>
495         </itemizedlist>
497         </sect2>
499 </sect1>
500 </chapter>