1 AMF B.02.01 Implementation
2 --------------------------
3 The implementation of AMF in openais is directed by the specification
4 SAI-AIS-AMF-B.02.01, see http://www.saforum.org/specification/.
8 The AMF has many major duties:
9 * issue instantiate, terminate, and cleanup operations for components
10 * assignment of component service instances to components
11 * executing of recovery and repair actions on fault reports delivered
12 by components (fault detection is a responsibility of all entities
15 An AMF user has to provide instantiate and cleanup commands and a
16 configuration file besides from the binaries that represents the actual
19 To start a component, AMF executes the instantiate command which starts
20 processes that are part of the component. AMF can stop the component
21 abruptly by running the cleaup command.
23 An service unit (SU) contains multiple components and represents a
24 "useable service" and is configured to execute on an AMF node. The AMF node
25 is mapped in the configuration to a CLM node which is "an operating system
26 instance". An SU is the smallest part that can be instantiated in a redundant
27 manner and can therefore be viewed as the unit of redundancy.
29 A service group (SG) contains multiple SUs. The SG is the unit that implements
30 high availability by managing its contained service units. An SG can be
31 configured to execute different redundancy policies.
33 An application contains multiple SGs and multiple service instances (SIs).
35 An SI represents the workload for an SU. An SI consists of one or more
36 component service instances (CSIs).
38 A CSI represents the workload of a component. The CSI is configured to include
39 a list of name value pairs through which the user can express the workload.
41 The AMF specification defines several types of components. The AMF
42 specification is exceedingly clear about which CLC operations occur for which
45 If a component is not sa-aware, the only level of high availability that
46 can be applied to the application is through execution of the CLC interfaces.
48 A special component, called a proxy component, can be used to present an
49 SA-aware component to AMF to manage a non-SA-aware component. This would be
50 useful, for example, to implement a healthcheck operation which runs some
51 operation of the unmodified application service.
53 Components that are SA-aware have been written specifically to the AMF
54 interfaces. These components provide the most support for high availability
55 for application developers.
57 When an SA-aware component has been instantiated it has to register within a
58 certain time. After a successful registration, AMF assigns workload to the
59 component by making callbacks once the service unit is available to take service.
60 There will be one callback for each CSI-assignment. Each CSI-assignment has
61 a HA state associated which indicates how the component shall act.
62 The HA state can be ACTIVE, STANDBY, QUIESCED or QUIESCING.
64 The number of CSIs assigned to a component and the setting of their HA state
65 is determined by AMF. In the configuration the operator specifies the preferred
66 assignment of workload to the defined SUs. The configuration specifies also
67 limits for how much work each SU can execute. If not the preferred distribution
68 of workload can be met due to problems in the cluster a reduction process with
69 6 levels of reduction will be executed by AMF. The purpose of the reduction
70 procedure is to come as close as possible to the preferred configuration without
71 violating any limits for how much workload an SU can handle. The reduction
72 procedure continues until there are no SUs in-service in the SG.
74 AMF supports fault detection through a healthcheck API. The user
75 specifies in the configuration file healthcheck keys and timing parameters.
76 This configuration is then used by the application developer to register
77 a healthcheck operation in the AMF. The healthcheck operation can be started
78 or stopped. Once started, the AMF will periodically send a request to the
79 component to determine its level of health. Optionally, AMF can be configured to
80 instead expect the component to report its health periodically.
81 The AMF reacts to negative healthchecks or failed healthchecks by executing
84 The AMF specification also includes an API for reporting errors with a
85 recommended recovery action. AMF will not take a weaker recovery action than
86 what is recommended but may take a stronger action based on the recovery
89 There is a recovery escalation policy for the recomendations:
93 When AMF receives a recommendation to restart a component, the recovery policy
94 attempts to restart the component first. When the component is restarted and
95 fail a certain number of times within a timeout period, the entire service unit
96 is restarted. When the SU has been restarted a certain number of times within
97 a certain timeout period, the SU is failed over to a standby SU. If AMF fails
98 over too many service units out of the same node in a given time period as a
99 consequence of error reports with either component restart or component
100 failover recommended recovery actions, the AMF escalates the recovery to an
101 entire node fail-over.
103 What is currently implemented ?
104 -------------------------------
106 SA-aware components can be instantiated and assigned load according to the
107 configuration specified in amf.conf. Other types of components are currently
108 not supported. The processes of instantiation and assignment of workload are
109 both simplified compared to the requirements in the AMF specification.
111 Service units represented by their components can be configured to execute
112 on different nodes. AMF supports initial start of the cluster as well as adding
113 of a node to the cluster after the initial start. AMF also supports that a node
114 leave the cluster by failing over the workload to standby service units.
116 Healthchecks are implemented as specified with only a few details missing.
118 The error report API is implemented but AMF ignores the recommendation of
119 recovery action instead it will always try to recover by 'component restart'.
121 The error escalation mechanism up to SU failover is also implemented as
122 specified with a few simplifications.
124 Only redundancy model N+M is (partly) implemented.
126 You can find a detailed list of what is NOT implemented later in the README.
130 The AMF specification doesn't specify a configuration file format. It does
131 however, describe many configuration options, which are specified formally in
132 SAI-Overview-B.02.01 chapter 4.5 - 4.11. The Overview can also be retrieved
133 from http://www.saforum.org/specification/.
135 An implementation specific feature of openais is to implement the configuration
136 options in a file called amf.conf. There is a man page in the /man directory
137 which describes the syntax of amf.conf and what configuration options which
138 are currently supported.
142 First the openais example programs should be installed. When compiling openais
143 in the exec directory a file called openais-instantiate is created. Copy this
144 file to a test directory of your own:
146 mkdir /tmp/aisexample
148 exec# cp openais-instantiate /tmp/aisexample
150 Copy also the script which implements the instantiate, terminate and clean-up
151 operations to your test directory:
153 exec# cp ../test/clc_cli_script /tmp/aisexample/clc_cli_script
155 Set execute permissions for the clc_cli_script
157 exec# chmod +x /tmp/aisexample/clc_cli_script
159 Copy the binary to be used for all components:
160 exec# cp ../test/testamf1 /tmp/aisexample/testamf1
162 Copy the amf example configuration files from the openais/conf directory to
165 exec# cp ../conf/*amf_example.conf /tmp/aisexample
167 set environment variables to the names of the configuration files:
169 setenv OPENAIS_AMF_CONFIG_FILE /tmp/aisexample/amf_example.conf
170 setenv OPENAIS_MAIN_CONFIG_FILE /tmp/aisexample/openais_amf_example.conf
172 You have to specify the host on which you would like to execute the AMF example.
173 Open the file 'amf_example.conf' and replace the line:
177 in the following section in the cluster configuration:
180 saAmfNodeSuFailOverProb=2000
181 saAmfNodeSuFailoverMax=2
185 p01 shall be replaced with the name of your host.
187 (You can obtain the name of your host by typing the command 'hostname' in a
190 Modify the following rows of 'openais_amf_example.conf' so that they match your
198 (One way to obtain your user and group is to type the command 'id' in a shell.)
200 Start aisexec by command:
203 aisexec will be run in the background.
204 Once aisexec is run using the example configuration file, 2 service units
205 will be instantiated. The testamf1 C code will be used for both component A
206 and component B of both SUs. The testamf1 program determines its
207 component name at start time from the saAmfComponentNameGet() api call.
208 The result is that 4 processes will be started by AMF.
210 Each testamf1 process will first try to register a bad component name and
211 there after register the name returned from saAmfComponentNameGet().
212 The testamf1 will be assigned CSIs after they execute a
213 saAmfComponentRegister() API call. Note that a successful registration causes
214 the state of the component and service units to be set to INSTANTIATED as
215 required by the AMF specification. The service instances and their names are
216 defined within the configuration file.
218 The component of type saAmfCSTypeName = B, which have the active HA state,
219 in this case, safComp=B,safSu=SERVICE_X_1,safSg=RAID,safApp=APP-1,
220 reports an error via saAmfErrorReport() after exactly 10 healthchecks.
221 The healthcheck period is configured to 1 second so one error report is sent
223 This results in openais calling the cleanup handler, which for
224 an sa-aware component, is the CLC_CLI_CLEANUP command. This causes the cleanup
225 operation of the clc_cli_script to be run. This cleanup command then reads the
226 pid of the process that was stored to /var/run ( or /tmp) at startup of the
227 testamf1 program. It then executes a kill -9 on the PID. Custom cleanup
228 operations can be executed by modifying the clc_cli_script script program.
230 After this is done 2 times (configurable) the entire service
231 unit is terminated and restarted due to the error escalation mechanism. Once
232 this happens 3 times (also configurable), the code escalates to level 2 and a
233 failover of the SU takes place. After this testamf1 makes no more error
234 reports and nothing will happen until some problem is recognized (like the
235 process of one of the components stops executing).
237 The states of the cluster and its contained entities can be obtained by issuing
238 the following command in the shell:
244 In the example, testamf1 is sending an error report at the 10th helthcheck.
245 This is actually controlled by the safCSIAttr = good_health_limit in
246 file amf_example.conf and can be changed as you like.
248 The file openais_amf_example.conf specifies logging to stderr.
250 If you would like to follow more closely the execution of the AMF in openais,
251 debug printouts can be enabled.
258 logfile: /tmp/openais.log
264 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
267 Setting 'debug: on' generally gives many printouts all other parts of openais.
269 Run the example on a cluster with 2 nodes
270 -----------------------------------------
272 It is easy to run the example on more than one node.
273 Modify the file openais_amf_example.conf:
276 Replace the following line:
277 bindnetaddr: 127.0.0.0
279 bindnetaddr specifies the address which the openais Executive should bind to.
280 This address should always end in zero. If the local interface traffic
281 should be routed over is 192.168.5.92, set bindnetaddr to 192.168.5.0.
283 Modify amf_example.conf like this:
285 Remove the comment character '#' from the following lines:
286 # safAmfNode = AMF2 {
287 # saAmfNodeSuFailOverProb=2000
288 # saAmfNodeSuFailoverMax=2
289 # saAmfNodeClmNode=p02
291 and replace p02 with the name of your second machine.
293 Locate the following two lines:
294 saAmfSUHostedByNode=AMF1
295 # saAmfSUHostedByNode=AMF2
299 # saAmfSUHostedByNode=AMF1
300 saAmfSUHostedByNode=AMF2
304 Any feed-back is appreciated.
306 Keep in mind only parts of the functionality is implemented. Reports of bugs or
307 behaviour not compliant with the AMF specification within the implemented part
308 is greatly appreciated :-).
310 What is currently NOT implemented ?
311 -----------------------------------
312 The following list specifies all chapters of the AMF specification which
313 currently is NOT fully implemented. The deviations from the specification are
314 described shortly except in those cases when none of the requirements in the
315 chapter is implemented.
319 3.3.1.2 Administrative State Not supported (always UNLOCKED).
320 3.3.1.4 Readiness State State STOPPING is not supported.
321 3.3.1.5 Service Unit’s HA State ... State QUIESCING is not supported.
322 3.3.2.2 Operational State AMF does not detect errors in the
324 • A command used by the Availability
325 Management Framework to control the
326 component life cycle returned an
327 error or did not return in time.
328 • The component fails to respond in
329 time to an Availability Management
330 Framework's callback.
331 • The component responds to an
332 Availability Management Framework's
333 state change callback
334 (SaAmfCSISetCallbackT) with an error.
335 • If the component is SA-aware, and it
336 does not register with the
337 Availability Management Framework
338 within the preconfigured time-period
339 after its instantiation.
340 • If the component is SA-aware, and it
341 unexpectedly unregisters with the
342 Availability Management Framework.
343 • The component terminates unexpectedly.
344 • When a fail-over recovery operation
345 performed at the level of the service
346 unit or the node containing the
347 service unit triggers an abrupt
348 termination of the component.
349 3.3.2.3 Readiness State State STOPPING is not supported.
350 3.3.2.4 Component’s HA State per ... State QUIESCING is not supported.
351 3.3.3.1 Administrative State Not supported (always UNLOCKED).
352 3.3.5 Service Group States Administrative state is not supported
354 3.3.6.1 Administrative State Not supported (always UNLOCKED).
355 3.3.6.2 Operational State None of the rules for transition between states are implemented.
356 3.3.7 Application States Administrative state is not supported (always UNLOCKED).
357 3.3.8 Cluster States Administrative state is not supported (always UNLOCKED).
358 3.5.1 Combined States for Pre-Inst.... Only Administrative state = UNLOCKED is supported.
359 3.5.2 Combined States for Non-Pre-I... Not supported.
360 3.6 Component Capability Model Configuration of capability model is
361 ignored. AMF expects all components to
362 be capable to be x_active_or_y_standby.
363 3.7.2 2N Redundancy Model Not supported.
364 3.7.3.1 Basics Spare service units can not be handled
366 3.7.3.3 Configuration • Ordered list of service units for a
367 service group: Not supported
368 (the order is unpredictable).
369 • Ordered list of SIs: Neither ranking
370 nor dependencies among SIs are
371 supported. SIs are assigned to SUs in
373 • Auto-adjust option: Not supported.
374 Auto-adjust is never done.
375 3.7.3.5.1 Handling of a Node Failure.. Not supported.
376 3.7.3.6 An Example of Auto-adjust Not supported.
377 3.7.4 N-Way Redundancy Model Not supported.
378 3.7.5 N-Way Active Redundancy Model Not supported.
379 3.7.6 No Redundancy Model Not supported.
380 3.7.7 The Effect of Administrative... Not supported.
381 3.9 Dependencies Among SIs, Compone.. Not supported.
382 3.11 Component Monitoring • External Active Monitoring:
384 3.12.1.1 Error Detection AMF does not support that a component
385 reports an error for another component.
386 3.12.1.2 Restart • AMF does not support terminating of
387 components by the terminate call-back
388 or the TERMINATE command.
389 • AMF does not consider component
390 instantiation-level at restart.
391 • The configuration option
392 disableRestart is not supported.
393 3.12.1.3 Recovery • Component or Service Unit Fail-Over:
394 • Component fail-over is not
396 • Only SU fail-over is implemented and
397 the only way to trig that case is by
399 • Node Switch-Over: Not implemented
400 • Node Fail-Over: Not implemented
401 • Node Fail-Fast: Not implemented
402 • The configuration option
403 recoveryOnFailure is not handled,
404 i.e. is never evaluated.
406 3.12.1.4 Repair • The configuration attribute for
407 automatic repair is not evaluated.
408 • The administrative operation
409 SA_AMF_ADMIN_REPAIRED is not
411 • Repair after component fail-over
413 • Node leave while performing
414 automatic repair of that node,
416 • Service unit failover recovery:
417 Is implemented except that an attempt
418 to repair is always done (confi-
419 guration attribute is not evaluated).
420 • Repair after Node Switch-Over,
421 Fail-Over or Fail-Fast
423 3.12.1.5 Recovery Escalation The recommended recovery action is not
424 evaluated at the reception of an error
426 3.12.2.1 Recommended Recovery Action The recommended recovery action is
427 never evaluated. Recovery action
428 SA_AMF_COMPONENT_RESTART is always
430 3.12.2.2 Escalations of Levels 1 and 2 Is implemented with the following exception:
431 • The configuration attribute
432 component_restart_max is compared to
433 the restart counter of the component
434 that has reported the error instead of
435 against the sum of all restart
436 counters of all components within
438 3.12.2.3 Escalation of Level 3 Not implemented
439 4.2 CLC-CLI's Environment Variables Translation of non-printable Unicode
440 characters is not supported.
441 4.4 INSTANTIATE Command • AMF does not evaluate the exit code of
442 the INSTANTIATE command as described
443 in the specification.
444 • AMF does not supervise that an
445 SA-aware component registers itself,
446 within the time limit configured.
447 As a consequence, none of the recovery
448 actions described are implemented.
449 4.5 TERMINATE Command Not supported.
450 4.6 CLEANUP Command AMF does not evaluate the exit code of
451 the CLEANUP command and thus does not
452 implement any recovery action.
453 4.7 AM_START Command Not supported.
454 4.8 AM_STOP Command Not supported.
455 5 Proxied Component Management Not implemented.
456 7 Administrative API Not implemented
457 8 Basic Operational Scenarios Not implemented.
458 9 Alarms and Notifications Not implemented.
460 Appendix A: Implementation of CLC .. CLC-interfaces are partly implemented
461 for SA-aware components.
462 The terminate operation,
463 saAmfComponentTerminateCallback(),
465 No CLC-interfaces are implemented for
466 any other type of component.
468 Appendix B: API functions in Unre.... AMF does not verify that the rules
469 described are fulfilled.
473 Which functions of the AMF API is currently NOT implemented ?
474 -------------------------------------------------------------
478 saAmfComponentUnregister() Is implemented in the library
481 saAmfPmStart() Is implemented in the library
484 saAmfPmStop() Is implemented in the library
487 saAmfHealthcheckStart() This function takes a parameter
488 of type SaAmfRecommendedRecoveryT.
489 The value of this parameter is
490 supposed to specify what kind of
491 recovery AMF should execute if
492 the component fails a health
493 check. AMF does not read the
494 value of this parameter but
495 instead always tries to recover
496 the component by a component
499 void (*SaAmfCSIRemoveCallbackT)() AMF will never make a call-back
502 (*SaAmfComponentTerminateCallbackT)() AMF will never make a call-back
505 (*SaAmfProxiedComponentInstantiateCallbackT)() AMF will never make a call-back
508 (*SaAmfProxiedComponentCleanupCallbackT)() AMF will never make a call-back
510 saAmfProtectionGroupTrack() Is implemented in the library
513 saAmfProtectionGroupTrackStop() Is implemented in the library
516 void (*SaAmfProtectionGroupTrackCallbackT)() AMF will never make a call-back
519 saAmfProtectionGroupNotificationFree() Not implemented.
521 saAmfComponentErrorReport() This function takes a parameter
522 of type SaAmfRecommendedRecoveryT.
523 The value of this parameter is
524 supposed to specify what kind of
525 recovery AMF should execute if
526 the component fails a health
527 check. AMF does not read the
528 value of this parameter but
529 instead always tries to recover
530 the component by a component
533 saAmfComponentErrorClear() Is implemented in the library