doc/README

   1      Monitoring with check_mk
   2
   3 Mathias Kettner, January 22nd, 2009
   4 mk@mathias-kettner.de
   5 www.mathias-kettner.de
   6 ----------------------------------------
   7
   8 1. The basic principle of check_mk
   9
  10 The classical approach for integrating service and host checks into
  11 Nagios is by specifying small external programs ("plugins") to be run
  12 by Nagios at periodic points of time. In case of OS monitoring the
  13 plugin contacts a remote daemon running on the target machine in order
  14 to fetch one item of information. This could be information about
  15 one specific partition, a network interface or simply the current
  16 memory utilisation. The most common daemon for monitoring of Linux und
  17 UNIX is the NRPE (Nagios Remote Plugin Executor). Variants are SNMP or
  18 SSH.
  19
  20 check_mk takes a different approach - with some crucial advantages as
  21 we will see later. The basic idea of check_mk is to fetch *all*
  22 information about a target host at once. For each host to be monitored
  23 check_mk is called by Nagios only once per time period (e.g. once per
  24 minute). It contacts a small daemon called "mknagios" on the target
  25 machine, which outputs all relevant information about the host -
  26 regardless of which items are actually being monitored. That daemon
  27 does not get any parameters and does not need to be configured.
  28
  29 check_mk now processes that information and extracts all items that
  30 have been configured for monitoring and checks them against configured
  31 levels (the configuration is done in one file main.mk on the
  32 Nagios server). It then sends the check results (OK/WARNING/CRITICAL
  33 plus performance data, if any) to Nagios via passive service
  34 checks. Nagios just processes these passive service checks just like
  35 active ones.
  36
  37 One of the main advantages of this approach as shown so far is a
  38 massive cut down of needed CPU resources - both on the monitoring
  39 host and on the target host. Consider an average number of 100
  40 different services per host. In the classic approach with NRPE each
  41 check period 100 seperate processes have to be started on the Nagios
  42 host, 100 TCP-connections to be built up and down and 100 plugins have
  43 to be executed locally on the target machine. check_mk in contrary
  44 just needs one single plugin to be started, one single TCP connection
  45 to be built up and one plugin to be run locally on the target host.
  46 The estimated speed up is at least a factor of 5 to 10.
  47
  48 This allows us to implement a much larger number of checks per time
  49 period on the monitoring server and at the same time save CPU
  50 resources on the target machines.
  51
  52
  53 2. Implemention of the plugin and the agent
  54
  55 The plugin check_mk is installed only on the monitoring server itself
  56 and implemented with Python. The local agent on the target hosts is
  57 implemented as shell script and run via inetd or xinetd. That way not
  58 binary code is needed. Currently Linux and Solaris are
  59 supported. Porting the script to other Unices is straight forward.
  60 The linux version of the script has successfully been tested and used
  61 on SLES 9 and SLES 10, both on 32 and 64 bit architecture.
  62
  63
  64
  65 3. Automated inventory
  66
  67 check_mk provides further advantages beyond the actual monitoring.
  68 One key feature is the automated inventory function. It scans one,
  69 several or all hosts for items not yet monitored and automatically
  70 configures them for being monitored. This way new partitions, database
  71 instances, network interfaces, multipath devices, raid arrays and
  72 other items will automatically added to the monitoring and cannot be
  73 forgotten. Also adding one new host or even a list of hosts to the
  74 monitoring environment is very easy. check_mk can do this because of
  75 the nature of the mknagios-agent, which always sends a complete list
  76 of all items of the host the are relevant for monitoring.
  77
  78
  79 4. Automatic creation of Nagios configuration
  80
  81 For all items checked by check_mk the Nagios configuration files
  82 are created automatically on demand. Nagios parameters can
  83 be customized manually by making use of the Nagios template
  84 mechanism.
  85
  86
  87 5. Direct RRD updates
  88
  89 check_mk supports graphical statistics over all checks that produce
  90 values that can be measured. It does this by using PNP4Nagios and
  91 round robin databases. For sake of performance check_mk does *not*
  92 hand over the data for the RRDs to Nagios but directly enters the values
  93 into the RRDs itself. This saves a significant amount of CPU resources.
  94
  95
  96 6. Service aggregations
  97
  98 check_mk lets you configure "service aggregations". If you use that
  99 feature then for each host H there will be created a host H-summary.
 100 Groups of services in H will be aggregated to one single service in
 101 H-summary.
 102
 103 This way it is for example possible to group all services beloning to
 104 a specific database instance into one single aggregated service for
 105 that instance in H-summary. The aggregated service is considered to be
 106 CRITICAL, if at least one of the underlying services on H is
 107 CRITICAL. That way it is much easier to get an overview over the
 108 overall state of the hosts. The aggregated services also can be used
 109 for notifications.
 110
 111
 112 7. Clusters
 113
 114 If you monitor services running on HA-clusters you have the problem
 115 that the monitoring server does not know which of the physical nodes a
 116 service is currently running on. If the cluster does have a service ip
 117 address then you can use that for monitoring. Some types of clusters
 118 do not have a service address, however.
 119
 120 check_mk supports monitoring such clusters by fetching the data from
 121 all pysical nodes and checking if the specific service is running on
 122 at least one of them. In Nagios such clusters appear as artificial
 123 hosts with no ip address.
 124
 125
 126 8. SNMP monitoring
 127
 128 check_mk also supports the "fetch all at once" principle for
 129 monitoring of networking devices via SNMP - with all the advantages
 130 explained above.  This includes an automatic inventory of Ethernet and
 131 FibreChannel switches for used and unused ports.
 132
 133
 134 9. Integration into Nagios
 135
 136 From Nagios' point of view check_mk is just one plugin next to others.
 137 It is straight forward to mix check_mk based and classical checks on
 138 one Nagios host.