Faq'n Tips


  1. Hey!  This doesn't look like a FAQ!  What gives?
  2. Are there mailing lists for Linux-HA?
  3. What is a cluster?
  4. What is a resource script ?
  5. How to monitor various resources?. If one of my resouces stops working heartbeat doesn't do anything unless the server crashes. How do I monitor resources with heartbeat?
  6. If my one of my ethernet connections goes away (cable severence, NIC failure, locusts), but my current primary node (the one with the services) is otherwise fine, no one can get to my services and I want to fail them over to my other cluster node.  Is there a way to do this?
  7. Every time my machine releases an IP alias, it loses the whole interface (i.e. eth0)!  How do I fix this?
  8. I want a lot of IP addresses as resources (more than 8).  What's the best way?
  9. The documentation indicates that a serial line is a good idea, is there really a drawback to using two ethernet connections?
  10. What is a difference between normal and nice failback ?
  11. How to use heartbeat with ipchains firewall?
  12. I got this message ERROR: No local heartbeat. Forcing shutdown and then heartbeat shut itself down for no reason at all!
  13. How to tune heartbeat on heavily loaded system to avoid split-brain?
  14. When heartbeat starts up I get this error message in my logs:
      WARN: process_clustermsg: node [<hostname>] failed authentication

    What does this mean?
  15. When I try to start heartbeat i receive message: "Starting High-Availability services: Heartbeat failure [rc=1]. Failed.
    and there is nothing in any of the log files and no messages. What is wrong ?
  16. How to run multiple clusters on same network segment ?
  17. How to get latest CVS version of heartbeat ?
  18. Heartbeat on other OSs.
  19. When I try to install the linux-ha.org heartbeat RPMs, they complain of dependencies from packages I already have installed!  Now what?
  20. I don't want heartbeat to fail over the cluster automatically.  How can I require human confirmation before failing over?
  21. What is STONITH?  And why might I need it?
  22. How do I figure out what STONITH devices are available, and how to configure them?
  23. I want to use a shared disk, but I don't want to use STONITH.  Any recommendations?
  24. Can heartbeat be configured in an active/active configuration? If so, how do I do this, since the haresource script is supposed to be the same on each box so I do not know how this could be done.
  25. If nothing helps, what should I do ?
  26. I want to submit a patch, how do I do that?


 
  1. Quit your bellyachin'!  We needed a "catch-all" document to supply useful information in a way that was easily referenced and would grow without a lot of work.  It's closer to a FAQ than anything else.

  2. Yes!  There are two public mailing lists for Linux-HA.  You can find out about them by visiting http://linux-ha.org/contact/.

  3. HA (High availability Cluster) - This is a cluster that allows a host (or hosts) to become Highly Available, that means if one node goes down (or a service on that node goes down) another node can pick up the service or node and take over from the failed machine. http://linux-ha.org
    Computing Cluster - This is what a Beowulf cluster is. It allows distributed computing over off the shelf components. In this case it is usually cheap IA32 machines. http://www.beowulf.org/
    Load balancing clusters - This is what the Linux Virtual Server project does. In this scenario you have one machine with load balances requests to a certain server (apache for example) over a farm of servers. www.linuxvirtualserver.org
    All of these sites have howtos etc. on them. For a general overview on clustering under Linux, look at the Clustering HOWTO.

  4. Resource scripts are basically (extended) System V init scripts. They have to support stop, start, and status operations.  In the future we will also add support for a "monitor" operation for monitoring services as you requested. The IPaddr script implements this new "monitor" operation now (but heartbeat doesn't use that function of it). For more info see Resource HOWTO.

  5. Heartbeat itself was not designed for monitoring various resources. If you need to monitor some resources (for example, availability of WWW server) you need some third party software. Mon is a reasonable solution.
    1. Get Mon from http://kernel.org/software/mon/.
    2. Get all required modules listed. You can find them at nearest mirror or at the CPAN archive (www.cpan.org). I am not very familiar with Perl, so i downloaded them from CPAN archive as .tar.gz packages and installed them usual way (perl Makefile.pl && make && make test && make install).
    3. Mon is software for monitoring different network resources. It can ping computers, connect to various ports, monitor WWW, MySQL etc. In case of dysfunction of some resources it triggers some scripts.
    4. Unpack mon in some directory. Best starting point is README file. Complete documentation is in <dir>/doc, where <dir> is place where you unpacked mon package.
    5. For a fast start do following steps:
      1. copy all subdirs found in <dir> to /usr/lib/mon
      2. create dir /etc/mon
      3. copy auth.cf from <dir>/etc to /etc/mon

      Now, mon is prepared to work. You need to create your own mon.cf file, where you should point to resources mon should watch and actions mon will start in case of dysfunction and when resources are available again.   All monitoring scripts are in /usr/lib/mon/mon.d/. At the beginning of every script you can find explanation how to use it.
      All alert scripts are placed in /usr/lib/mon/alert.d/. Those are scripts triggered in case something went wrong. In case you are using ipvs on theirs homepage (www.linuxvirtualserver.org) you can find scripts for adding and removing servers from an ipvs list.


  6. Yes!  Use the ipfail plug-in.  For each interface you wish to monitor, specify one or more "ping" nodes in your configuration.  Each node in your cluster will monitor these ping nodes.  Should one node detect a failure in one of these ping nodes, it will contact the other node in order to determine whether it or the ping node has the problem.  If the cluster node has the problem, it will try to failover its resources (if it has any).

    To use ipfail, you will need to add the following to your /etc/ha.d/ha.cf files:
            respawn hacluster /usr/lib/heartbeat/ipfail
            ping <IPaddr1> <IPaddr2> ... <IPaddrN>

    See Kevin's documentation for more details on the concepts.

    IPaddr1..N are your ping nodes.  NOTE:  ipfail requires the "nice_failback on" option.


  7. This isn't a problem with heartbeat, but rather is caused by various versions of net-tools.  Upgrade to the most recent version of net-tools and it will go away.  You can test it with ifconfig manually.


  8. Instead of failing over many IP addresses, just fail over one router address.  On your router, do the equivalent of "route add -net x.x.x.0/24 gw x.x.x.2", where x.x.x.2 is the cluster IP address controlled by heartbeat.  Then, make every address within x.x.x.0/24 that you wish to failover a permanent alias of lo0 on BOTH cluster nodes.  This is done via "ifconfig lo:2 x.x.x.3 netmask 255.255.255.255 -arp" etc...

  9. If anything makes your ethernet / IP stack fail, you may lose both connections. You definitely should run the cables differently, depending on how important your data is...

  10. Normal failback mode:
    In this mode, one of the two machines is designated as the preferred provider of a given resource group. If that machine is up, then it will always be the provider of every resource group for which it is preferred provider. Failovers occur when the preferred provider goes out of service, and when it comes back (failback). This mode is required if you wish to run an active-active configuration.
    Nice failback mode:
    In this mode, there is no natural affinity between a resource group and a particular node in the cluster (haresources file notwithstanding). Instead, there is an affinity between a resource group and whatever machine it is currently running on. Failovers occur *only* when a machine which is providing a service goes out of service. There is no concept of failback in this mode. This mode minimizes service interruptions, but cannot run an active-active configuration.

  11. To make heartbeat work with ipchains, you must accept incoming and outgoing traffic on 694 UDP port. Add something like
    /sbin/ipchains -A output -i ethN -p udp -s <source_IP> -d <dest_IP>  -j ACCEPT
    /sbin/ipchains -A input -i ethN -p udp -s <source_IP> -d <dest_IP>  -j ACCEPT

  12. This can be caused by one of two things:
    • System under heavy I/O load, or
    • Kernel bug.
    For how to deal with the first occurrance (heavy load), please read the answer to the next FAQ item.

    If your system was not under moderate to heavy load when it got this message, you probably have the kernel bug. The 2.4.18 Linux kernel had a bug in it which would cause it to not schedule heartbeat for very long periods of time when the system was idle, or nearly so. If this is the case, you need to get a kernel that isn't broken.

  13. "No local heartbeat" or "Cluster node returning after partition" under heavy load is typically caused by too small a deadtime interval. Here is suggestion for how to tune deadtime:
    Adding memory to the machine generally helps. Limiting workload on the machine generally helps. Newer versions of heartbeat are a bit better about this than pre 1.0 versions.

  14. It's common to get a single mangled packet on your serial interface when heartbeat starts up.  This is an indicator that happened.  It's harmless in this scenario.

  15. It's probably a permissions problem on authkeys.  It wants it to be read only mode (400, 600 or 700).  Depending on where and when it discovers the problem, the message will wind up in different places.
    But, it tends to be in
    1. stdout/stderr
    2. wherever you specified in your setup
    3. /var/log/messages
    Newer releases are better about also putting out startup messages to stderr in addition to whereever you have configured them to go.

  16. Use multicast and give each its own multicast group. If you need to/want to use broadcast, then run each cluster on different port numbers.  An example of a configuration using multicast would be to have the following line in your ha.cf file:
         mcast eth0 224.1.2.3 694 1 0
    This sets eth0 as the interface over which to send the multicast, 224.1.2.3 as the multicast group (will be same on each node in the same cluster), udp port 694 (heartbeat default), time to live of 1 (limit multicast to local network segment and not propagate through routers), multicast loopback disabled (typical).

  17. There is a CVS repository for Linux-HA. You can find it at cvs.linux-ha.org.  Read-only access is via login guest, password guest, module name linux-ha. More details are to be found in the announcement email.  It is also available through the web using viewcvs at http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/

  18. Heartbeat now uses use automake and is generally quite portable at this point. Join the Linux-HA-dev mailing list if you want to help port it to your favorite platform.

  19. Due to distribution RPM package name differences, this was unavoidable.  If you're not using STONITH, use the "--nodeps" option with rpm.  Otherwise, use the heartbeat source to build your own RPMs.  You'll have the added dependencies of autoconf >= 2.53 and libnet (get it from http://www.packetfactory.net/libnet).  Use the heartbeat source RPM (preferred) or unpack the heartbeat source and from the top directory, run "./ConfigureMe rpm".  This will build RPMS and place them where it's customary for your particular distro.  It may even tell you if you are missing some other required packages!

  20. You configure a "meatware" STONITH device into the ha.cf file.  The meatware STONITH device asks the operator to go power reset the machine which has gone down.  When the operator has reset the machine he or she then issues a command to tell the meatware STONITH plugin that the reset has taken place.  Heartbeat will wait indefinitely until the operator acknowledges the reset has occured.  During this time, the resources will not be taken over, and nothing will happen.

  21. STONITH is a form of fencing, and is an acronym standing for Shoot The Other Node In The Head.  It allows one node in the cluster to reset the other.  Fencing is essential if you're using shared disks, in order to protect the integrity of the disk data.  Heartbeat supports STONITH fencing, and resources which are self-fencing.  You need to configure some kind of fencing whenever you have a cluster resource which might be permanently damaged if both machines tried to make it active at the same time.  When in doubt check with the Linux-HA mailing list.

  22. To get the list of supported STONITH devices, issue this command:
    stonith -L
    To get all the gory details on exactly what these STONITH device names mean, and how to configure them, issue this command:
    stonith -h
  23. This is not something which heartbeat supports directly, however, there are a few kinds of resources which are "self-fencing".  This means that activating the resource causes it to fence itself off from the other node naturally.  Since this fencing happens in the resource agent, heartbeat doesn't know (and doesn't have to know) about it.  Two possible hardware candidates are IBM's ServeRAID-4 RAID controllers and ICP Vortex RAID controllers - but do your homework!!!   When in doubt check with the mailing list.

  24. Yes, heartbeat has supported active/active configurations since its first release. The key to configuring active/active clusters is to understand that each resource group in the haresources file is preceded by the name of the server which is normally supposed to run that service. In a "nice_failback off" configuration, when a cluster node comes up, it will take over any resources for which it is listed as the "normal master" in the haresources file. Below is an example of how to do this for an apache/mysql configuration.
    server1 10.10.10.1 mysql
    server2 10.10.10.2 apache
    
    In this case, the IP address 10.10.10.1 should be replaced with the IP address you want to contact the mysql server at, and 10.10.10.2 should be replaced with the IP address you want users to use to contact the web server. Any time server1 is up, it will run the mysql service. Any time server2 is up, it will run the apache service. If both server1 and server2 are up, both servers will be active. Note that this is contradictory with the nice_failback on parameter (but this is being fixed), which in turn prohibits the use of ipfail.

  25. Please make sure that you read all documentation and searched mail list archives. If you still can't find a solution you can post questions to the mailing list. Please include following:

  26. We love to get good patches.  Here's the preferred way:


Rev 0.0.5
(c) 2000 Rudy Pawul rpawul@iso-ne.com
(c) 2001 Dusan Djordjevic dj.dule@linux.org.yu