belkin.binbang.vm.gnt          kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
cspoker-bot.vm.gnt             kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
drac5.binbang.vm.gnt           kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
esclick.binbang.vm.gnt         kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
ganeti-manager.vm.gnt          kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
intel.binbang.vm.gnt           kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
proxy.vm.gnt                   kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
rheincode.pokersource.vm.gnt   kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
speedtouch-716g.binbang.vm.gnt kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
trac.pokersource.vm.gnt        kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)
tsunami.vm.gnt                 kvm        debootstrap z2-2.host.gnt ERROR_nodedown (node down)

1) update shorewall config on z2-3 to use z2-21 ip for http proxy

  • params

    diff -r c3e11f9f27f1 params
    a b  
    3131VM_SAVANNAH=10.10.1.3 
    3232 VM_SPEEDTOUCH_716G=10.10.1.14 
    3333  IP_SAVANNAH=87.98.156.150 
    3434  +IP_FAILOVER_Z21=87.98.243.123 
    3535    
    3636    VM_Z2WORK=10.10.1.15 
    3737     VM_CSPOKER=10.10.1.41 

1bis) gnt-cluster masterfailover on z2-3 (to become master)

           z2-3:/etc/shorewall# gnt-cluster masterfailover
           z2-3:/etc/shorewall# gnt-cluster getmaster
           z2-3.host.gnt

2) start vpn for fsffrance on z2-3, and stop it on z2-2

           z2-3:/etc/openvpn# mv fsf-vpn.conf.old fsf-vpn.conf
           z2-2:/etc/openvpn# mv fsf-vpn.conf fsf-vpn.conf.desactivated
           z2-2:/etc/openvpn# /etc/init.d/openvpn restart
           z2-3:/etc/openvpn# /etc/init.d/openvpn restart

check :

           z2-2:/etc/openvpn# ip r|grep -v "10.1"
           91.121.220.0/24 dev eth0  proto kernel  scope link  src 91.121.220.73 
           blackhole 10.0.0.0/8 
           default via 91.121.220.254 dev eth0  src 87.98.243.123 
           z2-3:/etc/openvpn# ip r|grep -v "10.1"
           192.168.181.16 via 10.8.0.117 dev tun0 
           10.8.0.117 dev tun0  proto kernel  scope link  src 10.8.0.118 
           192.168.29.17 via 10.8.0.117 dev tun0 
           192.168.5.0/24 via 10.8.0.117 dev tun0 
           91.121.83.0/24 dev eth0  proto kernel  scope link  src 91.121.83.129 
           192.168.35.0/24 via 10.8.0.117 dev tun0 
           192.168.67.0/24 via 10.8.0.117 dev tun0 
           192.168.17.0/24 via 10.8.0.117 dev tun0 
           10.8.0.0/24 via 10.8.0.117 dev tun0 
           192.168.181.0/24 via 10.8.0.117 dev tun0 
           192.168.170.0/24 via 10.8.0.117 dev tun0 
           192.168.14.0/24 via 10.8.0.117 dev tun0 
           192.168.29.0/24 via 10.8.0.117 dev tun0 
           192.168.25.0/24 via 10.8.0.117 dev tun0 
           blackhole 10.0.0.0/8 
           default via 91.121.83.254 dev eth0  src 87.98.243.39 
           maxence@call:~$ ssh root@z2-1.host.gnt 
           [...]

           Last login: Wed Nov 25 09:07:07 2009 from 10.8.0.6
           z2-1:~# exit

           maxence@call:~$ ssh root@proxy2.vm.gnt
           [...]
           Last login: Wed Nov 25 08:06:42 2009 from 10.8.0.6
           proxy2:~# 

2) migrate failover ip, check it works

ip migrated, check some service :

3) shutdown all vm (kill needed for rheincode, because even kvm monitor console doesn't answer)

can't kill rheincode, marked as "defunc"

z2-2:/etc/openvpn# ps aux|grep kvm
root      7678  0.0  0.0   9384   804 pts/1    S+   09:33   0:00 grep kvm
root     29667  2.2  0.0      0     0 ?        Zl   Nov12 418:45 [kvm] <defunct>

4) reboot z2-2

reboot hangs, doing hard reboot
[...]
Last login: Wed Nov 25 08:44:57 2009 from arennes-357-1-112-202.w90-12.abo.wanadoo.fr
z2-2:~# uptime
 09:37:49 up 0 min,  1 user,  load average: 0.05, 0.01, 0.00
z2-2:~# 

4bis) when rebooted, restart ganeti

Already restarted :

z2-1.host.gnt   1.3T 608.9G   3.9G  2.4G  1.3G    10    11
z2-2.host.gnt   1.3T   1.0T   3.8G  134M  3.7G    10    11
z2-3.host.gnt   1.3T   1.1T   3.9G  2.4G  1.2G    10     8
z2-4.host.gnt   1.3T 625.0G   7.6G  1.3G  6.7G     5    17
z2-5.host.gnt   1.3T   1.1T   3.9G  2.7G  369M    13     2
z2-6.host.gnt   1.8T   1.7T  11.8G  2.9G  9.4G     4     4
z2-7.host.gnt 911.0G 335.5G  15.7G  9.3G 14.5G     1     0

5) gnt-cluster verify

Wed Nov 25 09:19:12 2009 * Verifying node z2-2.host.gnt (master candidate)
Wed Nov 25 09:19:12 2009   - ERROR: file '/var/lib/ganeti/ssconf_master_node' has wrong checksum
Wed Nov 25 09:19:12 2009   - ERROR: file '/var/lib/ganeti/config.data' has wrong checksum
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 0 of instance home.binbang.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 1 of instance munin.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 4 of instance trac.pokersource.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 7 of instance z2work.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 10 of instance tsunami.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 11 of instance ligamen.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 13 of instance pioneer.binbang.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 14 of instance drupal-z2.pokersource.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 15 of instance proxy2.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 19 of instance lamp.pokersource.vm.gnt is not active
Wed Nov 25 09:19:12 2009   - ERROR: drbd minor 20 of instance harvest.vm.gnt is not active

after a "gnt-cluster redist-conf:

z2-3:/etc/shorewall# gnt-cluster redist-conf
z2-3:/etc/shorewall# gnt-cluster verify
Wed Nov 25 09:20:54 2009 * Verifying global settings
Wed Nov 25 09:20:54 2009 * Gathering data (7 nodes)
Wed Nov 25 09:21:01 2009 * Verifying node z2-1.host.gnt (master candidate)
Wed Nov 25 09:21:01 2009 * Verifying node z2-2.host.gnt (master candidate)
Wed Nov 25 09:21:01 2009 * Verifying node z2-3.host.gnt (master)
Wed Nov 25 09:21:01 2009 * Verifying node z2-4.host.gnt (master candidate)
Wed Nov 25 09:21:01 2009 * Verifying node z2-5.host.gnt (master candidate)
Wed Nov 25 09:21:01 2009 * Verifying node z2-6.host.gnt (master candidate)
Wed Nov 25 09:21:01 2009 * Verifying node z2-7.host.gnt (master candidate)
Wed Nov 25 09:21:01 2009   - ERROR: unallocated drbd minor 0 is in use
Wed Nov 25 09:21:01 2009   - ERROR: unallocated drbd minor 3 is in use
z2-3:/etc/shorewall# gnt-cluster repair-disk-sizes 
Wed Nov 25 09:41:45 2009  - INFO: Disk 0 of instance booken.binbang.vm.gnt has mismatched size, correcting: recorded 20480, actual 5120
Wed Nov 25 09:41:46 2009  - WARNING: Failure in blockdev_getsizes call to node z2-2.host.gnt, ignoring
Wed Nov 25 09:41:46 2009  - INFO: Disk 0 of instance aviosys.binbang.vm.gnt has mismatched size, correcting: recorded 20480, actual 10240
Wed Nov 25 09:41:46 2009  - WARNING: Disk 0 of instance neufbox.binbang.vm.gnt did not return size information, ignoring
Wed Nov 25 09:41:47 2009  - WARNING: Disk 0 of instance neufbox-fc.binbang.vm.gnt did not return size information, ignoring
Wed Nov 25 09:41:47 2009  - INFO: Disk 0 of instance dtv09ut.binbang.vm.gnt has mismatched size, correcting: recorded 10240, actual 5120
Wed Nov 25 09:41:48 2009  - WARNING: Failure in blockdev_getsizes call to node z2-3.host.gnt, ignoring

6) gnt-remove ganeti-monitor.vm.gnt

z2-3:/etc/shorewall# gnt-instance remove ganeti-manager.vm.gnt 
This will remove the volumes of the instance ganeti-manager.vm.gnt
(including mirrors), thus removing all the data of the instance.
Continue?
y/[n]/?: y
z2-3:/etc/shorewall# 

7) restart vms( trac.pokersource, rheincode, tsunami, proxy)

z2-3:/etc/shorewall# gnt-instance startup rheincode.pokersource.vm.gnt 
z2-3:/etc/shorewall# gnt-instance startup trac.pokersource.vm.gnt 
z2-3:/etc/shorewall# gnt-instance startup tsunami.vm.gnt 

8) move failover ip back to z2-2

done

9) check everything is ok, all hosts are back to normal, need to run "replace-disk" for trac.pokersource.vm.gnt and proxy.ligamen.vm.gnt. Both had their secondary on z2-3, primary were resp. on z2-2 and z2-5. The error was : "BlockDeviceError?: blockdev failed (exited with exit code 1): /dev/drbd12: Wrong medium type" during a gnt-cluster repair-dsik-size.

the "Wrong medium type" seems to appear when the underlying volume (here, a lvm lv) is missing, but the error still appears after a vgscan or after a stop/start of the vm...

Some vm were "split-brained", need to resync them with : drbdsetup /dev/drbdXY secondary (on the slave node) drbdsetup /dev/drbdXY invalidate

shown in syslog :

Nov 25 14:41:12 z2-1 kernel: [6070861.392773] block drbd10: helper command: /bin/true split-brain minor-10
Nov 25 14:41:12 z2-1 kernel: [6070861.393235] block drbd10: helper command: /bin/true split-brain minor-10 exit code 0 (0x0)
  • tsunami.vm.gnt