1) done move mekensleep vms' secondary on z2-4 (because not enough mem on z2-5):

z2-2:~# gnt-instance replace-disks -n z2-4.host.gnt wetball.mekensleep.vm.gnt 
z2-2:~# gnt-instance replace-disks -n z2-4.host.gnt hanabi.mekensleep.vm.gnt 
z2-2:~# gnt-instance replace-disks -n z2-4.host.gnt proxy.mekensleep.vm.gnt 

2) done failover vms on z2-6

z2-2:~# gnt-instance failover wetball.mekensleep.vm.gnt 
z2-2:~# gnt-instance failover hanabi.mekensleep.vm.gnt 
z2-2:~# gnt-instance failover proxy.mekensleep.vm.gnt 
z2-2:~# gnt-instance failover dtv09ut.binbang.vm.gnt 
z2-2:~# gnt-instance shutdown dtv09ut.binbang.vm.gnt 

will down vms during the failover :

  • dtv09ut.binbang.vm.gnt
  • hanabi.mekensleep.vm.gnt
  • proxy.mekensleep.vm.gnt
  • wetball.mekensleep.vm.gnt

3) done switch the mekensleep ip from z2-6 to z2-4 on ovh interface ( 91.121.57.196 ) . 3bis) check mekensleep services are up

  • Restart vm to trigger ganeti hook which add local route on new host (z2-4)
    z2-2:~# gnt-instance shutdown proxy.mekensleep.vm.gnt 
    z2-2:~# gnt-instance start proxy.mekensleep.vm.gnt 
    z2-2:~# gnt-instance shutdown wetball.mekensleep.vm.gnt 
    z2-2:~# gnt-instance start wetball.mekensleep.vm.gnt 
    z2-2:~# gnt-instance shutdown hanabi.mekensleep.vm.gnt 
    z2-2:~# gnt-instance start hanabi.mekensleep.vm.gnt 
    
  • Delete old local route on previous host (z2-6)
    z2-6:~# host wetball.mekensleep.vm.gnt
    wetball.mekensleep.vm.gnt has address 10.10.1.37
    z2-6:~# ip route | grep 10.10.1.37
    10.10.1.37 dev br0  scope link 
    z2-6:~# ip route del 10.10.1.37
    z2-6:~# ip route | grep 10.10.1.37
    10.10.1.37 via 10.1.4.6 dev tun4  proto zebra  metric 20 
    z2-6:~# host hanabi.mekensleep.vm.gnt
    hanabi.mekensleep.vm.gnt has address 10.10.1.45
    z2-6:~# ip route | grep 10.10.1.45
    10.10.1.45 dev br0  scope link 
    z2-6:~# ip route del 10.10.1.45
    z2-6:~# ip route | grep 10.10.1.45
    10.10.1.45 via 10.1.4.6 dev tun4  proto zebra  metric 20 
    z2-6:~# host proxy.mekensleep.vm.gnt
    proxy.mekensleep.vm.gnt has address 10.10.1.44
    z2-6:~# ip route | grep 10.10.1.44
    10.10.1.44 dev br0  scope link 
    z2-6:~# ip route del 10.10.1.44
    z2-6:~# ip route | grep 10.10.1.44
    10.10.1.44 via 10.1.4.6 dev tun4  proto zebra  metric 20 
    z2-6:~# host dtv09ut.binbang.vm.gnt
    dtv09ut.binbang.vm.gnt has address 10.10.1.38
    z2-6:~# ip route | grep 10.10.1.38
    10.10.1.38 dev br0  scope link 
    z2-6:~# ip route del 10.10.1.38
    z2-6:~# ip route | grep 10.10.1.38
    z2-6:~# 
    

4) done move secondaries from z2-6 to z2-5 (will works because we don't start them)

z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt wetball.mekensleep.vm.gnt 
z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt hanabi.mekensleep.vm.gnt 
z2-2:~# gnt-instance replace-disks -n z2-4.host.gnt dtv09ut.binbang.vm.gnt
z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt pokme.vm.gnt 
z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt mediagateusa.binbang.vm.gnt 
z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt proxy.mekensleep.vm.gnt 
Sat Nov 14 16:03:55 2009 STEP 1/6 check device existence
Sat Nov 14 16:03:55 2009  - INFO: checking volume groups
Sat Nov 14 16:03:56 2009  - INFO: checking disk/0 on z2-4.host.gnt
Sat Nov 14 16:03:56 2009 STEP 2/6 check peer consistency
Sat Nov 14 16:03:56 2009  - INFO: checking disk/0 consistency on z2-4.host.gnt
Sat Nov 14 16:03:56 2009 STEP 3/6 allocate new storage
Sat Nov 14 16:03:56 2009  - INFO: adding new local storage on z2-5.host.gnt for disk/0
Sat Nov 14 16:03:56 2009 STEP 4/6 changing drbd configuration
Sat Nov 14 16:03:56 2009  - INFO: activating a new drbd on z2-5.host.gnt for disk/0
Failure: command execution error:
Can't create block device <DRBD8(hosts=z2-4.host.gnt/23-z2-5.host.gnt/14, port=None, configured as 10.10.0.5:None 10.10.0.4:None, backend=<LogicalVolume(/dev/all/0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_data, not visible, size=10240m)>, metadev=<LogicalVolume(/dev/all/0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_meta, not visible, size=128m)>, not visible, size=10240m)> on node z2-5.host.gnt for instance proxy.mekensleep.vm.gnt: Can't assemble device after creation, very unusual event: drbd14: can't attach local disk: /dev/drbd14: Failure: (114) Lower device is already claimed. This usually means it is mounted.
z2-5:~# lvremove /dev/all/0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_data 
Do you really want to remove active logical volume 0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_data? [y/n]: y
  Logical volume "0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_data" successfully removed
z2-5:~# lvremove /dev/all/0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_meta 
Do you really want to remove active logical volume 0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_meta? [y/n]: y
  Logical volume "0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_meta" successfully removed
z2-5:~# cat /proc/drbd | grep Unconfigured
12: cs:Unconfigured
13: cs:Unconfigured
14: cs:Unconfigured
15: cs:Unconfigured
z2-5:~# drbdsetup  /dev/drbd12 down
z2-5:~# drbdsetup  /dev/drbd13 down
z2-5:~# drbdsetup  /dev/drbd14 down
z2-5:~# drbdsetup  /dev/drbd15 down

Filled: http://code.google.com/p/ganeti/issues/detail?id=78

z2-2:~# gnt-node info z2-6.host.gnt 
Node name: z2-6.host.gnt
  primary ip: 10.10.0.6
  secondary ip: 10.10.0.6
  master candidate: True
  drained: False
  offline: False
  primary for no instances
  secondary for instances:
    - proxy.mekensleep.vm.gnt
z2-2:~# gnt-instance replace-disks  -n z2-7.host.gnt proxy.mekensleep.vm.gnt 
Mon Nov 16 15:47:35 2009 STEP 1/6 check device existence
Mon Nov 16 15:47:35 2009  - INFO: checking volume groups
Mon Nov 16 15:47:36 2009  - INFO: checking disk/0 on z2-4.host.gnt
Mon Nov 16 15:47:36 2009 STEP 2/6 check peer consistency
Mon Nov 16 15:47:36 2009  - INFO: checking disk/0 consistency on z2-4.host.gnt
Mon Nov 16 15:47:36 2009 STEP 3/6 allocate new storage
Mon Nov 16 15:47:36 2009  - INFO: adding new local storage on z2-7.host.gnt for disk/0
Mon Nov 16 15:47:36 2009 STEP 4/6 changing drbd configuration
Mon Nov 16 15:47:36 2009  - INFO: activating a new drbd on z2-7.host.gnt for disk/0
Mon Nov 16 15:47:36 2009  - INFO: shutting down drbd for disk/0 on old node
Mon Nov 16 15:47:37 2009  - INFO: detaching primary drbds from the network (=> standalone)
Mon Nov 16 15:47:37 2009  - INFO: updating instance configuration
Mon Nov 16 15:47:37 2009  - INFO: attaching primary drbds to new secondary (standalone => connected)
Mon Nov 16 15:47:37 2009 STEP 5/6 sync devices
Mon Nov 16 15:47:37 2009  - INFO: Waiting for instance proxy.mekensleep.vm.gnt to sync disks.
Mon Nov 16 15:47:37 2009  - INFO: - device disk/0:  0.00% done, no time estimate
Mon Nov 16 15:47:37 2009  - INFO: - device disk/0:  0.00% done, no time estimate
Mon Nov 16 15:47:38 2009  - INFO: - device disk/0:  0.00% done, no time estimate
Mon Nov 16 15:47:38 2009  - INFO: - device disk/0:  0.10% done, 26213 estimated seconds remaining
Mon Nov 16 15:48:38 2009  - INFO: - device disk/0:  0.80% done, 13024 estimated seconds remaining
Mon Nov 16 15:49:38 2009  - INFO: - device disk/0:  1.40% done, 116475 estimated seconds remaining
Mon Nov 16 15:50:38 2009  - INFO: - device disk/0:  0.70% done, 1659 estimated seconds remaining
Mon Nov 16 15:51:38 2009  - INFO: - device disk/0:  1.40% done, 4122 estimated seconds remaining
Mon Nov 16 15:52:39 2009  - INFO: - device disk/0:  2.00% done, 1058 estimated seconds remaining
Timeout while talking to the master daemon. Error:
Nov 16 10:06:36 z2-7 kernel: [1033136.244033] block drbd3: peer( Primary -> Unknown ) conn( SyncTarget -> Timeout ) pdsk( UpToDate -> DUnknown ) 
Nov 16 10:06:36 z2-7 kernel: [1033136.244050] block drbd3: short sent RSWriteAck size=32 sent=11
Nov 16 10:06:36 z2-7 kernel: [1033136.244064] block drbd3: drbd_pp_alloc interrupted!
Nov 16 10:06:36 z2-7 kernel: [1033136.244069] block drbd3: alloc_ee: Allocation of a page failed
Nov 16 10:06:36 z2-7 kernel: [1033136.244074] block drbd3: error receiving RSDataReply, l: 4120!
Nov 16 10:06:36 z2-7 kernel: [1033136.245974] block drbd3: process_done_ee() = NOT_OK
Nov 16 10:06:36 z2-7 kernel: [1033136.246001] block drbd3: asender terminated
Nov 16 10:06:36 z2-7 kernel: [1033136.246006] block drbd3: Terminating asender thread
Nov 16 10:06:36 z2-7 kernel: [1033136.246970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
Nov 16 10:06:36 z2-7 kernel: [1033136.247018] IP: [<ffffffff8040d3ab>] sk_stream_wait_memory+0x88/0x1e5
Nov 16 10:06:36 z2-7 kernel: [1033136.247051] PGD 41c536067 PUD 41c5b9067 PMD 0 
Nov 16 10:06:36 z2-7 kernel: [1033136.247078] Oops: 0002 [#1] SMP 
Nov 16 10:06:36 z2-7 kernel: [1033136.247103] last sysfs file: /sys/devices/virtual/block/drbd3/removable
Nov 16 10:06:36 z2-7 kernel: [1033136.247132] CPU 0 
Nov 16 10:06:36 z2-7 kernel: [1033136.247152] Modules linked in: hmac nfs lockd fscache nfs_acl auth_rpcgss sunrpc kvm_amd kvm iptable_filter ip_tables x_tables tun bridge stp drbd cn loop snd_pcsp snd_pcm snd_timer i2c_nforce2 snd soundcore snd_page_alloc i2c_core k8temp shpchp pci_hotplug serio_raw evdev psmouse button processor ext3 jbd mbcache dm_mod usbhid hid sd_mod crc_t10dif ata_generic ide_pci_generic ohci_hcd ehci_hcd amd74xx sata_nv ide_core forcedeth libata scsi_mod floppy thermal fan thermal_sys [last unloaded: scsi_wait_scan]
Nov 16 10:06:36 z2-7 kernel: [1033136.247392] Pid: 29255, comm: drbd3_worker Not tainted 2.6.30-2-amd64 #1 H8DMR-82
Nov 16 10:06:36 z2-7 kernel: [1033136.247435] RIP: 0010:[<ffffffff8040d3ab>]  [<ffffffff8040d3ab>] sk_stream_wait_memory+0x88/0x1e5
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RSP: 0018:ffff88021dda5a40  EFLAGS: 00010246
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RAX: 0000000000000000 RBX: 00000000000005dc RCX: 000000000000afce
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RDX: 0000000000000008 RSI: 0000000000000000 RDI: ffffffff804065c6
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RBP: ffff88041c472380 R08: 0000000000000000 R09: ffff88041c472380
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] R10: ffff88021d1a7114 R11: ffff88021dda5b08 R12: 00000000000005dc
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] R13: 0000000000000000 R14: ffff88021dda5b08 R15: 7fffffffffffffff
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] FS:  00007f0a8edef790(0000) GS:ffffc20000000000(0000) knlGS:0000000000000000
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] CR2: 0000000000000008 CR3: 000000041c5b6000 CR4: 00000000000006e0
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] Process drbd3_worker (pid: 29255, threadinfo ffff88021dda4000, task ffff88021cdb2ab0)
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] Stack:
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  0000000000000000 ffff88021cdb2ab0 ffffffff80254742 ffff88021dda5a58
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  ffff88021dda5a58 0000000000000000 ffff88041d48a8e8 ffff88041c57f6c0
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  ffff88041c472380 ffff88021dda5b14 ffff88021cdb1000 0000000000000000
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] Call Trace:
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff80254742>] ? autoremove_wake_function+0x0/0x2e
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff8043efda>] ? tcp_sendmsg+0x6fa/0x85b
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff80403f24>] ? sock_sendmsg+0xa3/0xbb
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff8023bc9a>] ? default_wake_function+0x0/0x9
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff80254742>] ? autoremove_wake_function+0x0/0x2e
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff8020e5a9>] ? __switch_to+0xae/0x263
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff80235f65>] ? dequeue_entity+0xf/0x11f
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff804041f2>] ? kernel_sendmsg+0x2c/0x3e
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffffa024dbb5>] ? drbd_send+0xb9/0x1cf [drbd]
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff804b45e8>] ? schedule+0x9/0x1e
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffffa024e4f1>] ? _drbd_send_cmd+0x16f/0x183 [drbd]
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffffa024e81c>] ? drbd_send_cmd+0x64/0x8d [drbd]
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffffa024e998>] ? drbd_send_b_ack+0x37/0x40 [drbd]
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffffa023bdfd>] ? drbd_may_finish_epoch+0x122/0x2f8 [drbd]
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffffa023c305>] ? w_flush+0x54/0x5d [drbd]
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffffa02368be>] ? drbd_worker+0x4c6/0x4d3 [drbd]
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff804b47df>] ? schedule_timeout+0x9b/0xb6
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff804b47cf>] ? schedule_timeout+0x8b/0xb6
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffffa024ce5b>] ? drbd_thread_setup+0x16f/0x230 [drbd]
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff80210aca>] ? child_rip+0xa/0x20
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffffa024ccec>] ? drbd_thread_setup+0x0/0x230 [drbd]
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  [<ffffffff80210ac0>] ? child_rip+0x0/0x20
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] Code: f4 ff ba 32 00 00 00 89 d1 31 d2 f7 f1 83 c2 02 41 89 d4 4d 89 e5 49 bf ff ff ff ff ff ff ff 7f 48 8b 85 e8 01 00 00 48 8d 50 08 <f0> 80 48 08 01 48 8b 7d 78 ba 01 00 00 00 48 89 e6 e8 09 75 e4 
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RIP  [<ffffffff8040d3ab>] sk_stream_wait_memory+0x88/0x1e5
Nov 16 10:06:36 z2-7 kernel: [1033136.250009]  RSP <ffff88021dda5a40>
Nov 16 10:06:36 z2-7 kernel: [1033136.250009] CR2: 0000000000000008
Nov 16 10:06:36 z2-7 kernel: [1033136.254274] ---[ end trace 2ddd1cdd4c0c8cf4 ]---
z2-2:~# gnt-instance info proxy.mekensleep.vm.gnt | grep drbd
    - disk/0: drbd8, size 10.0G
      on primary:   /dev/drbd23 (147:23) in sync, status *DEGRADED*
      on secondary: /dev/drbd3 (147:3) in sync, status *DEGRADED* *MISSING DISK*
z2-2:~# gnt-instance info proxy.mekensleep.vm.gnt  | head
Instance name: proxy.mekensleep.vm.gnt
State: configured to be up, actual state is up
  Nodes:
    - primary: z2-4.host.gnt
    - secondaries: z2-7.host.gnt
z2-2:~# gnt-node info  z2-6.host.gnt 
Node name: z2-6.host.gnt
  primary ip: 10.10.0.6
  secondary ip: 10.10.0.6
  master candidate: True
  drained: False
  offline: False
  primary for no instances
  secondary for no instances

So as the primaries are on z2-4, secondaries on z2-5 for mekensleep vm, z2-5 as prim and z2-4 as secondary for dtv09ut.binbang.vm.gnt.

4.1) done backup z2-6 on rosiers:

[1]+  Done                    nohup rsync --delete -avHz --numeric-ids --exclude='/sys' --exclude='/proc' z2-6.pokersource.info:/ /mnt/z2-6-2009-11-16/ > /home/loic/z2-6.out 2>&1  (wd: ~)

5) done remove the node :

z2-2:~# gnt-node remove z2-6.host.gnt
Failure: command execution error:
list.remove(x): x not in list
z2-2:~# gnt-node list
Node          DTotal  DFree MTotal MNode MFree Pinst Sinst
z2-1.host.gnt   1.3T 649.3G   3.9G  2.2G  1.2G     8    10
z2-2.host.gnt   1.3T   1.1T   3.8G  2.1G  1.1G    10     9
z2-3.host.gnt   1.3T   1.1T   3.9G  2.5G  1.6G     8     8
z2-4.host.gnt   1.3T 569.8G   7.6G  3.4G  4.7G    11    13
z2-5.host.gnt   1.3T   1.2T   3.9G  2.5G  680M    10     6
z2-7.host.gnt 911.0G 375.5G  15.7G  410M 15.5G     1     2

6) done install the new node using http://trac.dunnewind.net/dunnewind/wiki/GanetiOspfHowto with hostname : z2-6.host.gnt

z2-2:~# gnt-node list
Node          DTotal  DFree MTotal MNode MFree Pinst Sinst
z2-1.host.gnt   1.3T 639.2G   3.9G  2.6G  1.2G     9    10
z2-2.host.gnt   1.3T   1.1T   3.8G  2.3G  944M    10     9
z2-3.host.gnt   1.3T   1.1T   3.9G  2.4G  1.5G     8     9
z2-4.host.gnt   1.3T 569.8G   7.6G  4.2G  4.7G    11    13
z2-5.host.gnt   1.3T   1.2T   3.9G  2.5G  676M    10     6
z2-6.host.gnt   1.8T   1.8T  11.8G  427M 11.5G     0     0
z2-7.host.gnt 911.0G 375.5G  15.7G  663M 15.2G     1     2

7) done add the node on the cluster :

z2-2:~# gnt-node add z2-6.host.gnt

8) recreate the disks' secondaries on z2-6 :

z2-2:~# gnt-instance replace-disks -n z2-6.host.gnt wetball.mekensleep.vm.gnt 
z2-2:~# gnt-instance replace-disks -n z2-6.host.gnt hanabi.mekensleep.vm.gnt 
z2-2:~# gnt-instance replace-disks -n z2-6.host.gnt proxy.mekensleep.vm.gnt 
z2-2:~# gnt-instance replace-disks -n z2-6.host.gnt dtv09ut.binbang.vm.gnt

9) Move instances back on z2-6 :

z2-2:~# gnt-instance failover wetball.mekensleep.vm.gnt 
z2-2:~# gnt-instance failover hanabi.mekensleep.vm.gnt
z2-2:~# gnt-instance failover proxy.mekensleep.vm.gnt 
z2-2:~# gnt-instance failover dtv09ut.binbang.vm.gnt 

10) migrate IP back on ovh interface 10bis) test services