1) done move mekensleep vms' secondary on z2-4 (because not enough mem on z2-5):
z2-2:~# gnt-instance replace-disks -n z2-4.host.gnt wetball.mekensleep.vm.gnt z2-2:~# gnt-instance replace-disks -n z2-4.host.gnt hanabi.mekensleep.vm.gnt z2-2:~# gnt-instance replace-disks -n z2-4.host.gnt proxy.mekensleep.vm.gnt
2) done failover vms on z2-6
z2-2:~# gnt-instance failover wetball.mekensleep.vm.gnt z2-2:~# gnt-instance failover hanabi.mekensleep.vm.gnt z2-2:~# gnt-instance failover proxy.mekensleep.vm.gnt z2-2:~# gnt-instance failover dtv09ut.binbang.vm.gnt z2-2:~# gnt-instance shutdown dtv09ut.binbang.vm.gnt
will down vms during the failover :
- dtv09ut.binbang.vm.gnt
- hanabi.mekensleep.vm.gnt
- proxy.mekensleep.vm.gnt
- wetball.mekensleep.vm.gnt
3) done switch the mekensleep ip from z2-6 to z2-4 on ovh interface ( 91.121.57.196 ) . 3bis) check mekensleep services are up
- Restart vm to trigger ganeti hook which add local route on new host (z2-4)
z2-2:~# gnt-instance shutdown proxy.mekensleep.vm.gnt z2-2:~# gnt-instance start proxy.mekensleep.vm.gnt z2-2:~# gnt-instance shutdown wetball.mekensleep.vm.gnt z2-2:~# gnt-instance start wetball.mekensleep.vm.gnt z2-2:~# gnt-instance shutdown hanabi.mekensleep.vm.gnt z2-2:~# gnt-instance start hanabi.mekensleep.vm.gnt
- Delete old local route on previous host (z2-6)
z2-6:~# host wetball.mekensleep.vm.gnt wetball.mekensleep.vm.gnt has address 10.10.1.37 z2-6:~# ip route | grep 10.10.1.37 10.10.1.37 dev br0 scope link z2-6:~# ip route del 10.10.1.37 z2-6:~# ip route | grep 10.10.1.37 10.10.1.37 via 10.1.4.6 dev tun4 proto zebra metric 20 z2-6:~# host hanabi.mekensleep.vm.gnt hanabi.mekensleep.vm.gnt has address 10.10.1.45 z2-6:~# ip route | grep 10.10.1.45 10.10.1.45 dev br0 scope link z2-6:~# ip route del 10.10.1.45 z2-6:~# ip route | grep 10.10.1.45 10.10.1.45 via 10.1.4.6 dev tun4 proto zebra metric 20 z2-6:~# host proxy.mekensleep.vm.gnt proxy.mekensleep.vm.gnt has address 10.10.1.44 z2-6:~# ip route | grep 10.10.1.44 10.10.1.44 dev br0 scope link z2-6:~# ip route del 10.10.1.44 z2-6:~# ip route | grep 10.10.1.44 10.10.1.44 via 10.1.4.6 dev tun4 proto zebra metric 20 z2-6:~# host dtv09ut.binbang.vm.gnt dtv09ut.binbang.vm.gnt has address 10.10.1.38 z2-6:~# ip route | grep 10.10.1.38 10.10.1.38 dev br0 scope link z2-6:~# ip route del 10.10.1.38 z2-6:~# ip route | grep 10.10.1.38 z2-6:~#
4) done move secondaries from z2-6 to z2-5 (will works because we don't start them)
z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt wetball.mekensleep.vm.gnt z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt hanabi.mekensleep.vm.gnt z2-2:~# gnt-instance replace-disks -n z2-4.host.gnt dtv09ut.binbang.vm.gnt z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt pokme.vm.gnt z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt mediagateusa.binbang.vm.gnt
z2-2:~# gnt-instance replace-disks -n z2-5.host.gnt proxy.mekensleep.vm.gnt Sat Nov 14 16:03:55 2009 STEP 1/6 check device existence Sat Nov 14 16:03:55 2009 - INFO: checking volume groups Sat Nov 14 16:03:56 2009 - INFO: checking disk/0 on z2-4.host.gnt Sat Nov 14 16:03:56 2009 STEP 2/6 check peer consistency Sat Nov 14 16:03:56 2009 - INFO: checking disk/0 consistency on z2-4.host.gnt Sat Nov 14 16:03:56 2009 STEP 3/6 allocate new storage Sat Nov 14 16:03:56 2009 - INFO: adding new local storage on z2-5.host.gnt for disk/0 Sat Nov 14 16:03:56 2009 STEP 4/6 changing drbd configuration Sat Nov 14 16:03:56 2009 - INFO: activating a new drbd on z2-5.host.gnt for disk/0 Failure: command execution error: Can't create block device <DRBD8(hosts=z2-4.host.gnt/23-z2-5.host.gnt/14, port=None, configured as 10.10.0.5:None 10.10.0.4:None, backend=<LogicalVolume(/dev/all/0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_data, not visible, size=10240m)>, metadev=<LogicalVolume(/dev/all/0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_meta, not visible, size=128m)>, not visible, size=10240m)> on node z2-5.host.gnt for instance proxy.mekensleep.vm.gnt: Can't assemble device after creation, very unusual event: drbd14: can't attach local disk: /dev/drbd14: Failure: (114) Lower device is already claimed. This usually means it is mounted.
z2-5:~# lvremove /dev/all/0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_data Do you really want to remove active logical volume 0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_data? [y/n]: y Logical volume "0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_data" successfully removed z2-5:~# lvremove /dev/all/0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_meta Do you really want to remove active logical volume 0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_meta? [y/n]: y Logical volume "0be159e5-85c2-4e7a-8b12-b25286148cf9.disk0_meta" successfully removed
z2-5:~# cat /proc/drbd | grep Unconfigured 12: cs:Unconfigured 13: cs:Unconfigured 14: cs:Unconfigured 15: cs:Unconfigured z2-5:~# drbdsetup /dev/drbd12 down z2-5:~# drbdsetup /dev/drbd13 down z2-5:~# drbdsetup /dev/drbd14 down z2-5:~# drbdsetup /dev/drbd15 down
Filled: http://code.google.com/p/ganeti/issues/detail?id=78
z2-2:~# gnt-node info z2-6.host.gnt
Node name: z2-6.host.gnt
primary ip: 10.10.0.6
secondary ip: 10.10.0.6
master candidate: True
drained: False
offline: False
primary for no instances
secondary for instances:
- proxy.mekensleep.vm.gnt
z2-2:~# gnt-instance replace-disks -n z2-7.host.gnt proxy.mekensleep.vm.gnt Mon Nov 16 15:47:35 2009 STEP 1/6 check device existence Mon Nov 16 15:47:35 2009 - INFO: checking volume groups Mon Nov 16 15:47:36 2009 - INFO: checking disk/0 on z2-4.host.gnt Mon Nov 16 15:47:36 2009 STEP 2/6 check peer consistency Mon Nov 16 15:47:36 2009 - INFO: checking disk/0 consistency on z2-4.host.gnt Mon Nov 16 15:47:36 2009 STEP 3/6 allocate new storage Mon Nov 16 15:47:36 2009 - INFO: adding new local storage on z2-7.host.gnt for disk/0 Mon Nov 16 15:47:36 2009 STEP 4/6 changing drbd configuration Mon Nov 16 15:47:36 2009 - INFO: activating a new drbd on z2-7.host.gnt for disk/0 Mon Nov 16 15:47:36 2009 - INFO: shutting down drbd for disk/0 on old node Mon Nov 16 15:47:37 2009 - INFO: detaching primary drbds from the network (=> standalone) Mon Nov 16 15:47:37 2009 - INFO: updating instance configuration Mon Nov 16 15:47:37 2009 - INFO: attaching primary drbds to new secondary (standalone => connected) Mon Nov 16 15:47:37 2009 STEP 5/6 sync devices Mon Nov 16 15:47:37 2009 - INFO: Waiting for instance proxy.mekensleep.vm.gnt to sync disks. Mon Nov 16 15:47:37 2009 - INFO: - device disk/0: 0.00% done, no time estimate Mon Nov 16 15:47:37 2009 - INFO: - device disk/0: 0.00% done, no time estimate Mon Nov 16 15:47:38 2009 - INFO: - device disk/0: 0.00% done, no time estimate Mon Nov 16 15:47:38 2009 - INFO: - device disk/0: 0.10% done, 26213 estimated seconds remaining Mon Nov 16 15:48:38 2009 - INFO: - device disk/0: 0.80% done, 13024 estimated seconds remaining Mon Nov 16 15:49:38 2009 - INFO: - device disk/0: 1.40% done, 116475 estimated seconds remaining Mon Nov 16 15:50:38 2009 - INFO: - device disk/0: 0.70% done, 1659 estimated seconds remaining Mon Nov 16 15:51:38 2009 - INFO: - device disk/0: 1.40% done, 4122 estimated seconds remaining Mon Nov 16 15:52:39 2009 - INFO: - device disk/0: 2.00% done, 1058 estimated seconds remaining Timeout while talking to the master daemon. Error:
Nov 16 10:06:36 z2-7 kernel: [1033136.244033] block drbd3: peer( Primary -> Unknown ) conn( SyncTarget -> Timeout ) pdsk( UpToDate -> DUnknown ) Nov 16 10:06:36 z2-7 kernel: [1033136.244050] block drbd3: short sent RSWriteAck size=32 sent=11 Nov 16 10:06:36 z2-7 kernel: [1033136.244064] block drbd3: drbd_pp_alloc interrupted! Nov 16 10:06:36 z2-7 kernel: [1033136.244069] block drbd3: alloc_ee: Allocation of a page failed Nov 16 10:06:36 z2-7 kernel: [1033136.244074] block drbd3: error receiving RSDataReply, l: 4120! Nov 16 10:06:36 z2-7 kernel: [1033136.245974] block drbd3: process_done_ee() = NOT_OK Nov 16 10:06:36 z2-7 kernel: [1033136.246001] block drbd3: asender terminated Nov 16 10:06:36 z2-7 kernel: [1033136.246006] block drbd3: Terminating asender thread Nov 16 10:06:36 z2-7 kernel: [1033136.246970] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 Nov 16 10:06:36 z2-7 kernel: [1033136.247018] IP: [<ffffffff8040d3ab>] sk_stream_wait_memory+0x88/0x1e5 Nov 16 10:06:36 z2-7 kernel: [1033136.247051] PGD 41c536067 PUD 41c5b9067 PMD 0 Nov 16 10:06:36 z2-7 kernel: [1033136.247078] Oops: 0002 [#1] SMP Nov 16 10:06:36 z2-7 kernel: [1033136.247103] last sysfs file: /sys/devices/virtual/block/drbd3/removable Nov 16 10:06:36 z2-7 kernel: [1033136.247132] CPU 0 Nov 16 10:06:36 z2-7 kernel: [1033136.247152] Modules linked in: hmac nfs lockd fscache nfs_acl auth_rpcgss sunrpc kvm_amd kvm iptable_filter ip_tables x_tables tun bridge stp drbd cn loop snd_pcsp snd_pcm snd_timer i2c_nforce2 snd soundcore snd_page_alloc i2c_core k8temp shpchp pci_hotplug serio_raw evdev psmouse button processor ext3 jbd mbcache dm_mod usbhid hid sd_mod crc_t10dif ata_generic ide_pci_generic ohci_hcd ehci_hcd amd74xx sata_nv ide_core forcedeth libata scsi_mod floppy thermal fan thermal_sys [last unloaded: scsi_wait_scan] Nov 16 10:06:36 z2-7 kernel: [1033136.247392] Pid: 29255, comm: drbd3_worker Not tainted 2.6.30-2-amd64 #1 H8DMR-82 Nov 16 10:06:36 z2-7 kernel: [1033136.247435] RIP: 0010:[<ffffffff8040d3ab>] [<ffffffff8040d3ab>] sk_stream_wait_memory+0x88/0x1e5 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RSP: 0018:ffff88021dda5a40 EFLAGS: 00010246 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RAX: 0000000000000000 RBX: 00000000000005dc RCX: 000000000000afce Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RDX: 0000000000000008 RSI: 0000000000000000 RDI: ffffffff804065c6 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RBP: ffff88041c472380 R08: 0000000000000000 R09: ffff88041c472380 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] R10: ffff88021d1a7114 R11: ffff88021dda5b08 R12: 00000000000005dc Nov 16 10:06:36 z2-7 kernel: [1033136.250009] R13: 0000000000000000 R14: ffff88021dda5b08 R15: 7fffffffffffffff Nov 16 10:06:36 z2-7 kernel: [1033136.250009] FS: 00007f0a8edef790(0000) GS:ffffc20000000000(0000) knlGS:0000000000000000 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Nov 16 10:06:36 z2-7 kernel: [1033136.250009] CR2: 0000000000000008 CR3: 000000041c5b6000 CR4: 00000000000006e0 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] Process drbd3_worker (pid: 29255, threadinfo ffff88021dda4000, task ffff88021cdb2ab0) Nov 16 10:06:36 z2-7 kernel: [1033136.250009] Stack: Nov 16 10:06:36 z2-7 kernel: [1033136.250009] 0000000000000000 ffff88021cdb2ab0 ffffffff80254742 ffff88021dda5a58 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] ffff88021dda5a58 0000000000000000 ffff88041d48a8e8 ffff88041c57f6c0 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] ffff88041c472380 ffff88021dda5b14 ffff88021cdb1000 0000000000000000 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] Call Trace: Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff80254742>] ? autoremove_wake_function+0x0/0x2e Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff8043efda>] ? tcp_sendmsg+0x6fa/0x85b Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff80403f24>] ? sock_sendmsg+0xa3/0xbb Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff8023bc9a>] ? default_wake_function+0x0/0x9 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff80254742>] ? autoremove_wake_function+0x0/0x2e Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff8020e5a9>] ? __switch_to+0xae/0x263 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff80235f65>] ? dequeue_entity+0xf/0x11f Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff804041f2>] ? kernel_sendmsg+0x2c/0x3e Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffffa024dbb5>] ? drbd_send+0xb9/0x1cf [drbd] Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff804b45e8>] ? schedule+0x9/0x1e Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffffa024e4f1>] ? _drbd_send_cmd+0x16f/0x183 [drbd] Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffffa024e81c>] ? drbd_send_cmd+0x64/0x8d [drbd] Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffffa024e998>] ? drbd_send_b_ack+0x37/0x40 [drbd] Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffffa023bdfd>] ? drbd_may_finish_epoch+0x122/0x2f8 [drbd] Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffffa023c305>] ? w_flush+0x54/0x5d [drbd] Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffffa02368be>] ? drbd_worker+0x4c6/0x4d3 [drbd] Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff804b47df>] ? schedule_timeout+0x9b/0xb6 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff804b47cf>] ? schedule_timeout+0x8b/0xb6 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffffa024ce5b>] ? drbd_thread_setup+0x16f/0x230 [drbd] Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff80210aca>] ? child_rip+0xa/0x20 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffffa024ccec>] ? drbd_thread_setup+0x0/0x230 [drbd] Nov 16 10:06:36 z2-7 kernel: [1033136.250009] [<ffffffff80210ac0>] ? child_rip+0x0/0x20 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] Code: f4 ff ba 32 00 00 00 89 d1 31 d2 f7 f1 83 c2 02 41 89 d4 4d 89 e5 49 bf ff ff ff ff ff ff ff 7f 48 8b 85 e8 01 00 00 48 8d 50 08 <f0> 80 48 08 01 48 8b 7d 78 ba 01 00 00 00 48 89 e6 e8 09 75 e4 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RIP [<ffffffff8040d3ab>] sk_stream_wait_memory+0x88/0x1e5 Nov 16 10:06:36 z2-7 kernel: [1033136.250009] RSP <ffff88021dda5a40> Nov 16 10:06:36 z2-7 kernel: [1033136.250009] CR2: 0000000000000008 Nov 16 10:06:36 z2-7 kernel: [1033136.254274] ---[ end trace 2ddd1cdd4c0c8cf4 ]---
z2-2:~# gnt-instance info proxy.mekensleep.vm.gnt | grep drbd
- disk/0: drbd8, size 10.0G
on primary: /dev/drbd23 (147:23) in sync, status *DEGRADED*
on secondary: /dev/drbd3 (147:3) in sync, status *DEGRADED* *MISSING DISK*
z2-2:~# gnt-instance info proxy.mekensleep.vm.gnt | head
Instance name: proxy.mekensleep.vm.gnt
State: configured to be up, actual state is up
Nodes:
- primary: z2-4.host.gnt
- secondaries: z2-7.host.gnt
z2-2:~# gnt-node info z2-6.host.gnt
Node name: z2-6.host.gnt
primary ip: 10.10.0.6
secondary ip: 10.10.0.6
master candidate: True
drained: False
offline: False
primary for no instances
secondary for no instances
So as the primaries are on z2-4, secondaries on z2-5 for mekensleep vm, z2-5 as prim and z2-4 as secondary for dtv09ut.binbang.vm.gnt.
4.1) done backup z2-6 on rosiers:
[1]+ Done nohup rsync --delete -avHz --numeric-ids --exclude='/sys' --exclude='/proc' z2-6.pokersource.info:/ /mnt/z2-6-2009-11-16/ > /home/loic/z2-6.out 2>&1 (wd: ~)
5) done remove the node :
z2-2:~# gnt-node remove z2-6.host.gnt Failure: command execution error: list.remove(x): x not in list z2-2:~# gnt-node list Node DTotal DFree MTotal MNode MFree Pinst Sinst z2-1.host.gnt 1.3T 649.3G 3.9G 2.2G 1.2G 8 10 z2-2.host.gnt 1.3T 1.1T 3.8G 2.1G 1.1G 10 9 z2-3.host.gnt 1.3T 1.1T 3.9G 2.5G 1.6G 8 8 z2-4.host.gnt 1.3T 569.8G 7.6G 3.4G 4.7G 11 13 z2-5.host.gnt 1.3T 1.2T 3.9G 2.5G 680M 10 6 z2-7.host.gnt 911.0G 375.5G 15.7G 410M 15.5G 1 2
6) done install the new node using http://trac.dunnewind.net/dunnewind/wiki/GanetiOspfHowto with hostname : z2-6.host.gnt
z2-2:~# gnt-node list Node DTotal DFree MTotal MNode MFree Pinst Sinst z2-1.host.gnt 1.3T 639.2G 3.9G 2.6G 1.2G 9 10 z2-2.host.gnt 1.3T 1.1T 3.8G 2.3G 944M 10 9 z2-3.host.gnt 1.3T 1.1T 3.9G 2.4G 1.5G 8 9 z2-4.host.gnt 1.3T 569.8G 7.6G 4.2G 4.7G 11 13 z2-5.host.gnt 1.3T 1.2T 3.9G 2.5G 676M 10 6 z2-6.host.gnt 1.8T 1.8T 11.8G 427M 11.5G 0 0 z2-7.host.gnt 911.0G 375.5G 15.7G 663M 15.2G 1 2
7) done add the node on the cluster :
z2-2:~# gnt-node add z2-6.host.gnt
8) recreate the disks' secondaries on z2-6 :
z2-2:~# gnt-instance replace-disks -n z2-6.host.gnt wetball.mekensleep.vm.gnt z2-2:~# gnt-instance replace-disks -n z2-6.host.gnt hanabi.mekensleep.vm.gnt z2-2:~# gnt-instance replace-disks -n z2-6.host.gnt proxy.mekensleep.vm.gnt z2-2:~# gnt-instance replace-disks -n z2-6.host.gnt dtv09ut.binbang.vm.gnt
9) Move instances back on z2-6 :
z2-2:~# gnt-instance failover wetball.mekensleep.vm.gnt z2-2:~# gnt-instance failover hanabi.mekensleep.vm.gnt z2-2:~# gnt-instance failover proxy.mekensleep.vm.gnt z2-2:~# gnt-instance failover dtv09ut.binbang.vm.gnt
10) migrate IP back on ovh interface 10bis) test services