Today, I met with an interesting problem. I tried to create a primary-primary (dual primary) DRBD cluster on proxmox.
The first we must have fully configured proxmox Two-node cluster. Like this:
https://pve.proxmox.com/wiki/Proxmox_VE_4.x_Cluster
We must have a good configuration of /etc/hosts to resolve names into IP:
root@cl3-amd-node1:/etc/drbd.d# cat /etc/hosts cat /etc/hosts 127.0.0.1 localhost.localdomain localhost 192.168.1.104 cl3-amd-node1 pvelocalhost 192.168.1.108 cl3-amd-node2
root@cl3-amd-node2:/etc/drbd.d# cat /etc/hosts cat /etc/hosts 127.0.0.1 localhost.localdomain localhost 192.168.1.104 cl3-amd-node1 192.168.1.108 cl3-amd-node2 pvelocalhost
One server was created on hardware raid PCI-E LSI 9240-4i (/dev/sdb) and second server was build on software raid via mdadm (/dev/md1) on debian jessie with installation with proxmox packages. So the backend for drbd devices was on one side – hardware raid and software raid on the other side. We must create a two disks with the same size (in sectors):
root@cl3-amd-node1: fdisk -l /dev/sdb Disk /dev/sdb: 1.8 TiB, 1998998994944 bytes, 3904294912 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes Device Boot Start End Sectors Size Id Type /dev/sdb1 2048 1953260927 1953258880 931.4G 83 Linux
root@cl3-amd-node2: fdisk -l /dev/md1 Disk /dev/md1: 931.4 GiB, 1000069595136 bytes, 1953260928 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes Device Boot Start End Sectors Size Id Type /dev/md1p1 2048 1953260927 1953258880 931.4G 83 Linux
Now, we must have a direct network to each other of servers for drbd traffic, which will be very high. I use a bond of two gigabit network cards:
#cl3-amd-node1: cat /etc/network/interfaces auto bond0 iface bond0 inet static address 192.168.5.104 netmask 255.255.255.0 slaves eth2 eth1 bond_miimon 100 bond_mode balance-rr
#cl3-amd-node2: cat /etc/network/interfaces auto bond0 iface bond0 inet static address 192.168.5.108 netmask 255.255.255.0 slaves eth1 eth2 bond_miimon 100 bond_mode balance-rr
And we can test the speed of this network with package iperf:
apt-get install iperf
We start an iperf instance on one server by this command:
#cl3-amd-node2 iperf -s -p 888
And from the other, we connect to this instance for 20 seconds:
#cl3-amd-node1 iperf -c 192.168.5.108 -p 888 -t 20 #and the conclusion ------------------------------------------------------------ Client connecting to 192.168.5.108, TCP port 888 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.5.104 port 49536 connected with 192.168.5.108 port 888 [ ID] Interval Transfer Bandwidth [ 3] 0.0-20.0 sec 4.39 GBytes 1.88 Gbits/sec
So we can see, that I have a bonded network from two network cards and the resulting speed is almost 2Gbps.
Now, we can continue with installing and setting up the drbd resource.
apt-get install drbd-utils drbdmanage
All aspects of DRBD are controlled in its configuration file, /etc/drbd.conf. Normally, this configuration file is just a skeleton with the following contents:
include “/etc/drbd.d/global_common.conf”;
include “/etc/drbd.d/*.res”;
The simplest configuration is:
cat /etc/drbd.d/global_common.conf global { usage-count yes; } common { net { protocol C; } }
And the configuration of resource itself. It must be the same on both nodes:
root@cl3-amd-node1:/etc/drbd.d# cat /etc/drbd.d/r0.res resource r0 { disk { c-plan-ahead 15; c-fill-target 24M; c-min-rate 90M; c-max-rate 150M; } net { protocol C; allow-two-primaries yes; data-integrity-alg md5; verify-alg md5; } on cl3-amd-node1 { device /dev/drbd0; disk /dev/sdb1; address 192.168.5.104:7789; meta-disk internal; } on cl3-amd-node2 { device /dev/drbd0; disk /dev/md1p1; address 192.168.5.108:7789; meta-disk internal; } }
root@cl3-amd-node2:/etc/drbd.d# cat /etc/drbd.d/r0.res resource r0 { disk { c-plan-ahead 15; c-fill-target 24M; c-min-rate 90M; c-max-rate 150M; } net { protocol C; allow-two-primaries yes; data-integrity-alg md5; verify-alg md5; } on cl3-amd-node1 { device /dev/drbd0; disk /dev/sdb1; address 192.168.5.104:7789; meta-disk internal; } on cl3-amd-node2 { device /dev/drbd0; disk /dev/md1p1; address 192.168.5.108:7789; meta-disk internal; } }
Now, we must create and initialize backend devices for drbd, on both nodes:
drbdadm create-md r0 #answer yes to destroy possible data on devices
Now, we can start the drbd service, on both nodes:
root@cl3-amd-node2:/etc/drbd.d# /etc/init.d/drbd start [ ok ] Starting drbd (via systemctl): drbd.service. root@cl3-amd-node1:/etc/drbd.d# /etc/init.d/drbd start [ ok ] Starting drbd (via systemctl): drbd.service.
Or we can start it on both nodes:
drbdadm up r0
And we can see it as inconsistent and both of them are secondary:
root@cl3-amd-node1:~# drbdadm status r0 role:Secondary disk:Inconsistent cl3-amd-node2 role:Secondary peer-disk:Inconsistent
Start the initial full synchronization. This step must be performed on only one node, only on initial resource configuration, and only on the node you selected as the synchronization source. To perform this step, issue this command:
root@cl3-amd-node1:# drbdadm primary --force r0
And we can see the status of our drbd storage:
root@cl3-amd-node2:~# drbdadm status r0 role:Secondary disk:Inconsistent cl3-amd-node1 role:Primary replication:SyncTarget peer-disk:UpToDate done:3.10
After synchronization successfully finish, we set up our secondary server to be primary:
root@cl3-amd-node2:~# drbdadm status r0 role:Secondary disk:UpToDate cl3-amd-node1 role:Primary peer-disk:UpToDate
root@cl3-amd-node2:~# drbdadm primary r0
And we can see status of this dual-primary (primary-primary) drbd storage resource:
root@cl3-amd-node2:~# drbdadm status r0 role:Primary disk:UpToDate cl3-amd-node1 role:Primary peer-disk:UpToDate
Now we have a new block device on both servers:
root@cl3-amd-node2:~# fdisk -l /dev/drbd0 Disk /dev/drbd0: 931.4 GiB, 1000037986304 bytes, 1953199192 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes
We can configure this drbd block device as physical volume for lvm. This lvm is on top of this drbd. So, we can continue as it is a physical disk. Do it only on one server. The change will reflect on second server, due to primary-primary disk of drbd:
pvcreate /dev/drbd0 Physical volume "/dev/drbd0" successfully created
As we can see, we must adapt /etc/lvm/lvm.conf to our needs, because it scans all block devices and we can found duplicate entries:
root@cl3-amd-node2:~# pvs Found duplicate PV WXwDGteoexfmLxN6GQvt6Nd3jJxgvT2z: using /dev/drbd0 not /dev/md1p1 Found duplicate PV WXwDGteoexfmLxN6GQvt6Nd3jJxgvT2z: using /dev/md1p1 not /dev/drbd0 Found duplicate PV WXwDGteoexfmLxN6GQvt6Nd3jJxgvT2z: using /dev/drbd0 not /dev/md1p1 PV VG Fmt Attr PSize PFree /dev/drbd0 lvm2 --- 931.36g 931.36g /dev/md0 pve lvm2 a-- 931.38g 0
So, we must edit filter option in this configuration. Look at our resouce configuration r0.res. We must exlude our backend devices (/dev/sdb1 on one server and /dev/md1p1 on second server), or we can reject all devices and allow only specific. I prefer reject all and allow only what we want. So edit the filter variable.
root@cl3-amd-node1:~# cat /etc/lvm/lvm.conf | grep drbd filter =[ "a|/dev/drbd0|", "a|/dev/sda3|", "r|.*|" ]
root@cl3-amd-node2:~# cat /etc/lvm/lvm.conf | grep drbd filter =[ "a|/dev/drbd0|", "a|/dev/md0|", "r|.*|" ]
Now, we don’t see duplicates and we can create a volume group. Only on one server:
root@cl3-amd-node2:~# vgcreate drbd0-vg /dev/drbd0 Volume group "drbd0-vg" successfully created ... root@cl3-amd-node2:~# pvs PV VG Fmt Attr PSize PFree /dev/drbd0 drbd0-vg lvm2 a-- 931.36g 931.36g /dev/md0 pve lvm2 a-- 931.38g 0
And finally we add the LVM group to the proxmox. It can be done via web interface. So, go to proxmox web interface to Datacenter, click on storage and add (LVM).
Then create your ID (this is the name of your storage. It can not be changed later. Maybe: drbd0-vg), next you will see the previously created volume group drbd0-vg. So select it and enable the sharing by click the ‘shared’ box.
Now, we can create virtual machine on this LVM and when we can migrate it without downtime from one server to another because of drbd. There is one shared storage. So when the migration starts, machine is started on another server and through ssh tunnel is migrate content of ram. And after few seconds, it is started.
Sometimes, after some circumstances with network disconnect and connect, there is split-brain detected. So if this happened, don’t panic. When this happened, both servers are marked as “standalone” and drbd storage started to diverge. From this time there happened different writes to both sides. We must one of this servers mark as victim, because one of these servers has the “right” data and the other has “wrong” data. So the only way is backup the running virtuals on the “victim” and then we must destroy/discard this data on drbd storage and synchronize it from other server, which has “right” data. So if this is happening, this is in logs:
root@cl3-amd-node1:~# dmesg | grep -i brain [499210.096185] drbd r0/0 drbd0 cl3-amd-node1: helper command: /sbin/drbdadm initial-split-brain [499210.097306] drbd r0/0 drbd0 cl3-amd-node1: helper command: /sbin/drbdadm initial-split-brain exit code 0 (0x0) [499210.097313] drbd r0/0 drbd0: Split-Brain detected but unresolved, dropping connection!
We must manually solve this problem. So I choose as victim: cl3-amd-node1. We must set this node as secondary:
drbdadm secondary r0
And now, we must disconnect it and connect it back with marking data to be discarded.
root@cl3-amd-node1:~# drbdadm connect --discard-my-data r0
And after synchronization, mark it back to primary node:
root@cl3-amd-node1:~# drbdadm primary r0
And in log, we can see:
cl3-amd-node1 kernel: [246882.068518] drbd r0/0 drbd0: Split-Brain detected, manually solved. Sync from peer node
Have fun.