DRBD Dual-Primary Obstacles

I have DRBD installed and configured on two different servers, everything is working as expected but I am having a horrible I/O disk read/write speed and network latency like crazy, my shared partition is 35 GB and it takes me days to fully sync between the two servers and 15 minutes to copy a 100 MB directory also I configured automatic split-brain recovery and it works fine but when it do happen one of the disks (the faulty one) gets the secondary stamp so it’s not auto-mounted unless it’s promoted to primary manually, any idea how to overcome the slow speed and automate the nodes promotions to primary after recovering from split-brain and fully sync with the other node so I could automount them during boot or even after the boot, I don’t want to babysit the servers waiting for every single reboot?

global { usage-count no; }
common { syncer { al-extents 3389; rate 150M; } }
resource web {
  protocol C;
  startup {
    wfc-timeout 30;
    outdated-wfc-timeout 20;
    degr-wfc-timeout 30;
    become-primary-on both;
  }
  net {
    sndbuf-size 0;
    max-buffers 8000;
    max-epoch-size 8000;
    unplug-watermark 16;
    # cram-hmac-alg sha1;
    # shared-secret PASSWORD;
    allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  on first-node {
    device /dev/drbd0;
    disk /dev/xvdc0;
    address 192.168.1.11:7789;
    meta-disk internal;
  }
  on second-node {
    device /dev/drbd0;
    disk /dev/xvdc0;
    address 192.168.1.12:7789;
    meta-disk internal;
  }
  disk {
    no-disk-barrier;
    no-disk-flushes;
    on-io-error detach;
    fencing resource-and-stonith;
  }
  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
  #  fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
  #  after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  #  local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emerg$
  #  pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/$
  #  pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/$
  }
}

Answer

Eliminate some areas of where the error can be. First check are the network interfaces behaving normal, check for errors and correct speed.

ifconfig -a   # (some lines removed)
 eth0      Link encap:Ethernet  HWaddr f0:X
      inet addr:172.2.2.11  Bcast:172.2.2.255  Mask:255.255.252.0
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      RX packets:1345026 errors:0 dropped:0 overruns:0 frame:0
      TX packets:1465184 errors:2 dropped:0 overruns:0 carrier:2
      collisions:35456 txqueuelen:1000 
      RX bytes:897944732 (897.9 MB)  TX bytes:185044496 (185.0 MB)

Check if there are not a lot of errors, I have 2 when transmitting on a total of 1465184, which I don’t find alarming.
My number of collisions is higher than I want it to be.

# mii-tool eth0
eth0: negotiated 100baseTx-HD, link ok  

Only 100 Mb. For me network topology would improve if I would get a Gigabit switch. Also I have HD, which is half duplex, also bad. My network interface is capable of 1000baseT-FD, so the bottleneck is the switch.

For you, if you have a second network interface on the servers, you could connect them directly to each other. Also ethtool eth0 will give nicer output.

If there are no network problems, check for hardisk errors.

# smartctl -t short /dev/sda    # test the harddisk

# smartctl -H /dev/sda           
smartctl 6.2 2013-04-20 r3812 [x86_64-linux-3.11.0-15-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

# smartctl --all  /dev/sda   # for all information

After that, check logfiles for errors, High CPU load? Maybe the encryption takes a long time.

Attribution
Source : Link , Question Author : user204252 , Answer Author : jris198944

Leave a Comment