Sunday, 5 February 2012

A Basic Ceph Storage & KVM Virtualisation Tutorial

So I had been meaning to give CEPH & KVM Virtualisation a whirl in the lab for quite some time now. Here I have provided for you all a set of command-by-command instructions I used for setting it up on a single host. The goal here is really just to get it to the 'working' stage for some basic functional experimentation.

Ideally you would want to set this up on multiple hosts with proper network separation so you can see how it performs with real-world attenuation.



Overview

For those who don't know it, CEPH is a distributed storage solution that allows you to scale horizontally with multiple machines/heads instead of the more traditional methodologies which use centralised heads with large amounts of storage attached to them.

The principle here is that you should be able to buy lots of inexpensive computers with a bunch of direct attached storage and just cluster them to achieve scalability. Also, without a central point of failure or performance bottleneck you should be able to scale beyond the limitations of our past storage architectures.

So CEPH like most distributed storage solutions, really has 3 main components:

  • Object Storage Device (ODS): this is where the blocks get stored. You would usually have lots of these.
  • Meta-Data Server (MDS): This where the metadata gets stored. You would have less of these. They are used for looking up where the blocks are stored and storing meta-data about files and blocks.
  • Monitor (MON): Cluster management, configuration and state. This component keeps track of the state of the clustering really.
Here is a basic diagram provided from the official site (so yes I stole it - I hope thats okay):



As you can see, ideally these components are meant to be ran on different sets of systems, with the OSD component being the most frequent. I'm just going to run them all on the same host for this demo, which is useful for a functional test, but not for a destructive or performance test.

By the way the OSD part can used different types of backends and filesystems for storage, but in this example I've chosen BTRFS.

So CEPH itself supports multiple different ways of mounting its storage, which makes it quite a flexible solution.

In this demo I'm going to concentrate only on the RBD and Ceph DFS mechanisms.

This installation was tested with:

  • ceph 0.4.0
  • qemu-kvm 1.0
  • libvirt 0.9.8
  • debian 7.0 (wheezy)
I'm using the bleeding edge versions of these components because CEPH is really in heavy development its better to see how the main-line of development works to get a clearer picture.

OS Preparation


It was tested with real hardware hosted by Hetzner in Germany. The box specs were roughly something like:
  • Quad-Core processor
  • 16 GB of RAM
  • 2 x 750 GB disks (no hardware raid)

To begin, I personally built a Debian 6.0 system (because thats all Hetzner offers you within its Robot tool) with a spare partition that I later use for the OSD/BTRFS volume. The layout was something like:

  • /dev/md0: /boot, 512MB
  • /dev/md1: LVM, 74 GB
  • /dev/md2: the rest of the disk

And in the LVM partition I defined the following logical volumes:

  • /: 50 GB (ext4)
  • swap: 20 GB

The device /dev/md2 I reserved for BTRFS. I believe a more optimal configuration is to not use an MD device but just use /dev/sda2 & /dev/sda3 and let BTRFS do the mirroring. I however have no data or performance statistics to prove this at the moment.

To get the system upgraded from Debian 6 to 7 is fairly straight-forward. First update the APT sources list.

/etc/apt/sources.list:

deb     http://ftp.de.debian.org/debian/ wheezy main contrib non-free
deb     http://ftp.de.debian.org/debian/ wheezy-proposed-updates main contrib non-free

Then run the following to then get the latest updates:

apt-get update
apt-get -y dist-upgrade

The kernel would have been upgraded so you should reboot at this point.



CEPH Installation

Install CEPH using the ceph package. This should pull in all the dependencies you need.

apt-get install ceph

Create some directories for the various CEPH components:

mkdir -p /srv/ceph/{osd,mon,mds}

I used a configuration file like this below. Obviously you will need to change the various parts to suit your environment. I've left out authentication in this demo for simplicity, although if you want to do real destructive and load-testing you should always include this.

/etc/ceph/ceph.conf:

[global]
        log file = /var/log/ceph/$name.log
        pid file = /var/run/ceph/$name.pid
[mon]
        mon data = /srv/ceph/mon/$name
[mon.<your_short_hostname>]
host = <your_hostname>
mon addr = <your_ip_address>:6789
[mds]
[mds.<your_short_hostname>]
host = <your_hostname>
[osd]
osd data = /srv/ceph/osd/$name
osd journal = /srv/ceph/osd/$name/journal
osd journal size = 1000 ; journal size, in megabytes
[osd.0]
host = <your_hostname>
btrfs devs = /dev/md2
        btrfs options = rw,noatime

Now for configuration, CEPH chooses to try and SSH into remote boxes and configure things. I believe this is nice for people who are just getting started, but I'm not sure if this is correct going forward if you already have your own Configuration Management tool like Puppet, Chef or CFEngine.

So to begin with, there is a command that will initialise your CEPH filesystems based on the configuration you have provided:

/sbin/mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs --no-copy-conf

Starting the daemon was a little strange (see the -a switch?):

/etc/init.d/ceph -a start

So just to test its all working, lets mount the CEPH DFS volume onto the local system:

mount -t ceph <your_hostname>:/ /mnt

What you are looking at here is the CEPH object store mounted in /mnt. This is a shared object store - and you should be able to have multiple hosts mount it just like NFS. As mentioned before-hand however, this is not the only way of getting access to the CEPH storage cluster.

CEPH DFS & Directory Snapshots

So I just wanted to segway a little and talk about this neat feature. CEPH & the ceph based mount point above has the capability to do per-directory snapshots which could come in useful. The interface is quite simple as well.

making a snapshot:

mkdir /mnt/test
cd /mnt/test
touch a b c
mkdir .snap/my_snapshot

deleting a snapshot:

rmdir .snap/my_snapshot

finding a snapshot:

The .snap directory won't show up when you do a ls -la in the dir.

Simply assume its there and do something like:

ls -la .snap

... in the directory and the snapshots should show up under the names you created them with.

RADOS Block Device

So an alternative way of using your CEPH storage is by using RBD. The RBD interface gives you the capability to expose an object onto a remote system as a block device. Obviously this has the same caveats as any block device, so multiple hosts that mount the same device must ensure they use some sort of clustered file system such as OCFS2.

So first if its not already, load the rbd kernel module:

modprobe rbd

Using the 'rbd' command line tool, create an image (size is in megs):

rbd create mydisk --size 10000

You can list the current images if you want:

rbd list

Now to mount the actually device, you just have to tell the kernel first:

echo "<your_ip_address> name=admin rbd mydisk" > /sys/bus/rbd/add

It should create a device like /dev/rbd/rbd/mydisk. Lets now format it with a real filesystem and mount it:

mkfs -t ext4 /dev/rbd/rbd/mydisk
mkdir /srv/mydisk
mount -t ext4 /dev/rbd/rbd/mydisk /srv/mydisk

KVM/Qemu Support

QEMU (and libvirt for that matter) at some point merged in patched to allow you to specify an 'rbd' store as a backend to a QEMU virtual instance. I'm going to focus on using an Intel/KVM image for this tutorial.

So lets start by installing KVM & Qemu and the various other pieces we'll need:

apt-get install kvm libvirt-bin virtinst iptables-persistent

We want to probably create a pool for vm disks separate from the pre-existing ones. You can create as many of these as you need:

rados mkpool vm_disks

Now create a qemu image inside the pool. Notice we are just using 'qemu-image' to do this?

qemu-img create -f rbd rbd:vm_disks/box1_disk1 10G

Create yourself a bridge network by modifying the correct Debian configuration file.

/etc/network/interfaces:

auto virbr0
iface virbr0 inet static
  bridge_ports none
  address 192.168.128.1
  netmask 255.255.255.0
  network 192.168.128.0
  broadcast 192.168.128.255

And now bring up the interface:

ifup --verbose virbr0

We'll need some firewall rules so that NAT works in this case. Obviously your network needs may vary here.

/etc/iptables/rules.v4:

*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A FORWARD -s 192.168.128.0/24 -m comment --comment "100 allow forwarding from internal" -j ACCEPT
-A FORWARD -d 192.168.128.0/24 -m comment --comment "100 allow forwarding to internal" -j ACCEPT
COMMIT
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -s 192.168.128.0/24 -o eth0 -m comment --comment "500 outbound nat for internal" -j MASQUERADE
COMMIT

And restart iptables-persistent to load the rules:

service iptables-persistent restart

Turn on forwarding for IPv4:

echo 1 > /proc/sys/net/ipv4/ip_forward

Now that the network is done, we want to create a script to help us launch our VM instance.

First of all create a device definition file called disk.xml with the following contents. This allows us to work-around limitations in virt-install, as it doesn't yet support these extra options as command-line arguments.

/root/disk.xml:

<disk type='network' device='disk'>
  <source protocol='rbd' name='vm_disks/box1_disk1'/>
  <target dev='vda' bus='virtio'/>
</disk>

Now lets create our script.

/root/virt.sh:

#!/bin/bash

set -x

virt-install \
  --name=box1 \
  --ram=512 \
  --vcpus=1 \
  --location=http://ftp.de.debian.org/debian/dists/wheezy/main/installer-amd64/ \
  --extra-args="console=ttyS0" \
  --serial=pty \
  --console=pty,target_type=serial \
  --os-type=linux \
  --os-variant=debiansqueeze \
  --network=bridge=virbr0,model=virtio \
  --graphics=none \
  --virt-type=kvm \
  --noautoconsole \
  --nodisks

# This is because virt-install doesn't support passing rbd 
# style disk settings yet.
# Attaching it quickly before system boot however seems to work
virsh attach-device box1 disk.xml --persistent

And finally we should be able to run it:

./virt.sh

Now attach to the console and go through the standard installation steps for the OS.

virsh console box1


Note: There is no DHCP or DNS server setup - for this test I just provided a static IP and used my own DNS servers.

As you go through the setup, the RBD disk we defined and created should be available like a normal disk as you would expect. After installation you shouldn't really notice any major functional difference.

Once installation is complete, you should be able to boot the system:

virsh start box 1

And then access the console:

# virsh console box1
Connected to domain box1
Escape character is ^]

Debian GNU/Linux 6.0 box1 ttyS0

box1 login: 

And then your done.

Summary

So this is quite an interesting exercise and one worth doing, but the software is still very much early-release. They even admit this themselves.

I'm wary of performance and stability more then anything, something I can't test with just a single host - so if I ever get the time I'd really like to run this thing properly.

I had a brief look at the operations guide, and it seems the instructions for adding and removing a host to the OSD cluster looks not as automatic as I would like it. Ideally, you really want the kind of behaviour that ElasticSearch offers on this level so that adding and removing nodes is almost a brain-dead task. Having said that, adding a node seems easier then some of the storage systems/solutions I've seen about the place :-).

So regardless of my concerns - I think this kind of storage is definitely the future and I'm certainly cheering the CEPH team on for this one. The functionality was fun (and yes kind of exciting) to play with and I can see that the real-world possibilities of such a solution in the open-source arena are quite probable now.

Other things to try from here
  • Check out another alternative: Sheepdog, which also seems to be gaining ground but only on the QEMU storage front. Its a very specific solution as apposed to CEPH's generic storage solution.
  • Test CEPH integration with OpenNebula and OpenStack so you can see it within a full cloud provisioning case. This might require some custom scripts to support cloning RBD-stored base images etc. but should be interesting.
  • Test the S3 emulation provided by the RadosGW component.
References

Formal (yet incomplete) Documentation:
Wiki:
Installation on Debian:
RBD:
QEMU-RBD:
Snapshots:

3 comments:

  1. Hi,
    I have set up ceph system with a client, mon and mds on one system which is
    connected to 2 osds. . The ceph setup worked fine and I did some tests on it and
    they too worked fine. Now, I want to set up virtual machines on my system and
    want to run multiple virtual machine instances in parallel . But I don’t know
    about what are the exact installation steps that are needed to be done after the
    ceph installation. Then I followed your article but this talks about setting up a single virtual machine instance rather than setting up multiple virtual machine instances. Can you plz help me with the changes that needs to be done in my case.

    Thanks in advance.

    --Udit

    ReplyDelete
  2. Udit,

    You should be able to create many virtual machines, just make sure the disk device is uniquely named. In my example I show a sample disk.xml with the name for an RBD device. I think as long as the disk device is unique for each machine you should be okay.

    ken.

    ReplyDelete
  3. Another tip if you don't like writing XML. You can create a virtual machine with virt-manager and then edit the XML file under /etc/libvirt/qemu, replacing the actual disk image with the virtual disk. At the moment, Virsh does not support RBD directly though.

    ReplyDelete