Ideally you would want to set this up on multiple hosts with proper network separation so you can see how it performs with real-world attenuation.
Overview
For those who don't know it, CEPH is a distributed storage solution that allows you to scale horizontally with multiple machines/heads instead of the more traditional methodologies which use centralised heads with large amounts of storage attached to them.
The principle here is that you should be able to buy lots of inexpensive computers with a bunch of direct attached storage and just cluster them to achieve scalability. Also, without a central point of failure or performance bottleneck you should be able to scale beyond the limitations of our past storage architectures.
So CEPH like most distributed storage solutions, really has 3 main components:
- Object Storage Device (ODS): this is where the blocks get stored. You would usually have lots of these.
- Meta-Data Server (MDS): This where the metadata gets stored. You would have less of these. They are used for looking up where the blocks are stored and storing meta-data about files and blocks.
- Monitor (MON): Cluster management, configuration and state. This component keeps track of the state of the clustering really.
Here is a basic diagram provided from the official site (so yes I stole it - I hope thats okay):
As you can see, ideally these components are meant to be ran on different sets of systems, with the OSD component being the most frequent. I'm just going to run them all on the same host for this demo, which is useful for a functional test, but not for a destructive or performance test.
By the way the OSD part can used different types of backends and filesystems for storage, but in this example I've chosen BTRFS.
So CEPH itself supports multiple different ways of mounting its storage, which makes it quite a flexible solution.
In this demo I'm going to concentrate only on the RBD and Ceph DFS mechanisms.
This installation was tested with:
- ceph 0.4.0
- qemu-kvm 1.0
- libvirt 0.9.8
- debian 7.0 (wheezy)
I'm using the bleeding edge versions of these components because CEPH is really in heavy development its better to see how the main-line of development works to get a clearer picture.
OS Preparation
It was tested with real hardware hosted by Hetzner in Germany. The box specs were roughly something like:
- Quad-Core processor
- 16 GB of RAM
- 2 x 750 GB disks (no hardware raid)
To begin, I personally built a Debian 6.0 system (because thats all Hetzner offers you within its Robot tool) with a spare partition that I later use for the OSD/BTRFS volume. The layout was something like:
- /dev/md0: /boot, 512MB
- /dev/md1: LVM, 74 GB
- /dev/md2: the rest of the disk
And in the LVM partition I defined the following logical volumes:
- /: 50 GB (ext4)
- swap: 20 GB
The device /dev/md2 I reserved for BTRFS. I believe a more optimal configuration is to not use an MD device but just use /dev/sda2 & /dev/sda3 and let BTRFS do the mirroring. I however have no data or performance statistics to prove this at the moment.
To get the system upgraded from Debian 6 to 7 is fairly straight-forward. First update the APT sources list.
/etc/apt/sources.list:
deb http://ftp.de.debian.org/debian/ wheezy main contrib non-free
deb http://ftp.de.debian.org/debian/ wheezy-proposed-updates main contrib non-free
Then run the following to then get the latest updates:
apt-get update
apt-get -y dist-upgrade
The kernel would have been upgraded so you should reboot at this point.
CEPH Installation
Install CEPH using the ceph package. This should pull in all the dependencies you need.
apt-get install ceph
Create some directories for the various CEPH components:
mkdir -p /srv/ceph/{osd,mon,mds}
I used a configuration file like this below. Obviously you will need to change the various parts to suit your environment. I've left out authentication in this demo for simplicity, although if you want to do real destructive and load-testing you should always include this.
/etc/ceph/ceph.conf:
[global]
log file = /var/log/ceph/$name.log
pid file = /var/run/ceph/$name.pid
[mon]
mon data = /srv/ceph/mon/$name
[mon.<your_short_hostname>]
host = <your_hostname>
mon addr = <your_ip_address>:6789
[mds]
[mds.<your_short_hostname>]
host = <your_hostname>
[osd]
osd data = /srv/ceph/osd/$name
osd journal = /srv/ceph/osd/$name/journal
osd journal size = 1000 ; journal size, in megabytes
[osd.0]
host = <your_hostname>
btrfs devs = /dev/md2
btrfs options = rw,noatime
Now for configuration, CEPH chooses to try and SSH into remote boxes and configure things. I believe this is nice for people who are just getting started, but I'm not sure if this is correct going forward if you already have your own Configuration Management tool like Puppet, Chef or CFEngine.
So to begin with, there is a command that will initialise your CEPH filesystems based on the configuration you have provided:
/sbin/mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs --no-copy-conf
Starting the daemon was a little strange (see the -a switch?):
/etc/init.d/ceph -a start
So just to test its all working, lets mount the CEPH DFS volume onto the local system:
mount -t ceph <your_hostname>:/ /mnt
What you are looking at here is the CEPH object store mounted in /mnt. This is a shared object store - and you should be able to have multiple hosts mount it just like NFS. As mentioned before-hand however, this is not the only way of getting access to the CEPH storage cluster.
CEPH DFS & Directory Snapshots
So I just wanted to segway a little and talk about this neat feature. CEPH & the ceph based mount point above has the capability to do per-directory snapshots which could come in useful. The interface is quite simple as well.
making a snapshot:
mkdir /mnt/test
cd /mnt/test
touch a b c
mkdir .snap/my_snapshot
deleting a snapshot:
rmdir .snap/my_snapshot
finding a snapshot:
The .snap directory won't show up when you do a ls -la in the dir.
Simply assume its there and do something like:
ls -la .snap
... in the directory and the snapshots should show up under the names you created them with.
RADOS Block Device
So an alternative way of using your CEPH storage is by using RBD. The RBD interface gives you the capability to expose an object onto a remote system as a block device. Obviously this has the same caveats as any block device, so multiple hosts that mount the same device must ensure they use some sort of clustered file system such as OCFS2.
So first if its not already, load the rbd kernel module:
modprobe rbd
Using the 'rbd' command line tool, create an image (size is in megs):
rbd create mydisk --size 10000
You can list the current images if you want:
rbd list
Now to mount the actually device, you just have to tell the kernel first:
echo "<your_ip_address> name=admin rbd mydisk" > /sys/bus/rbd/add
It should create a device like /dev/rbd/rbd/mydisk. Lets now format it with a real filesystem and mount it:
mkfs -t ext4 /dev/rbd/rbd/mydisk
mkdir /srv/mydisk
mount -t ext4 /dev/rbd/rbd/mydisk /srv/mydisk
KVM/Qemu Support
QEMU (and libvirt for that matter) at some point merged in patched to allow you to specify an 'rbd' store as a backend to a QEMU virtual instance. I'm going to focus on using an Intel/KVM image for this tutorial.
So lets start by installing KVM & Qemu and the various other pieces we'll need:
apt-get install kvm libvirt-bin virtinst iptables-persistent
We want to probably create a pool for vm disks separate from the pre-existing ones. You can create as many of these as you need:
rados mkpool vm_disks
Now create a qemu image inside the pool. Notice we are just using 'qemu-image' to do this?
qemu-img create -f rbd rbd:vm_disks/box1_disk1 10G
Create yourself a bridge network by modifying the correct Debian configuration file.
/etc/network/interfaces:
auto virbr0
iface virbr0 inet static
bridge_ports none
address 192.168.128.1
netmask 255.255.255.0
network 192.168.128.0
broadcast 192.168.128.255
And now bring up the interface:
ifup --verbose virbr0
We'll need some firewall rules so that NAT works in this case. Obviously your network needs may vary here.
/etc/iptables/rules.v4:
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A FORWARD -s 192.168.128.0/24 -m comment --comment "100 allow forwarding from internal" -j ACCEPT
-A FORWARD -d 192.168.128.0/24 -m comment --comment "100 allow forwarding to internal" -j ACCEPT
COMMIT
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -s 192.168.128.0/24 -o eth0 -m comment --comment "500 outbound nat for internal" -j MASQUERADE
COMMIT
And restart iptables-persistent to load the rules:
service iptables-persistent restart
Turn on forwarding for IPv4:
echo 1 > /proc/sys/net/ipv4/ip_forward
Now that the network is done, we want to create a script to help us launch our VM instance.
First of all create a device definition file called disk.xml with the following contents. This allows us to work-around limitations in virt-install, as it doesn't yet support these extra options as command-line arguments.
/root/disk.xml:
<disk type='network' device='disk'>
<source protocol='rbd' name='vm_disks/box1_disk1'/>
<target dev='vda' bus='virtio'/>
</disk>
Now lets create our script.
/root/virt.sh:
#!/bin/bash
set -x
virt-install \
--name=box1 \
--ram=512 \
--vcpus=1 \
--location=http://ftp.de.debian.org/debian/dists/wheezy/main/installer-amd64/ \
--extra-args="console=ttyS0" \
--serial=pty \
--console=pty,target_type=serial \
--os-type=linux \
--os-variant=debiansqueeze \
--network=bridge=virbr0,model=virtio \
--graphics=none \
--virt-type=kvm \
--noautoconsole \
--nodisks
# This is because virt-install doesn't support passing rbd
# style disk settings yet.
# Attaching it quickly before system boot however seems to work
virsh attach-device box1 disk.xml --persistent
And finally we should be able to run it:
./virt.sh
Now attach to the console and go through the standard installation steps for the OS.
virsh console box1
Note: There is no DHCP or DNS server setup - for this test I just provided a static IP and used my own DNS servers.
As you go through the setup, the RBD disk we defined and created should be available like a normal disk as you would expect. After installation you shouldn't really notice any major functional difference.
Once installation is complete, you should be able to boot the system:
virsh start box 1
And then access the console:
# virsh console box1
Connected to domain box1
Escape character is ^]
Debian GNU/Linux 6.0 box1 ttyS0
box1 login:
And then your done.
Summary
So this is quite an interesting exercise and one worth doing, but the software is still very much early-release. They even admit this themselves.
I'm wary of performance and stability more then anything, something I can't test with just a single host - so if I ever get the time I'd really like to run this thing properly.
I had a brief look at the operations guide, and it seems the instructions for adding and removing a host to the OSD cluster looks not as automatic as I would like it. Ideally, you really want the kind of behaviour that ElasticSearch offers on this level so that adding and removing nodes is almost a brain-dead task. Having said that, adding a node seems easier then some of the storage systems/solutions I've seen about the place :-).
So regardless of my concerns - I think this kind of storage is definitely the future and I'm certainly cheering the CEPH team on for this one. The functionality was fun (and yes kind of exciting) to play with and I can see that the real-world possibilities of such a solution in the open-source arena are quite probable now.
Other things to try from here
- Check out another alternative: Sheepdog, which also seems to be gaining ground but only on the QEMU storage front. Its a very specific solution as apposed to CEPH's generic storage solution.
- Test CEPH integration with OpenNebula and OpenStack so you can see it within a full cloud provisioning case. This might require some custom scripts to support cloning RBD-stored base images etc. but should be interesting.
- Test the S3 emulation provided by the RadosGW component.
References
Formal (yet incomplete) Documentation:
Wiki:
Installation on Debian:
RBD:
QEMU-RBD:
Snapshots:
Hi,
ReplyDeleteI have set up ceph system with a client, mon and mds on one system which is
connected to 2 osds. . The ceph setup worked fine and I did some tests on it and
they too worked fine. Now, I want to set up virtual machines on my system and
want to run multiple virtual machine instances in parallel . But I don’t know
about what are the exact installation steps that are needed to be done after the
ceph installation. Then I followed your article but this talks about setting up a single virtual machine instance rather than setting up multiple virtual machine instances. Can you plz help me with the changes that needs to be done in my case.
Thanks in advance.
--Udit
Udit,
ReplyDeleteYou should be able to create many virtual machines, just make sure the disk device is uniquely named. In my example I show a sample disk.xml with the name for an RBD device. I think as long as the disk device is unique for each machine you should be okay.
ken.
Another tip if you don't like writing XML. You can create a virtual machine with virt-manager and then edit the XML file under /etc/libvirt/qemu, replacing the actual disk image with the virtual disk. At the moment, Virsh does not support RBD directly though.
ReplyDelete