Archive for December, 2011

Infiniband IPoIB & SRP target setup on Ububtu 11.10

Saturday, December 24th, 2011

In a previous blog article, I covered how to get the infiniband fabric up on Ubuntu 10.10 (Maverick Meerkat). This time I’ll cover IPoIB on Ubuntu 11.10 (Oneiric Ocelot), and also add in the information on how to also get the SRP target configured and running. This guide does not require you to rebuild the linux kernel, and works from the stock install of Ubuntu 11.10. So you should be up-and running with an SRP target in about 15 minutes, all going well.
As before, I’m using an Ubuntu box for the disk subsystem, and a Windows 7 as a client (initiator) machine.

Here’s a few acronyms:

SCSI – we should know what this one is (Small Computer Systems Interface)
iSCSI – SCSI protocol over a network.  (Internet SCSI).
SCST – SCSI Target subsystem for linux with target drivers for iSCSI, Fibre Channel, SRP, SAS, FCoE, etc. We’re most interested in the SRP target.
RDMA – Remote Direct Memory Access – fast way of copying chunks of memory from one machine to another.
SRP – SCSI RDMA Protocol – wraps it all up using RDMA over SCSI protocol.

Stage 1 – Setting up IPoIB

This section shows how to get the basic fabric up and running with TCP/IP over Infiniband (IPoIB). Once you’ve done this, you’ll be able to send pings between the machines, ssh from one to the other across the fabric, etc.

For the windows 7 machine, it’s a simple case of installing the OFED drivers from  openfbrics.org. While installing the windows drivers, make sure you select the SRP options, as later in this guide we’ll be setting up the RAID box as a drive on the Windows box using SRP (iSCSI RDMA Protocol). So,  install the OFED driver package.

To enable improved throughput, enable connected mode on the link by updating the “Connected Mode” and “Connected Mode Payload Mtu size” in the properties of the interface. Open up “Device Manager” and find your Infiniband adapter:

Right clock on the adapter and bring up the properties dialog:

 

So, on to the linux box. I started with a fresh install of Ubuntu 11.10.

Firstly Install Ubuntu. Update all the packages to the latest using

sudo apt-get update
sudo apt-get upgrade

Everything below is done as root, this avoids having to type ‘sudo’ before every command, so I just call “sudo bash”.

Note: No need to edit the udev rules any more as per the Ubuntu 10.10 HOWTO, as they are correctly set by the kernel 3.0.0 and newer.

Edit /etc/modules and add the following modules:

ib_sa
ib_cm
ib_umad
ib_addr
ib_uverbs
ib_ipoib
ib_ipath
ib_qib

Next,

apt-get install opensm

This will install the subnet manager and all the relevant dependencies, libibverbs, etc.
Then add the relevant entries for the interface into /etc/network/interfaces file:

auto ib0
iface ib0 inet static
address 192.168.1.1
netmask 255.255.255.0

Then reboot. This will create the relevant infiniband entries in /sys, load the ipoib modules, and bring up the infiniband port with an ip address. You should now have a functioning infiniband port on your Ubuntu machine, and you should be able to ping the remote machine.

The next thing is to enable connected mode for the infiniband connection. I found that this increased the tcp/ip netperf benchmarks from 3gbps tp 7gbps.

root@raid:~# echo connected >`find /sys -name mode | grep ib0`
root@raid:~# echo 65520 >`find /sys -name mtu | grep ib0`

To make this happen when the ib0 interface is brought up, modify the /etc/network/interfaces file as follows:

auto ib0
iface ib0 inet static
address 10.4.12.1
netmask 255.255.255.0
up echo connected >`find /sys -name mode | grep ib0`
up echo 65520 >`find /sys -name mtu | grep ib0`

 

Stage 2 – Setting up SRP Target

The next step, should you wish to take it, is to set up the SRP targets. This is much better option than using samba shares, as it gives you a massive boost in throughput, and uses much less CPU. For example, on my setup using IPoIB and samba shares, I was able to achieve no more than 125MB/sec. Using a ramdisk set up as an SRP target, I was able to achieve 900MB/sec between my two machines.
So, first get a few packages:

apt-get install libmthca1
apt-get install iscsitarget
apt-get install open-iscsi
apt-get install lsscsi
apt-get install scsitools
change /etc/default/iscsitarget to:
ISCSITARGET_ENABLE=true

We now need to get scstadmin. I typically pull the latest version from subversion, so we first need to get svn. apt-get install subversion
There’s a choice to me made now. According to the guide at http://iscsi-scst.sourceforge.net/iscsi-scst-howto.txt , you get a slight performance increase by patching and re-compiling the kernel. That’s a lot of work, so I skipped that step for this guide. I went for the easier option, which is just to run the few commands below. I’m not sure what the difference in performance is, but I can still pull 900MB/sec across the fabric from a ramdisk set up as a an SRP target. And not having to rebuild the kernel makes this a 15 minute procedure rather than a 3 hour one. :)

cd ~
svn co https://scst.svn.sourceforge.net/svnroot/scst/trunk scst
cd scst
make scst scst_install
make iscsi iscsi_install
make scstadm scstadm_install
make srpt srpt_install

N.B.: You’ll need to rebuild these each time Ubuntu upgrades the kernel, as the modules get out of sync. Just re-run the make’s again.

Lets now use scstadmin to create an SRP target:

First we modprobe a couple of kernel modules. That will make the relevant drivers and targets available to scstadmin. We should then be able to see thse drivers/targets with a few scstadmin ‘list’ commands:

modprobe scst_vdisk
modprobe ib_srpt
root@raid:~# scstadmin -list_handler

Collecting current configuration: done.
 Handler
 -------------
 vdisk_fileio
 vdisk_blockio
 vdisk_nullio
 vcdrom

All done.
root@raid:~# scstadmin -list_driver

Collecting current configuration: done.

 Driver
 -------
 ib_srpt

All done.
root@raid:~# scstadmin -list_device

Collecting current configuration: done.

 Handler           Device
 ------------------------
 vdisk_nullio     -
 vdisk_fileio     -
 vdisk_blockio    DISK01
 vcdrom           -

All done.
root@raid:~#
So, if you can see those handlers, targets, and drivers, your good to go with the next step.

Summary:

The following command list is the quick list to use (and modify) if you’ve done this kind of thing before. The detail section below gives a bit of explanation on each of the lines.

scstadmin -disable_target ib_srpt_target_0 -driver ib_srpt
scstadmin -clear_config -force
scstadmin -open_dev DISK01 -handler vdisk_blockio -attributes filename=/dev/sda
scstadmin -set_dev_attr DISK01 -attributes t10_dev_id=0x2346,threads_num=4
scstadmin -add_group HOST01 -driver ib_srpt -target ib_srpt_target_0
scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group HOST01 -device DISK01 -attributes read_only=0
scstadmin -add_init 0x0002c9020021f9fc0002c902002200bc -driver ib_srpt -target ib_srpt_target_0 -group HOST01
scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt
scstadmin -write_config /etc/scst.conf

Detail:

Clear the current config (only if that’s ok, and you don’t have any other config that you want to keep. I do this because I’m starting with a clean slate).

scstadmin -clear_config -force

Create DISK01, assigning it to a partition (/dev/sdg1, /dev/md0p1, etc., etc.). I’m using a disk partition in this example.

scstadmin -open_dev DISK01 -handler vdisk_blockio -attributes filename=/dev/sdg1

Now set the drive attributes.

scstadmin -set_dev_attr DISK01 -attributes t10_dev_id=0x2345

Now add a group

scstadmin -add_group HOST01 -driver ib_srpt  -target ib_srpt_target_0

Add a LUN to the group, assigning it to DISK01

scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group HOST01 -device DISK01 -attributes read_only=0

Add an initiator to the group, allowing the initiator to connect to our new target. I got this from watching /var/log/messages while I was disabling and enabling the Infiniband SRP miniport in device manager on the Win7 box. This caused the SRP miniport to attempt to connect to the Ubuntu target, and that attempt is shown in the /var/log/messages file along with the initiator ID.

scstadmin -add_init 0x0002c9020021f9fc0002c902002200bc -driver ib_srpt -target ib_srpt_target_0 -group HOST01

Finally enable the target

scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt

And write the config.

scstadmin -write_config /etc/scst.conf

At this point you should see a new drive appear on the Win7 box, and asking you to format it. If not, you could try disabling and enabling the SRP miniport driver again, and take a look at /var/log/messages to see what’s happening.

Once the new drive appears on the Windows host, you should be able to format it and start using it. If you don’t see it immediately, have a look in “Disk Management”.

I found that the //etc/init.d/scst script was not getting called at boot, so I added a softlink in /etc/rcS.d

cd /etc/rcS.d
ln -s ../init.d/scst S26scst

This is so it will start after scsitools, opensm and open-iscsi. Using the /etc/init.d/scst script also resolved a problem I’ve had for a long time, in that I used to have some modules in /etc/modules, so the SRP miniport would attempt to connect too early in the linux boot sequence, and fail, so that I had to go into the device manager and disable/enable to get it working. Now it pops up perfectly every boot of the linux box, and I  don’t have to go into device manager on the Win7 box any more.

I’d suggest you run a benchmark on it also just to see what kind of speed you’re getting out if it in real-world usage. There’s a very handy (free) benchmarking took available from attotech.com. Oh, and please do drop a comment below if you do get decent speeds. I’m always interested to hear what people are getting. Oh, and also, comment if you find this guide useful!

 

References:

SCST – http://scst.sourceforge.net/

Installing iSCSI-SCST – http://iscsi-scst.sourceforge.net/iscsi-scst-howto.txt

 

Performance Testing.

If you have plenty of memory in your linux machine, you might like to do a test of a ramdisk SRP target over the fabric.

There’s a ramdisk set up by default in Ubuntu at /dev/ram0, but it’s quite small, at 64K. I like to bump it up to 1Gig or 2Gig, as I have 4Gig in my linux box and it does need all that for normal operation. Adding an extra parameter to the kernel line in /boot/grub/grub.cfg does the trick nicely.

ramdisk_size=1024000 for a 1Gig ramdisk at /dev/ram0

ramdisk_size=2048000 for a 2Gig ramdisk at /dev/ram0

so the full line would look like:

linux   /boot/vmlinuz-3.0.0-12-generic root=UUID=39331d95-e9ba-48a5-9dd9-09978e0503c4 ro   quiet splash vt.handoff=7 ramdisk_size=1024000

Then using scstadmin, we add another disk to the script above, but we use vdisk_fileio rather than vdisk_blockio.

scstadmin -open_dev DISK02 -handler vdisk_fileio -attributes filename=/dev/ram0

Then when you open disk manager in windows, you should see a disk requesting a new MBR, which you go ahead and add, then format and run a few benchmarks on it.

 

900MB/sec Network at Home

Thursday, December 22nd, 2011

I’ve just made a breakthrough in my research into setting up an infiniband fabric at home. I started out at about 135MB/sec. made slow progress up to 200-220MB/sec, but this evening broke the 900MB/sec barrier. Almost 1 GIGABYTE per second. That’s well over a CD every second! Here’s the ATTO benchmark output…

In previous posts, I’d identified several bottlenecks, and as I got around each one, I slowly got the speed up. However, everything seemed to max out at 200-220MB/sec. I’d identified PCI-express slots as one potential problem but when I set up a ramdisk on the linux box and exported that as an SRP target, I was still only getting 200MB/sec. Then I thought – could it be my i7 Win7 box? Surely not. That’s sooo much more powerful than the athlon64 x2 3GHz box that I’d linux installed on. But for a laugh I had a look at the PCI express slot that I’d inserted the Mellanox Infiniband card into. Even though it’s a 16x sized slot, my jaw dropped when I saw PCIEx4 written on the motherboard beside it. But the Mellanox card is a PCIEx8 card! So I moved the card into the spare PCIEx16 slot, and re-ran my tests, only to smash all previous records, achieving almost 950MB/sec for some packet sizes.

Just to put this into context:

Machine A:

  • 3GHz Athlon 64 dual core, Asus A8N-SLI Premium, 4Gig ram, Mellanox MHEA28-XTC 10Gbps Infiniband card.
  • Running Ubuntu 11.10, 2Gig Ramdisk set up as an SRP target

Machine B:

  • 3.20GHz Intel i7 quad core, Gigabyte GA-X58-USB3, 12Gig ram, Mellanox MHEA28-XTC 10Gbps Infiniband card
  • Running Windows 7 Home Premium, OFED infiniband drivers package, SRP Initiator. ATTO Disk Benchmark utility.

Both machines are connected directly together, no switch was needed.

Consumer grade hardware for the most part, with the magical Mellanox ingredient that pushes this beyond what I previously though possible.

Also, it’s worth noting that this is a stock Ubuntu 11.10 install, a custom kernel was NOT required. I’m working on a HOWTO to get the Infiniband fabric up and SRP targets configured, I’ll be posting that soon.  I reckon it’ll be possible to configure an Ubuntu box up to the 900MB/sec speeds in under 15 minutes.

Here’s the HDTune benchmark of the same setup. Not quite 900MB/sec, but not far off….

“You can learn from my mistakes” (putting the infiniband card in the wrong slot)  :)

More to come soon…..

 

 

Additional Notes:

Here’s a shot of the Linux Box. It’s changed a bit since this photo was taken, there’s now a raid card in there where the graphics card is shown, but it give you and idea of the “consumer grade” hardware I’m doing this with….

 

RAID controller installation & benchmarks

Saturday, December 17th, 2011

Continuing on my research into a high-bandwidth infiniband fabric at home, mainly with a view to fast backups of the 2000Gigs of photos I now have, I was now on the lookout on eBay for a good-value hardware RAID controller card. After a couple of weeks of looking at cards and prices, I came across an auction for a Dell PERC 5/i 8-port card. Bidding was pretty low, probably due to the comment on the auction that the seller could not guarantee that the card was in fully working condition. Turned out later to be in great working order. :)

I ended up getting it for €31 including shipping to Ireland from Finland. Most of these cards go for three times that price. I was just lucky that the auction ended at 3am, and I dont think that  many people bothered to stay up to keep an eye on it.

About a week later, the card arrived, along with the seperate order of a couple of SFF8484 to 4-way SATA cables, I was ready to go.

I’d already freed up one of the PCI-e 16x slots in by Ubuntu box, so the Perc 5/i slotted right in. I hooked up the 5 x 1TB drives that I was configuring in a RAID5 volume, and booted up.

Upon boot, the raid card was detected, and I hit CTRL-R to enter the setup. It was fairly straightforward, I selected the 5 drives that I’d connected up to it, and initialised a new RAID5 volume.

Anyway, the drive then appeared to Linux as a 4TB volume, which I used parted to partition into one big volume.

There were a couple of problems once I’d installed the drive,  read speed and write speed. I was getting 200MB/sec reads and 50MB/sec writes. I did a bit of googling and found an excellent thread on the Dell PERC 5/i controller (ref1), so I flashed the firmware up to the very latest from LSI. I initially did not have a battery, but the latest firmware allowed me to force Write-Back and Adaptive Read-Ahead (not safe settings without battery, but it’d be ok till the battery arrived, another eBay purchase), which should improve the throughput. Also, I found that when you do a quick init of the aray, it says that it completes very quickly, but it’ll go ahead and do a full background init afterwards thereby confusing my throughput results. I found that the best thing to do was to do a full init in the RAID configuration utility and let it sit there for 2.5 hours to complete the full init of 4TB (5x1TB). That way I’d know that the initialisation process would not be skewing my measurements later. Before I did the full init, though, I was getting 170MB read/write (across the infiniband fabric) even when the background init was in progress, so the firmware upgrade was definitely a step in the right direction. I was looking forward to full speed measurements in a couple of hours.

Once the init finished, I was disappointed to find that I was only getting just under 200MB/sec read/write.

So, I tore down the array again and went back to a 5 disk RAID0 array, for max speed. This gave me the following:

This was only marginally better than the RAID5 readings, and worse read speed than using software RAID and motherboard SATA connectors. So to eliminate my Linux box motherboard and CPU, I took the RAID controller out of the Linux box and interted it into my i7  desktop (i7-950@3.07GHz 12G RAM), leaving the drives in the linux box, and running the SATA cables between the two machines. The throughput from the raid card was quite different.

So, now I was getting 300-350MB/sec writes, and up to 600MB/sec reads, with 500MB/sec reads quite common above block size of 64K. That proved that my linux box motherboard is now the bottleneck. Maybe the fact that I’ve two PCIe-8x cards plugged into the two PCIe-16x slots intended for graphics cards is a problem?

At least now I knew what the Perc 5i RAiD controller was capable of, I just had to get the machine that it was hosted in up to spec to be able to drive it at full speed. Maybe then we’ll get more out of the infiniband link between the two machines. I’ll look at re-organising the cards in the machine, maybe look at getting a new motherboard with faster PCI-express lanes.

I’m rapidly coming to the conclusion that trying to re-cycle 5-year-old components to get uber-speed throughput is a lot more difficult than it seems at first. Might be something to do with the hardware not being up to the task when pushed to the limit. The motherboard I’m using in the Linux box with the RAID controller is an Asus A8N-LSI-Premium, which was a top-end consumer-grade motherboard in 2005. With two PCI-e 16x slots, each theoretically capable of 2GB/sec (PCI-e 1.0 8 lanes) in SLI mode, yet I’m only able to get 200-300MB/sec through them? What’s with that? Surely there’s gotta be a helluva lot more data going through those slots when there’s two big fat graphics cards in them.

Stay posted for more updates soon….

Update: Just to help narrow down the bottleneck, I managed to get a 1Gigabyte ramdisk set up as an SRP target. Here’s the throughput across the fabic:

So, The infiniband card in the PCI-express socket with 8 lanes assigned is only getting 200MB/sec through it from RAM. CPU was about 25% busy through the whole test.

 

 

 

References:

http://www.overclock.net/t/359025/perc-5-i-raid-card-tips-and-benchmarks

 

Service Resumes….

Thursday, December 1st, 2011

Apologies to all who where trying to get at the website yesterday. Due to a non-delivery of an email last month, I was not reminded to renew my hosting. So, yesterday, they suspended my account. 30 minutes later, I had paid the fees, yet it took almost 24 hours for my website to be re-activated. I’m not happy.

Anyway, welcome back all….

Regards,

Dave.