Xen 3.4 – 4.0 Bug On AMD 6100 Series (Magny-Cours) Opteron
Xen 3.4 through 4.0 will not boot on the new AMD 6100 series (Magny-Cours) Opteron CPU (Socket G34). There seems to be a bug in the Machine Check Exception (MCE) handling code that causes Xen to panic. The 'nomce' (3.4) and 'no-mce' (4.0) boot options do not properly turn off the MCE code, so this cannot be used to avoid the issue.
There is a patch in Xen 4.0 unstable that fixes the 'nomce' boot option so that it works properly. Unfortunately this change has not been back-ported to Xen 3.4 by most of the Linux distros yet.
I have applied the fix to Xen 3.4.1 for OpenSUSE 11.2 and built RPMs for the affected packages. This did the trick and my AMD 6100 series systems are now running Xen just fine with the 'nomce' boot option. Read on for more details and the link to the RPMs with the fix.
Xen Panic Output
If you have an AMD 6100 series CPU and are getting the following output, then you are probably running into the MCE bug:
(XEN) Xen BUG at amd_nonfatal.c:165 (XEN) ----[ Xen-3.4.2 x86_64 debug=n Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff828c801778f9>] mce_amd_work_fn+0x1d9/0x1f0 (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor (XEN) rax: 0000000000000ffe rbx: ffff828c8024ff28 rcx: 0000000000000000 (XEN) rdx: c0080ffe01000000 rsi: 0000000000000413 rdi: 0000000000000000 (XEN) rbp: 000000025f13f8e0 rsp: ffff828c8024fe60 r8: ffff828c8028f800 (XEN) r9: 0000000000000000 r10: 0000000000000005 r11: 0000000000000000 (XEN) r12: 0000000000000000 r13: ffff828c80177720 r14: ffff83081fd7b190 (XEN) r15: ffff83081fd7b190 cr0: 000000008005003b cr4: 00000000000006f0 (XEN) cr3: 00000004ca4a6000 cr2: 000000000083c770 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff828c8024fe60: (XEN) 0000000000000000 c0080ffe01000000 ffff828c80221180 ffff828c8011a12c (XEN) ffff8300dfc2c060 ffff828c80221180 ffff83081fd7b198 ffff828c8011a20d (XEN) 000000024ab06880 0000000000000000 ffff828c8024ff28 ffff828c80267900 (XEN) ffff828c80266900 0000000000000000 ffff828c80221100 ffff828c801185b8 (XEN) 000000000000e008 ffff828c8024ff28 ffff828c80266900 ffff828c802215b0 (XEN) 000000025e3b7f20 ffff828c80138fcc 0000000000000000 ffff8300dfafc000 (XEN) ffff8300dfc2c000 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000246 (XEN) 0000000000000008 00000000ffff8e54 0000000000000054 0000000000000000 (XEN) ffffffff802053aa 0000000000000001 0000000000000000 0000000000000001 (XEN) 0000010000000000 ffffffff802053aa 000000000000e033 0000000000000246 (XEN) ffffffff80511f50 000000000000e02b 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 ffff8300dfafc000 (XEN) Xen call trace: (XEN) [<ffff828c801778f9>] mce_amd_work_fn+0x1d9/0x1f0 (XEN) [<ffff828c8011a12c>] execute_timer+0x2c/0x50 (XEN) [<ffff828c8011a20d>] timer_softirq_action+0xbd/0x2e0 (XEN) [<ffff828c801185b8>] do_softirq+0x58/0x80 (XEN) [<ffff828c80138fcc>] idle_loop+0x4c/0xa0 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) Xen BUG at amd_nonfatal.c:165 (XEN) ****************************************
Xen Community Discussion
Since these CPUs are just starting to hit the market, I was not able to find a lot of details on this issue. The only real discussion I was able to find was this thread on the [Xen-devel] mailing list:
[Xen-devel] (XEN) Xen BUG at amd_nonfatal.c:165 on a new amd g34 board
There is a patch attached in the thread and it works. For those not wanting to get into building the package from source I have created some patched RPMs for all the platforms I currently have access to.
Binary Package With Fix
I have built RPMs for the current stable version of OpenSUSE (11.2) and will be building packages for SLES 11 shortly. I will then be back-porting to OpenSUSE 11.1 and posting those RPMs within a week. Maybe I will do OpenSUSE 11.0 and SLES 10. Leave a comment if you need one of these platforms and, if there is demand, I will build packages for them.
Using The Patch
Untar the archive with the proper RPMs for your platform. Install the package with 'zypper', it will fetch any dependencies required from your configured repositories:
# zypper install xen-3.4.1_20360_04-2.1.x86_64.rpm \
xen-tools-3.4.1_20360_04-2.1.x86_64.rpm \
xen-libs-3.4.1_20360_04-2.1.x86_64.rpm
After you have installed the patched RPMs, you need to add the 'nomce' boot option to your Xen entry in GRUB. It should be on the line that reads 'kernel /xen.gz ...' Below is an example from one of my patched hosts:
title Xen -- openSUSE 11.2 - 2.6.31.12-0.2
root (hd0,0)
kernel /xen.gz nomce noreboot com2=115200,8n1 console=com2
module /vmlinuz-2.6.31.12-0.2-xen root=/dev/rootvg/rootlv
module /initrd-2.6.31.12-0.2-xen
OpenSUSE 11.2 Packages
These RPMs are based on the latest (as of 2010-09-06) source RPM package for 'xen'. I have artificially inflated the version number so that it will be applied as an upgrade to an existing install if necessary. This may interfere with future official OpenSUSE updates to this package, and if so, will need to be removed manually and re-installed.
I will try to maintain up-to-date versions of these RPMs as long as official OpenSUSE updates come out that do not include the patch. Check back often as I will setup a mailing list this week to handle notifications on when I update these packages.
Simply install the RPMs included with 'zypper', it will properly fetch the dependencies from your other repositories.
Xen 3.4.1 for OpenSUSE 11.2 with 'nomce' option fix.
Novell SUSE Linux Enterprise Server 11 Packages
My day job runs SLES 11, so I will have an opportunity to build SLES 11 packages with the fix this week. Check back after 2010-09-09 for these packages.
UPDATE: Things have been busy at work and I have not gotten around to building a SLES 11 package. I am probably not going to build one unless someone asks for it, so if you need it, leave a comment!
September 30th, 2010 - 05:07
Thanks a lot for this post. It has been very helpful for me. Good work.