[Openstack-operators] well tested distro/kernel combination for production

Discussion:

sylecn

2014-02-08 04:02:54 UTC

Hi,

I have experienced "rcu_sched detected stalls on CPUs/tasks" in ubuntu vms,
which result in dead vm that can't be rebooted/deleted, and I believe it's
because of either bug in hypervisor kernel or guest kernel.

I'd like to know which os version and kernel version do you use in
production. Both public and private clouds are welcome. My company plans to
run a small (to medium) private cloud. Hypervisor runs ubuntu 12.04 and the
first guest OSes will be ubuntu 12.04 and CentOS 6. So kernel version for
those is much appreciated.

Is there a wiki page about this?

PS. Here is a combination that have the above mentioned error:

hypervisor os: ubuntu 12.04.3
hypervisor kernel: 3.8.0-35-generic
vm os: ubuntu 12.04
vm kernel: 3.2.0-56-virtual
openstack: havana
libvirt: 1.1.1-0ubuntu8~cloud2

Relevant old bugs on similar issues:
rhel5.5 running as kvm guest hangs randomly
https://bugzilla.redhat.com/show_bug.cgi?id=619798

Bug #503138 "Lucid & Natty, KVM, After kernel message hrtimer: ..." : Bugs
: "kvm" package : Ubuntu
https://bugs.launchpad.net/ubuntu/+source/kvm/+bug/503138

I don't have a 100% way to reproduce the problem, but it happens quite
often, no matter when the vm is idle or loaded, which is not acceptable in
production.

Thanks,
Yuanle
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140208/a763004d/attachment.html>

Narayan Desai

2014-02-08 14:01:57 UTC

Permalink

We had stability problems with both host and guest kernels when running
with 12.04. We found that upgrading the hosts to the backported 3.8 kernel
helped a lot with that. We also had problems specifically on guests with a
lot of resources (32 cores, 1T memory). Upgrading to a 3.11 kernel in the
guest (also a backport) resolved that issue.
-nld

Post by sylecn
Hi,
I have experienced "rcu_sched detected stalls on CPUs/tasks" in ubuntu
vms, which result in dead vm that can't be rebooted/deleted, and I believe
it's because of either bug in hypervisor kernel or guest kernel.
I'd like to know which os version and kernel version do you use in
production. Both public and private clouds are welcome. My company plans to
run a small (to medium) private cloud. Hypervisor runs ubuntu 12.04 and the
first guest OSes will be ubuntu 12.04 and CentOS 6. So kernel version for
those is much appreciated.
Is there a wiki page about this?
hypervisor os: ubuntu 12.04.3
hypervisor kernel: 3.8.0-35-generic
vm os: ubuntu 12.04
vm kernel: 3.2.0-56-virtual
openstack: havana
libvirt: 1.1.1-0ubuntu8~cloud2
rhel5.5 running as kvm guest hangs randomly
https://bugzilla.redhat.com/show_bug.cgi?id=619798
Bug #503138 "Lucid & Natty, KVM, After kernel message hrtimer: ..." : Bugs
: "kvm" package : Ubuntu
https://bugs.launchpad.net/ubuntu/+source/kvm/+bug/503138
I don't have a 100% way to reproduce the problem, but it happens quite
often, no matter when the vm is idle or loaded, which is not acceptable in
production.
Thanks,
Yuanle
_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140208/d91702a7/attachment.html>

George Shuklin

2014-02-09 00:40:37 UTC

Permalink

Can't say which kernel is stable, but just yesterday I've got rather
unfunny error on my lab stand with 3.8.0-35-generic (x86_64): vim went
to IO and did not come back (in D+ state). Disk was fine, other software
was fine, but in_flight time was 100% for disk and kernel starts to
report 'stall' about hanged vim. I played around some time, but none of
tricks was not able to 'free' vim (not the disk reinitialization, not
the pci bus rescan).

In my case that happens after rather brutal test of 'snapshot creating
during 32 concurrent read/write operations from instance'.

Post by sylecn
Hi,
I have experienced "rcu_sched detected stalls on CPUs/tasks" in ubuntu
vms, which result in dead vm that can't be rebooted/deleted, and I
believe it's because of either bug in hypervisor kernel or guest kernel.
I'd like to know which os version and kernel version do you use in
production. Both public and private clouds are welcome. My company
plans to run a small (to medium) private cloud. Hypervisor runs ubuntu
12.04 and the first guest OSes will be ubuntu 12.04 and CentOS 6. So
kernel version for those is much appreciated.
Is there a wiki page about this?
hypervisor os: ubuntu 12.04.3
hypervisor kernel: 3.8.0-35-generic
vm os: ubuntu 12.04
vm kernel: 3.2.0-56-virtual
openstack: havana
libvirt: 1.1.1-0ubuntu8~cloud2
rhel5.5 running as kvm guest hangs randomly
https://bugzilla.redhat.com/show_bug.cgi?id=619798
Bugs : ?kvm? package : Ubuntu
https://bugs.launchpad.net/ubuntu/+source/kvm/+bug/503138
I don't have a 100% way to reproduce the problem, but it happens quite
often, no matter when the vm is idle or loaded, which is not
acceptable in production.

Narayan Desai

2014-02-09 03:31:05 UTC

Permalink

Host or guest?
-nld

Post by George Shuklin
Can't say which kernel is stable, but just yesterday I've got rather
unfunny error on my lab stand with 3.8.0-35-generic (x86_64): vim went to
IO and did not come back (in D+ state). Disk was fine, other software was
fine, but in_flight time was 100% for disk and kernel starts to report
'stall' about hanged vim. I played around some time, but none of tricks was
not able to 'free' vim (not the disk reinitialization, not the pci bus
rescan).
In my case that happens after rather brutal test of 'snapshot creating
during 32 concurrent read/write operations from instance'.

Post by sylecn
Hi,
I have experienced "rcu_sched detected stalls on CPUs/tasks" in ubuntu
vms, which result in dead vm that can't be rebooted/deleted, and I believe
it's because of either bug in hypervisor kernel or guest kernel.
I'd like to know which os version and kernel version do you use in
production. Both public and private clouds are welcome. My company plans to
run a small (to medium) private cloud. Hypervisor runs ubuntu 12.04 and the
first guest OSes will be ubuntu 12.04 and CentOS 6. So kernel version for
those is much appreciated.
Is there a wiki page about this?
hypervisor os: ubuntu 12.04.3
hypervisor kernel: 3.8.0-35-generic
vm os: ubuntu 12.04
vm kernel: 3.2.0-56-virtual
openstack: havana
libvirt: 1.1.1-0ubuntu8~cloud2
rhel5.5 running as kvm guest hangs randomly
https://bugzilla.redhat.com/show_bug.cgi?id=619798
Bugs : "kvm" package : Ubuntu
https://bugs.launchpad.net/ubuntu/+source/kvm/+bug/503138
I don't have a 100% way to reproduce the problem, but it happens quite
often, no matter when the vm is idle or loaded, which is not acceptable in
production.

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140208/fc6625ba/attachment.html>

George Shuklin

2014-02-09 15:51:54 UTC

Permalink

Host.

I'm usually not bother with guest problems.

I'm not sure, but I think I hit that problem a second time (Fist time it
was during snapshot creation too, but I did not dig deep enough).
Most obvious symptom is '100% disk utilization' in atop regardless of
actual IO, the second is 'stalled' messages in dmesg after 120 sec.

Post by Narayan Desai
Host or guest?
-nld
On Sat, Feb 8, 2014 at 6:40 PM, George Shuklin
Can't say which kernel is stable, but just yesterday I've got
rather unfunny error on my lab stand with 3.8.0-35-generic
(x86_64): vim went to IO and did not come back (in D+ state). Disk
was fine, other software was fine, but in_flight time was 100% for
disk and kernel starts to report 'stall' about hanged vim. I
played around some time, but none of tricks was not able to 'free'
vim (not the disk reinitialization, not the pci bus rescan).
In my case that happens after rather brutal test of 'snapshot
creating during 32 concurrent read/write operations from instance'.
Hi,
I have experienced "rcu_sched detected stalls on CPUs/tasks"
in ubuntu vms, which result in dead vm that can't be
rebooted/deleted, and I believe it's because of either bug in
hypervisor kernel or guest kernel.
I'd like to know which os version and kernel version do you
use in production. Both public and private clouds are welcome.
My company plans to run a small (to medium) private cloud.
Hypervisor runs ubuntu 12.04 and the first guest OSes will be
ubuntu 12.04 and CentOS 6. So kernel version for those is much
appreciated.
Is there a wiki page about this?
hypervisor os: ubuntu 12.04.3
hypervisor kernel: 3.8.0-35-generic
vm os: ubuntu 12.04
vm kernel: 3.2.0-56-virtual
openstack: havana
libvirt: 1.1.1-0ubuntu8~cloud2
rhel5.5 running as kvm guest hangs randomly
https://bugzilla.redhat.com/show_bug.cgi?id=619798
..." : Bugs : "kvm" package : Ubuntu
https://bugs.launchpad.net/ubuntu/+source/kvm/+bug/503138
I don't have a 100% way to reproduce the problem, but it
happens quite often, no matter when the vm is idle or loaded,
which is not acceptable in production.
_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
<mailto:OpenStack-operators at lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140209/33958488/attachment.html>