← Back to team overview

kernel-packages team mailing list archive

[Bug 1524259] [NEW] igb: Detected Tx Unit Hang with stack trace

 

You have been subscribed to a public bug:

Hello.

For some time now we have a problem with one of our servers, that
happens sporadically (once in a day or two days) and causes are not
still known. We searched on lauchpad and tried many possible solutions,
but nothing helped. We had tried vanilla Ubuntu 14.04.3 kernel - 3.16.x,
and also 3.19.0-25-generic and linux-image-3.19.0-33-generic - the same
symptoms on all of these versions. We also tried to rollback to 3.13:
3.13.0-43-generic and 3.13.0-62-generic, but the problem still persists.

Our current configuration is: Ubuntu 14.04.3 with kernel 3.13.0-43.72
with Xen 4.4.2-0ubuntu0.14.04.3 (this host is used as xen hypervisor
with iSCSI initiator if it is important). And here is how it's going:

kernel: [135522.062941] igb 0000:01:00.1: Detected Tx Unit Hang
kernel: [135522.062941]   Tx Queue             <5>
kernel: [135522.062941]   TDH                  <e>
kernel: [135522.062941]   TDT                  <21>
kernel: [135522.062941]   next_to_use          <21>
kernel: [135522.062941]   next_to_clean        <e>
kernel: [135522.062941] buffer_info[next_to_clean]
kernel: [135522.062941]   time_stamp           <10203c3ca>
kernel: [135522.062941]   next_to_watch        <ffff8800bac590f0>
kernel: [135522.062941]   jiffies              <10203c4e6>
kernel: [135522.062941]   desc.status          <1c8200>
kernel: [135526.063054]   desc.status          <0>

Many of messages like this. Right after that we have reports like:
kernel: [135526.982825]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4328767466, last ping 4328768718, now 4328769972
kernel: [135526.982911]  connection2:0: detected conn error (1011)

And finally:

kernel: [135527.014836] WARNING: CPU: 8 PID: 0 at /build/buildd/linux-3.13.0/net/sched/sch_generic.c:264 dev_watchdog+0x276/0x280()
kernel: [135527.014839] NETDEV WATCHDOG: eth1 (igb): transmit queue 4 timed out
kernel: [135527.014841] Modules linked in: xt_physdev xen_netback xen_blkback cls_u32 sch_sfq sch_htb xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xen_gntdev xen_evtchn xenfs xen_privcmd ip6_tables ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi gpio_ich joydev ioatdma serio_raw mac_hid shpchp lpc_ich i7core_edac intel_powerclamp coretemp edac_core lp parport hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear iptable_raw nf_nat nf_conntrack iptable_mangle iptable_filter psmouse ip_tables igb x_tables ahci libahci i2c_algo_bit dca ptp bridge pps_core 8021q garp stp llc mrp
kernel: [135527.014903] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 3.13.0-43-generic #72-Ubuntu
kernel: [135527.014905] Hardware name: Supermicro X8DTU/X8DTU, BIOS 2.1c       08/03/2012
kernel: [135527.014907]  0000000000000009 ffff880268103d98 ffffffff81720bf6 ffff880268103de0
kernel: [135527.014912]  ffff880268103dd0 ffffffff810677cd 0000000000000004 ffff880250b18000
kernel: [135527.014916]  ffff8800030e5940 0000000000000008 0000000000000008 ffff880268103e30
kernel: [135527.014920] Call Trace:
kernel: [135527.014923]  <IRQ>  [<ffffffff81720bf6>] dump_stack+0x45/0x56
kernel: [135527.014934]  [<ffffffff810677cd>] warn_slowpath_common+0x7d/0xa0
kernel: [135527.014937]  [<ffffffff8106783c>] warn_slowpath_fmt+0x4c/0x50
kernel: [135527.014943]  [<ffffffff81645686>] dev_watchdog+0x276/0x280
kernel: [135527.014947]  [<ffffffff81645410>] ? dev_graft_qdisc+0x80/0x80
kernel: [135527.014952]  [<ffffffff81074386>] call_timer_fn+0x36/0x100
kernel: [135527.014955]  [<ffffffff81645410>] ? dev_graft_qdisc+0x80/0x80
kernel: [135527.014959]  [<ffffffff8107531f>] run_timer_softirq+0x1ef/0x2f0
kernel: [135527.014964]  [<ffffffff8106cc1c>] __do_softirq+0xec/0x2c0
kernel: [135527.014969]  [<ffffffff8106d165>] irq_exit+0x105/0x110
kernel: [135527.014976]  [<ffffffff814340f5>] xen_evtchn_do_upcall+0x35/0x50
kernel: [135527.014981]  [<ffffffff8173313e>] xen_do_hypervisor_callback+0x1e/0x30
kernel: [135527.014982]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
kernel: [135527.014990]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
kernel: [135527.014996]  [<ffffffff81009e20>] ? xen_safe_halt+0x10/0x20
kernel: [135527.015001]  [<ffffffff8101caaf>] ? default_idle+0x1f/0xc0
kernel: [135527.015005]  [<ffffffff8101d376>] ? arch_cpu_idle+0x26/0x30
kernel: [135527.015010]  [<ffffffff810bef35>] ? cpu_startup_entry+0xc5/0x290
kernel: [135527.015015]  [<ffffffff810101b8>] ? cpu_bringup_and_idle+0x18/0x20
kernel: [135527.015018] ---[ end trace 431e88429488f9a4 ]---
kernel: [135527.015044] igb 0000:01:00.1 eth1: Reset adapter

Then the network connection to this machine is dead and it tries to
reconnect continuously, but with no success.

We had no problems after rollback to 3.13.0-43 kernel in about a week,
but now it's continue crashing with the above error. I'm not sure how to
diagnose this, so need assist. Thanks.

Thats what we have in dmesg about the NIC's:
[   15.220822] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.0.5-k
[   15.220882] igb: Copyright (c) 2007-2013 Intel Corporation.
[   15.421684] igb 0000:01:00.0: added PHC on eth0
[   15.421770] igb 0000:01:00.0: Intel(R) Gigabit Ethernet Network Connection
[   15.421827] igb 0000:01:00.0: eth0: (PCIe:2.5Gb/s:Width x4) 00:25:90:00:cc:fc
[   15.421885] igb 0000:01:00.0: eth0: PBA No: Unknown
[   15.421939] igb 0000:01:00.0: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
[   15.621679] igb 0000:01:00.1: added PHC on eth1
[   15.621747] igb 0000:01:00.1: Intel(R) Gigabit Ethernet Network Connection
[   15.621815] igb 0000:01:00.1: eth1: (PCIe:2.5Gb/s:Width x4) 00:25:90:00:cc:fd
[   15.621885] igb 0000:01:00.1: eth1: PBA No: Unknown
[   15.621949] igb 0000:01:00.1: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
[   24.581560] igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   30.941733] igb: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[   30.941851] igb 0000:01:00.1 eth1: Link Speed was downgraded by SmartSpeed

And here is ethtool output:
Features for eth1:
rx-checksumming: on
tx-checksumming: on
	tx-checksum-ipv4: on
	tx-checksum-ip-generic: off [fixed]
	tx-checksum-ipv6: on
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: on
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: off [fixed]
	tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: bot-comment
-- 
igb: Detected Tx Unit Hang with stack trace
https://bugs.launchpad.net/bugs/1524259
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu.