kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #132797
[Bug 1331513] Re: 14e4:165f tg3 eth1: transmit timed out, resetting on BCM5720
I was able to consistently reproduce this issue in Debian Wheezy by setting up two Dell PowerEdge R620 servers directly connected and doing constant scp transfer of large files back and forth while also setting the interface up & down in a loop until it breaks (while [ true ]; do ip link set ${DEV} down; sleep 1; ip link set ${DEV} up; sleep 9; done).
I have then updated the tg3 driver to version 3.137h, but the issue was still reproducible.
I have tried RedHat 7.1 and it works fine (Kernel: 3.10.0-229.el7.x86_64, tg3 3.137).
Then I have tried Debian Jessie (kernel 3.16, tg3 3.137) and the issue is not reproducible.
Logs from Debian Wheezy:
Aug 28 14:16:27 bond-111 kernel: [ 519.336495] ------------[ cut here ]------------
Aug 28 14:16:27 bond-111 kernel: [ 519.336505] WARNING: at /build/linux-l1NKWv/linux-3.2.68/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151()
Aug 28 14:16:27 bond-111 kernel: [ 519.336508] Hardware name: PowerEdge R620
Aug 28 14:16:27 bond-111 kernel: [ 519.336510] NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
Aug 28 14:16:27 bond-111 kernel: [ 519.336513] Modules linked in: drbd lru_cache nf_conntrack_tftp nf_conntrack virtio_net virtio_blk virtio_pci virtio_ri
ng virtio kvm bonding sb_edac coretemp snd_pcm crc32c_intel ghash_clmulni_intel aesni_intel snd_page_alloc snd_timer snd soundcore aes_x86_64 edac_core joy
dev pcspkr shpchp iTCO_wdt iTCO_vendor_support evdev dcdbas aes_generic cryptd processor button thermal_sys acpi_power_meter wmi ext3 mbcache jbd microcode
usbhid hid sg sr_mod sd_mod cdrom crc_t10dif ahci libahci libata ehci_hcd megaraid_sas scsi_mod usbcore usb_common tg3 libphy [last unloaded: drbd]
Aug 28 14:16:27 bond-111 kernel: [ 519.336580] Pid: 29511, comm: scp Not tainted 3.2.0-4-amd64 #1 Debian 3.2.68-1+deb7u2
Aug 28 14:16:27 bond-111 kernel: [ 519.336583] Call Trace:
Aug 28 14:16:27 bond-111 kernel: [ 519.336585] <IRQ> [<ffffffff81046dbd>] ? warn_slowpath_common+0x78/0x8c
Aug 28 14:16:27 bond-111 kernel: [ 519.336599] [<ffffffff81046e69>] ? warn_slowpath_fmt+0x45/0x4a
Aug 28 14:16:27 bond-111 kernel: [ 519.336605] [<ffffffff812a8c91>] ? netif_tx_lock+0x40/0x75
Aug 28 14:16:27 bond-111 kernel: [ 519.336609] [<ffffffff812a8e01>] ? dev_watchdog+0xf2/0x151
Aug 28 14:16:27 bond-111 kernel: [ 519.336613] [<ffffffff810525f4>] ? run_timer_softirq+0x19a/0x261
Aug 28 14:16:27 bond-111 kernel: [ 519.336615] [<ffffffff812a8d0f>] ? netif_tx_unlock+0x49/0x49
Aug 28 14:16:27 bond-111 kernel: [ 519.336618] [<ffffffff8104c46a>] ? __do_softirq+0xb9/0x177
Aug 28 14:16:27 bond-111 kernel: [ 519.336622] [<ffffffff813583ec>] ? call_softirq+0x1c/0x30
Aug 28 14:16:27 bond-111 kernel: [ 519.336627] [<ffffffff8100fa91>] ? do_softirq+0x3c/0x7b
Aug 28 14:16:27 bond-111 kernel: [ 519.336630] [<ffffffff8104c6d2>] ? irq_exit+0x3c/0x99
Aug 28 14:16:27 bond-111 kernel: [ 519.336632] [<ffffffff8100f66a>] ? do_IRQ+0x82/0x98
Aug 28 14:16:27 bond-111 kernel: [ 519.336637] [<ffffffff813513ee>] ? common_interrupt+0x6e/0x6e
Aug 28 14:16:27 bond-111 kernel: [ 519.336638] <EOI> [<ffffffff811b43cd>] ? copy_user_generic_string+0x2d/0x40
Aug 28 14:16:27 bond-111 kernel: [ 519.336647] [<ffffffff810b4d6a>] ? iov_iter_copy_from_user_atomic+0x70/0x93
Aug 28 14:16:27 bond-111 kernel: [ 519.336651] [<ffffffff810b580f>] ? generic_file_buffered_write+0x143/0x259
Aug 28 14:16:27 bond-111 kernel: [ 519.336656] [<ffffffff810b6618>] ? __generic_file_aio_write+0x248/0x278
Aug 28 14:16:27 bond-111 kernel: [ 519.336659] [<ffffffff81036638>] ? should_resched+0x5/0x23
Aug 28 14:16:27 bond-111 kernel: [ 519.336663] [<ffffffff810b66a5>] ? generic_file_aio_write+0x5d/0xb5
Aug 28 14:16:27 bond-111 kernel: [ 519.336668] [<ffffffff810fae68>] ? do_sync_write+0xb4/0xec
Aug 28 14:16:27 bond-111 kernel: [ 519.336672] [<ffffffff811656d1>] ? security_file_permission+0x16/0x2d
Aug 28 14:16:27 bond-111 kernel: [ 519.336674] [<ffffffff810fb559>] ? vfs_write+0xa2/0xe9
Aug 28 14:16:27 bond-111 kernel: [ 519.336677] [<ffffffff810fb736>] ? sys_write+0x45/0x6b
Aug 28 14:16:27 bond-111 kernel: [ 519.336680] [<ffffffff813561b2>] ? system_call_fastpath+0x16/0x1b
Aug 28 14:16:27 bond-111 kernel: [ 519.336682] ---[ end trace 5a2a84798c0630dd ]---
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1331513
Title:
14e4:165f tg3 eth1: transmit timed out, resetting on BCM5720
Status in The Dell-poweredge project:
Triaged
Status in linux package in Ubuntu:
Triaged
Bug description:
we have a problem with Dell PowerEdge machines, having the Broadcom
5720 chip. We have this problem on generation 12 systems, across
different models (R420, R620), with several combinations of bios
firmwares, lifecycle firmwares, etc... We see this on several versions
of the linux kernel, ranging from 3.2.x up tot 3.11, with several
versions of the tg3 driver, including a manually compiled latest
version (3.133d) loaded in a 3.11. The latest machine, where we can
reproduce the problem has Ubuntu Precise installed, but we also see
this behaviour on Debian machines. We run Xen on it, running HVM hosts
on it. Storage is handled over iSCSI (and it is the iSCSI interface we
can trigger this bug on in a reproducible way, while we have the
impression it also happens on other interfaces, but there we don't
have a solid case where we have e reproducible setup).
All this info actually points into the direction of the tg3 driver
and/or hardware below it not handling certain datastreams or data
patterns correctly, and finally crashing the system. It seems
unrelated to the version of kernel running, xen-version running,
amount of VM's running, firmwares and revisions running, etc...
We have been trying to pinpoint this for over a year now, being unable
to actually create a scenario where we could reproduce this. As of
this week, we finally found a specific setup where we could trigger
the error within a reasonable time.
The error is triggered by running a certain VM on the Xen stack, and
inside that VM, importing a mysqldump in a running mysql on that VM.
The VM has it's traffic on an iSCSI volume, so this effectually
generates a datastream over the eth1 interface of the machine. Within
a short amount of time, the system will crash in 2 steps. We first see
a timeout on the tg3 driver on the eth1 interface (dmesg output
section attached). This sometimes repeats two or three times, and
finally, step 2, the machine freezes and reboots.
While debugging, we noticed that the bug goes away when we disable sg
offloading with ethtool.
If you need any additional info, feel free to ask.
ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.11.0-19-generic 3.11.0-19.33~precise1
ProcVersionSignature: Ubuntu 3.11.0-19.33~precise1-generic 3.11.10.5
Uname: Linux 3.11.0-19-generic x86_64
AlsaDevices:
total 0
crw-rw---T 1 root audio 116, 1 Jun 18 16:36 seq
crw-rw---T 1 root audio 116, 33 Jun 18 16:36 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu17.6
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Wed Jun 18 16:47:27 2014
HibernationDevice: RESUME=UUID=f3577e02-64e3-4cab-b6e7-f30efa111565
InstallationMedia: Ubuntu-Server 12.04.4 LTS "Precise Pangolin" - Release amd64 (20140204)
MachineType: Dell Inc. PowerEdge R420
MarkForUpload: True
PciMultimedia:
ProcFB:
ProcKernelCmdLine: placeholder root=UUID=bbc71780-90bf-4647-b579-e48d5d8c2bce ro vga=0x317
RelatedPackageVersions:
linux-restricted-modules-3.11.0-19-generic N/A
linux-backports-modules-3.11.0-19-generic N/A
linux-firmware 1.79.12
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux-lts-saucy
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 01/20/2014
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.1.2
dmi.board.name: 0JD6X3
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr2.1.2:bd01/20/2014:svnDellInc.:pnPowerEdgeR420:pvr:rvnDellInc.:rn0JD6X3:rvrA00:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R420
dmi.sys.vendor: Dell Inc.
To manage notifications about this bug go to:
https://bugs.launchpad.net/dell-poweredge/+bug/1331513/+subscriptions