← Back to team overview

kernel-packages team mailing list archive

[Bug 1446064] [NEW] ISST-SAN: Filesystem converted into read only after interface failover

 

You have been subscribed to a public bug:

== Comment: #0  ==
I was running interface failover tests on the Storage Texan2 ( TMS9840). zop03-01 has disks coming from Texan2 via NIPV.

Interface failover basically brings down a interface, so that other
inetrfaces take over, waits for 10 min and then brings it back up. There
are 4 interfaces on texan2.


While the third one was brought down, the IO on zop03-01 was stopped. I was running tests to create directories and files on the FS which were created on the multipath disks. The tests suddenly started to fail when the interface failover happened on the 3rd interface.

Now i see that when i login to the system everything is read only.

root@zop03-01:~# touch abc
touch: cannot touch ?abc?: Read-only file system
root@zop03-01:~#


root@zop03-01:~# lsb_release -sc; uname -m; uname -r
vivid
ppc64le
3.19.0-9-generic


root@zop03-01:~# less /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
/dev/mapper/mpath10-part2 /               ext4    errors=remount-ro 0       1
/dev/mapper/mpath10-part3 none            swap    sw              0       0
kte:/kte        /kte        nfs     soft,rw,nolock,auto,exec       0 0
kte:/data       /data        nfs     soft,rw,nolock,auto,exec       0 0
kte:/distros    /distros         nfs     soft,rw,nolock,auto,exec       0 0
kte:/images     /images        nfs     soft,rw,nolock,auto,exec       0 0
root@zop03-01:~# 


root@zop03-01:~# df -lh
Filesystem      Size  Used Avail Use% Mounted on
udev            7.5G     0  7.5G   0% /dev
tmpfs           1.6G   38M  1.5G   3% /run
/dev/sdah2       48G   29G   16G  65% /
tmpfs           7.6G     0  7.6G   0% /dev/shm
tmpfs           5.0M  128K  4.9M   3% /run/lock
tmpfs           7.6G     0  7.6G   0% /sys/fs/cgroup
tmpfs           1.6G     0  1.6G   0% /run/user/0
root@zop03-01:~# 


root@zop03-01:~# fsck.ext4 /dev/sdah2
e2fsck 1.42.12 (29-Aug-2014)
/dev/sdah2: recovering journal
/dev/sdah2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (4957824, counted=5069276).
Fix<y>? yes
Free inodes count wrong (2952662, counted=2953167).
Fix<y>? yes

/dev/sdah2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdah2: ***** REBOOT LINUX *****
/dev/sdah2: 186417/3139584 files (1.2% non-contiguous), 7488804/12558080 blocks
root@zop03-01:~# 
=================

== Comment: #6  ==

There is no issue now. System is all fine. But i can recreate it easily.
If i start        IO tests and parallelly start interface failover
tests, it can be recreated.

But i just dont want it to get into the same state. and you do an fsck,
reboot solves the problem.

We need to get into the root cause for why the system is going to bad
state.


== Comment: #8  ==

Re-creating the issue with more details:

The IO tests were running ( IO tests basically create directories and files on the FS).
I started the interface failover tests. It basically fails the interface on the SAN subsystem and failsover to the next available interface.There are such 4 interfaces available.The SAN is texan2.


WHen the interface was failed, The IO halted for a moment, by that i mean:

Created Directory /FS3-part1/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS4-part1/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS3-part2/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS1-part3/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS1-part1/test1 rc = 0 at 04/17/2015 12:24:06
Created Directory /FS7-part1/test1 rc = 0 at 04/17/2015 12:24:07
Created Directory /FS0-part2/test1 rc = 0 at 04/17/2015 12:24:08
Created Directory /FS0-part1/test1 rc = 0 at 04/17/2015 12:24:08
Created Directory /FS0-part3/test1 rc = 0 at 04/17/2015 12:24:09



=======> Here it halted

On the console, i was seeing messages like:

root@zop03-01:~# [35488.779299] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35488.779344] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35489.803377] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35489.803646] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.529306] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.538252] sd 3:0:0:1: [sdl] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.538306] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548175] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548179] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548216] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549267] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549340] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549361] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558355] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558360] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558367] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.568328] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.841752] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.842736] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.843690] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.844659] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.845630] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.481757] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.482930] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.482944] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.483939] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.483954] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 

Till here the machine was all fine, in rw state.


Now, after 4 minutes of time:

I suddenly saw the errors of IO failing:
Write BLAST Failed on /blast/log/BLAST_zop03-01_root_04_17_2015_12_20_31.prt prt file - error = 30 EROFS         Read-only file system
Error accessing file  /blast/log/BLAST_zop03-01_root_04_17_2015_12_20_31.prt
Error Code 30
---> Replacement print file will be created ...Error accessing file  /blast/log/BLAST_zop03-01_root_04_17_2015_12_28_17.prt
Error Code -1

***********************


On Console:
2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35618.852794] ibmvfc 30000003: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35640.184808] INFO: task jbd2/sdab2-8:13145 blocked for more than 120 seconds.
[35640.184829]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.184833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.184904] INFO: task kworker/u320:1:18262 blocked for more than 120 seconds.
[35640.184910]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.184913] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185000] INFO: task latest_blast:18543 blocked for more than 120 seconds.
[35640.185005]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185008] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185057] INFO: task latest_blast:18544 blocked for more than 120 seconds.
[35640.185062]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185065] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185123] INFO: task latest_blast:18545 blocked for more than 120 seconds.
[35640.185128]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185187] INFO: task latest_blast:18546 blocked for more than 120 seconds.
[35640.185192]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185195] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185242] INFO: task latest_blast:18547 blocked for more than 120 seconds.
[35640.185247]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185297] INFO: task latest_blast:18548 blocked for more than 120 seconds.
[35640.185302]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185350] INFO: task latest_blast:18549 blocked for more than 120 seconds.
[35640.185355]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185358] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185405] INFO: task latest_blast:18550 blocked for more than 120 seconds.
[35640.185410]       Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185413] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35676.081382] blk_update_request: I/O error, dev sdf, sector 0
[35676.264929] blk_update_request: I/O error, dev sdj, sector 0
[35676.264951] blk_update_request: I/O error, dev sdg, sector 0
[35676.264992] blk_update_request: I/O error, dev sdh, sector 0
[35676.265026] blk_update_request: I/O error, dev sdi, sector 0
[35677.253544] blk_update_request: I/O error, dev sdab, sector 46601528
[35677.253550] blk_update_request: I/O error, dev sdaa, sector 0
[35677.253609] blk_update_request: I/O error, dev sdy, sector 0
[35677.253658] blk_update_request: I/O error, dev sdz, sector 0
[35677.253690] blk_update_request: I/O error, dev sdw, sector 0
[35677.254020] Aborting journal on device sdab2-8.
[35677.254037] EXT4-fs error (device sdab2) in add_dirent_to_buf:1699: Journal has aborted
[35677.254138] EXT4-fs error (device sdab2) in ext4_dx_add_entry:2077: Journal has aborted
[35677.254151] EXT4-fs error (device sdab2) in __ext4_new_inode:850: Journal has aborted
[35677.254204] EXT4-fs error (device sdab2) in __ext4_new_inode:850: Journal has aborted
[35677.255088] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255096] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.255100] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.255101] EXT4-fs error (device sdab2): ext4_journal_check_start:56:
[35677.255102] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255103] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255123] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.255130] EXT4-fs (sdab2): ext4_writepages: jbd2_start: 3648 pages, ino 1835459; err -30
[35677.255179] EXT4-fs error (device sdab2) in ext4_evict_inode:243: Journal has aborted
[35677.255557] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal[35677.255570] EXT4-fs error (device sdab2) in ext4_evict_inode:243: Journal has aborted
[35677.255581] EXT4-fs error (device sdab2): ext4_journal_check_start:56: EXT4-fs error (device sdab2): ext4_journal_check_start:56:
[35677.266659] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266680] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266697] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal[35677.266712] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266725] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266737] EXT4-fs error (device sdab2): ext4_journal_check_start:56:
[35677.266759] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266777] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266789] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266801] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266829] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266857] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266883] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266907] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266981] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35690.568831] ibmvfc 30000004: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35691.175548] ibmvfc 30000006: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7


and then i see that

root@zop03-01:~# touch abc
touch: cannot touch ?abc?: Read-only file system
root@zop03-01:~#


It has gone into RO mode again. I will atatch the dmesg and syslog in separate attachments.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: architecture-ppc64le bot-comment bugnameltc-124029 severity-critical targetmilestone-inin---
-- 
ISST-SAN: Filesystem converted into read only after interface failover
https://bugs.launchpad.net/bugs/1446064
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu.