kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #132074
[Bug 1446064] Re: ISST-SAN: Filesystem converted into read only after interface failover
[Expired for linux (Ubuntu) because there has been no activity for 60
days.]
** Changed in: linux (Ubuntu)
Status: Incomplete => Expired
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1446064
Title:
ISST-SAN: Filesystem converted into read only after interface failover
Status in linux package in Ubuntu:
Expired
Bug description:
== Comment: #0 ==
I was running interface failover tests on the Storage Texan2 ( TMS9840). zop03-01 has disks coming from Texan2 via NIPV.
Interface failover basically brings down a interface, so that other
inetrfaces take over, waits for 10 min and then brings it back up.
There are 4 interfaces on texan2.
While the third one was brought down, the IO on zop03-01 was stopped. I was running tests to create directories and files on the FS which were created on the multipath disks. The tests suddenly started to fail when the interface failover happened on the 3rd interface.
Now i see that when i login to the system everything is read only.
root@zop03-01:~# touch abc
touch: cannot touch ?abc?: Read-only file system
root@zop03-01:~#
root@zop03-01:~# lsb_release -sc; uname -m; uname -r
vivid
ppc64le
3.19.0-9-generic
root@zop03-01:~# less /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/mapper/mpath10-part2 / ext4 errors=remount-ro 0 1
/dev/mapper/mpath10-part3 none swap sw 0 0
kte:/kte /kte nfs soft,rw,nolock,auto,exec 0 0
kte:/data /data nfs soft,rw,nolock,auto,exec 0 0
kte:/distros /distros nfs soft,rw,nolock,auto,exec 0 0
kte:/images /images nfs soft,rw,nolock,auto,exec 0 0
root@zop03-01:~#
root@zop03-01:~# df -lh
Filesystem Size Used Avail Use% Mounted on
udev 7.5G 0 7.5G 0% /dev
tmpfs 1.6G 38M 1.5G 3% /run
/dev/sdah2 48G 29G 16G 65% /
tmpfs 7.6G 0 7.6G 0% /dev/shm
tmpfs 5.0M 128K 4.9M 3% /run/lock
tmpfs 7.6G 0 7.6G 0% /sys/fs/cgroup
tmpfs 1.6G 0 1.6G 0% /run/user/0
root@zop03-01:~#
root@zop03-01:~# fsck.ext4 /dev/sdah2
e2fsck 1.42.12 (29-Aug-2014)
/dev/sdah2: recovering journal
/dev/sdah2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (4957824, counted=5069276).
Fix<y>? yes
Free inodes count wrong (2952662, counted=2953167).
Fix<y>? yes
/dev/sdah2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdah2: ***** REBOOT LINUX *****
/dev/sdah2: 186417/3139584 files (1.2% non-contiguous), 7488804/12558080 blocks
root@zop03-01:~#
=================
== Comment: #6 ==
There is no issue now. System is all fine. But i can recreate it
easily. If i start IO tests and parallelly start interface
failover tests, it can be recreated.
But i just dont want it to get into the same state. and you do an
fsck, reboot solves the problem.
We need to get into the root cause for why the system is going to bad
state.
== Comment: #8 ==
Re-creating the issue with more details:
The IO tests were running ( IO tests basically create directories and files on the FS).
I started the interface failover tests. It basically fails the interface on the SAN subsystem and failsover to the next available interface.There are such 4 interfaces available.The SAN is texan2.
WHen the interface was failed, The IO halted for a moment, by that i mean:
Created Directory /FS3-part1/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS4-part1/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS3-part2/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS1-part3/test1 rc = 0 at 04/17/2015 12:24:05
Created Directory /FS1-part1/test1 rc = 0 at 04/17/2015 12:24:06
Created Directory /FS7-part1/test1 rc = 0 at 04/17/2015 12:24:07
Created Directory /FS0-part2/test1 rc = 0 at 04/17/2015 12:24:08
Created Directory /FS0-part1/test1 rc = 0 at 04/17/2015 12:24:08
Created Directory /FS0-part3/test1 rc = 0 at 04/17/2015 12:24:09
=======> Here it halted
On the console, i was seeing messages like:
root@zop03-01:~# [35488.779299] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35488.779344] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35489.803377] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35489.803646] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.529306] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.538252] sd 3:0:0:1: [sdl] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.538306] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548175] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548179] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.548216] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549267] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549340] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.549361] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558355] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558360] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.558367] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35494.568328] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.841752] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.842736] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.843690] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.844659] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35524.845630] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.481757] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.482930] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.482944] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.483939] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
[35525.483954] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status:
Till here the machine was all fine, in rw state.
Now, after 4 minutes of time:
I suddenly saw the errors of IO failing:
Write BLAST Failed on /blast/log/BLAST_zop03-01_root_04_17_2015_12_20_31.prt prt file - error = 30 EROFS Read-only file system
Error accessing file /blast/log/BLAST_zop03-01_root_04_17_2015_12_20_31.prt
Error Code 30
---> Replacement print file will be created ...Error accessing file /blast/log/BLAST_zop03-01_root_04_17_2015_12_28_17.prt
Error Code -1
***********************
On Console:
2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35618.852794] ibmvfc 30000003: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35640.184808] INFO: task jbd2/sdab2-8:13145 blocked for more than 120 seconds.
[35640.184829] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.184833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.184904] INFO: task kworker/u320:1:18262 blocked for more than 120 seconds.
[35640.184910] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.184913] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185000] INFO: task latest_blast:18543 blocked for more than 120 seconds.
[35640.185005] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185008] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185057] INFO: task latest_blast:18544 blocked for more than 120 seconds.
[35640.185062] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185065] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185123] INFO: task latest_blast:18545 blocked for more than 120 seconds.
[35640.185128] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185187] INFO: task latest_blast:18546 blocked for more than 120 seconds.
[35640.185192] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185195] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185242] INFO: task latest_blast:18547 blocked for more than 120 seconds.
[35640.185247] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185297] INFO: task latest_blast:18548 blocked for more than 120 seconds.
[35640.185302] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185350] INFO: task latest_blast:18549 blocked for more than 120 seconds.
[35640.185355] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185358] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35640.185405] INFO: task latest_blast:18550 blocked for more than 120 seconds.
[35640.185410] Not tainted 3.19.0-9-generic #9-Ubuntu
[35640.185413] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35676.081382] blk_update_request: I/O error, dev sdf, sector 0
[35676.264929] blk_update_request: I/O error, dev sdj, sector 0
[35676.264951] blk_update_request: I/O error, dev sdg, sector 0
[35676.264992] blk_update_request: I/O error, dev sdh, sector 0
[35676.265026] blk_update_request: I/O error, dev sdi, sector 0
[35677.253544] blk_update_request: I/O error, dev sdab, sector 46601528
[35677.253550] blk_update_request: I/O error, dev sdaa, sector 0
[35677.253609] blk_update_request: I/O error, dev sdy, sector 0
[35677.253658] blk_update_request: I/O error, dev sdz, sector 0
[35677.253690] blk_update_request: I/O error, dev sdw, sector 0
[35677.254020] Aborting journal on device sdab2-8.
[35677.254037] EXT4-fs error (device sdab2) in add_dirent_to_buf:1699: Journal has aborted
[35677.254138] EXT4-fs error (device sdab2) in ext4_dx_add_entry:2077: Journal has aborted
[35677.254151] EXT4-fs error (device sdab2) in __ext4_new_inode:850: Journal has aborted
[35677.254204] EXT4-fs error (device sdab2) in __ext4_new_inode:850: Journal has aborted
[35677.255088] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255096] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.255100] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.255101] EXT4-fs error (device sdab2): ext4_journal_check_start:56:
[35677.255102] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255103] EXT4-fs (sdab2): Remounting filesystem read-only
[35677.255123] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.255130] EXT4-fs (sdab2): ext4_writepages: jbd2_start: 3648 pages, ino 1835459; err -30
[35677.255179] EXT4-fs error (device sdab2) in ext4_evict_inode:243: Journal has aborted
[35677.255557] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal[35677.255570] EXT4-fs error (device sdab2) in ext4_evict_inode:243: Journal has aborted
[35677.255581] EXT4-fs error (device sdab2): ext4_journal_check_start:56: EXT4-fs error (device sdab2): ext4_journal_check_start:56:
[35677.266659] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266680] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266697] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal[35677.266712] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266725] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266737] EXT4-fs error (device sdab2): ext4_journal_check_start:56:
[35677.266759] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266777] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266789] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266801] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266829] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266857] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266883] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266907] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35677.266981] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
[35690.568831] ibmvfc 30000004: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
[35691.175548] ibmvfc 30000006: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
and then i see that
root@zop03-01:~# touch abc
touch: cannot touch ?abc?: Read-only file system
root@zop03-01:~#
It has gone into RO mode again. I will atatch the dmesg and syslog in separate attachments.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1446064/+subscriptions