← Back to team overview

kernel-packages team mailing list archive

[Bug 1446064] Re: ISST-SAN: Filesystem converted into read only after interface failover

 

** Changed in: linux (Ubuntu)
       Status: New => Confirmed

** Changed in: linux (Ubuntu)
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1446064

Title:
  ISST-SAN: Filesystem converted into read only after interface failover

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  == Comment: #0  ==
  I was running interface failover tests on the Storage Texan2 ( TMS9840). zop03-01 has disks coming from Texan2 via NIPV.

  Interface failover basically brings down a interface, so that other
  inetrfaces take over, waits for 10 min and then brings it back up.
  There are 4 interfaces on texan2.

  
  While the third one was brought down, the IO on zop03-01 was stopped. I was running tests to create directories and files on the FS which were created on the multipath disks. The tests suddenly started to fail when the interface failover happened on the 3rd interface.

  Now i see that when i login to the system everything is read only.

  root@zop03-01:~# touch abc
  touch: cannot touch ?abc?: Read-only file system
  root@zop03-01:~#

  
  root@zop03-01:~# lsb_release -sc; uname -m; uname -r
  vivid
  ppc64le
  3.19.0-9-generic

  
  root@zop03-01:~# less /etc/fstab
  # /etc/fstab: static file system information.
  #
  # Use 'blkid' to print the universally unique identifier for a
  # device; this may be used with UUID= as a more robust way to name devices
  # that works even if disks are added and removed. See fstab(5).
  #
  # <file system> <mount point>   <type>  <options>       <dump>  <pass>
  /dev/mapper/mpath10-part2 /               ext4    errors=remount-ro 0       1
  /dev/mapper/mpath10-part3 none            swap    sw              0       0
  kte:/kte        /kte        nfs     soft,rw,nolock,auto,exec       0 0
  kte:/data       /data        nfs     soft,rw,nolock,auto,exec       0 0
  kte:/distros    /distros         nfs     soft,rw,nolock,auto,exec       0 0
  kte:/images     /images        nfs     soft,rw,nolock,auto,exec       0 0
  root@zop03-01:~# 

  
  root@zop03-01:~# df -lh
  Filesystem      Size  Used Avail Use% Mounted on
  udev            7.5G     0  7.5G   0% /dev
  tmpfs           1.6G   38M  1.5G   3% /run
  /dev/sdah2       48G   29G   16G  65% /
  tmpfs           7.6G     0  7.6G   0% /dev/shm
  tmpfs           5.0M  128K  4.9M   3% /run/lock
  tmpfs           7.6G     0  7.6G   0% /sys/fs/cgroup
  tmpfs           1.6G     0  1.6G   0% /run/user/0
  root@zop03-01:~# 

  
  root@zop03-01:~# fsck.ext4 /dev/sdah2
  e2fsck 1.42.12 (29-Aug-2014)
  /dev/sdah2: recovering journal
  /dev/sdah2 contains a file system with errors, check forced.
  Pass 1: Checking inodes, blocks, and sizes
  Pass 2: Checking directory structure
  Pass 3: Checking directory connectivity
  Pass 4: Checking reference counts
  Pass 5: Checking group summary information
  Free blocks count wrong (4957824, counted=5069276).
  Fix<y>? yes
  Free inodes count wrong (2952662, counted=2953167).
  Fix<y>? yes

  /dev/sdah2: ***** FILE SYSTEM WAS MODIFIED *****
  /dev/sdah2: ***** REBOOT LINUX *****
  /dev/sdah2: 186417/3139584 files (1.2% non-contiguous), 7488804/12558080 blocks
  root@zop03-01:~# 
  =================

  == Comment: #6  ==

  There is no issue now. System is all fine. But i can recreate it
  easily. If i start        IO tests and parallelly start interface
  failover tests, it can be recreated.

  But i just dont want it to get into the same state. and you do an
  fsck, reboot solves the problem.

  We need to get into the root cause for why the system is going to bad
  state.

  
  == Comment: #8  ==

  Re-creating the issue with more details:

  The IO tests were running ( IO tests basically create directories and files on the FS).
  I started the interface failover tests. It basically fails the interface on the SAN subsystem and failsover to the next available interface.There are such 4 interfaces available.The SAN is texan2.

  
  WHen the interface was failed, The IO halted for a moment, by that i mean:

  Created Directory /FS3-part1/test1 rc = 0 at 04/17/2015 12:24:05
  Created Directory /FS4-part1/test1 rc = 0 at 04/17/2015 12:24:05
  Created Directory /FS3-part2/test1 rc = 0 at 04/17/2015 12:24:05
  Created Directory /FS1-part3/test1 rc = 0 at 04/17/2015 12:24:05
  Created Directory /FS1-part1/test1 rc = 0 at 04/17/2015 12:24:06
  Created Directory /FS7-part1/test1 rc = 0 at 04/17/2015 12:24:07
  Created Directory /FS0-part2/test1 rc = 0 at 04/17/2015 12:24:08
  Created Directory /FS0-part1/test1 rc = 0 at 04/17/2015 12:24:08
  Created Directory /FS0-part3/test1 rc = 0 at 04/17/2015 12:24:09


  
  =======> Here it halted

  On the console, i was seeing messages like:

  root@zop03-01:~# [35488.779299] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35488.779344] sd 4:0:0:5: [sdab] Command (2A) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35489.803377] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35489.803646] sd 2:0:1:0: [sdf] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.529306] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.538252] sd 3:0:0:1: [sdl] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.538306] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.548175] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.548179] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.548216] sd 3:0:0:2: [sdm] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.549267] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.549340] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.549361] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.558355] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.558360] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.558367] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35494.568328] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35524.841752] sd 5:0:0:1: [sdaj] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35524.842736] sd 5:0:0:2: [sdak] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35524.843690] sd 5:0:0:0: [sdai] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35524.844659] sd 5:0:0:3: [sdal] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35524.845630] sd 5:0:0:4: [sdam] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35525.481757] sd 3:0:0:0: [sdk] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35525.482930] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35525.482944] sd 3:0:0:3: [sdn] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35525.483939] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 0
  [35525.483954] sd 3:0:0:4: [sdo] Command (28) failed: transaction cancelled (200:600) flags: 0 fcp_rsp: 0, resid=0, scsi_status: 

  Till here the machine was all fine, in rw state.

  
  Now, after 4 minutes of time:

  I suddenly saw the errors of IO failing:
  Write BLAST Failed on /blast/log/BLAST_zop03-01_root_04_17_2015_12_20_31.prt prt file - error = 30 EROFS         Read-only file system
  Error accessing file  /blast/log/BLAST_zop03-01_root_04_17_2015_12_20_31.prt
  Error Code 30
  ---> Replacement print file will be created ...Error accessing file  /blast/log/BLAST_zop03-01_root_04_17_2015_12_28_17.prt
  Error Code -1

  ***********************

  
  On Console:
  2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
  [35618.852794] ibmvfc 30000003: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
  [35640.184808] INFO: task jbd2/sdab2-8:13145 blocked for more than 120 seconds.
  [35640.184829]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.184833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35640.184904] INFO: task kworker/u320:1:18262 blocked for more than 120 seconds.
  [35640.184910]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.184913] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35640.185000] INFO: task latest_blast:18543 blocked for more than 120 seconds.
  [35640.185005]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.185008] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35640.185057] INFO: task latest_blast:18544 blocked for more than 120 seconds.
  [35640.185062]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.185065] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35640.185123] INFO: task latest_blast:18545 blocked for more than 120 seconds.
  [35640.185128]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.185131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35640.185187] INFO: task latest_blast:18546 blocked for more than 120 seconds.
  [35640.185192]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.185195] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35640.185242] INFO: task latest_blast:18547 blocked for more than 120 seconds.
  [35640.185247]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.185250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35640.185297] INFO: task latest_blast:18548 blocked for more than 120 seconds.
  [35640.185302]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.185305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35640.185350] INFO: task latest_blast:18549 blocked for more than 120 seconds.
  [35640.185355]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.185358] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35640.185405] INFO: task latest_blast:18550 blocked for more than 120 seconds.
  [35640.185410]       Not tainted 3.19.0-9-generic #9-Ubuntu
  [35640.185413] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [35676.081382] blk_update_request: I/O error, dev sdf, sector 0
  [35676.264929] blk_update_request: I/O error, dev sdj, sector 0
  [35676.264951] blk_update_request: I/O error, dev sdg, sector 0
  [35676.264992] blk_update_request: I/O error, dev sdh, sector 0
  [35676.265026] blk_update_request: I/O error, dev sdi, sector 0
  [35677.253544] blk_update_request: I/O error, dev sdab, sector 46601528
  [35677.253550] blk_update_request: I/O error, dev sdaa, sector 0
  [35677.253609] blk_update_request: I/O error, dev sdy, sector 0
  [35677.253658] blk_update_request: I/O error, dev sdz, sector 0
  [35677.253690] blk_update_request: I/O error, dev sdw, sector 0
  [35677.254020] Aborting journal on device sdab2-8.
  [35677.254037] EXT4-fs error (device sdab2) in add_dirent_to_buf:1699: Journal has aborted
  [35677.254138] EXT4-fs error (device sdab2) in ext4_dx_add_entry:2077: Journal has aborted
  [35677.254151] EXT4-fs error (device sdab2) in __ext4_new_inode:850: Journal has aborted
  [35677.254204] EXT4-fs error (device sdab2) in __ext4_new_inode:850: Journal has aborted
  [35677.255088] EXT4-fs (sdab2): Remounting filesystem read-only
  [35677.255096] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.255100] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.255101] EXT4-fs error (device sdab2): ext4_journal_check_start:56:
  [35677.255102] EXT4-fs (sdab2): Remounting filesystem read-only
  [35677.255103] EXT4-fs (sdab2): Remounting filesystem read-only
  [35677.255123] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.255130] EXT4-fs (sdab2): ext4_writepages: jbd2_start: 3648 pages, ino 1835459; err -30
  [35677.255179] EXT4-fs error (device sdab2) in ext4_evict_inode:243: Journal has aborted
  [35677.255557] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal[35677.255570] EXT4-fs error (device sdab2) in ext4_evict_inode:243: Journal has aborted
  [35677.255581] EXT4-fs error (device sdab2): ext4_journal_check_start:56: EXT4-fs error (device sdab2): ext4_journal_check_start:56:
  [35677.266659] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.266680] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266697] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal[35677.266712] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266725] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266737] EXT4-fs error (device sdab2): ext4_journal_check_start:56:
  [35677.266759] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.266777] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.266789] EXT4-fs error (device sdab2): ext4_journal_check_start:56: [35677.266801] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.266829] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.266857] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.266883] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.266907] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35677.266981] EXT4-fs error (device sdab2): ext4_journal_check_start:56: Detected aborted journal
  [35690.568831] ibmvfc 30000004: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7
  [35691.175548] ibmvfc 30000006: 298400: Port Login failed: unable to establish (2100:100) unknown fc type (0) no additional explanation (0) rc=0xF7


  and then i see that

  root@zop03-01:~# touch abc
  touch: cannot touch ?abc?: Read-only file system
  root@zop03-01:~#

  
  It has gone into RO mode again. I will atatch the dmesg and syslog in separate attachments.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1446064/+subscriptions