New subject: ZFS - Re: disk flaws, classic vs. modern

7 Oct 2012

I have a problem that cropped up that spans both old systems and flaw 
maps / disks that have flaws to skip over and current technology.

In the old days with my experience on a lot of cdc winchesters and 
removable pack smd drives (and trident) there were flaw maps you could 
use in a controller to figure out where the bad spots were.

When shipped from the factory the mmd / cmd / emd / fsd drives (at least 
the first two) were not allowed to have more than a false address mark 
on a track, and a limit of maybe 5 for the entire stack.  Later at least 
on the mmd's they had  run of marginal media and upped the FAM errors 
considerably.  (10? 15? don't recall).  I later found some drives 
delivered to such as Datapoint for large systems delivered when we got 
the story about the FAM problem (needing to up the count) that had zero 
flaws, but that was 10 years later buying some scrap drives (better than 
the ones I had purchased 10 years before, still working).

Anyway, recently I had a system with a 1.5tb seagate grow a count of 
"uncorrectable offline sector count" errors.  I'm telling this up front 
since these are effectively the same as the above errors, sectors that 
are not recoverable by the drive and presented as bad or timeout spots 
when you seek to the sectors.

The errors were not there when I initialized the linux system on the 
drive, and grew later.  To complicate things a bit this was part of a 
LVM raid ext3 raid 5 set, so there are other complications here, but the 
initial build was flawless, ran about a 6 months then this drive grew 16 
bad errors visible to linux.

So, I have now got the situation where there is a bad spot in a file 
(more complicated because this is part of a raid set, but bear with 
me).  If you power cycle the drive set they do a scan of the media with 
the raid system I'm using (linux based) and hang before releasing it for 
server operations.

That is the only flaw that there was, and I had a brick.  Thank heaven I 
could put it in a desktop system and recover the data (7tb of it).

Anyway, have we lost the capability with such as Linux to run with flaws 
growing on media at the level where transfers from media come from the 
drive target to the host, or did this vendor of raid equipment 
(appliance was readynas nv2+) have a flaw in their bringup procedure.

I am glad I shopped and got a system with raid 5 support like this with 
a linux system that I could take out and troubleshoot with any linux 
tools, rather than hardware raid.  Dodged that bullet.

but I am disappointed even so with the behavior of the raid set when I 
put it on my recovery system.  I think there is a basic loss of what one 
would have been accustomed to in earlier times with media, throwing 
their hands up with defective media.

I was only able to narrow down the error with manual applications of 
careful dd commands and shell scripting (lacking a better tool) to see 
the errors.  There is a nice timesaving web page if you hunt around the 
pages found by searching for smart errors.

At least that is a nice tool, telling a lot about the drive.

I am not going into why a 1.5tb drive with huge amounts of extra space 
(as I understand it probably > 500gb) can't reassign media 16 
consecutive sectors, as that is a totally different discussion.

Jim