Origin of "partition" in storage devices

Tue Feb 1 18:50:35 CST 2022

On Tue, Feb 1, 2022 at 5:32 PM Paul Koning <paulkoning at comcast.net> wrote:

>
>
> > On Feb 1, 2022, at 6:00 PM, Warner Losh via cctalk <
> cctalk at classiccmp.org> wrote:
> >
> > On Tue, Feb 1, 2022 at 12:42 PM Grant Taylor via cctalk <
> > cctalk at classiccmp.org> wrote:
> >
> >> On 2/1/22 2:14 AM, Joshua Rice via cctalk wrote:
> >>> There's several advantages to doing it that way, including balancing
> >>> wear on a disk (especially today, with SSDs), as a dedicated swap
> >>> partition could put undue wear on certain areas of disk.
> >>
> >> I thought avoiding this very problem was the purpose of the wear
> >> leveling functions in SSD controllers.
> >
> > All modern SSD's firmware that I'm aware of decouple the physical
> location
> > from the LBA. They implement some variation of 'append store log' that
> > abstracts out the LBAs from the chips the data is stored in. One big
> reason
> > for this is so that one worn out 'erase block' doesn't cause a hole in
> the
> > LBA
> > range the drive can store data on. You expect to retire hundreds or
> > thousands of erase blocks in today's NAND over the life of the drive, and
> > coupling LBAs to a physical location makes that impossible.
>
> Another reason is that the flash memory write block size is larger than
> the sector size exposed to the host, and the erase block size is much
> larger than the write block size.  So the firmware has to keep track of
> retired data, move stuff around to collect an erase block worth of that,
> then erase it to make it available again to receive incoming writes.
>

Yes. That larger sector size is the 'erase block' that I was talking about.
The whole garbage
collection of old data makes or breaks drive performance. There's a number
of techniques
that are used these days to hide it, though. The extra erase blocks that
are free are usual
SLC so can be written quickly (and for many work loads recently written
blocks are likely
to be rewritten soon, so when you GC the SLC pages, you're only doing small
portions of
them). This buffer of available blocks is one reason you need larger
capacity.

Another twist: Recently, NAND has a 16k-64k page size, which is one reason
you'll
see drives report an emulated size of 4k (since it has 4k LBAs), but a
native size of
16k or 64k. This helps drive provisioning software align partitions on a
boundary that
the drive is likely able to handle better.

> The spare capacity of an SSD can be pretty substantial.  I remember one
> some years ago that had a bug which, in a subtle way, exposed the internal
> structure of the device.  It turned out the exposed capacity was 49/64th of
> the physical flash space.  Strange fraction, I don't think we were ever
> told why, but the supplier did confirm we analyzed it correctly.
>

The other reason you'd need additional capacity is to meet the endurance
requirements
of the drive. If the vendor says it will last N years with M months of
retention and Z writes,
then you have to make sure that after Z writes some margin above your
capacity point
remains in service. You can usually only keep those blocks whose error
rates are low
enough that you can safely store data for M months on them in service, and
that gets
much harder at the end of life. Having a larger over provisioning, you can
increase
N, M or Z (but usually not all of them).

Warner