On Tue, Feb 1, 2022 at 5:32 PM Paul Koning <paulkoning at comcast.net> wrote:
On Feb 1, 2022, at 6:00 PM, Warner Losh via
cctalk <
cctalk at classiccmp.org> wrote:
On Tue, Feb 1, 2022 at 12:42 PM Grant Taylor via cctalk <
cctalk at classiccmp.org> wrote:
On 2/1/22 2:14 AM, Joshua Rice via cctalk wrote:
There's several advantages to doing it that
way, including balancing
wear on a disk (especially today, with SSDs), as a dedicated swap
partition could put undue wear on certain areas of disk.
I thought avoiding this very problem was the purpose of the wear
leveling functions in SSD controllers.
All modern SSD's firmware that I'm aware of decouple the physical
location
from the LBA. They implement some variation of
'append store log' that
abstracts out the LBAs from the chips the data is stored in. One big
reason
for this is so that one worn out 'erase
block' doesn't cause a hole in
the
LBA
range the drive can store data on. You expect to retire hundreds or
thousands of erase blocks in today's NAND over the life of the drive, and
coupling LBAs to a physical location makes that impossible.
Another reason is that the flash memory write block size is larger than
the sector size exposed to the host, and the erase block size is much
larger than the write block size. So the firmware has to keep track of
retired data, move stuff around to collect an erase block worth of that,
then erase it to make it available again to receive incoming writes.
Yes. That larger sector size is the 'erase block' that I was talking about.
The whole garbage
collection of old data makes or breaks drive performance. There's a number
of techniques
that are used these days to hide it, though. The extra erase blocks that
are free are usual
SLC so can be written quickly (and for many work loads recently written
blocks are likely
to be rewritten soon, so when you GC the SLC pages, you're only doing small
portions of
them). This buffer of available blocks is one reason you need larger
capacity.
Another twist: Recently, NAND has a 16k-64k page size, which is one reason
you'll
see drives report an emulated size of 4k (since it has 4k LBAs), but a
native size of
16k or 64k. This helps drive provisioning software align partitions on a
boundary that
the drive is likely able to handle better.
The spare capacity of an SSD can be pretty
substantial. I remember one
some years ago that had a bug which, in a subtle way, exposed the internal
structure of the device. It turned out the exposed capacity was 49/64th of
the physical flash space. Strange fraction, I don't think we were ever
told why, but the supplier did confirm we analyzed it correctly.
The other reason you'd need additional capacity is to meet the endurance
requirements
of the drive. If the vendor says it will last N years with M months of
retention and Z writes,
then you have to make sure that after Z writes some margin above your
capacity point
remains in service. You can usually only keep those blocks whose error
rates are low
enough that you can safely store data for M months on them in service, and
that gets
much harder at the end of life. Having a larger over provisioning, you can
increase
N, M or Z (but usually not all of them).
Warner