[OmniOS-discuss] [developer] NVMe Performance

Richard Elling richard.elling at richardelling.com
Sun Apr 17 02:15:50 UTC 2016

> On Apr 15, 2016, at 7:49 PM, Richard Yao <ryao at gentoo.org> wrote:
> On 04/15/2016 10:24 PM, Josh Coombs wrote:
>> On Fri, Apr 15, 2016 at 9:26 PM, Richard Yao <ryao at gentoo.org> wrote:
>>> The first is to make sure that ZFS uses proper alignment on the device.
>>> According to what I learned via Google searches, the Intel DC P3600
>>> supports both 512-byte sectors and 4096-byte sectors, but is low leveled
>>> formatted to 512-byte sectors by default. You could run fio to see how the
>>> random IO performance differs on 512-byte IOs at 512-byte formatting vs 4KB
>>> IOs at 4KB formatting, but I expect that you will find it performs best in
>>> the 4KB case like Intel's enterprise SATA SSDs do. If the 512-byte random
>>> IO performance was notable, Intel would have advertised it, but they did
>>> not do that:
>>> http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3600-spec.pdf
>>> http://www.cadalyst.com/%5Blevel-1-with-primary-path%5D/how-configure-oracle-redo-intel-pcie-ssd-dc-p3700-23534
>> So, I played around with this.  Intel's isdct tool will let you secure
>> erase the P3600 and set it up as a 4k sector device, or a 512, with a few
>> other options as well.  I have to re-look but it might support 8k sectors
>> too.  Unfortunately the NVMe driver doesn't play well with the SSD
>> formatted for anything other than 512 byte sectors.  I noted my findings in
>> Illumos bug #6912.
> The documentation does not say that it will do 8192-byte sectors,
> although ZFS in theory should be okay with them. My tests on the Intel
> DC S3700 suggested that 4KB vs 8KB was too close to tell. I recall
> deciding that Intel did a good enough job at 4KB that it should go into
> ZoL's quirks list as a 4KB drive.

ZIL traffic is all 4K, unless phys blocksize is larger. There are a number of Flash SSDs
that prefer 8k, and you can tell by the “optimal transfer size.” Since the bulk of the market
driving SSD sales is running NTFS, 4K is the market sweet spot.

> The P3600 is probably similar because its NAND flash controller "is an
> evolution of the design used in the S3700/S3500":
> http://www.anandtech.com/show/8104/intel-ssd-dc-p3700-review-the-pcie-ssd-transition-begins-with-nvme <http://www.anandtech.com/show/8104/intel-ssd-dc-p3700-review-the-pcie-ssd-transition-begins-with-nvme>
>> I need to look at how Illumos partitions the devices if you just feed zpool
>> the device rather than a partition, I didn't look to see if it was aligning
>> things correctly or not on it's own.
> It will put the first partition at a 1MB boundary and set an internal
> alignment shift consistent with what the hardware reports.
>> The second is that it is possible to increase IOPS beyond Intel's
>>> specifications by doing a secure erase, giving SLOG a tiny 4KB aligned
>>> partition and leaving the rest of the device unused. Intel's numbers are
>>> for steady state performance where almost every flash page is dirty. If you
>>> leave a significant number of pages clean (i.e. unused following a secure
>>> erase), the drive should perform better than what Intel claims by virtue of
>>> the internal book keeping and garbage collection having to do less. Anandtech
>>> has benchmarks numbers showing this effect on older consumer SSDs on
>>> Windows in a comparison with the Intel DC S3700:
>> Using isdct I have mine set to 50% over-provisioning, so they show up as
>> 200GB devices now.  As noted in bug 6912 you have to secure erase after
>> changing that setting or the NVMe driver REALLY gets unhappy.
> If you are using it as a SLOG, you would probably want something like
> 98% overprovisioning to match the ZeusRAM, which was designed for use as
> a ZFS SLOG device and was very well regarded until it was discontinued:
> https://www.hgst.com/sites/default/files/resources/[FAQ]_ZeusRAM_FQ008-EN-US.pdf <https://www.hgst.com/sites/default/files/resources/[FAQ]_ZeusRAM_FQ008-EN-US.pdf>

ZeusRAM was great for its time, but the 12G replacements perform similarly. The
biggest difference between ZeusRAMs and Flash SSDs seems to be in the garbage
collection. In my testing, low DWPD drives have less consistent performance as the
garbage collection is less optimized. For the 3 DWPD drives we’ve tested, the performance
for slog workloads is more consistent than the 1 DWPD drives.

> ZFS generally does not need much more from a SLOG device. The way to
> ensure that you do not overprovision more/less than ZFS is willing to
> use on your system would be to look at zfs_dirty_data_max
> That being said, you likely will want to run fio random IO benchmarks at
> different overprovisioning levels after a secure erase and a dd
> if=/dev/urandom of=/path/to/device so you can see the difference in
> performance yourself. Happy benchmarking. :)

/dev/urandom is too (intentionally) slow. You’ll bottleneck there.

Richard’s advice is good: test with random workloads. Contrary to popular believe, the ZIL
workload is not contiguous for most environments. So testing with “sequential” tests will not
deliver good results.

The other unfortunate problem with Flash SSDs is the parallelism. For a given product line,
the smaller drives also have lower performance. This is particularly important for slogs because
you think that the smaller size is what you need, but the published benchmarks are for the
largest size in the product line, and can be 2x or 4x faster (latency) than the smaller versions.
 — richard


Richard.Elling at RichardElling.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.omniti.com/pipermail/omnios-discuss/attachments/20160416/516035d1/attachment-0001.html>

More information about the OmniOS-discuss mailing list