Skip to content
This repository was archived by the owner on Mar 17, 2025. It is now read-only.

Commit d008ec2

Browse files
authored
Merge pull request #12 from letsencrypt/footnotes
Convert footnotes to use markdown feature
2 parents 933bcf7 + 97aef3a commit d008ec2

1 file changed

Lines changed: 53 additions & 53 deletions

File tree

README.md

Lines changed: 53 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ Accordingly, we're tuning ZFS and MariaDB for performance over durability, avoid
2020

2121
## Preparing the drives
2222

23-
Many modern storage drives can present different sector sizes (LBA formats) to the host system. Only one (or none) will be their internal, best-performing sector size. This is often the largest sector size they can natively support, e.g. "4Kn."<sup>[1](#fn1),[2](#fn2),[3](#fn3),[4](#fn4)</sup> We currently use Intel NVMe drives, which have a changeable "Variable Sector Size."<sup>[5](#fn5)</sup> Intel's online documentation and specifications don't list the sector size options for the P4610 model we use, but scanning it showed us two possible values: 0 (512B) or 1 (4KB). flashbench<sup>[6](#fn6),[7](#fn7)</sup> results strongly suggest that the internal sector size is 8KB.
23+
Many modern storage drives can present different sector sizes (LBA formats) to the host system. Only one (or none) will be their internal, best-performing sector size. This is often the largest sector size they can natively support, e.g. "4Kn."[^1][^2][^3][^4] We currently use Intel NVMe drives, which have a changeable "Variable Sector Size."[^5] Intel's online documentation and specifications don't list the sector size options for the P4610 model we use, but scanning it showed us two possible values: 0 (512B) or 1 (4KB). flashbench[^6][^7] results strongly suggest that the internal sector size is 8KB.
2424

2525
### Implementation
2626

27-
We use the Intel Memory & Storage Tool<sup>[8](#fn8)</sup> to set the Variable Sector Size to 4,096, the best-performing of the available options.
27+
We use the Intel Memory & Storage Tool[^8] to set the Variable Sector Size to 4,096, the best-performing of the available options.
2828

2929
**WARNING:** This erases all data.
3030
```
@@ -42,15 +42,15 @@ for driveIndex in {0..23}; do
4242
-intelssd ${driveIndex}
4343
done
4444
```
45-
<sup>[5](#fn5),[9](#fn9)</sup>
45+
[^5][^9]
4646

4747
## ZFS kernel module settings
4848

49-
* Almost all of the I/O to our datasets will be done by the InnoDB database engine, which has its own prefetching logic. Since ZFS's prefetching would be redundant and less well optimized, we disable it: `zfs_prefetch_disable=1`.<sup>[1](#fn1),[10](#fn10)</sup>
49+
* Almost all of the I/O to our datasets will be done by the InnoDB database engine, which has its own prefetching logic. Since ZFS's prefetching would be redundant and less well optimized, we disable it: `zfs_prefetch_disable=1`.[^1][^10]
5050

5151
## Building the vdevs, pool & datasets
5252

53-
### Basic concepts, from the bottom up<sup>[11](#fn11)</sup>
53+
### Basic concepts, from the bottom up[^11]
5454

5555
* ZFS acts as both a volume manager and a filesystem.
5656

@@ -62,17 +62,17 @@ done
6262

6363
### Vdevs & pool
6464

65-
* We match our drives' best-performing 8KB sector size: `ashift=13`.<sup>[1](#fn1),[2](#fn2),[3](#fn3),[4](#fn4)</sup>
65+
* We match our drives' best-performing 8KB sector size: `ashift=13`.[^1][^2][^3][^4]
6666

67-
* We want to automatically activate hot spare drives if another drive fails: `autoreplace=on`.<sup>[3](#fn3)</sup>
67+
* We want to automatically activate hot spare drives if another drive fails: `autoreplace=on`.[^3]
6868

69-
* We use `/dev/disk/by-id/` paths to identify drives, in case they're swapped around to different drive bays or the OS' device naming schema changes.<sup>[3](#fn3)</sup>
69+
* We use `/dev/disk/by-id/` paths to identify drives, in case they're swapped around to different drive bays or the OS' device naming schema changes.[^3]
7070

71-
* We use RAID-1+0, in order to achieve the best possible performance without being vulnerable to a single-drive failure.<sup>[3](#fn3),[10](#fn10),[12](#fn12),[13](#fn13),[14](#fn14)</sup>
71+
* We use RAID-1+0, in order to achieve the best possible performance without being vulnerable to a single-drive failure.[^3][^10][^12][^13][^14]
7272

73-
* We balance vdevs across controllers, buses, and backplane segments, in order to improve throughput and fault tolerance.<sup>[3](#fn3)</sup>
73+
* We balance vdevs across controllers, buses, and backplane segments, in order to improve throughput and fault tolerance.[^3]
7474

75-
* We store data in datasets, not directly in pools, in order to allow easier management of properties, quotas, and snapshots.<sup>[3](#fn3)</sup>
75+
* We store data in datasets, not directly in pools, in order to allow easier management of properties, quotas, and snapshots.[^3]
7676

7777
#### Implementation
7878

@@ -100,20 +100,20 @@ sudo zpool create \
100100

101101
* These properties are inherited by child datasets, unless overridden.
102102

103-
* We've no need to incur the overhead of tracking when files were last accessed: `atime=off`.<sup>[1](#fn1),[15](#fn15)</sup>
103+
* We've no need to incur the overhead of tracking when files were last accessed: `atime=off`.[^1][^15]
104104

105-
* We currently use LZ4 compression, which is extremely efficient and may even improve performance by reducing I/O to the drives: `compression=lz4`.<sup>[1](#fn1),[2](#fn2),[3](#fn3),[4](#fn4),[11](#fn11),[13](#fn13),[14](#fn14)</sup>
105+
* We currently use LZ4 compression, which is extremely efficient and may even improve performance by reducing I/O to the drives: `compression=lz4`.[^1][^2][^3][^4][^11][^13][^14]
106106

107107
* The performance effects may be more mixed since our record size is only twice the sector size, meaning that compression can prevent relatively few sector writes. We might re-evaluate this choice. See <a href="https://github.com/letsencrypt/openzfs-nvme-databases/issues/9">#9</a>.
108108

109-
* Just like with prefetching, InnoDB has its own caching logic, so ZFS's caching would be redundant and less well optimized. We have ZFS cache only metadata: `primarycache=metadata`.<sup>[1](#fn1),[2](#fn2),[10](#fn10),[13](#fn13)</sup>
109+
* Just like with prefetching, InnoDB has its own caching logic, so ZFS's caching would be redundant and less well optimized. We have ZFS cache only metadata: `primarycache=metadata`.[^1][^2][^10][^13]
110110

111-
* ZFS's default record size of 128KB is appropriate for medium-sequential writes, i.e. general use including database backups, which may also use this dataset. We set it explicitly - `recordsize=128k`<sup>[1](#fn1),[2](#fn2),[10](#fn10),[13](#fn13),[14](#fn14),[15](#fn15)</sup> - on this parent dataset, and override it on the InnoDB child dataset.
111+
* ZFS's default record size of 128KB is appropriate for medium-sequential writes, i.e. general use including database backups, which may also use this dataset. We set it explicitly - `recordsize=128k`[^1][^2][^10][^13][^14][^15] - on this parent dataset, and override it on the InnoDB child dataset.
112112

113-
* We store extended attributes in inodes, instead of hidden subdirectories, to reduce I/O overhead for SELinux: `xattr=sa`.<sup>[1](#fn1),[4](#fn4),[16](#fn16),[17](#fn17)</sup> Use of this flag is further supported given that we rely on SELinux and POSIX ACLs in our systems. Without the flag, even the root user attempting to set an ACL on a folder/file on a ZFS mount will receive `Operation not permitted`. According to the zfs man page <sup>[21](#fn21)</sup>,
113+
* We store extended attributes in inodes, instead of hidden subdirectories, to reduce I/O overhead for SELinux: `xattr=sa`.[^1][^4][^16][^17] Use of this flag is further supported given that we rely on SELinux and POSIX ACLs in our systems. Without the flag, even the root user attempting to set an ACL on a folder/file on a ZFS mount will receive `Operation not permitted`. According to the zfs man page,[^21]
114114
> The use of system attribute based xattrs is strongly encouraged for users of SELinux or POSIX ACLs. Both of these features heavily rely of extended attributes and benefit significantly from the reduced access time.
115115
116-
* We also allow larger nodes, in order to accommodate this: `dnodesize=auto`.<sup>[4](#fn4)</sup> N.b. This does break our pools' compatibility with non-Linux ZFS implementations.
116+
* We also allow larger nodes, in order to accommodate this: `dnodesize=auto`.[^4] N.b. This does break our pools' compatibility with non-Linux ZFS implementations.
117117

118118
#### Implementation
119119

@@ -134,13 +134,13 @@ sudo zfs get acltype
134134

135135
### InnoDB child dataset
136136

137-
* Although we're not using a ZIL device because all our drives are the same (fast) speed, we still hint to ZFS that throughput is more important than latency for our workload: `logbias=throughput`.<sup>[1](#fn1),[2](#fn2),[14](#fn14)</sup>
137+
* Although we're not using a ZIL device because all our drives are the same (fast) speed, we still hint to ZFS that throughput is more important than latency for our workload: `logbias=throughput`.[^1][^2][^14]
138138

139139
* ZIL may still have major benefits in this scenario. See <a href="https://github.com/letsencrypt/openzfs-nvme-databases/issues/7">#7</a>.
140140

141-
* InnoDB's default page size is 16KB. (This would be interesting to experiment with.) We know every write will be that size, and it's a multiple of the drives' sector size. So, we set the tablespace dataset's record size to match: `recordsize=16k`.<sup>[1](#fn1),[2](#fn2),[10](#fn10),[13](#fn13),[14](#fn14),[15](#fn15)</sup>
141+
* InnoDB's default page size is 16KB. (This would be interesting to experiment with.) We know every write will be that size, and it's a multiple of the drives' sector size. So, we set the tablespace dataset's record size to match: `recordsize=16k`.[^1][^2][^10][^13][^14][^15]
142142

143-
* ZFS stores an *extra* copy of all metadata by default, beyond the redundancy provided by mirroring. Because we're prioritizing performance for a write-intensive workload, we lower this level of redundancy: `redundant_metadata=most`.<sup>[2](#fn2),[13](#fn13)</sup>
143+
* ZFS stores an *extra* copy of all metadata by default, beyond the redundancy provided by mirroring. Because we're prioritizing performance for a write-intensive workload, we lower this level of redundancy: `redundant_metadata=most`.[^2][^13]
144144

145145
#### Implementation
146146

@@ -155,72 +155,72 @@ sudo zfs create \
155155

156156
## MariaDB settings
157157

158-
* ZFS has very efficient checksumming that's integral to its operation. So, we turn off InnoDB's checksums, which would be redundant: `innodb_checksum_algorithm=none`.<sup>[1](#fn1)</sup>
158+
* ZFS has very efficient checksumming that's integral to its operation. So, we turn off InnoDB's checksums, which would be redundant: `innodb_checksum_algorithm=none`.[^1]
159159

160-
* Because ZFS writes are atomic and we've aligned page/record sizes, we disable the doublewrite buffer in order to reduce overhead: `innodb_doublewrite=0`.<sup>[1](#fn1),[2](#fn2),[10](#fn10),[14](#fn14),[15](#fn15)</sup>
160+
* Because ZFS writes are atomic and we've aligned page/record sizes, we disable the doublewrite buffer in order to reduce overhead: `innodb_doublewrite=0`.[^1][^2][^10][^14][^15]
161161

162-
* We store tables in individual files, for much easier backup, recovery, or relocation: `innodb_file_per_table=ON`.<sup>[13](#fn13)</sup>
162+
* We store tables in individual files, for much easier backup, recovery, or relocation: `innodb_file_per_table=ON`.[^13]
163163

164-
* We reduce writes by setting the redo log's write-ahead block size to match the InnoDB dataset's record size, 16KB: `innodb_log_write_ahead_size=16384`.<sup>[1](#fn1)</sup> Some articles suggest using a larger block size for logs, but MySQL caps this value at the tablespace's record size.<sup>[1](#fn1),[18](#fn18)</sup>
164+
* We reduce writes by setting the redo log's write-ahead block size to match the InnoDB dataset's record size, 16KB: `innodb_log_write_ahead_size=16384`.[^1] Some articles suggest using a larger block size for logs, but MySQL caps this value at the tablespace's record size.[^1][^18]
165165

166-
* We disable AIO, which performs poorly on Linux: `innodb_use_native_aio=0`, `innodb_use_atomic_writes=0`.<sup>[2](#fn2)</sup>
166+
* We disable AIO, which performs poorly on Linux: `innodb_use_native_aio=0`, `innodb_use_atomic_writes=0`.[^2]
167167

168-
* We disable proactively flushing pages in the same extent, because group writes are not an issue with aligned page/record sizes: `innodb_flush_neighbors=0`.<sup>[22](#fn22),[23](#fn23)</sup>
168+
* We disable proactively flushing pages in the same extent, because group writes are not an issue with aligned page/record sizes: `innodb_flush_neighbors=0`.[^22][^23]
169169

170-
* We increase target & max IOPS above the defaults. We still use conservative values to avoid excessive SSD wear,<sup>[24](#fn24)</sup> but the defaults were tuned for spinning disks: `innodb_io_capacity=1000`, `innodb_io_capacity_max=2500`.<sup>[23](#fn23)</sup>
170+
* We increase target & max IOPS above the defaults. We still use conservative values to avoid excessive SSD wear,[^24] but the defaults were tuned for spinning disks: `innodb_io_capacity=1000`, `innodb_io_capacity_max=2500`.[^23]
171171

172172
## Operations
173173

174-
* We'll run regular scrubs (integrity checks) of zpools.<sup>[3](#fn3),[11](#fn11),[19](#fn19)</sup>
174+
* We'll run regular scrubs (integrity checks) of zpools.[^3][^11][^19]
175175

176-
* We'll monitor zpools' health using Prometheus' node_exporter.<sup>[20](#fn20)</sup>
176+
* We'll monitor zpools' health using Prometheus' node_exporter.[^20]
177177

178178
## References
179179

180-
<a name="fn1">1</a>: https://shatteredsilicon.net/blog/2020/06/05/mysql-mariadb-innodb-on-zfs/
180+
[^1]: https://shatteredsilicon.net/blog/2020/06/05/mysql-mariadb-innodb-on-zfs/
181181

182-
<a name="fn2">2</a>: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html
182+
[^2]: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html
183183

184-
<a name="fn3">3</a>: https://pthree.org/2012/12/13/zfs-administration-part-viii-zpool-best-practices-and-caveats/
184+
[^3]: https://pthree.org/2012/12/13/zfs-administration-part-viii-zpool-best-practices-and-caveats/
185185

186-
<a name="fn4">4</a>: https://wiki.debian.org/ZFS#Advanced_Topics
186+
[^4]: https://wiki.debian.org/ZFS#Advanced_Topics
187187

188-
<a name="fn5">5</a>: https://www.intel.com/content/www/us/en/support/articles/000016238/memory-and-storage/data-center-ssds.html
188+
[^5]: https://www.intel.com/content/www/us/en/support/articles/000016238/memory-and-storage/data-center-ssds.html
189189

190-
<a name="fn6">6</a>: https://github.com/bradfa/flashbench
190+
[^6]: https://github.com/bradfa/flashbench
191191

192-
<a name="fn7">7</a>: https://old.reddit.com/r/zfs/comments/aq913n/anyone_know_what_the_physical_sector_size_ashift/egjudnr/
192+
[^7]: https://old.reddit.com/r/zfs/comments/aq913n/anyone_know_what_the_physical_sector_size_ashift/egjudnr/
193193

194-
<a name="fn8">8</a>: https://downloadcenter.intel.com/product/140108/intel-ssd-dc-p4610-series
194+
[^8]: https://downloadcenter.intel.com/product/140108/intel-ssd-dc-p4610-series
195195

196-
<a name="fn9">9</a>: https://downloadmirror.intel.com/29821/eng/Intel_Memory_And_Storage_Tool_User%20Guide-Public-342245-004US.pdf
196+
[^9]: https://downloadmirror.intel.com/29821/eng/Intel_Memory_And_Storage_Tool_User%20Guide-Public-342245-004US.pdf
197197

198-
<a name="fn10">10</a>: http://assets.en.oreilly.com/1/event/21/Optimizing%20MySQL%20Performance%20with%20ZFS%20Presentation.pdf
198+
[^10]: http://assets.en.oreilly.com/1/event/21/Optimizing%20MySQL%20Performance%20with%20ZFS%20Presentation.pdf
199199

200-
<a name="fn11">11</a>: https://www.freebsd.org/doc/handbook/zfs-term.html
200+
[^11]: https://www.freebsd.org/doc/handbook/zfs-term.html
201201

202-
<a name="fn12">12</a>: https://old.reddit.com/r/zfs/comments/47i2wi/innodb_and_arc_for_workload_thats_50_update/d0erson/
202+
[^12]: https://old.reddit.com/r/zfs/comments/47i2wi/innodb_and_arc_for_workload_thats_50_update/d0erson/
203203

204-
<a name="fn13">13</a>: https://www.usenix.org/system/files/login/articles/login_winter16_09_jude.pdf
204+
[^13]: https://www.usenix.org/system/files/login/articles/login_winter16_09_jude.pdf
205205

206-
<a name="fn14">14</a>: https://old.reddit.com/r/zfs/comments/3mvv8e/does_anyone_run_mysql_or_postgresql_on_zfs/cvlbyjz/
206+
[^14]: https://old.reddit.com/r/zfs/comments/3mvv8e/does_anyone_run_mysql_or_postgresql_on_zfs/cvlbyjz/
207207

208-
<a name="fn15">15</a>: https://wiki.freebsd.org/ZFSTuningGuide#MySQL
208+
[^15]: https://wiki.freebsd.org/ZFSTuningGuide#MySQL
209209

210-
<a name="fn16">16</a>: https://openzfs.org/wiki/Features#SA_based_xattrs
210+
[^16]: https://openzfs.org/wiki/Features#SA_based_xattrs
211211

212-
<a name="fn17">17</a>: https://old.reddit.com/r/zfs/comments/89xe9u/zol_xattrsa/
212+
[^17]: https://old.reddit.com/r/zfs/comments/89xe9u/zol_xattrsa/
213213

214-
<a name="fn18">18</a>: https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-logging.html
214+
[^18]: https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-logging.html
215215

216-
<a name="fn19">19</a>: https://www.bouncybouncy.net/blog/incremental-zpool-scrub/
216+
[^19]: https://www.bouncybouncy.net/blog/incremental-zpool-scrub/
217217

218-
<a name="fn20">20</a>: https://github.com/prometheus/node_exporter/pull/1632
218+
[^20]: https://github.com/prometheus/node_exporter/pull/1632
219219

220-
<a name="fn21">21</a>: https://zfsonlinux.org/manpages/0.8.4/man8/zfs.8.html
220+
[^21]: https://zfsonlinux.org/manpages/0.8.4/man8/zfs.8.html
221221

222-
<a name="fn22">22</a>: https://blog.pythian.com/exposing-innodb-internals-via-system-variables-part-3-io-table-data/
222+
[^22]: https://blog.pythian.com/exposing-innodb-internals-via-system-variables-part-3-io-table-data/
223223

224-
<a name="fn23">23</a>: https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-diskio.html
224+
[^23]: https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-diskio.html
225225

226-
<a name="fn24">24</a>: https://www.percona.com/blog/2019/12/18/give-love-to-your-ssds-reduce-innodb_io_capacity_max/
226+
[^24]: https://www.percona.com/blog/2019/12/18/give-love-to-your-ssds-reduce-innodb_io_capacity_max/

0 commit comments

Comments
 (0)