Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma vishal.l.verma@intel.com Vault 2015 1 Introduction The Block Translation Table Read and Write Flows Synchronization Performance/Efficiency BTT vs. DAX 2 NVDIMMs and Persistent Memory CPU caches Speed DRAM Capacity Persistent Memory Traditional Storage ● ● ● NVDIMMs are byte-addressable We won't talk of “Total System Persistence” ● But using persistent memory DIMMs for storage Drivers to present this as a block device - “pmem” 3 Problem Statement • Byte addressability is great – But not for writing a sector atomically Userspace write() 'pmem' driver - /dev/pmem0 memcpy() 0 1 2 3 --- NVDIMM 4 Problem Statement • On a power failure, there are three possibilities 1.No blocks are torn (common on modern drives) 2.A block was torn, but reads back with an ECC error 3.A block was torn, but reads back without an ECC error (very rare on modern drives) • With pmem, we use memcpy() – ECC is correct between two stores – Torn sectors will almost never trigger ECC on the NVDIMM – Case 3 becomes most common! – Only file systems with data checksums will survive this case 5 Naive solution • Full Data Journaling • Write every block to the journal first • 2x latency • 2x media wear 6 Slightly better solution • Maintain an 'on-disk' indirection table and an in-memory free block list • The map/indirection table has LBA -> actual block offset mappings LBA Actual 0 42 1 5050 0 - Free 2 314 3 - LBA 3 3 3 Map • New writes grab a block from free list 0 • On completing the write, atomically swap the free list entry and map entry 12 2 Free List 42 - LBA 0 314 - LBA 2 NVDIMM 7 Slightly better solution • Maintain an 'on-disk' indirection table and an in-memory free block list • The map/indirection table has LBA -> actual block offset mappings • New writes grab a block from free list • On completing the write, atomically swap the free list entry and map entry LBA Actual 0 42 1 5050 0 - Free 2 314 3 - LBA 3 3 3 Map write( to LBA 3 ) 42 - LBA 0 0 2 12 Free List 314 - LBA 2 NVDIMM 8 Slightly better solution • Maintain an 'on-disk' indirection table and an in-memory free block list • The map/indirection table has LBA -> actual block offset mappings LBA Actual 0 42 1 5050 2 314 3 0 Map • New writes grab a block from free list 3 • On completing the write, atomically swap the free list entry and map entry 12 2 Free List 0 - LBA 3 3 - Free 42 - LBA 0 314 - LBA 2 NVDIMM 9 Slightly better solution • Easy enough to implement • Should be performant • Caveat: – The only way to recreate the free list is to read the entire map – Consider a 512GB volume, bs=512 => reading 1073741824 map entries – Map entries have to be 64-bit, so we end up reading 8GB at startup – Could save the free list to media on clean shutdown – But...clunky at best 10 Introduction The Block Translation Table Read and Write Flows Synchronization Performance/Efficiency BTT vs. DAX 11 The Block Translation Table Arena Info Block (4K) • nfree: The number of free blocks in reserve. • Flog: Portmanteau of free list + log – – – Has nfree entries. Each entry has two 'slots' that 'flip-flop' Each slot has: Block being written Old mapping New mapping Sequence num Arena 0 512G Data Blocks Arena 1 512G • Info block: Info about arena - offsets, lbasizes etc. • External LBA: LBA as visible to upper layers • ABA: Arena Block Address - Block offset within an arena • Premap/Postmap ABA: The block offset into the data area as seen prior to/post indirection from the map . . . Backing Store nfree reserved blocks BTT Map BTT Flog (8K) Info Block Copy (4K) Arena 12 What's in a lane? Free List CPU 0 CPU 1 CPU 2 get_lane() = 0 get_lane() = 1 get_lane() = 2 Flog blk seq slot LBA old new seq LBA` old` new` seq` 2 0b10 0 5 32 2 0b10 XX XX XX XX Lane 1 6 0b10 1 XX XX XX XX 8 38 6 0b10 Lane 2 14 0b01 0 42 42 14 0b01 XX XX XX XX Lane 0 • The idea of “lanes” is purely logical • num_lanes = min(num_cpus, nfree) • lane = cpu % num_lanes • If num_cpus > num_lanes, we need locking on lanes – But if not, we can simply preempt_disable() and need not take a lock Map 5 2 8 6 42 14 13 Introduction The Block Translation Table Read and Write Flows Synchronization Performance/Efficiency BTT vs. DAX 14 BTT – Reading a block read() LBA 5 • Convert external LBA to Arena number + pre-map ABA CPU 0 • Get a lane (and take lane_lock if needed) • Read map to get the mapping Map • If ZERO flag is set, return zeroes • If ERROR flag is set, return an error • Read data from the block that the map points to • Release lane (and lane_lock) pre post 5 10 Lane 0 Read data from 10 Release Lane 0 15 write() LBA 5 BTT – Writing a block CPU 0 • Convert external LBA to Arena number + pre-map ABA Map (old) • Get a lane (and take lane_lock if needed) pre post blk seq slot 5 10 2 0b10 0 • Use lane to index into free list, write data to this free block • Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq] flog[0][0] = {5, 10, 2, 0b10} • Write new post-map ABA into map. map[5] = 2 • Write old post-map entry into the free list • Release lane (and lane_lock) Free List[0] write data to 2 • Read map to get the existing mapping • Calculate next sequence number and write into the free list entry Lane 0 Map pre post 5 2 free[0] = {10, 0b11, 1} Release Lane 0 16 BTT – Analysis of a write Free List[0] write() LBA 5 CPU 0 Lane 0 Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Release Lane 0 Opportunities for interruption/power failure 17 BTT – Analysis of a write Free List[0] Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write() LBA 5 CPU 0 Lane 0 Map (old) write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Release Lane 0 • On reboot: – No on-disk change had happened, everything comes back up as normal 18 BTT – Analysis of a write Free List[0] write() LBA 5 CPU 0 Lane 0 Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Release Lane 0 • On reboot: – Map hasn't been updated – Reads will continue to get the 5 → 10 mapping – Flog will still show '2' as free and ready to be written to 19 BTT – Analysis of a write Free List[0] write() LBA 5 CPU 0 Lane 0 Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Release Lane 0 • On reboot: – Read flog[0][0] = {5, 10, 2, 0b10} – Flog claims map[5] should have been '2', but map[5] is still '10' (== flog.old) – Since flog and map disagree, recovery routine detects an incomplete transaction – Flog is assumed to be “true” since it is always written before the map – Recovery routine completes the transaction by updating map[5] = 2; free[0] = 10 20 BTT – Analysis of a write Free List[0] write() LBA 5 CPU 0 Lane 0 Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write data to 2 flog[0][0] = {5, 10, 2, 0b10} • Special case, the flog write is torn: map[5] = 2 Release Lane 0 free[0] = {10, 0b11, 1} Bit sequence for flog.seq: 01->10->11->01 • On reboot: Old ← → New – Read flog[0][0] = {5, 10, X, 0b11}; flog[0][1] = {X, X, X, 0b01} – Since seq is written last, the half-written flog entry does not show up as “new” – Free list is reconstructed using the newest non-torn flog entry flog[0][1] in this case – map[5] remains '10', and '2' remains free. 21 BTT – Analysis of a write Free List[0] write() LBA 5 CPU 0 Lane 0 Map (old) Map blk seq slot pre post pre post 2 0b10 0 5 10 5 2 write data to 2 flog[0][0] = {5, 10, 2, 0b10} map[5] = 2 free[0] = {10, 0b11, 1} Release Lane 0 • On reboot: – Since both flog and map were updated, free list reconstruction will happen as usual 22 Introduction The Block Translation Table Read and Write Flows Synchronization Performance/Efficiency BTT vs. DAX 23 Let's Race! Write vs. Write CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10 24 Let's Race! Write vs. Write CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10 25 Let's Race! Write vs. Write CPU 1 CPU 2 write LBA 0 write LBA 0 get-free[1] = 5 get-free[2] = 6 write data - postmap ABA 5 write data - postmap ABA 6 ... ... read old_map[0] = 10 read old_map[0] = 10 write log 0/10/5/xx write log 0/10/6/xx write map = 5 write map = 6 write free[1] = 10 write free[2] = 10 Critical section 26 Let's Race! Write vs. Write ● Solution: An array of map_locks indexed by a hash of the premap ABA CPU 1 write LBA 0; get-free[1] = 5; write_data to 5 CPU 2 write LBA 0; get-free[2] = 6; write_data to 6 lock map_lock[0 % nfree] read old_map[0] = 10 write log 0/10/5/xx; write map = 5; free[1] = 10 unlock map_lock[0 % nfree] lock map_lock[0 % nfree] read old_map[0] = 5 write log 0/5/6/xx; write map = 6; free[2] = 5 unlock map_lock[0 % nfree] 27 Let's Race! Read vs. Write CPU 1 (Reader) CPU 2 (Writer) read LBA 0 write LBA 0 ... get-free[2] = 6 read map[0] = 5 write data to postmap block 6 start reading postmap block 5 write meta: map[0] = 6, free[2] = 5 ... another write LBA 12 ... get-free[2] = 5 ... write data to postmap block 5 finish reading postmap block 5 ● BUG! – writing a block that is being read from This doesn't corrupt on-disk layout, but the read appears torn 28 Let's Race! Read vs. Write ● Solution: A Read Tracking Table indexed by lane, tracking in-progress reads CPU 1 (Reader) CPU 2 (Writer) read LBA 0 write LBA 0 read map[0] = 5 get-free[2] = 6; write data write rtt[1] = 5 write meta: map[0] = 6, free[2] = 5 start reading postmap block 5 another write LBA 12 ... get-free[2] = 5 ... scan RTT – '5' is present - wait! finish reading postmap block 5 ... clear rtt[1] ... write data to postmap block 5 29 Introduction The Block Translation Table Read and Write Flows Synchronization Performance/Efficiency BTT vs. DAX 30 That's Great...but is it Fast? 512B Block size 4K Block size Write Amplification ~4.6% [536B] ~0.5% [4120B] Capacity Overhead ~0.8% ~0.1% ● Overall, BTT to introduces a ~10% performance overhead ● We think there is still room for improvement 31 Introduction The Block Translation Table Read and Write Flows Synchronization Performance/Efficiency BTT vs. DAX 32 BTT vs. DAX ● DAX stands for Direct Access ● Patchset by Matthew Wilcox, merged into 4.0-rc1 ● Allows mapping a pmem range directly into userspace via mmap ● DAX is fundamentally incompatible with the idea of BTT ● If the application is aware of persistent, byte-addressable memory, and can use it to an advantage, DAX is the best path for it • If the application relies on atomic sector update semantics, it must use the BTT – It may not know that it relies on this.. ● XFS relies on journal updates being sector atomic – For xfs-dax, we'd need to use logdev=/dev/[btt-partition] 33 Resources ● http://pmem.io - General persistent memory resources. Focuses on the NVML, a library to make persistent memory programming easier ● The 'pmem' driver on github: https://github.com/01org/prd ● linux-nvdimm mailing list: https://lists.01.org/mailman/listinfo/linux-nvdimm ● ● linux-nvdimm patchwork: https://patchwork.kernel.org/project/linux-nvdimm/list/ #pmem on OFTC 34 Q&A
© Copyright 2024