Block Translation Table

Providing Atomic Sector Updates in
Software for Persistent Memory
Vishal Verma
vishal.l.verma@intel.com
Vault 2015
1
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
2
NVDIMMs and Persistent Memory
CPU
caches
Speed
DRAM
Capacity
Persistent Memory
Traditional Storage
●
●
●
NVDIMMs are byte-addressable
We won't talk of “Total System Persistence”
●
But using persistent memory DIMMs for storage
Drivers to present this as a block device - “pmem”
3
Problem Statement
• Byte addressability is great
– But not for writing a
sector atomically
Userspace
write()
'pmem' driver - /dev/pmem0
memcpy()
0
1
2
3
---
NVDIMM
4
Problem Statement
• On a power failure, there are three possibilities
1.No blocks are torn (common on modern drives)
2.A block was torn, but reads back with an ECC error
3.A block was torn, but reads back without an ECC error (very rare on modern
drives)
• With pmem, we use memcpy()
–
ECC is correct between two stores
–
Torn sectors will almost never trigger ECC on the NVDIMM
–
Case 3 becomes most common!
–
Only file systems with data checksums will survive this case
5
Naive solution
• Full Data Journaling
• Write every block to the journal first
• 2x latency
• 2x media wear
6
Slightly better solution
• Maintain an 'on-disk' indirection
table and an in-memory free
block list
• The map/indirection table has
LBA -> actual block offset
mappings
LBA
Actual
0
42
1
5050
0 - Free
2
314
3 - LBA 3
3
3
Map
• New writes grab a block from
free list
0
• On completing the write,
atomically swap the free list entry
and map entry
12
2
Free List
42 - LBA 0
314 - LBA 2
NVDIMM
7
Slightly better solution
• Maintain an 'on-disk' indirection
table and an in-memory free
block list
• The map/indirection table has
LBA -> actual block offset
mappings
• New writes grab a block from
free list
• On completing the write,
atomically swap the free list entry
and map entry
LBA
Actual
0
42
1
5050
0 - Free
2
314
3 - LBA 3
3
3
Map
write( to LBA 3 )
42 - LBA 0
0
2
12
Free List
314 - LBA 2
NVDIMM
8
Slightly better solution
• Maintain an 'on-disk' indirection
table and an in-memory free
block list
• The map/indirection table has
LBA -> actual block offset
mappings
LBA
Actual
0
42
1
5050
2
314
3
0
Map
• New writes grab a block from
free list
3
• On completing the write,
atomically swap the free list entry
and map entry
12
2
Free List
0 - LBA 3
3 - Free
42 - LBA 0
314 - LBA 2
NVDIMM
9
Slightly better solution
• Easy enough to implement
• Should be performant
• Caveat:
– The only way to recreate the free list is to read the entire map
–
Consider a 512GB volume, bs=512 => reading 1073741824 map entries
–
Map entries have to be 64-bit, so we end up reading 8GB at startup
–
Could save the free list to media on clean shutdown
–
But...clunky at best
10
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
11
The Block Translation Table
Arena Info Block (4K)
• nfree: The number of free blocks in reserve.
• Flog: Portmanteau of free list + log
–
–
–
Has nfree entries.
Each entry has two 'slots' that 'flip-flop'
Each slot has:
Block being
written
Old
mapping
New
mapping
Sequence
num
Arena 0
512G
Data Blocks
Arena 1
512G
• Info block: Info about arena - offsets, lbasizes etc.
• External LBA: LBA as visible to upper layers
• ABA: Arena Block Address - Block offset within an arena
• Premap/Postmap ABA: The block offset into the data
area as seen prior to/post indirection from the map
.
.
.
Backing Store
nfree reserved blocks
BTT Map
BTT Flog (8K)
Info Block Copy (4K)
Arena
12
What's in a lane?
Free List
CPU 0
CPU 1
CPU 2
get_lane() = 0
get_lane() = 1
get_lane() = 2
Flog
blk
seq
slot
LBA
old
new
seq
LBA`
old`
new`
seq`
2
0b10
0
5
32
2
0b10
XX
XX
XX
XX
Lane 1
6
0b10
1
XX
XX
XX
XX
8
38
6
0b10
Lane 2
14
0b01
0
42
42
14
0b01
XX
XX
XX
XX
Lane 0
• The idea of “lanes” is purely logical
• num_lanes = min(num_cpus, nfree)
• lane = cpu % num_lanes
• If num_cpus > num_lanes, we need locking on lanes
– But if not, we can simply preempt_disable() and need not take a lock
Map
5
2
8
6
42
14
13
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
14
BTT – Reading a block
read() LBA 5
• Convert external LBA to Arena number + pre-map ABA
CPU 0
• Get a lane (and take lane_lock if needed)
• Read map to get the mapping
Map
• If ZERO flag is set, return zeroes
• If ERROR flag is set, return an error
• Read data from the block that the map points to
• Release lane (and lane_lock)
pre
post
5
10
Lane 0
Read data from 10
Release Lane 0
15
write() LBA 5
BTT – Writing a block
CPU 0
• Convert external LBA to Arena number + pre-map ABA
Map (old)
• Get a lane (and take lane_lock if needed)
pre
post
blk
seq
slot
5
10
2
0b10
0
• Use lane to index into free list, write data to this free
block
• Write flog entry: [premap_aba / old postmap_aba / new
postmap_aba / seq]
flog[0][0] = {5, 10, 2, 0b10}
• Write new post-map ABA into map.
map[5] = 2
• Write old post-map entry into the free list
• Release lane (and lane_lock)
Free List[0]
write data to 2
• Read map to get the existing mapping
• Calculate next sequence number and write into the free
list entry
Lane 0
Map
pre
post
5
2
free[0] = {10, 0b11, 1}
Release Lane 0
16
BTT – Analysis of a write
Free List[0]
write() LBA 5
CPU 0
Lane 0
Map (old)
Map
blk
seq
slot
pre
post
pre
post
2
0b10
0
5
10
5
2
write data to 2
flog[0][0] = {5, 10, 2, 0b10}
map[5] = 2
free[0] = {10, 0b11, 1}
Release
Lane 0
Opportunities for interruption/power failure
17
BTT – Analysis of a write
Free List[0]
Map
blk
seq
slot
pre
post
pre
post
2
0b10
0
5
10
5
2
write() LBA 5
CPU 0
Lane 0
Map (old)
write data to 2
flog[0][0] = {5, 10, 2, 0b10}
map[5] = 2
free[0] = {10, 0b11, 1}
Release
Lane 0
• On reboot:
–
No on-disk change had happened,
everything comes back up as normal
18
BTT – Analysis of a write
Free List[0]
write() LBA 5
CPU 0
Lane 0
Map (old)
Map
blk
seq
slot
pre
post
pre
post
2
0b10
0
5
10
5
2
write data to 2
flog[0][0] = {5, 10, 2, 0b10}
map[5] = 2
free[0] = {10, 0b11, 1}
Release
Lane 0
• On reboot:
–
Map hasn't been updated
–
Reads will continue to get the 5 → 10 mapping
–
Flog will still show '2' as free and ready to be
written to
19
BTT – Analysis of a write
Free List[0]
write() LBA 5
CPU 0
Lane 0
Map (old)
Map
blk
seq
slot
pre
post
pre
post
2
0b10
0
5
10
5
2
write data to 2
flog[0][0] = {5, 10, 2, 0b10}
map[5] = 2
free[0] = {10, 0b11, 1}
Release
Lane 0
• On reboot:
–
Read flog[0][0] = {5, 10, 2, 0b10}
–
Flog claims map[5] should have been '2', but map[5] is still '10' (== flog.old)
–
Since flog and map disagree, recovery routine detects an incomplete transaction
–
Flog is assumed to be “true” since it is always written before the map
–
Recovery routine completes the transaction by updating map[5] = 2; free[0] = 10
20
BTT – Analysis of a write
Free List[0]
write() LBA 5
CPU 0
Lane 0
Map (old)
Map
blk
seq
slot
pre
post
pre
post
2
0b10
0
5
10
5
2
write data to 2
flog[0][0] = {5, 10, 2, 0b10}
• Special case, the flog write is torn:
map[5] = 2
Release
Lane 0
free[0] = {10, 0b11, 1}
Bit sequence for flog.seq: 01->10->11->01
• On reboot:
Old ←
→ New
–
Read flog[0][0] = {5, 10, X, 0b11}; flog[0][1] = {X, X, X, 0b01}
–
Since seq is written last, the half-written flog entry does not show up as “new”
–
Free list is reconstructed using the newest non-torn flog entry flog[0][1] in this case
–
map[5] remains '10', and '2' remains free.
21
BTT – Analysis of a write
Free List[0]
write() LBA 5
CPU 0
Lane 0
Map (old)
Map
blk
seq
slot
pre
post
pre
post
2
0b10
0
5
10
5
2
write data to 2
flog[0][0] = {5, 10, 2, 0b10}
map[5] = 2
free[0] = {10, 0b11, 1}
Release
Lane 0
• On reboot:
–
Since both flog and map
were updated, free list
reconstruction will happen
as usual
22
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
23
Let's Race! Write vs. Write
CPU 1
CPU 2
write LBA 0
write LBA 0
get-free[1] = 5
get-free[2] = 6
write data - postmap ABA 5
write data - postmap ABA 6
...
...
read old_map[0] = 10
read old_map[0] = 10
write log 0/10/5/xx
write log 0/10/6/xx
write map = 5
write map = 6
write free[1] = 10
write free[2] = 10
24
Let's Race! Write vs. Write
CPU 1
CPU 2
write LBA 0
write LBA 0
get-free[1] = 5
get-free[2] = 6
write data - postmap ABA 5
write data - postmap ABA 6
...
...
read old_map[0] = 10
read old_map[0] = 10
write log 0/10/5/xx
write log 0/10/6/xx
write map = 5
write map = 6
write free[1] = 10
write free[2] = 10
25
Let's Race! Write vs. Write
CPU 1
CPU 2
write LBA 0
write LBA 0
get-free[1] = 5
get-free[2] = 6
write data - postmap ABA 5
write data - postmap ABA 6
...
...
read old_map[0] = 10
read old_map[0] = 10
write log 0/10/5/xx
write log 0/10/6/xx
write map = 5
write map = 6
write free[1] = 10
write free[2] = 10
Critical section
26
Let's Race! Write vs. Write
●
Solution: An array of map_locks indexed by a hash of the premap ABA
CPU 1
write LBA 0; get-free[1] = 5; write_data to 5
CPU 2
write LBA 0; get-free[2] = 6; write_data to 6
lock map_lock[0 % nfree]
read old_map[0] = 10
write log 0/10/5/xx; write map = 5; free[1] = 10
unlock map_lock[0 % nfree]
lock map_lock[0 % nfree]
read old_map[0] = 5
write log 0/5/6/xx; write map = 6; free[2] = 5
unlock map_lock[0 % nfree]
27
Let's Race! Read vs. Write
CPU 1 (Reader)
CPU 2 (Writer)
read LBA 0
write LBA 0
...
get-free[2] = 6
read map[0] = 5
write data to postmap block 6
start reading postmap block 5
write meta: map[0] = 6, free[2] = 5
...
another write LBA 12
...
get-free[2] = 5
...
write data to postmap block 5
finish reading postmap block 5
●
BUG! – writing a block
that is being read from
This doesn't corrupt on-disk layout, but the read appears torn
28
Let's Race! Read vs. Write
●
Solution: A Read Tracking Table indexed by lane, tracking in-progress reads
CPU 1 (Reader)
CPU 2 (Writer)
read LBA 0
write LBA 0
read map[0] = 5
get-free[2] = 6; write data
write rtt[1] = 5
write meta: map[0] = 6, free[2] = 5
start reading postmap block 5
another write LBA 12
...
get-free[2] = 5
...
scan RTT – '5' is present - wait!
finish reading postmap block 5
...
clear rtt[1]
...
write data to postmap block 5
29
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
30
That's Great...but is it Fast?
512B Block size
4K Block size
Write Amplification
~4.6% [536B]
~0.5% [4120B]
Capacity Overhead
~0.8%
~0.1%
●
Overall, BTT to introduces a ~10% performance overhead
●
We think there is still room for improvement
31
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
32
BTT vs. DAX
●
DAX stands for Direct Access
●
Patchset by Matthew Wilcox, merged into 4.0-rc1
●
Allows mapping a pmem range directly into userspace via mmap
●
DAX is fundamentally incompatible with the idea of BTT
●
If the application is aware of persistent, byte-addressable memory, and can use it to an
advantage, DAX is the best path for it
• If the application relies on atomic sector update semantics, it must use the BTT
–
It may not know that it relies on this..
●
XFS relies on journal updates being sector atomic
–
For xfs-dax, we'd need to use logdev=/dev/[btt-partition]
33
Resources
●
http://pmem.io - General persistent memory resources. Focuses on the
NVML, a library to make persistent memory programming easier
●
The 'pmem' driver on github: https://github.com/01org/prd
●
linux-nvdimm mailing list: https://lists.01.org/mailman/listinfo/linux-nvdimm
●
●
linux-nvdimm patchwork:
https://patchwork.kernel.org/project/linux-nvdimm/list/
#pmem on OFTC
34
Q&A