Deriving and Comparing Deduplication Techniques Using a Model

Deriving and Comparing Deduplication Techniques Using a Model-Based Classification
Data Deduplication
Goals
Technique to save storage capacity
Goal 1: Uncouple core concepts from implementations
•  Exploit redundancy
•  Commonly used for
backup systems
•  Data Domain, HP, …
• 
• 
• 
• 
Traditional: Systems as inherently linked characteristics
But: Systems consist of independent core concepts
Dedup. approach is defined by its prefetching approach
Prefetching vs. deduplication exactness
Basic technique
exact
Split data up in chunks
Fingerprint chunks
Compare-by-Hash
Remove duplicates
approximate
• 
• 
• 
• 
Container Caching
Different Data Sets and Chunk Sizes
Different Data Sets
Block Locality Caching
Sparse Indexing
disk
memory
disk
memory
disk
memory
Chunk
Index
Bloom
Filter
Chunk
Index
Bloom
Filter
Chunk
Index
Sparse
Index
cont.
storage
cont.
cache
Block
Index
BLC
segment
storage
manifest
cache
disk
memory
disk
memory
disk
memory
Sparse
Index
cont.
storage
cont.
cache
Sparse
Index
Block
Index
BLC
Sparse
Index
segment
storage
manifest
cache
•  Weekly backups
•  3x university home directories, 1x windows machines (Meyer
et al.)
exact
approx.
Goal 2: Extensive Comparison of all approaches
•  Container Caching and Sparse Indexing not compared on the
same data sets
•  Comparison of the approaches
•  {CC, BLC, SI } x {exact, approximate} x {chunk sizes} …
•  IO patterns
Different RAM sizes for approximate approaches
•  So far: 8GB total memory (cache size to data set size)
•  But: today 128GB and more
Different Chunk Sizes
•  Tradeoff between less memory and more duplicate detection
•  4-16KB (HOME) , 8 + 16KB (Microsoft)
exact
approx.
IO Access Patterns
Container Caching
Block Locality Caching
Sparse Indexing
Jürgen Kaiser, André Brinkmann, Tim Süß, Johannes Gutenberg University Mainz, {kaiserj, brinkman, suesst}@uni-mainz.de
Dirk Meister, Pure Storage, dirk@purestorage.com
https://research.zdv.uni-mainz.de