Correlating Multiple TB of Performance Data to User Jobs

Michael Kluge, ZIH
Correlating Multiple TB
of Performance Data to User Jobs
Lustre User Group 2015, Denver, Colorado
Zellescher Weg 12
Willers-Bau A 208
Tel. +49 351 - 463 – 34217
Michael Kluge (michael.kluge@tu-dresden.de)
About us
ZIH: about us
– HPC and service provider
– Research Institute
– Big Data Competence Center
Main research areas:
– Performance Analysis Tools
– Energy Efficiency
– Computer Architecture
– Standardization Efforts (OpenACC, OpenMP)
Michael Kluge, ZIH
2
ZIH HPC and BigData Infrastructure
Erlangen, Potsdam
100 Gbit/s Cluster Uplink
TU
Dresden
ZIHBackbone
Home
Archiv
HTW, Chemnitz, Leipzig, Freiberg
Parallel Machine
BULL HPC
Haswell processors
2 x 612 Nodes
~ 30.000 Cores
other clusters
Megware Cluster
IBM iDataPlex
High Throughput
BULL Islands
Westmere, Sandy Bridge,
und Haswell processors,
GPU-Nodes
700 Nodes
~15.000 Cores
Shared Memory
SGI Ultraviolet 2000
512 cores SandyBridge
8 TB RAM NUMA
Storage
Michael Kluge, ZIH
3
HRSK-II, Phase 2, currently installed
GPU Isle
64 nodes
Haswell
2 x 12 cores
64 GB RAM
+
2 x Nvidia K
80
Troughput Isle
232 nodes
Haswell
2 x 12 cores
64/128/256
GB RAM
128 GB SSD
HPC Isle
612 nodes
Haswell
2 x 12 cores
64 GB RAM
128 GB SSD
HPC Isle
612 nodes
Haswell
2 x 12 cores
64 GB RAM
128 GB SSD
2x
Export
2 x SMP nodes
2 TB RAM, 48 cores
64 GB RAM
24 Lustre Server
5 PB
NLSATA
80 GB/s
400.000
IOPS
40 TB
SSD
10 GB/s
1 Mio.
IOPS
4x
Login
to Campus 4 x 10GE
Uplink, Home, Backup
~ 1,2 PF peak in CPUs, ~35.000 cores total
~ 300 TF peak in GPUs, 128 cards
Michael Kluge, ZIH
4
FDR
InfiniBand
Architecture of our storage concept (FASS)
ZIH-Infrastructure
Flexible Storage System (FASS)
Login
…
User Z
Transaction
SCRATCH
User A
Storage
SSD
Server 2
Server N
Server/File systems
User A
User Z
Net
Switch 2
HPC-Component
Switch 1
Export
Batch-System
Checkpoint.
Scratch
Server 1
…
SAS
SATA
Throughput-Component
Optimization and control
Michael Kluge, ZIH
Access
statistics
5
Analysis
Performance Analysis for I/O Activities (1)
What is of interest
– Application requests, Interface types
– Access sizes/patterns
– Network/Server/Storage utilization
Level of detail:
– Record everything
– Record data every 5 to 10 seconds
Challenge:
– How to analyze this?
– How to deal with anonymous data?
Michael Kluge, ZIH
6
I/O and Job Correlation: Architecture/Software
SLURM Batch System
HPC File System
node
MDS
node
I/O
network
node
RAID
Batch System
Log Files
Global I/O Data
OSS
RAID
node
OSS
SAN
Cluster Formation
RAID
application events
node local events
Correlation System
OSS
utilization
errors
Job Clusters
backend, cache
throughput, IOPS
Job Clusters
with high
I/O demands
Agent
Interface
Agent
Interface
Agent
Interface
Agent
Interface
Data Processing
Database
Storage
Database
Storage
Job Clusters
with low
I/O demands
Database
Storage
Data
Request
s
Dataheap Data Collection System
https://fusionforge.zih.tu-dresden.de/projects/dataheap/
Michael Kluge, ZIH
7
Visualization of Live Data for Phase 1
Michael Kluge, ZIH
8
Visualization of Live Data for Phase 2
Michael Kluge, ZIH
9
Performance Analysis for I/O Activities (2)
Amount of collected data:
– 240 OSTs, 20 OSS servers, …
– Looking at stats and brw_stats (how to store a histogram in
database?)
 ~75.000 tables
– about 1 GB of data per hour
Analysis is supposed to cover 6 month
– 4 TB data
– No way to analyze this with any serial approach
Michael Kluge, ZIH
10
Analysis Process
map operations to a set of tables
– modify all tables (align data, scale, aggregate over time)
– create a new table from an input set of tables (aggregation)
– plot (time series, histograms)
currently: single thread Haskell program
Swift/T work in progress
– 4 TB means 50 seconds on an 80 GB/s file system for the
initial read
– Swift/T has nice ways to describe data dependencies and to
parallelize this kind of workflows (open for suggestions!)
Michael Kluge, ZIH
11
Approach to look at the Raw Data
analysis of the raw performance data
– take original data and create derived metrics
– look at plots and histograms
correlate metrics with user jobs
– take job log and split jobs into classes
– check for correlations between job classes and
performance metrics
Michael Kluge, ZIH
12
Metric Example: NetApp vs. OST Bandwidth (20s)
Michael Kluge, ZIH
13
Metric Example: NetApp vs. OST Bandwidth (60s)
Michael Kluge, ZIH
14
Metric Example: NetApp vs. OST Bandwidth (10 minutes)
Michael Kluge, ZIH
15
Metric Example: 3 Month bandwidth vs. IOPS
Michael Kluge, ZIH
16
I/O and Job Correlation: Starting Point (Bandwidth)
Michael Kluge, ZIH
17
Building Job Classes
a class is defined by:
– user id, core count
– similar runtime
– similar start time
(time isolation)
sort all jobs by runtime, split list at gaps
split these groups again if they are not overlapping
(long gap between start)
challenges: runtime fluctuation, large run time variations
Michael Kluge, ZIH
18
I/O and Job Correlation: Preliminary Result (1)
Michael Kluge, ZIH
19
I/O and Job Correlation: Preliminary Result (1)
Michael Kluge, ZIH
20
I/O and Job Correlation: Correlation Factors
Michael Kluge, ZIH
21
Open Topics
How to store a histogram in database?
Other approaches than Swift/T
Deal with the zeros
Cache intermediate results
Michael Kluge, ZIH
22
Conclusion / Questions
it is possible to map anonymous performance
data back to user jobs
advance towards storage controller analysis
(caches, backend bandwidth, …)
Michael Kluge, ZIH
23