Bad Things Happen to Good People: How to Minimize the Pain

NEODBUG – November 21st, 2013
Bad Things Happen to Good People:
How to Minimize the Pain
DB2 for Linux, UNIX, and Windows
Kelly Schlamb (kschlamb@ca.ibm.com)
Executive IT Specialist, Worldwide Information Management Technical Sales
IBM Canada Ltd.
© 2013 IBM Corporation
Disclaimer
© Copyright IBM Corporation 2013. All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE
MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED
“AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM'S CURRENT
PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE
FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER
DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY
WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF
ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.
IBM's statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM's sole discretion. Information
regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any
future features or functionality described for our products remains at our sole discretion.
IBM, the IBM logo, ibm.com, Information Management, DB2, DB2 Connect, DB2 OLAP Server, pureScale, System Z, Cognos, solidDB, Informix,
Optim, InfoSphere, and z/OS are trademarks or registered trademarks of International Business Machines Corporation in the United States, other
countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or
™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks
may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and
trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
2
© 2013 IBM Corporation
Logging and Recovery
October 28, 2013
© 2013 IBM Corporation
What Could Possibly Go Wrong? Famous Last Words.
As the Boy Scouts say … "Be prepared!"
Database server
goes down (planned
or unplanned)
Underlying storage
becomes unusable
Site disaster
Database is
accidentally dropped
Table is accidentally
dropped or
data deleted
4
Hardware component
failure
Table space is
accidentally dropped
© 2013 IBM Corporation
Develop a Recovery Plan
ƒ Some questions to ask yourself when developing
a recovery plan
– Does the database need to be recoverable?
– What are your RPO and RTO requirements?
• Recovery Point Objective – how much data loss is acceptable, if any, if a major
incident occurs?
• Recovery Time Objective – what is an acceptable length of time to perform the
recovery while the system is unavailable?
–
–
–
–
How frequent do the backup operations need to be?
How much storage space can be used for backups and archived logs?
Where do you want the backups and archived logs to go?
Will table space level backups be sufficient, or will full database backups be
necessary?
– Do the backups need to be automated?
– Is high availability (HA) a consideration?
– Is off-site disaster recovery (DR) a consideration?
5
© 2013 IBM Corporation
Logging & Log File Management Recommendations
ƒ Use archive logging (not circular logging) for
production environments
– Provides better recovery characteristics (lower RPO)
– Permits use of online backups (better availability) and table space level
backups (more granular)
– Can setup two archive log paths for best protection
ƒ Include logs in backup images (default behavior)
– Allows restoring of an online backup if all you have is the backup image
(e.g. disaster recovery)
ƒ Configure mirrored logging on separate file systems
– Protects against file system corruption or accidental deletion of log files
ƒ For archive logging, consider using automated log file management
– Choose how long to retain recovery objects like backups and archived logs and
when to automatically prune them
– See NUM_DB_BACKUPS, REC_HIS_RETENTN, and AUTO_DEL_REC_OBJ
database configuration parameters
6
© 2013 IBM Corporation
Backup & Recovery
ƒ Various methods and options available for backing up and
recovering the data in your databases
–
–
–
–
–
–
Offline or online backups
Database or table space backups
Split mirror / flash copy backups
Incremental backups
Rebuild database from table space backup images
Backup compression
ƒ As database size increases, consider using more frequent, online
table space level backups
ƒ Built-in autonomics for backup command provides optimal values for
number of buffers, buffer size, and parallelism
– Values calculated based on amount of utility heap memory available, the
number of processors available, and the database configuration
ƒ Automatic backup maintenance can be enabled
7
© 2013 IBM Corporation
Automatic Backups
ƒ Automatic database backups simplify backup management
by ensuring that recent full backup of database is performed
ƒ The need to perform a backup is based on one or more of the
following criteria
– You have never created a full database backup
– The time elapsed since the last full backup is more than a specified number of hours
– The transaction log space consumed since the last backup is more than a specified
number of 4 KB pages (in archive logging mode only).
ƒ Backups can be configured to be offline or online
ƒ Supports disk, tape, TSM, and vendor DLL media types
ƒ Feature is enabled/disabled by using auto_db_backup and auto_maint
database configuration parameters
ƒ Policies can be defined via SYSPROC.AUTOMAINT_SET_POLICY and
SYSPROC.AUTOMAINT_SET_POLICYFILE stored procedures
8
© 2013 IBM Corporation
Incremental Backups
ƒ Incremental backups contain data changed since previous backups
(depending on the type of backup – incremental or delta)
– In addition to data, each incremental backup image also contains all of the database
metadata
ƒ Incremental (cumulative) backup image
– Contains all database data that has changed since the most recent, successful, full
backup operation
– Cumulative because a series of incremental backups taken over time will each have the
contents of the previous incremental backup image
ƒ Delta backup image
– Contains a copy of all database data that has changed since the last successful backup
(full, incremental, or delta) of the table space in question
ƒ Combinations of database and table space incremental backups are
permitted, in both online and offline modes of operation
ƒ To restore a database or table space to a consistent state, the recovery
process must begin with a consistent image of the object to be restored,
followed by the application of the appropriate incremental backup images
ƒ TRACKMOD database configuration parameter must be set to YES
9
© 2013 IBM Corporation
Incremental Backups: Examples
ƒ Weekly full backups, daily
incremental (cumulative)
backups:
ƒ Weekly full backups, daily
delta backups:
ƒ Weekly full backups (Sunday),
incremental (cumulative)
backup half-way through week,
delta backups all other days:
ƒ Recovery required (prior to
Saturday's delta being taken):
10
© 2013 IBM Corporation
Database Rebuild
ƒ Automatic restoring of necessary backup images
ƒ Rebuild a database from table space backup images
– Means no longer having to take as many full database backups, which is
becoming less possible as databases grow in size
– Instead, take more frequent table space backups
ƒ In a recovery situation, if you need to bring a subset of table spaces
online faster than others, you can do a partial database rebuild
– May also be used for
• Creating a separate database for QA purposes
• Data recovery purposes
ƒ Can choose which table spaces to restore as part of the rebuild
–
–
–
–
11
All table spaces in the database at a time that the backup was taken
All table spaces included in a selected table space backup
A specific list of table spaces specified as part of the restore command
All table spaces except those in the specific list provided
© 2013 IBM Corporation
Database Disaster Recovery using Rebuild
ƒ Necessary table spaces will be restored automatically based on
recovery history data
Recovery
History File
SalesDB
SYSCATSPACE
2
SMS01
DMS01
3
Backup
Database
SALESDB
SUN
Backup
TS DMS01
MON
TUE
Backup
TS DMS02
WED
THU
DMS02
1
Backup
TS DMS01
FRI
Database
Fails !
SAT
1. RESTORE DB SALESDB
REBUILD WITH ALL TABLESPACES IN DATABASE TAKEN AT <Friday>
(DB2 restores Friday, Sunday and Thursday)
2. ROLLFORWARD DB SALESDB TO END OF LOGS AND STOP
12
© 2013 IBM Corporation
Database Recovery - Rebuild from Table Space Backups
ƒ Database recovery using multiple table space backups
Recovery
History File
SalesDB
SYSCATSPACE
2
Backup
TS DMS02
DMS01
SUN
TUE
DMS01
3
DMS02
1
Backup
TS SMS01
SYSCATSPACE
MON
SMS01
Backup
TS DMS02
DMS01
WED
THU
Backup
TS DMS01
FRI
Database
Fails !
SAT
1. RESTORE DB SALESDB
REBUILD WITH ALL TABLESPACES IN DATABASE TAKEN AT <Friday>
(DB2 restores Friday, Tuesday and Thursday)
2. ROLLFORWARD DB SALESDB TO END OF LOGS AND STOP
13
© 2013 IBM Corporation
Database Partial Copy using Rebuild
ƒ Can create a database copy with a subset of table spaces
SalesDB
SalesDB
SYSCATSPACE
2
DMS01
1.
DMS01
DMS02
1
Backup
Database
SALESDB
SUN
SYSCATSPACE
SMS01
Backup
TS DMS01
MON
TUE
Backup
TS DMS02
WED
THU
Backup
TS DMS01
FRI
Copy for
testing
SAT
RESTORE DB SALESDB
REBUILD WITH TABLESPACE(SYSCATSPACE,DMS01) TAKEN AT Friday
(DB2 Restores Friday, and Sunday (just SYSCATSPACE TS)
OR
RESTORE DB SALESDB REBUILD WITH ALL TABLESPACES IN DATABASE
EXCEPT TABLESPACE(SMS01,DMS02) TAKEN AT Friday
2. ROLLFORWARD DB SALESDB TO END OF LOGS AND STOP
14
© 2013 IBM Corporation
Transportable Schemas
ƒ Efficient schema movement between databases
ƒ Using a backup image as the source, allows you to copy a set of
table spaces and SQL schemas from one database into another
ƒ A database schema must be transported in its entirety
– If a table space contains both the schema you want to transport, as well as
another schema, you must transport all data objects from both schemas
– These self contained (from a table space perspective) sets of schemas that
have no references to other database schemas are called transportable sets
ƒ Restore will do multiple operations under the covers
–
–
–
–
Restore SYSCATSPACE and specified table spaces from backup image
Roll them forward to a point of consistency
Validate the schemas specified
Transfer ownership of the specified table spaces (including containers) to the
target database
– Recreate the schema in the target database
15
© 2013 IBM Corporation
Transportable Schemas: Example
Not a valid transportable set
Valid transportable set
The database contains the following valid transportable sets:
ƒ
ƒ
ƒ
ƒ
mydata1: schema1 + schema2
mydata2 + myindex: schema3
multidata1 + multiuser2 + multiindex1: schema4 + schema5
Any combination of the above transportable sets
To Move All Table Spaces:
restore db old_db \
tablespace (“mydata1”,”mydata2”,”myindex”,”multidata1”,”multiindex1”,”multiuser2”)\
schema (“schema1”,”schema2”,”schema3”,”schema4”,”schema5”) transport into new_db
16
© 2013 IBM Corporation
DB2's Dropped Table Recovery Feature
ƒ Recover the contents of a dropped table using DB2's
table space restore and rollforward operations
– When rolling forward through the drop of the table, the data is
exported prior to the replay of the drop
ƒ Requires that the table space be enabled for dropped table recovery
– Enabled by default at table space creation time
ƒ When a table is dropped, an entry is made in the transaction log files as well
as in the recovery history file
ƒ You can recover a dropped table by doing the following:
1. Identify the dropped table by invoking the LIST HISTORY DROPPED TABLE command
2. Restore a database- or table space-level backup image taken before the table
was dropped
3. Create an export directory to which files containing the table data are to be written
4. Roll forward to a point in time after the table was dropped (or to end of logs) by using the
RECOVER DROPPED TABLE parameter on the ROLLFORWARD DATABASE command
5. Re-create the table by using the CREATE TABLE statement from the recovery history file
6. Import the table data that was exported during the rollforward operation into the table
17
© 2013 IBM Corporation
Proactive Checking of Database and Recovery Objects
ƒ db2dart – Database analysis and reporting tool
– Examines database for architectural correctness
ƒ INSPECT – Database inspection command
– Inspects database for architectural integrity
– Can be run while the database is online
ƒ db2ckbkp – Check backup command
– Test the integrity of a backup image
ƒ db2cklog – Check archive log file validity command
– Check the validity of archive log files in order to determine whether or not the
log files can be used during rollforward recovery of a database or table space
18
© 2013 IBM Corporation
IBM DB2 Recovery Expert
Granular and Flexible Data Recovery
Part of the DB2
Advanced
Recovery Feature
ƒ Faster
– Simplifies and optimizes database recovery by reducing disruption during the
recovery process
• DBAs can quickly restore or correct erroneous data
– Log Analysis enables organizations to monitor changes that allow for
quick recovery
ƒ Smarter
– Provides intelligent analysis of DB2 and DB2 recovery assets to find the most
efficient recovery path
ƒ Simpler
– Facilitates process of rebuilding data assets to a specified point in time, often
without taking operations offline
“AFS is establishing a disaster recovery policy with our Vision Application. DB2 Recovery Expert
provides us with the functionality to roll back both databases to a point where the tables are
consistent. This will help us meet 100% of our needs for this project. The product itself is awesome
and WEB UI is very nice.”
-Kirk B. Spadt, Principal Architect, Automated Financial Systems
19
© 2013 IBM Corporation
IBM DB2 Merge Backup (MBK)
Part of the DB2
Advanced
Recovery Feature
ƒ Merge full DB2 backups with DB2 incremental/delta
backups to build a new up-to-date full DB2 backup image
ƒ Eliminates the need to take DB2 full backups
ƒ Use online or offline, and table space or database backup images
ƒ Split out table space backup images from a full backup image
ƒ Run MBK on the database server or on a standalone remote machine
ƒ Benefits include
– Reduce backup intensive resources on the database server
– Reduce backup storage footprint
• Eliminate full DB2 backups by backing up only what needs to be backed up
– Reduce number of objects required during recovery
• Simplify recovery process through restore of up-to-date full backup images
• Speed up recovery with up-to-date more recent full backup images
20
© 2013 IBM Corporation
Merge Backup Overview
Time
Full DB2 Backup
from DB2
DB2 Incremental Backups
Merge Backup
Processing
Full DB2 Backup
from MBK
21
© 2013 IBM Corporation
Eliminate the Need to Take Full DB2 Backups
MBK
Full DB2
Full DB2
Backup
Backup
MBK
MBK
...
22
Full DB2
Full DB2
Backup
Backup
© 2013 IBM Corporation
High Availability and Disaster Recovery
October 28, 2013
© 2013 IBM Corporation
What Does Downtime Mean To Your Business?
Damaged
DamagedReputation
Reputation
Negative
NegativePress
Press
Loss
Lossof
ofMarket
MarketShare
Share
to
Competitors
to Competitors
Customer
Customer
Dissatisfaction
Dissatisfaction
Lost
LostRevenue
Revenue
Availability
Downtime per Year
99.999%
5 minutes
99.99%
52 minutes
99.9%
8 hours, 45 minutes
99%
3.65 days
95%
18.25 days
64% of organizations require
that their databases deliver a
minimum of 99.99% or better
uptime for their most mission
critical applications *
24
Breach
Breachof
of
SLA/Contract
SLA/Contract
Reduced
ReducedIT/Employee
IT/Employee
Productivity
Productivity
Reduced
ReducedEnd-User
End-User
Productivity
Productivity
Reduced
ReducedCompany/
Company/
Shareholder
ShareholderValue
Value
* ITIC Paper "Two-Thirds of Corporations Now Require 99.99% Database Uptime, Reliability ", Laura DiDio, July 10th, 2013
© 2013 IBM Corporation
DB2 Has the Right Solution to Meet Your HA Needs
HADR
HADR
25
CF
CF
pureScale
Integrated
Clustering
© 2013 IBM Corporation
High Availability and/or Disaster Recovery Options
ƒ Server failover
– Shared disk or remote disk mirroring
ƒ HADR
–
–
–
–
High Availability and/or Disaster Recovery
Easy to set up and manage
Automatic failover with TSA integration
Fast failover
ƒ Q-Replication
– Flexible – can handle database subsets
– Active/Active
– Asynchronous
ƒ pureScale (Active / Active)
– Continuous Availability
– Load Balancing
– Easy to set up and manage
26
© 2013 IBM Corporation
Server-Based Failover
ƒ DB2 ships with an
integrated TSA cluster
manager
–
–
–
–
tx
tx
Node Failure Detection
Disk takeover
IP takeover
Restart DB2
ƒ Management framework
included to keep the
cluster topology in sync
Active Server
27
© 2013 IBM Corporation
Storage Replication
ƒ Uses remote disk mirroring technology
– Maximum distance between sites is typically 100s of km
(for synchronous, 1000s of km for asynchronous)
– For example: IBM Metro Mirror, EMC SRDF
ƒ Transactions run against primary site only,
DR site is passive
– If primary site fails, database at DR site can be brought online
ƒ All data and logs must be mirrored to the DR site
– Synchronous replication guarantees no data loss
– Writes are synchronous and therefore ordered, but “consistency groups” are
still needed
• If failure to update one volume, don’t want other volumes to get updated (leaving data
and logs out of sync)
28
© 2013 IBM Corporation
High Availability Disaster Recovery (HADR)
ƒ Provides local High Availability and/or Disaster Recovery
– Keeps two copies of a database in sync with each other on two
different servers
ƒ Simple to setup and manage
ƒ DB2 9.5 adds an integrated cluster manager for automatic failover
ƒ DB2 10 adds multiple standbys, time delay and log buffering to
handle network spikes
29
© 2013 IBM Corporation
HADR: DB2 Delivers Fast Failover at Low Cost
ƒ Redundant copy of the database to protect against site or storage failure
ƒ Support for Rolling Upgrades
ƒ Failover typically under 15 seconds
ƒ Example: Real SAP workload, 600 SAP users – database available in 11 sec.
ƒ 100% performance after primary failure
Automatic Client Reroute
Client application transparently resumes on Standby
tx
Network Connection
tx
tx
tx
HADR
TSA for server monitoring
Built in cluster manager
Monitors primary & performs takeover
Keeps the two
servers in sync
Standby Server
Server
Primary
30
Standby Server
© 2013 IBM Corporation
HADR Synchronization Modes
Synchronous, Near Synchronous,
Asynchronous and Super Asynchronous
HADR
HADR
send()
receive()
ro
ch
yn
As
log writer
us
no
ra
e
N
Co
m
m
log
file
it
R
Sup
er A
syn
c hr
ro
h
nc
y
S
u
no
s
s
on ou
r
h
c
n
Sy
eq
ue
st
log
file
ono
us
Commit Succeeded
31
© 2013 IBM Corporation
HADR Failover
ƒ Single command called "TAKEOVER"
– Change the standby into a primary
– Switch the roles of a healthy primary-standby pair
– No db2start / restart database / rollforward etc.
ƒ Integrated TSA provides heartbeat monitoring & automated
“TAKEOVER”
– Set up for you during DB2 installation
– Use a network tiebreaker to avoid split brain scenarios
– Configuration is available in this whitepaper
ƒ Automatic client re-route (ACR) provides transparent failover
– And will rerun the statement that was running when the failure occurred as long
as it’s the first statement of a transaction with no data yet returned
32
© 2013 IBM Corporation
HADR Multiple Standbys
Principal Standby
Any
Primary
e
mod
c
n
sy
Super async mode
Super as
ync mod
e
Auxiliary Standby
Auxiliary Standby
ƒ Allows for one standby for high availability and up to two other standbys for
disaster recovery
– Rolling fix pack updates of standbys and primary without losing HA
ƒ Reads on standby supported on all standbys
ƒ TSA Automation for takeover is only to Principal Standby
ƒ Takeover (forced and non-forced) supported from any standby
– After takeover, configuration parameters on new primary’s standbys will be changed
automatically so they point to the new primary
33
© 2013 IBM Corporation
HADR Log Spooling on the Standby
ƒ When enabled, allows the standby to spool log records arriving from
the primary
ƒ Decouples log replay on the standby from receiving of the log data
from the primary
ƒ Supported with any synchronization mode
Primary
Standby
Any sync mode
Spooled logs
on standby
34
34
© 2013 IBM Corporation
HADR Time-Delayed Apply on the Standby
ƒ Helps recover from application errors
– For example, accidental deletion of important table data
– Must be noticed before time delay on standby results in change
being replayed
ƒ Enabled via the new HADR_REPLAY_DELAY database
configuration parameter
– Specifies a delay in seconds for applying changes on a standby
– A value of 0 means no time delay (the default)
Standby
Primary
super async mode
Spooled logs
on standby
35
© 2013 IBM Corporation
Q Replication
ƒ High-throughput, low latency logical data replication
– Distance between sites can be up to thousands of km
ƒ Asynchronous replication
ƒ Includes support for:
–
–
–
–
Delayed apply
Multiple targets
Replicating a subset of data
Data transformation
ƒ DR site can be active
– Bi-directional replication is supported for updates on both primary and DR sites
36
© 2013 IBM Corporation
Q Replication
ƒ Each message
represents a transaction
ƒ Highly parallel apply
process
ƒ Differentiated conflict
detection and resolution
37
© 2013 IBM Corporation
Continuous Availability Using Q Replication
Q
Q Replication
Replication provides
provides aa solution
solution for
for continuous
continuous availability
availability where
where the
the active
active
secondary
secondary system
system is
is also
also available
available for
for other
other applications
applications
38
© 2013 IBM Corporation
Introducing DB2 pureScale
ƒ Extreme capacity
– Buy only what you need,
add capacity as your needs grow
ƒ Application transparency
– Avoid the risk and cost of
application changes
ƒ Continuous availability
– Deliver uninterrupted access to your
data with consistent performance
Learning from the undisputed Gold Standard... System z
39
© 2013 IBM Corporation
DB2 pureScale
Scalability, Performance, and Always Available Transactions
ƒ DB2 pureScale
– Robust infrastructure for OLTP workloads
– Provides improved availability, performance
and scalability
– Application transparency
– Scales to >100 members
– Leverages z/OS cluster technology
ƒ Highlights of pureScale enhancements
in DB2 10.5
– Rich disaster recovery options, now including
integrated HADR support
– Backup and restore between pureScale and non-pureScale environments
– Online fix pack updates
– Add members online for additional capacity
– Included in Advanced Workgroup and Advanced Enterprise editions
40
© 2013 IBM Corporation
DB2 pureScale
Architected for extreme scale and availability
•
Multiple DB2 members for scalable and
available database environment
•
Client application connects into any DB2
member to execute transactions
Clients
•
CS
CS
CS
CS
Member
Member
Member
Member
•
Shared storage for database data and
transaction logs
•
Cluster caching facilities (CF) provide
centralized global locking and page
cache management for highest levels of
availability and scalability
Cluster Interconnect
CFCS
Logs
Logs
Logs
Logs
Primary CF
CFCS
Secondary CF
Automatic workload balancing
•
Duplexed, for no single point of failure
•
High speed, low latency interconnect for
efficient and scalable communication
between members and CFs
•
DB2 Cluster Services provides integrated
failure detection, recovery automation
and the clustered file system
Database
Shared Storage
DB2 pureScale Cluster (Instance)
41
© 2013 IBM Corporation
Scale with Ease
ƒ Scale up or out… without changing
your applications
Add
member
online
– Efficient coherency protocols designed
to scale without application changes
– Applications automatically and
transparently workload balanced
across members
– Up to 128 members
Member
ƒ Without impacting availability
– Members can be added while
cluster remains online
Log
CF
Member
Log
Member
Log
Member
Member
Log
Log
CF
ƒ Without administrative complexity
– No data redistribution required
“DB2 pureScale is the only solution we found that provided near linear scalability... It scales 100 percent, which
means when I add servers and resources to the cluster, I get 100 percent of the benefit. Before, we had to ‘oversize’ our
servers, and used only 50 - 60 percent of the available capacity so we could scale them when we needed.”
-- Robert M. Collins Jr. (Kent), Database Engineer, BNSF Railway Inc.
42
Online Recovery from Failures
ƒ DB2 pureScale design point is to
maximize availability during
failure recovery processing
X
ƒ When a database member fails, only
in-flight data remains locked until
member recovery completes
CF
CF
– In-flight = data being updated on the
failed member at the time it failed
Database member failure
“We pulled cards, we powered off systems, we uninstalled
devices, we did everything we could do to make the cluster
go out of service, and we couldn’t make it happen.”
-- Robert M. Collins Jr. (Kent), Database Engineer, BNSF Railway Inc.
Only data in-flight updates locked
during recovery
% of Data Available
ƒ Target time to availability of rows
associated with in-flight updates on
failed member in seconds
100
50
Time (~seconds)
Stealth System Maintenance
ƒ Allows DBAs to apply system
maintenance without negotiating
an outage window
ƒ Example: Upgrade the OS in a
rolling fashion across the cluster
Single Database View
ƒ Procedure:
1. Drain (a.k.a. Quiesce)
Wait for transactions to end their
life naturally; new transactions
routed to other members
2. Remove & maintain
3. Reintegrate into cluster
Workload balancing starts
sending it work as a least
loaded machine
4. Repeat until done
44
DB2
DB2
DB2
DB2
© 2013 IBM Corporation
Rolling Database Fix Pack Updates
New in
DB2 10.5
ƒ Transparently install pureScale fix packs or perform system
maintenance in an online rolling fashion
ƒ No outage experienced by applications
ƒ Single installFixPack command
run on each member/CF for fix pack update
– Quiesces member
• Existing transactions allowed to finish
(configurable timeout)
• New transactions sent to other members
CF
CF
– Installs binaries
– Updates instance
• Member still behaves as if running on previous fix pack level
– Unquiesces member
ƒ Final installFixPack command to complete and commit updates
– Instance now running at new fix pack level
45
© 2013 IBM Corporation
Rolling Fix Pack Updates (cont.)
Cluster is effectively
running at: GA FP1
Transactions routed away from
member undergoing
maintenance, so no application
outages experienced. Workload
balancing brings work back after
maintenance finished
Member
Code level:
3 >> installFixPack
installFixPack –online
–online
CF S
Code level:
Code level:
GA FP1
4 >> installFixPack
installFixPack –online
–online
Member
Code level:
GA FP1
5 >> installFixPack
installFixPack –online
–online
CF P
GA FP1
1 >> installFixPack
installFixPack –online
–online
46
6 >> installFixPack
installFixPack –check_commit
–check_commit
> installFixPack –commit_level
7
Member
GA FP1
Cluster not
running at
new level
until commit
is performed
Code level:
GA FP1
2 >> installFixPack
installFixPack –online
–online
© 2013 IBM Corporation
Disaster Recovery Options for pureScale
ƒ Variety of disaster recovery
options to meet your needs
– HADR (new in DB2 10.5)
– Storage Replication
– Q Replication
– InfoSphere Change Data
Capture (CDC)
– Geographically Dispersed
pureScale Cluster (GDPC)
– Manual Log Shipping
47
© 2013 IBM Corporation
Geographically Dispersed pureScale Clusters (GDPC)
ƒ A “stretch” or geographically dispersed pureScale cluster spans two sites
–
–
–
–
At distances of tens of km
Active/active DR, where half of the cluster is at site A, other half at site B
Enables a level of DR support suitable for many types of disasters
Supported for AIX (using InfiniBand) and RedHat Linux (using 10 Gigabit Ethernet)
ƒ Both sites active and available for transactions during normal operation
ƒ On failures, client connections are automatically redirected to surviving members
– Applies to both individual members within sites and total site failure
M1
M3 CFP
Tens of km
CFS M2
Site A
M4
Site B
Workload fully balanced
M1
M3 CFP
Site A
Tens of km
CFS M2
M4
Site B
Workload rebalanced on hardware failure
48
M1
M3 CFP
Site A
Tens of km
CFS M2
M4
Site B
Workload rebalanced on site failure
© 2013 IBM Corporation
HA & DR Scenarios
October 28, 2013
© 2013 IBM Corporation
Local Cluster Failover
DB2 automation with built-in cluster manager
HA Cluster
Local Standby
Database
Primary
Database
DB1
Pros:
ƒ Cons:
ƒ Inexpensive local failover solution
ƒ No protection from disk failure
ƒ Protection from software and server failure ƒ No protection from site failure
ƒ DB2 9.5 integrated TSA cluster manager
ƒ Failover times vary from 5 to 15 minutes
50
© 2013 IBM Corporation
HADR Local or Remote with Read on Standby
Primary Connection
HADR Cluster
Standby
Database
Primary
Database
DB1
Pros:
ƒ Inexpensive local failover or DR solution
ƒ Protection from software, server, storage,
and site failures
ƒ Simple to setup and monitor
ƒ Failover time in the range of 30 sec
ƒ Reporting on standby without increase in
failover time
51
DB1a
Read only
Cons:
ƒ Two full copies of the database (a
plus from a redundancy perspective)
ƒ Only read transactions can run on
standby
© 2013 IBM Corporation
HADR With Disk Mirroring to Remote DR Site
Primary Connection
HADR Cluster
Automatic client reroute
Disaster Recovery
Site
Local Standby
Database
Primary
Database
Remote Disk Mirror
Technology
DB1
Pros:
ƒ Very fast local failover with DR capability
ƒ Protection from software, server, storage,
and site failures
ƒ Local failover time in the range 30 seconds
52
DB1aa
DB1a
Cons:
ƒ Three full copies of the database (a
plus from a redundancy perspective)
ƒ More costly than HADR for just DR
© 2013 IBM Corporation
HADR With Multiple Standby’s (DB2 10)
Primary Connection
HADR Cluster
Automatic client reroute
Local Standby
Database
Primary
Database
DB1
DB1a
Pros:
ƒ Very fast local failover with DR capability
ƒ Protection from software, server, storage,
and site failures
ƒ Allows for time delay on auxiliary standbys
ƒ Local failover time in the range 30 seconds
53
Disaster Recovery
Site
DB1aa
Remote Standby
Cons:
ƒ Three full copies of the database (a
plus from a redundancy perspective)
ƒ Super Async only for DR site
© 2013 IBM Corporation
Q Replication
Multiple alternate standby
servers
…
Read/Write
write on
standby
Primary Connection
Site A
Primary
Database
Site C
Site B
Remote
Standby
DB3
Q-based SQL replication to
logical standby’s
DB2
DB1
Pros:
ƒ Protected from software, server, storage,
and site failures
ƒ Failover time is “instant”
ƒ Standby can be full or subset and is fully
accessible (read and/or write)
ƒ Multiple standby servers
54
Cons:
ƒ More complex to setup and monitor (but
more flexibility) vs. HADR
ƒ Asynchronous
© 2013 IBM Corporation
High Availability Disaster Recovery (HADR) Options
ƒ Local HA – fast failover with server and storage protection
Site A
Server
A
Server
B
Primary
Standby
ƒ Disaster Recovery – server, storage and site protection
Site A
Any Distance
Server
A
Primary
Server
B
Site B
Standby
ƒ Both – fast local failover with server, storage and site protection
Any Distance
Site A
55
Server
A
Server
B
Server
C
Primary
Standby
Standby
Site B
© 2013 IBM Corporation
HADR with Replication – Best Practice for HA and DR
ƒ HADR Pairs with Replication
Site A
Server
A
Server
B
Primary
Standby
Q Replication
Any Distance
Server
C
Server
D
Active
Standby
Site B
ƒ Delivers:
–
–
–
–
–
–
Fast Local Failover
Active / Active DR
Rolling patch upgrades
Rolling version upgrades
Online database on-disk modifications
Schema modifications online/rolling
ƒ Can replace HADR at each site with pureScale for even better HA
56
© 2013 IBM Corporation
DB2 pureScale Availability Options
ƒ Local only
– Online recovery, active-active, protection from server failure
Site A
Member
1
Member
2
CF
CF
ƒ Geographically dispersed cluster
– Online recovery, active-active, protection from server, storage and site failure
Site A
Member
1
Member
3
Member
2
CF
Member
4
CF
< 80km
Site B
ƒ Local pureScale plus DR replication
– Online recovery, active-active, protection from server, storage and site failure
Site A
57
Member
1
CF
Member
2
Member
1
Any distance
CF
CF
Disk Replication
or
IBM Replication Server
Member
2
CF
Site B
© 2013 IBM Corporation
DB2 pureScale Availability Option in DB2 10.5
ƒ DB2 10.5 supports HADR with pureScale
– Online recovery, protection from server, storage and site failure
– Easy to set up and manage
– Any distance (ASYNC or SuperAsync)
Site A
Member
1
CF
Member
2
CF
CF
HADR
58
Member
1
Any distance
Member
2
CF
Site B
© 2013 IBM Corporation
HA & DR Summary
ƒ When it comes to HA and DR, one size does not fit all
ƒ There are many availability options, each with their own advantages
–
–
–
–
–
Server failover
HADR
Q-Replication
pureScale
More likely a combination of several of the above
ƒ Choose the one that best suits your deployment
– Determine the right solution considering
•
•
•
•
59
Cost (hardware/software/network/site)
Availability requirements
Management costs
Application requirements
© 2013 IBM Corporation
Kelly Schlamb (kschlamb@ca.ibm.com)
Executive IT Specialist, Worldwide Information Management Technical Sales
IBM Canada Ltd.
60
© 2013 IBM Corporation