Hosted Service Strategy Guide

Hosted Service Strategy Guide
Prepared by
Jason Gaudreau, Senior Technical Account Manager
VMware Professional Services
jgaudreau@vmware.com
Revision History
Date
11/20/2014
Author
Jason Gaudreau
Comments
Reviewers
VMware
© 2015 VMware, Inc. All rights reserved. U.S and international copyright and intellectual property laws
protect this product. This product is covered by one or more patents listed at
http://www.vmware.com/download/patents.html.
VMware, the VMware “boxes” logo and design, Virtual SMP and vMotion are registered trademarks or
trademarks of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names
mentioned herein may be trademarks of their respective companies.
VMware, Inc
3401 Hillview Ave
Palo Alto, CA 94304
www.vmware.com
Hosted Tiers and Services Strategy Guide
Contents
Introduction ....................................................................................... 5
Executive Summary .......................................................................... 5
Key Functional Requirements for Service Tiers ................................. 6
Infrastructure Resiliency .................................................................... 6
Recovery Point Objective (RPO) ....................................................... 6
Recovery Time Object (RTO) ............................................................ 6
Infrastructure Performance ................................................................ 6
Factors for Building Service Tiers ...................................................... 6
Host Systems .................................................................................... 7
Server Form Factor ........................................................................... 7
Host Resource Capacity .................................................................... 7
Host High Availability......................................................................... 8
VM Restart Priority ............................................................................ 9
Live Migration .................................................................................... 9
Trickle Down Servernomics ............................................................... 9
Storage ........................................................................................... 13
Data Protection ............................................................................... 13
Storage Performance ...................................................................... 14
Multipathing ..................................................................................... 17
Virtual Disks .................................................................................... 17
Networking ...................................................................................... 18
Multi-NIC Configuration ................................................................... 19
Virtual Machines .............................................................................. 24
VM-VM Anti-Affinity Rules ............................................................... 24
Reservations ................................................................................... 24
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 3 of 30
Hosted Tiers and Services Strategy Guide
Limits............................................................................................... 25
Shares............................................................................................. 25
Resource Pools ............................................................................... 25
Infrastructure Maintenance and Deployment Management ............. 27
Maintenance and Support ............................................................... 27
Managed ......................................................................................... 27
Unmanaged..................................................................................... 27
Creating Service Offerings .............................................................. 28
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 4 of 30
Hosted Tiers and Services Strategy Guide
Introduction
Mission critical applications, such as customer facing applications and financial systems are vital
to a smooth operation of the company’s business. These applications are core to the company’s
mission, and system downtime translates to financial losses to the organization. While other
applications, like general purpose printing, software media libraries, and infrastructure monitoring
tools don’t require the same service level capabilities as mission critical applications.
Customer facing applications provide new business opportunities and improved business
capability; however it is driving the need for decreasing recovery time objectives and more
stringent service levels from a service availability perspective. We cannot treat all IT data center
systems the same, some systems are more critical to the operation than others. Business
requirements have changed over the past few years for system applications that drive business
revenue, expectations are that systems are available 24/7, like online retail systems.
Because of the exploding number of applications, infrastructure growth, and the high cost of
downtime, IT organizations need to use their infrastructure resources to build service tiers that
they offer their business partners. Not only will this help improve application performance,
reliability, and availability; but it can increase host density ratios, provide cost transparency for
meeting the expected business requirements, and enable you to create process improvements
from a service management perspective.
We will not be focusing on a complete disaster recovery solutions in this paper; this assessment
will strictly center on IT system availability and performance.
Executive Summary
Vital business applications that run on mission critical systems must be able to recover quickly
and require high availability solutions. Core business applications should be supported by an N+2
failover solution. This will be enforced with vSphere Admission Controls. Data protection should
include RAID 0+1 to ensure there is no impact to the array in the event of a multi-drive failure and
to increase storage I/O performance. Virtual disks should consist of Thick Provisioned Eager
Zeroed to guarantee that there is no oversubscription of the datastores and to maximize
performance.
Stateless web applications that use network load balancing (NLB) will have VM-VM anti-affinity
rules set to ensure that each virtual instance is on a separate host node.
Availability comes at a cost because higher levels of availability require redundancy, automated
recovery, and mirrored-pair solutions. The greater the need for higher availability, the more the IT
system price tag. In order to tackle this scenario, it is important to adopt a tiered approach to
providing high availability based on the criticality of the IT system to our business partners. The
proposed solution above provides a cost efficient approach to create a highly stable and available
platform to host mission critical systems.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 5 of 30
Hosted Tiers and Services Strategy Guide
Key Functional Requirements for Service Tiers
Infrastructure Resiliency
IT system resiliency is determined by redundant system components including servers,
networking, storage, and system recoverability. The components must be highly available to meet
mission critical system needs and minimize downtime. A failure in any link in the infrastructure
chain could result in the loss of IT system availability to the business. As a result, redundancy
must be applied to all infrastructure components to ensure high availability.
Recovery Point Objective (RPO)
The system data on our critical systems dictates the amount of data that can be lost as the result
of a failure. Generally, mission critical systems cannot sustain any data loss and require a very
low recovery point objective. Systems that are not mission critical IT systems often can sustain
some amount of data loss or lost transactions resulting from a system failure.
Recovery Time Object (RTO)
Recovery time objectives (RTOs) spell out the maximum allowable time to restore IT services.
RTOs are typically associated with recoverability, whereas Quality of Service (QoS) needs are
associated with availability. Most organizations use RTOs to express disaster recovery
requirements. For our purpose, we are going to focus on availability solutions for protecting our IT
systems from downtime caused by individual system outages, component outages, and
maintenance activity.
Infrastructure Performance
For mission critical applications, it is no longer sufficient for an application to just meet functional
requirements. You need to ensure the application satisfies the desired performance parameters
for the consumer.
With a focus on controlling costs, IT leaders must run at just enough resource capacity to meet
the requirements of the business. For many organizations, day-to-day operations includes
running several generations of physical servers, which provide varying degrees of performance. It
is important to use the latest hardware with the largest feature set to run the core business
applications. This will provide the best application performance and hardware reliability.
Factors for Building Service Tiers





Vital business functions need to be highly available and operate with minimum disruption
Highly resilient infrastructure design to withstand failures
Decrease the number of system outages to mission critical systems
Operate within scheduled maintenance and deployment release dates
Provide necessary performance to meet application requirements
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 6 of 30
Hosted Tiers and Services Strategy Guide
Host Systems
Server Form Factor
VMware vSphere allows organizations to spread the virtual machines (servers) across multiple
physical hosts, with the ability to consolidate workloads into each server. Essentially, a scale up
design uses a small number of large powerful servers, as opposed to a scale out solution design
that revolves around smaller servers. Both aim to achieve the computing power that is required to
run business applications, but the way in which they scale is different and has a different impact
to support.
Scale up advantages:





Better resource management: Larger servers can take better advantage of the
hypervisor’s resource optimization capabilities. Scaling out doesn’t make as efficient use
of the resources because they are more limited on an individual node.
Cost: Scaling up is cheaper.
Fewer Hypervisors: With fewer servers loaded with the hypervisor, it is easier to maintain
hypervisor upgrades, hypervisor patching, BIOS and firmware upgrades, and a smaller
footprint for system monitoring.
Larger VMs possible: Scale up is more flexible with large VMs because of resource
scaling.
Power and cooling: In general scaling up requires less power and cooling because it is a
smaller amount of host nodes.
Scale out advantages:



Less impact during a host failure: Having fewer VMs per server reduces the risk if a
physical host failure should occur. By scaling out to small servers, fewer VMs are
affected at once.
Less expensive host redundancy: It is significantly cheaper to maintain an N+2 host
policy.
Although scaling up hosts saves money on OPEX and infrastructure costs, the
recommendation for mission critical applications is to scale out so that the VM impact is
minimized in the event of a system failure. vSphere High Availability (HA) uses a restart
of the virtual machine as the mechanism for addressing host failures. This means there is
a period of downtime when the host fails and the VM(s) completes reboot on a different
host(s).
Host Resource Capacity
vSphere clustering has the capability of admission control to ensure that capacity is available for
maintenance and host failure. Failover capacity is calculated by determining how many hosts can
fail and still leave enough capacity to satisfy the requirements of all powered-on virtual machines.
An N+2 solution, where N is the number of physical servers in the environment plus two
additional physical servers to host the VMs provides the advantage of allowing for an unexpected
system failure while one host is out of the cluster for maintenance. This cluster design can sustain
an impact of two hosts without disrupting mission critical systems.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 7 of 30
Hosted Tiers and Services Strategy Guide
This ensures that we are not over-committed in host resource allocation which can lead to poor
performance on the VMs should there be a multi-host failure.
For non-mission critical applications, running at N+1 allows for non-disruptive maintenance of the
underlying host systems and tolerates the impact of a single host outage without business impact.
Required Host Resources for Fail-Over
Number of Hosts N+1 Resource Capacity N+2 Resource Capacity
2 hosts
50% NA
3 hosts
67%
33%
4 hosts
75%
50%
5 hosts
80%
60%
6 hosts
83%
67%
7 hosts
86%
71%
8 hosts
87%
75%
9 hosts
89%
78%
10 hosts
90%
80%
Figure 1 Resources Allocations for Host Failure
Host High Availability
vSphere High Availability is a clustering solution to detect failed physical hosts and recover virtual
machines. If vSphere HA discovers that a host node is down, it quickly restarts the host’s virtual
machines on other servers in the cluster. This enables us to protect virtual machines and their
workloads.
Figure 2 vSphere HA
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 8 of 30
Hosted Tiers and Services Strategy Guide
VM Restart Priority
If a host fails and its virtual machines need to be restarted, you can control the order in which this
is done with the VM restart priority setting. VM restart priority determines the relative order in
which virtual machines are restarted on a new host after an outage. The virtual machines with the
highest priority are attempted first and it continues to those with lower priority until all virtual
machines are running or there are no cluster resources available.
Placing mission critical applications with a VM restart priority of High will ensure critical
applications are online quickly.
Live Migration
vSphere vMotion provides the ability to perform live migrations of a virtual machine from one
ESXi host to another ESXi host without service interruption. This is a no-downtime operation;
network connections are not dropped and applications continue running uninterrupted.
This makes vMotion an effective tool for load balancing VMs across host nodes within a cluster.
Additionally, if a host node needs to be powered off for hardware maintenance, you will use
vMotion to migrate all the active virtual machines from the host going offline to another host to
ensure there is no business disruption.
Trickle Down Servernomics
VMware recommends that all hosts in a cluster have similar CPU and memory configurations to
have a balanced cluster and optimal HA resource calculations. This will not only help you in the
event of a physical server outage in a cluster, it can help improve performance by taking
advantage of all the capabilities in your latest generation servers.
In order to have multiple processor architectures in a single cluster you need to enable Enhanced
vMotion Compatibility (EVC) mode. EVC mode allows migration of virtual machines between
different generations of CPUs, making it possible to aggregate older and new server hardware
generations in a single cluster.
However, despite the obvious advantages of EVC mode, you need to factor in the costs
associated with this feature. Some applications will potentially lose performance due to certain
advanced CPU features not being made available to the guest, even though the underlying host
supports them. When an ESXi host with a newer generation CPU joins the cluster, the baseline
will automatically hide the CPU features that are new and unique to that CPU generation. The
below table (Figure 3) lists the EVC Levels and a description of the features that are enabled.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 9 of 30
Hosted Tiers and Services Strategy Guide
Figure 3 EVC Modes
To illustrate some of the performance variations, VMware ran some test that replicated
applications in our customer environments to find out the impact of EVC mode. They created
several guest virtual machines to run workloads with different EVC modes ranging from Intel
Merom to Intel Westmere. For the Java-based server-side applications, its performance on an
ESXi host with processor as new as Westemere and as old as Merom had a negligible variation
of 0.0007%. For OpenSSL(AES), the Intel Westmere EVC mode outperformed the other modes
by more than three times. The improved performance is due to the encryption acceleration made
possible by the introduction of the AESNI instruction set available on Intel processors –
Westmere (Figure 4).
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 10 of 30
Hosted Tiers and Services Strategy Guide
Figure 4 OpenSSL with EVC Mode
Another key aspect to a balanced cluster is ensuring there is not a large variation of resources
available in a single cluster, this happens when mixing different generation servers. For instance,
when purchasing a HP ProLiant DL380 Generation 5 (G5) server the processor available was two
Quad-Core Intel Xeon processor with 12 MB of L2 cache memory and a maximum of 64 GB of
memory. The Generation 8 (G8) version of the HP ProLiant DL380p allowed for two 12-core Intel
Xeon processors with 30 MB of L2 cache memory and maximum 768 GB of memory. That is a
dramatic difference!
If we look at theoretical density ratios, we can expect 32 vCPUs on the HP ProLiant DL380 G5
and 96 vCPUs on the HP Proliant DL380p G8. A solid estimate for the number of vCPUs per
processor core for production workloads is 4.
Total vCPUs = Processor Cores x 4
Unbalanced clusters can have a significant impact on VMware HA when a new generation server
fails and legacy systems need to pick up the additional workload.
Furthermore, to mix the same servers in a cluster, you would need to enable EVC mode L1 for
the Penryn processor architecture, hiding all the chipset features from L2 through L4.
By defining cluster service tiers, you can ensure that applications with critical workloads that are
vital to the business have the latest generation features and your clusters are balanced. You
“trickle down” older generation servers to hosting tiers that don’t have the same SLA and
performance requirements as your core business applications.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 11 of 30
Hosted Tiers and Services Strategy Guide
Figure 5 Cluster Service Tiers
Additionally, you can incorporate a strategy that includes scale out for mission critical applications
and scale up for non-mission critical workloads. Your clusters should be created to meet a
business requirement or a functional requirement. For instance, you could create clusters based
on service levels like illustrated above (Figure 5) or you can construct clusters based on
functional requirements, such as a SQL cluster. The database cluster might require a high
number of processor cores so it stays within NUMA architecture and it may need some of the
resources reserved to meet performance requirements.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 12 of 30
Hosted Tiers and Services Strategy Guide
Figure 6 Cluster Tiers
Storage
Data Protection
It’s all about recovery; data protection design protects against all relevant types of failure and
minimizes data loss. While disk capacity has increased more than 1,000-fold since RAID levels
were introduced in 1987, disk I/O rates have only increased by 150-fold. This means that when a
disk in a RAID set does fail, it can take hours to repair and re-establish full redundancy.
RAID levels:


RAID-1: An exact copy (or mirror) of a set of data on two disks.
RAID-5: Uses block-level striping with parity data distributed across all member disks.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 13 of 30
Hosted Tiers and Services Strategy Guide



RAID-6: Extends RAID 5 by adding an additional parity block; thus it uses block-level
striping with two parity blocks distributed across all member disks.
RAID 10: Arrays consisting of a top-level RAID-0 array (or stripe set) composed of two or
more RAID-1 arrays (or mirrors). A single-drive failure in a RAID 10 configuration results
in one of the lower-level mirrors entering degraded mode, but the top-level stripe may be
configured to perform normally (except for the performance hit), as both of its constituent
storage elements are still operable—this is application-specific.
RAID 0+1: In contrast to RAID 10, RAID 0+1 arrays consist of a top-level RAID-1 mirror
composed of two or more RAID-0 stripe sets. A single-drive failure in a RAID 0+1
configuration results in one of the lower-level stripes completely failing (as RAID 0 is not
fault tolerant), while the top-level mirror enters degraded mode.
For mission critical systems being able to overcome the overlapping failure of two disks in a RAID
set is important to protect from data loss. RAID 0+1 stripes data across a pair of mirrors. This
approach gives an excellent level of redundancy, because every block of data is written to a
second disk. Rebuild times are also short in comparison to other RAID types.
Raid 0+1 has increased read performance with mirrored copies of the data; it can read from the
mirrored disks in parallel. Furthermore, there is a dramatic improvement in write performance to
the disk; RAID 0+1 needs to only write to two disks at a time. As opposed to RAID-5 which has to
take into account four steps when writing to the disk. It needs to read the old data, then read the
parity, then write the new data, and then write the parity. This is known as the RAID-5 write
penalty.
Mirrored RAID volumes offer high degrees of protection, but at the cost of 50 percent loss of
usable capacity.
Storage Performance
The types of drives in the storage array and IO activity have a dramatic impact on application
performance. In the below diagram (Figure 7), you can see the typical IOPs expected by today’s
magnetic and flash drives.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 14 of 30
Hosted Tiers and Services Strategy Guide
Figure 7 Drive IOPS
Mission critical applications and applications that have a heavy IO workload can benefit from
incorporating flash drives. For instance, if we size a 3PAR storage array with just 15K FC disks
we need 106 total disks to meet the IOP requirements of 16,000 IOPS. That would put us into the
3PAR 7400 with 9 drive enclosures and 152 disk for redundancy. The cost is around $400,000.00
for the storage array.
If we look at using a 3PAR storage array with a mix of SSD and 10K drives with Adaptive
Optimization, we need 16 SATA disks and 8 SLC SSD disks to meet the 16,000 IOPS
requirement. This drops us down to the 3PAR 7200 with 3 drive enclosures and 24 disks for
redundancy. The cost is around $100,000.00 for this storage array.
Figure 8 Disk Reduction
As you can see, by leveraging SSD you can dramatically reduce the price of building out the
infrastructure necessary for applications that have a significant IOP requirement. Using sub-LUN
auto-tiering, such as HP 3PAR’s Adaptive Optimization enables automatic storage tiering on the
array. With this feature, the storage system analyzes IO and then migrates regions of 128 MB
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 15 of 30
Hosted Tiers and Services Strategy Guide
between different storage tiers. Frequently accessed regions of volumes are moved to higher
tiers, less frequently accessed regions are shifted to lower tiers.
Figure 9 Sub-LUN Auto-Tiering
Like mentioned previously when working with applications workloads, you must be mindful of the
write penalty with raid sets. When using RAID-5, you need to take into account four steps when
writing to the disk. It needs to read the old data, then read the parity, then write the new data, and
then write the parity. This is known as the RAID-5 write penalty.
Figure 10 RAID Penalty
When calculating your IO workload, use the following formula:
Read IOPS + (Write IOPS * Raid Penalty) = Total IOPS
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 16 of 30
Hosted Tiers and Services Strategy Guide
In our example, we have a corporate application that requires 1,280 average IOPS with the readwrite ratio being 50/50 in a RAID-5 volume. The formula would be 640 Read IOPS + (640 Write
IOPS * 4) = 3,200 IOPS. The impact of including the RAID penalty into your IO calculations is
very important, if you are calculating your applications IO requirements based on the 1,280 IOPS
instead of the 3,200 IOPS you can degrade the performance of your application.
Multipathing
vSphere hosts use HBA adapters through fabric switches to connect to the storage array’s
storage processor ports. By using multiple HBA devices for redundancy, more than one path is
created to the LUNs. The hosts use a technique called “multipathing” which provides several
features such as load balancing, path failover management, and aggregated bandwidth.
Virtual Disks
Virtual disks (VMDKs) are how virtual machines encapsulate their disk devices. Virtual disks
come in three formats Thin Provision, Thick Provisioned Lazy Zeroed, and Thick Provisioned
Eager Zeroed.



Thick Provision Lazy Zeroed: Creates a virtual disk in a default thick format. Space
required for the virtual disk is allocated when the virtual disk is created. Data remaining
on the physical device is not erased during creation, but is zeroed out on demand at a
later time on first write from the virtual machine.
Thick Provision Eager Zeroed: Space required for the virtual disk is allocated at creation
time. In contrast to the flat format, the data remaining on the physical device is zeroed out
when the virtual disk is created. It might take much longer to create disks in this format
than to create other types of disks, but you can see a slight performance improvement.
Thin Provision: Use this format to save storage space. For the thin disk, you provision as
much datastore space as the disk would require based on the value that you enter for the
disk size. However, the thin disk starts small and at first, uses only as much datastore
space as the disk needs for its initial operations.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 17 of 30
Hosted Tiers and Services Strategy Guide
Figure 11 Disk Provisioning
Thick Provisioned Eager Zeroed virtual disks are true thick disks. In this format, the size of the
VMDK file on the datastore is the size of the virtual disk that you create and is pre-zeroed. For
example, if you created 500 GB virtual disk and place 100 GB of data on it, the VMDK file will be
500 GB at the datastore filesystem. As the I/O occurs in the guest, the VMkernel (Host OS kernel)
does not need to zero the blocks prior to the I/O occurring. The result is slightly improved I/O
latency and fewer backend storage I/O operations.
Because zeroing takes place at run-time for a thin disk, there will be some performance impact
for write-intensive applications while writing data for the first time. After all of a thin disk’s blocks
are allocated and zeroed out, the thin disk is no different from a thick disk in terms of
performance. Some storage array manufacturers implement thin provisioning behind the LUN.
Although in most instances array based thin provisioning will perform better than VMFS thin
provisioning, you still need to take into account the higher CPU, disk, and memory overhead to
maintain the LUNs thin.
Another benefit for Thick Provisioned Eager Zeroed is that you can’t over-subscribe the LUN like
you can with Thin Provisioned disks.
Thick Provisioned Eager Zeroed ensures disk resources are committed to mission critical
systems and provides slight disk I/O improvement. The drawback to this disk format is it requires
more storage capacity than Thin Provisioning because you are committing the entire disk
allocation to the datastore.
Networking
vSphere Networking
A vSphere standard switch works much like a physical switch. It is a software-based switch that
keeps track of which virtual machines are connected to each of its virtual ports and then uses that
information to forward traffic to other virtual machines. A vSphere standard switch (vSS) can be
connected to a physical switch by physical uplink adapters; this gives the virtual machines the
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 18 of 30
Hosted Tiers and Services Strategy Guide
ability to communicate to the external networking environment and other physical resources.
Even though the vSphere standard switch emulates a physical switch, it lacks most of the
advanced functionality of physical switches. A vSphere distributed switch (vDS) is a softwarebased switch that acts as a single switch to provide traffic management across all associated
hosts on a datacenter. This enables administrators to maintain a consistent network configuration
across multiple hosts.
A distributed port is a logical object on a vSphere distributed switch that connects to a host’s
VMkernel or to a virtual machine’s network adapter. A port group shares port configuration
options, these can include traffic shaping, security settings, NIC teaming, and VLAN tagging
policies for each member port. Typically, a single standard switch is associated with one or more
port groups.
A distributed port group is a port group associated with a vSphere distributed switch; it specifies
port configuration options for each member port. Distributed port groups define how a connection
is made through the vSphere distributed switch to the network.
Additionally, vSphere distributed switches provide advanced features like Private VLANs, network
vMotion, bi-directional traffic shaping, and third party virtual switch support.
Multi-NIC Configuration
Most corporate environments are using multiple 1 gigabyte (1 GB) Ethernet adapters deployed as
their physical uplinks. In the diagram below (Figure 12), we are using 6 uplink adapters
connected to a combination of vSphere standard switches and a vSphere distributed switch. By
using multiple network adapters, we can separate the VMware kernel (host OS kernel) traffic
which includes management, vMotion, and fault tolerance from virtual machine traffic.
In the example below, VM traffic goes through the virtual distributed switch and VMware kernel
traffic stays on the virtual standard switch providing further isolation. Additionally, this provides
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 19 of 30
Hosted Tiers and Services Strategy Guide
redundancy for all components except fault tolerance, which may not be a requirement for all
companies due to its limitation of supporting one vCPU.
ESXi Host – 1 Gb
Virtual
Center
Mgmt
vMotion
vCenter Configuration
FT
Port
VM VLAN VM VLAN VM VLAN
Groups
*
*
*
vDS dvSwitch
ESX Configuration
dvUplinks
vSS
vSS
vmnic0
vmnic1
Service
Console
vMotion
VLAN
Tag
VLAN
Tag
Onboard
0
Onboard
1
vmnic2
vmnic3
vmnic4
vmnic5
PCI A
0
PCI A
1
PCI B
0
PCI B
1
Physical
iLO
Trunk Team
Trunk Team
1000 Auto
100 Full
VLAN 90
1000 Auto
VLAN
178
1000 Auto
1000 Auto
1000 Auto
VLAN
180
VLAN *
1000 Auto
VLAN 98
Physical Switch
Figure 12 1 GB Network Connection
Today, many virtualized datacenters are shifting to the use of 10 gigabit Ethernet (10GbE)
network adapters. The use of 10GbE adapters replaces configuring multiple 1GB network cards.
With 10GbE, ample bandwidth is provided for multiple traffic flows to coexist and share the same
physical 10GbE link. Flows that were limited to the bandwidth of a single 1GbE link are now able
to use as much as 10GbE.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 20 of 30
Hosted Tiers and Services Strategy Guide
Now let’s take a look at 10 gigabit Ethernet configurations and the impact on your environments.
Because we don't have as many uplink adapters, the way we approach traffic shaping and
network isolation is different. I am going to demonstrate two scenarios, the first provides traffic
shaping and isolation by the uplink adapters, and the second is a more dynamic approach that
takes advantage of vSphere Network I/O Control (NIOC).
With the first scenario we are segmenting the virtual machine traffic to dvUplink1 and providing
failover to dvUplink0, this provides physical isolation of your virtual machine traffic from your
management traffic. The VMkernel traffic is pointed to dvUplink0 with dvUplink1 being the failover
adapter. If security controls dictate that you segment your traffic, this is a good solution, but there
is a good chance that you won't be using the full capabilities of both your 10GbE network
adapters.
ESXi Host – 1 pair 10 GbE
vCenter
Configuration
vDS dvSwitch
Port Groups
Mgmt
vMotion
ESX Host
Configuration
FT
VM NICs
dvUplinks
dvUplink0 dvUplink1
Physical
iLO
Onboard
0
1G auto
10G Auto
Onboard
1
10G Auto
VLAN 90
VLAN *
Traffic at the Port Group segmented by
different VLANs (multiple VLANs for VMNICS)
Physical Switch
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 21 of 30
Hosted Tiers and Services Strategy Guide
Traffic Type
Management
vMotion
FT
Virtual Machine
VLAN (Example)
178
180
98
*
Teaming Policy
Explicit failover
Explicit failover
Explicit failover
Explicit failover
Active dvUplink
dvUplink0
dvUplink0
dvUplink0
dvUplink1
Standby dvUplink
dvUplink1
dvUplink1
dvUplink1
dvUplink0
Figure 13 10 GB Static Network Design
In our second scenario, we are going to use network resource pools to determine the bandwidth
that different network traffic types are given on a vSphere distributed switch.
With vSphere Network I/O Control (NIOC), the convergence of diverse workloads can be enabled
to be on a single networking pipe to take full advantage of 10 GbE. The NIOC concept revolves
around resource pools that are similar in many ways to the ones already existing for CPU and
memory.
In the diagram below (Figure 14), all the traffic is going through the Active dvUplinks 0 and 1. We
are going to use a load-based teaming (LBT) policy, which was introduced vSphere 4.1, to
provide traffic-load-awareness and ensure physical NIC capacity in the NIC team is optimized.
Last, we are going to set our NIOC share values. I have set virtual machine traffic to High (100
shares), management and fault tolerance to Medium (50 shares), and vMotion to Low (25
shares). The share values are based on the relative importance we placed on the individual traffic
roles in our environment. Furthermore, you can enforce traffic bandwidth limits on the overall vDS
set of dvUplinks.
Network I/O Control provides the dynamic capability necessary to take full advantage of your
10GbE uplinks, it provides sufficient controls to the vSphere administrator, in the form of limits
and shares parameters, to enable and ensure predictable network performance when multiple
traffic types contend for the same physical network resources.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 22 of 30
Hosted Tiers and Services Strategy Guide
ESXi Host – 1 pair 10 GbE
vCenter
Configuration
vDS dvSwitch
Port Groups
Mgmt
vMotion
ESX Host
Configuration
FT
VM NICs
dvUplinks
dvUplink0 dvUplink1
Physical
iLO
Onboard
0
1G auto
10G Auto
Onboard
1
10G Auto
VLAN 90
VLAN *
Traffic at the Port Group segmented by
different VLANs (multiple VLANs for VMNICS)
Physical Network
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 23 of 30
Hosted Tiers and Services Strategy Guide
Traffic Type
Management
vMotion
FT
Virtual Machine
VLAN (Example)
178
180
98
*
Traffic Type
NIOC Shares
Management
50
vMotion
25
FT
25
Virtual Machine
100
Teaming Policy
LBT
LBT
LBT
LBT
Active dvUplink
dvUplink0,1
dvUplink0,1
dvUplink0,1
dvUplink0,1
Standby dvUplink
None
None
None
None
Figure 14 10 GB Dynamic Network Design
Virtual Machines
VM-VM Anti-Affinity Rules
A VM-VM Anti-Affinity rule specifies which virtual machines are not allowed to run on the same
host. Anti-Affinity rules can be used to offer host failure resiliency to mission critical services
provided by multiple virtual machines using network load balancing (NLB). It also allows you to
separate virtual machines with network intensive workloads; if they were placed on one host, they
might saturate the host’s networking capacity.
Reservations
Reservations are the guaranteed minimum amount of host resources allocated to a virtual
machine to avoid over commitment. It ensures the virtual machine has sufficient resources to run
efficiently. vCenter Server or ESXi allows you to power on a virtual machine only if there are
enough unreserved resources to satisfy the reservation of the virtual machine. The server
guarantees that amount even when the physical server is heavily loaded. After a virtual machine
has accessed its full reservation, it is allowed to retain that amount of memory and the memory is
not reclaimed, even if the virtual machine becomes idle.
For example, assume you have 2 GB of memory available for two virtual machines. You specify a
reservation for 1 GB of memory for VM1 and 1 GB of memory for VM2. Now each virtual machine
is guaranteed to get 1 GB of memory if it needs it. However, if VM1 is only using 500 MB of
memory and hasn’t accessed all the memory, than VM2 can use 1.5 GB of memory until VM1’s
resource demand increases to 1 GB.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 24 of 30
Hosted Tiers and Services Strategy Guide
If an application that is customer facing or mission critical, needs a guaranteed memory
allocation; the reservation needs to be specified carefully because it may impact the performance
of other virtual machines and significantly reduce consolidation ratios.
Figure 15 Virtual Memory Configuration
Limits
A limit is the upper threshold of the host resources allocated to a virtual machine. A server will
never allocate more resources to a virtual machine than the limit. The default is set to unlimited,
which means the amount of resources configured for the virtual machine when it is created
becomes the effective limit. For example, if you configured 2 GB of memory when you created a
virtual machine but set a limit of 1 GB, the virtual machine would never be able to access more
than 1 GB of memory even when the application demand required more resources. If this value is
misconfigured, users may experience application performance issues even though the host has
plenty of resources available.
Shares
Shares specify the relative priority for a virtual machine to the host’s resources. If the host’s
memory is overcommitted, and a mission critical virtual machine is not achieving an acceptable
performance level, the virtualization administrator can adjust the virtual machine’s shares to
escalate the relative priority so that the hypervisor will allocate more host memory to the mission
critical virtual machine.
The shares can be selected in a Low, Normal, or High value; which specifies the shares value
respectively in a 1:2:4 ratio.
Resource Pools
A resource pool is a logical abstraction for flexible management of resources. Resource pools
can be grouped into hierarchies and used to hierarchically partition available CPU and memory
resources.
Each standalone host and each DRS cluster has an (invisible) root resource pool that groups the
resources of that host or cluster. The root resource pool does not appear because the resources
of the host (or cluster) and the root resource pool are always the same.
Users can create child resource pools of the root resource pool or of any user-created child
resource pool. Each child resource pool owns some of the parent’s resources and can, in turn,
have a hierarchy of child resource pools to represent successively smaller units of computational
capability.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 25 of 30
Hosted Tiers and Services Strategy Guide
A resource pool can contain child resource pools, virtual machines, or both. You can create a
hierarchy of shared resources. The resource pools at a higher level are called parent resource
pools. Resource pools and virtual machines that are at the same level are called siblings. The
cluster itself represents the root resource pool. If you do not create child resource pools, only the
root resource pools exist.
For each resource pool; you specify reservation, limit, shares, and whether the reservation should
be expandable. The resource pool resources are then available to child resource pools and virtual
machines.
For example, assume a host has a number of virtual machines (Figure 16). The marketing
department uses three of the virtual machines and the QA department uses two virtual machines.
Because the QA department needs larger amounts of CPU and memory, the administrator
creates one resource pool for each group. The administrator sets CPU Shares to High for the QA
department pool and to Normal for the Marketing department pool so that the QA department
users can run automated tests. The second resource pool with fewer CPU and memory
resources is sufficient for the lighter load of the marketing staff. Whenever the QA department is
not fully using its allocation, the marketing department can use the available resources.
Figure 16 Resource Pool Example
By using resource pools, you can create customer service level definitions tailored toward service
offerings. The below chart demonstrates service class definitions, which incorporate limits,
shares, and reservations. This can help with compute resource micro-segmentation. Each
resource pool will have different shares, CPU and memory limits, and different expansion
capabilities. This helps to prioritize the virtual machine workloads in accordance to service
guidelines.
Figure 17 Resource Pool Tiers
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 26 of 30
Hosted Tiers and Services Strategy Guide
Infrastructure Maintenance and Deployment Management
Maintenance and Support
All IT organizations have limits on their resources, people, time and money. Therefore, it is critical
to determine what the vital business functions are. By creating a small infrastructure cell
dedicated to mission critical core systems, you can enhance your infrastructure maintenance and
deployment processes. Moreover, it can be risky taking more than two hosts out of the cluster at
a time to perform maintenance and upgrades. By creating a small cluster, you can assure that
changes to the mission critical cell are only performed on approve infrastructure release dates.
This will help to minimize the risk to the business for vital business functions by having a more
rigid change and release management processes.
Furthermore, by defining service levels with infrastructure resources, you can define infrastructure
operations support levels to match application availability expectations. As infrastructure growth
continues at 20% year-over-year, head count to support the increased infrastructure remains flat.
There are only a handful of options to meet this challenge.
1. Do nothing, which will degrade the ability to be strategic and place your infrastructure
engineers in firefighting mode.
2. Maintain a steady FTE ratio and regularly add headcount
3. Create support offerings to balance out time and effort based on application availability.
This is no different than vendor support offerings. For Gold service level, the IT operations team
would provide 24x7 support with root cause analysis, Silver service might be 24x5 support, and
Bronze support would be Monday through Friday 8 am to 8 pm.
Like mentioned in the introduction; mission critical applications, such as customer facing
applications and financial systems are core to the company’s mission, and system downtime
translates to financial losses to the organization. While other applications, like general purpose
printing, software media libraries, and infrastructure monitoring tools don’t require the same
service level capabilities as mission critical applications.
Managed
In a managed server environment, infrastructure services generally takes the responsibility of
providing all server builds, any infrastructure or application software upgrades, and general
maintenance such as reboots and hardware issues. The managed server type could apply to your
Gold and Silver environments.
A managed server leaves all the management duties of running the server in infrastructure
services control.
In a managed server environment the application support team will not have any administration
rights to the server unless a business justified exception is approved by Enterprise Information
Security & Risk Management.
Unmanaged
A private cloud provides IT business partners the equivalent of their own personal datacenter.
The infrastructure team allocates each owner a pool of resources (compute, memory, and disk),
helps them with a catalog of standard server build templates, and then allows them to create,
manage, and delete their virtual instances through a cloud management portal.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 27 of 30
Hosted Tiers and Services Strategy Guide
The Bronze service tier with self-service provisioning should be considered “unmanaged” unless
special arrangements are made on an exception basis.
Unmanaged virtual machines gives application support teams complete server administration
along with the responsibility that goes with it. Unmanaged server hosting, despite its name, does
not really leave application support teams to their own devices, all of the application support
teams are still bound to adhere to corporate security guidelines and standards.
Infrastructure services support in an unmanaged server is limited. Infrastructure services would
still monitor the overall host and cluster performance, resolves problems with infrastructure
related software, and troubleshoots operating system and connectivity issues.
In the event that an issue occurs on an unmanaged server and is due to a change made by the
application support team, infrastructure services could provide a limited amount of engineering
time to resolve the issue (ex. 30 minutes).
Creating Service Offerings
By defining service offerings, you provide your business partners the framework to make right
decisions and accountability to encourage the desired behavior through cost transparency. By
using a multi-faceted approach with all the technology capabilities available, you can provide the
expected business outcome by matching up technology with business requirements. In the
diagram below, we start to put the components together into our service levels. For instance, our
Gold service level includes the Gold Cluster, Tier 1 and Tier 2 storage offerings, Platinum and
Gold+ Resource Pools, Restart Order of High, Managed service, and 24 x 7 infrastructure
support.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 28 of 30
Hosted Tiers and Services Strategy Guide
Figure 18 Service Tiers
Also, creating a hosting services heat map provides further clarity, it helps define which
application services are approved for specific operational infrastructure hosting tiers. This can
include options for external public cloud service providers. Through these measured steps, you
become a service broker to your business partners.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 29 of 30
Hosted Tiers and Services Strategy Guide
Figure 19 Hosting Services Heat Map
Infrastructure investments are capital expenditures made for corporate-wide consumption to
support business capabilities through IT operations. Traditionally, infrastructure investments
tended to be more tactical and required more effort to identify, quantify, and calculate benefits
and costs. However, by operating at a more service oriented level, the infrastructure investments
can closely align with the strategic technical business plan and provide a greater return on
investment.
This hosted service strategy will help business leadership when accessing four major enterprise
goals:
1.
2.
3.
4.
Cost-effective use of infrastructure
Effective use of asset utilization for business requirements
Application availability and resiliency
Maintaining appropriate staffing levels
The job of a CIO is determining the trade-off between the cost of technology and meeting
business requirements. Becoming a service oriented organization will encourage you to become
a more cohesive partner to the business, business leadership will have more information
available to make decisions and you will become a greater stakeholder in influencing IT
decisions.
Proprietary VMWare © 2015 VMware, Inc. All rights reserved. VMware is a registered trademark of VMware,
Inc.
Page 30 of 30