MPI-3 Is Here: Optimize and Perform with Intel MPI Tools Gergana Slavova gergana.s.slavova@intel.com Technical Consulting Engineer Intel Software and Service Group Faster Code Faster Intel® Parallel Studio XE 2015 Faster Code Explicit vector programming speeds more code Optimizations for Intel® Xeon Phi™ coprocessor, Skylake and Broadwell microarchitectures MPI library now supports latest MPI-3 standard Faster processing of small matrixes Parallel direct sparse solvers for clusters Code Faster Comprehensive compiler optimization reports Analyze Windows* or Linux* profile data on a Mac* Latest standards support MPI-3, OpenMP 4, Full C++11 and Fortran 2003 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 2 How Intel® Parallel Studio XE 2015 helps make Faster Code Faster for HPC HPC Cluster Cluster Edition Multi-fabric MPI library MPI error checking and tuning Professional Edition MPI Messages Threading design & prototyping Parallel performance tuning Memory & thread correctness Composer Edition Vectorized & Threaded Node Intel® C++ and Fortran compilers Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Parallel models (e.g., OpenMP*) Optimization Notice Optimized libraries 3 Available Standalone or in Cluster suite Achieve High Performance for MPI Cluster Applications with Intel® MPI Elevate Development Tools to a Comprehensive Shared, Distributed & Hybrid Application Development Suite MPI Library + Library+ MPI Library - MPICH Based MPI 3.0 Standard Benchmarks & Tuning Compilers Performance Libs Analysis Tools Also Available in Intel® Parallel Studio XE 2015 Cluster Edition+ +Available for Windows** and Linux* Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 4 Intel® MPI Library Overview Optimized MPI application performance Application-specific tuning Automatic tuning Lower latency and multi-vendor interoperability Industry leading latency Performance optimized support for the latest OFED capabilities through DAPL 2.x Faster MPI communication iWARP Optimized collectives Sustainable scalability beyond 150K cores Native InfiniBand* interface support allows for lower latencies, higher bandwidth, and reduced memory requirements More robust MPI applications Seamless interoperability with Intel® Trace Analyzer and Collector Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 5 Reduced Latency Means Faster Performance Intel® MPI Library Superior Performance with Intel® MPI Library 5.0 Superior Performance with Intel® MPI Library 5.0 4 bytes Intel MPI 5.0 1 512 bytes 16 Kbytes 128 Kbytes 4 Mbytes Platform MPI 9.1.2 CE MVAPICH2 2.0rc2 1 1.8 1 1 1 1 1.1 1.6 1.9 1.5 1 2.2 1 1 0.0 1 1 1.0 2 2.0 2.9 2.0 Speedup (times) 64 Processes, 8 nodes (InfiniBand + shared memory), Linux* 64 Relative (Geomean) MPI Latency Benchmarks (Higher is Better) 3.4 2.5 3.0 3.1 Speedup (times) 192 Processes, 8 nodes (InfiniBand + shared memory), Linux* 64 Relative (Geomean) MPI Latency Benchmarks (Higher is Better) 0.5 0 4 bytes OpenMPI 1.7.3 512 bytes 16 Kbytes 128 Kbytes 4 Mbytes Intel MPI 5.0 MVAPICH2-2.0 RC2 Configuration: Hardware: CPU: Dual Intel® Xeon E5-2697v2@2.70Ghz; 64 GB RAM. Interconnect: Mellanox Technologies* MT27500 Family [ConnectX*-3] FDR.. Software: RedHat* RHEL 6.2; OFED 3.5-2; Intel® MPI Library 5.0 Intel® MPI Benchmarks 3.2.4 (default parameters; built with Intel® C++ Compiler XE 13.1.1 for Linux*); Configuration: Hardware: Intel® Xeon® CPU E5-2680 @ 2.70GHz, RAM 64GB; Interconnect: InfiniBand, ConnectX adapters; FDR. MIC: C0-KNC 1238095 kHz; 61 cores. RAM: 15872 MB per card. Software: RHEL 6.2, OFED 1.5.4.1, MPSS Version: 3.2, Intel® C/C++ Compiler XE 13.1.1, Intel® MPI Benchmarks 3.2.4.; Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 . Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 . Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 6 Intel® MPI Library 5.0 What’s New MPI-3 Standard Support Non-Blocking Collectives Fast RMA Large Counts MPICH ABI Compatibility Compatibility with MPICH* v3.1, IBM* MPI v1.4, Cray* MPT v7.0 Performance & Scaling Memory Consumption Optimizations Scaling up to 150K Ranks* Gains up to 35% reduction on Collectives Hydra now default job manager on Windows* Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 7 What is in MPI-3? Supported in Intel® MPI Library 5.0? Topic Motivation Main Result Collective Operations Collective performance Non-Blocking & Sparse Collectives Yes Remote Memory Access Cache coherence, PGAS support Fast RMA Yes Backward Compatibility Buffers > 2 GB Large buffer support, const buffers Yes, partial Fortran Bindings Fortran 2008 Fortran 2008 bindings Removed C++ bindings No support in MPICH3.0 Tools Support PMPI Limitations MPIT Interface Yes Hybrid Programming Core count growth MPI_Mprobe, shared memory windows Yes Fault Tolerance Node count growth None. Next time? N/A Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 8 I want a complete comm/comp overlap Problem Computation/communication overlap is not possible with the blocking collective operations Solution: Non-blocking Collectives Add non-blocking equivalents for existing blocking collectives Do not mix non-blocking and blocking collectives on different ranks in the same operation Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Example (C) // Start synchronization MPI_Ibarrier(comm, &req); // Do extra computation … // Complete synchronization MPI_Test(&req, …); Optimization Notice 9 I have a sparse communication network Problem Neighbor exchanges are poorly served by the current collective operations (memory and performance losses) Solution: Sparse Collectives Add blocking and non-blocking Allgather* and Alltoall* collectives based on neighborhoods Example (FORTRAN) call MPI_NEIGHBOR_ALLGATHER(& & sendbuf, sendcount, sendtype,& & recvbuf, recvcount, recvtype,& & graph_comm, ierror) Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 10 I want to use one-sided calls to reduce sync overhead Problem MPI-2 one-sided operations are too general to work efficiently on cache coherent systems and compete with PGAS languages Solution: Fast Remote Memory Access Eliminate unnecessary overheads by adding a ‘unified’ memory model Simplify usage model by supporting the MPI_Request non-blocking call, extra synchronization calls, relaxed restrictions, shared memory, and much more Example (FORTRAN) call MPI_WIN_GET_ATTR(win, MPI_WIN_MODEL, & memory_model, flag, ierror) if (memory_model .eq. MPI_WIN_UNIFIED) then ! private and public copies coincide Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 11 I’m sending *very* large messages Problem Original MPI counts are limited to 2 Gigaunits, while applications want to send much more Solution: Large Buffer Support “Hide” the long counts inside the derived MPI datatypes Add new datatype query calls to manipulate long counts Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Example (FORTRAN) // mpi_count may be, e.g., // 64-bit long MPI_Get_elements_x(&status, datatype, &mpi_count); Optimization Notice 12 Comprehensive Shared, Distributed & Hybrid Application Development Suite Compilers Analysis Tools Performance Libs Achieve High Performance for MPI Cluster Applications with Intel® Trace Analyzer and Collector A component of Intel® Parallel Studio XE Cluster Edition Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 13 Intel® Trace Analyzer and Collector Overview Intel® Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load balancing -trace Identify communication hotspots API and -tcollect Intel® Trace Collector Low overhead Excellent scalability Linker Binary Trace File (.stf) Powerful aggregation and filtering functions Idealizer Compiler Objects Features Event-based approach Source Code Intel® Trace Analyzer Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Runtime Output Optimization Notice 14 Strengths of Event-based Tracing Collect Record Predict Detailed MPI program behavior Exact sequence of program states – keep timing consistent Collect information about exchange of messages: at what times and in which order An event-based approach is able to detect temporal dependencies! Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 15 Using the Intel® Trace Analyzer and Collector is … Easy! Step 1 Step 2 Run your binary and create a tracefile $ mpirun –trace –n 2 ./test View the Results: $ traceanalyzer & Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 16 Intel® Trace Analyzer and Collector Summary page Time interval shown Aggregation of shown data Tagging & Filtering Imbalance Diagram Settings Compare Idealizer Perf Assistant Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 17 Intel® Trace Analyzer and Collector 9.0 What’s New MPI Communications Profile Summary Overview Expanded Standards Support with MPI 3.0 Automatic Performance Assistant Detect common MPI performance issues Automated tips on potential solutions Automatically detect performance issues and their impact on runtime Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 18 MPI-3 supported in Intel® Trace Analyzer and Collector 9.0 Support for major MPI-3.0 features Non-blocking collectives Fast RMA Large counts Non-blocking Allreduce (MPI_Iallreduce) Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 19 Call to Action Explore MPI-3 for your application today Download the Intel® Parallel Studio XE 2015 Cluster Edition Tell us about your experiences Intel® Clusters and HPC Technology forums software.intel.com/en-us/forums/intel-clusters-and-hpc-technologychnology Intel® MPI Library product page (LEARN tab) www.intel.com/go/mpi Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 20 Learn More for your Customers – Intel SW Developer Webinars Topic Date & Time (Pacific) Succeed by “Modernizing code”; Parallelize, Vectorize; Examples, Tips, Advice Sep 9, 2014, 1:15-2:15PM Knights Corner: Your Path to Knights Landing Sep 17, 2014, 9:00-10:00AM Update Now: What’s New in Intel® Compilers and Lirbaries Sep 23, 2014, 9:00-10:00AM Got Errors? Need to Thread? Intel® Software Dev Tools to the Rescue Sep 30, 2014, 9:00-10:00AM MPI-3 is Here: Optimize and Perform with Intel MPI Tools Oct 7, 2014, 9:00-10:00AM How an Oil and Gas Application Gained 10x Performance Oct 14, 2014, 9:00-10:00AM Accelerate Complex Simulations: An Example from Manufacturing Oct 21, 2014, 9:00-10:00AM New Intel® Math Kernel Library Features Boost Performance for Tiny and Gigantic Computations Nov 4, 2014, 9:00-10:00AM Design Code that Scales Sign Up https://software.intel.com/en-us/articles/intel-software-tools-technicalwebinar-series Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 21 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 22
© Copyright 2024