PGAS BoF The Partitioned Global Address Space (PGAS) Programming Languages Organizers: Tarek El-Ghazawi, Lauren Smith, Bill Carlson and Kathy Yelick www.pgas.org 1 Agenda Welcome - Tarek El-Ghazawi and Lauren Smith (2 min) PGAS 2014 Summary - Jeff Hammond (8 min) PGAS 2015 Plans – Tarek (3 min) PGAS Announcements (2 min) Quick Updates (20 min) Applications – Kathy Yelick, Irene Moulitsas and Christian Simmendinger Technology – Sameer Shende, Oscar R. Hernandez, Nenad Vukicevic, Yili Zheng, Hitoshi Murai, Vivek Kumar, Jacob Nelson, Deepak Majeti, Olivier Serres, Sung-Eun Choi and Salvatore Filippone Questions (25 min) www.pgas.org 2 Summary of PGAS 2014 One person’s impression… Stuff to read • http://www.csm.ornl.gov/OpenSHMEM2014/i ndex.html • http://nic.uoregon.edu/pgas14/schedule.php • http://nic.uoregon.edu/pgas14/keynote.php OpenSHMEM User Group • New features proposed: thread support, flexible memory allocation, FT. • Multiple papers on how to implement efficiently • Major progress towards open community standardization discussion (resembled MPI Forum). Intel PGAS Tutorial • • • • • Intel: Prototype SHMEM over SFI Intel: MPI-3 features for PGAS LBNL: UPC/GASNet OSU: MVAPICH2-X Intel: OpenSHMEM over MPI-3 Users interested in both SFI and MPI-3 as new PGAS runtimes. PGAS is not just about UPC… • • • • • Day-long OpenSHMEM User Group event Full session on OpenSHMEM in PGAS Runtimes: SFI, GASNet, MPI-3, OpenCoarrays Models: HabenaroUPC++, HPX, Chapel, UPC++ Applications using MPI+OSHM, UPC++, CAF The big winners at PGAS14 were C++ and OpenSHMEM… Best Paper Native Mode-Based Optimizations of Remote Memory Accesses in OpenSHMEM for Intel Xeon Phi Naveen Namashivayam, Sayan Ghosh, Dounia Khaldi, Deepak Eachempati and Barbara Chapman. Right: NVIDIA CTO of Tesla Steve Oberlin presenting the Best Paper award to Naveen. Panel summary • Need better HW support for PGAS in general. • 128-bit pointers and X1E behavior desired… • Questions about heterogeneity (vendor/ISA/core size). • Debate about active-messages. • Unclear if/how Chapel will ever take off. What is the killer app/library here? • Python ecosystem overwhelming. How to get this for PGAS? Plans Tarek El-Ghazawi, General Chair D.K. Panda, Program Chair Tentative Dates: 9/16-9/18, 2015 Start on the morning of the 16th and end around noon on the 18th Location, GW Main Campus, D.C. City Center- At Foggy Bottom Metro Stop, Metro in from anywhere in Greater DC Walk to the White House Walk to tens of DC restaurants Visit the Smithonsians Visit Georgetown for fun or NSF, DARPA, DoE, .. for funding! Keep an eye on www.pgas.org for emerging details PGAS Announcements PGAS Booth in SC2014 : #2255 BoF Wednesday 5:30pm - 7pm Application Experiences with Emerging PGAS APIs: MPI-3, OpenSHMEM and GASPI - 386 Chapel Users Group Meeting – 383 OpenSHMEM: Further Developing the SHMEM Standard for the HPC Community – 294 Mailing list To register, send an empty email to: pgas+subscribe@googlegroups.com Announcements from the audience www.pgas.org 12 Low Overhead Atomic Updates Enable Genomics Assembly Grand Challenge Meraculous Assembly Pipeline reads x x New fast I/O using SeqDB over HDF5 Meraculous assembler is used in production at the Joint Genome Institute • Wheat assembly is a “grand challenge” • Hardest part is contig generation (large inmemory hash table) k-mers New analysis filters errors using probabilistic “Bloom Filter” contigs Graph algorithm (connected components) scales to 15K cores on NERSC’s Edison Human: 44 hours to 20 secs Wheat: “doesn’t run” to 32 secs Ongoing work: Scaffolds using Scalable Alignment UPC • Gives tera- to petabtye “shared” memory • Combines with parallel I/O new genome mapping algorithm to anchor 92% of wheat chromosome Uses Dynamic Aggregation Kathy Yelick Dynamic Runtime for Productive Asynchronous Remote Updates Enables Scalable Data Fusion New PGAS: Asynchronous invocation Old PGAS: E.g. *p = … or … = a[i];ç finish { … async f (x)…} • • • Uses UPC++ Async unpack Seismic modeling for energy applications “fuses” observational data into simulation. PGAS illusion of scalable shared memory to construct matrix and measure data “fit” New UPC++ dialect supports PGAS libraries; future distributed data structure library Cores: 48 192 768 3K 12K Kathy Yelick Dr. Irene Moulitsas I.Moulitsas@cranfield.ac.uk School of Engineering Lattice Boltzmann solver using UPC The LB method is a mesoscopic approach for fluid dynamics. Its governing equation describes how a density distribution function changes in time. On the numerical side, this is resolved in certain directions and the equation is solved in two steps: collision step and streaming step. The method is able to resolve the velocity, pressure and density fields for incompressible flows. Validation: The captured von Kármán vortex street behind a cylinder. Intra-node speedup of the UPC version vs. serial version on ASTRAL (SGI) on a 800x800 mesh. SC14, New Orleans 19 November 2014 Inter-node speedup of the UPC version and vs. serial Validation: The velocity magnitudes version on ASTRAL (SGI) on a 1600x1600 streamlines in the lid-driven cavity flow.mesh. Page 15 Irene Moulitsas Dr. Irene Moulitsas I.Moulitsas@cranfield.ac.uk School of Engineering Navier Stokes solver using CAF We solve compressible Navier Stokes equations on mixed type unstructured meshes employing different numerical schemes – First Order, MUSCL-2, MUSCL-3, WENO-3, WENO-4, WENO-5. Speedup of the Coarray Fortran version vs. Speedup of the Coarray Fortran version vs. Validation RAE 2822 and the MPI version on ASTRAL (SGI)studies using were performed for thethe MPI version on ARCHER (CRAY XC-30) NACA 0012 aerofoils. Experimental data was obtained from INTEL compiler using CRAY compiler. the NPARC Alliance Verification and Validation archive. SC14, New Orleans 19 November 2014 Page 16 Irene Moulitsas GASPI/GPI2 GASPI - A Failure Tolerant PGAS API for Asynchronous Dataflow on Heterogeneous Architectures. GPI2-1.1.1 • Support for GPU/Xeon Phi • Minor fixes PGAS community benchmarks – CFD-Proxy • Multithreaded OpenMP/MPI/GASPI calculation of green gauss gradients for 2 million point aircraft mesh with halo (ghost cell) exchange. • Strong scaling benchmark. We aim for ~100 points per thread/core. Christian Simmendinger CFD-Proxy Application Experiences with Emerging PGAS APIs: MPI-3, OpenSHMEM and GASPI Wednesday , 17:30 https://github.com/PGAS-community-benchmarks/CFD-Proxy Christian Simmendinger TAU Performance System ® • Parallel Profiling and Tracing Toolkit supports UPC, SHMEM, Co-Array Fortran. • Cray CCE compiler support for instrumentation of UPC • Added support for sampling, initial support for rewriting binaries • Notify, fence, barrier, loops, forall instrumented • Compiler-based instrumentation • Runtime layer instrumentation, DMAPP layer • 3D Communication Matrix, trace based views • Other compilers supported: • Berkeley UPC, IBM XLUPC, GNU UPC • Planned: PDT update with EDG 4.9 UPC • Support for OpenSHMEM, Mellanox OpenSHMEM, Cray SHMEM, SGI SHMEM • HPCLinux LiveDVD/OVA [http://www.hpclinux.org] • Please stop by the PGAS booth (#2255) for more information http://tau.uoregon.edu Sameer Shende OpenSHMEM Highlights Overview Accomplishments ORNL and University of Houston are driving Oscar Hernandez, Pavel Shamis, Manju Venkata ORNL, UH & UTK Team the OpenSHMEM specification The OpenSHMEM 1.1 specification was ratified on June 2014 We have defined a new roadmap for OpenSHMEM, version 1.1, 1.2, 1.5 and 2.0 with community input Recent work includes building a community and tools eco-system for OpenSHMEM OpenSHMEM 1.1 specification was released. We are working with the community on the OpenSHMEM 1.2 specification. OpenSHMEM reference implementation is integrated with UCCS and runs on Infiniband and uGNI OpenSHMEM User Group Meeting (OUG14) http://www.csm.ornl.gov/OpenSHMEM2014/ Programming environment was enhanced with new tools: Vampir/Score-P, OpenSHMEM Analyzer, TAU) OpenSHMEM - Roadmap • OpenSHMEM v1.1 (June 2014) – Errata, bug fixes • OpenSHMEM v1.6 OpenSHMEM Programming Environment – Non-blocking collectives – Ratified (100+ tickets resolved)• OpenSHMEM v1.7 – Thread safety update • OpenSHMEM v1.2 (November 2014) (20+ tickets) • OpenSHMEM Next Generation – API naming convention (2.0) – finalize(), global_exit() – Let’s go wild !!! (Exascale!) – Consistent data type support – Version information – Clarifications: zero-length, wait • OpenSHMEM v1.5 (Late 2015) – shmem_local_ptr() – Non-blocking communication semantics (RMA, AMO) – teams, groups – Thread safety – Active set + Memory context – Fault Tolerance – Exit codes – Locality – I/O White paper: OpenSHMEM Tools API OpenSHMEM 2014user group meeting http://www.csm.ornl.gov/OpenSHMEM2014/ Oscar R. Hernandez OpenSHMEM User Meeting 2014 • Co-located with PGAS’14 @ Eugene Oregon – October 7th, 2014 • Invited Speakers @ – Two invited speakers: NVIDIA, Mellanox • Working Meeting – Update from the Vendors/Community – OpenSHMEM – Work in Progress Papers [10 short papers] – OpenSHMEM 1.2 ratification and roadmap discussions • Call for WIP (Work in Progress Papers) – Addenum to PGAS proceedings – Goal: Community presented extensions proposals and discussed their OpenSHMEM experiences • Website: www.csm.ornl.gov/workshops/oug2014 Oscar R. Hernandez GNU/Clang UPC, Intrepid Technology • Clang UPC 3.4.1 Released (clangupc.github.io) – UPC 1.3 specification complaint – Integrated into the latest Berkeley UPC toolset (2.20.0) – SMP and Infiniband Portals4 runtime (clangupc.github.io/portals4) • Clang UPC2C Translator (clangupc.github.io/clang-upc2c) – Berkeley UPC translator compatible – Integrated into the latest Berkeley UPC toolset (2.20.0) • Clang UPC with remote pointers in LLVM (clangupc.github.io/clang-upc-ir) – Experimental branch with UPC shared pointers in LLVM IR – Passes all UPC tests, full integration expected in 2015 • Clang UPC with libfabric runtime (clangupc.github.io/libfabric) – Infiniband based runtime that supports OFIWG libfabric specification • GNU UPC 5.0 (www.gccupc.org) – Soon to be released – Plan to merge into the GCC main trunk Nenad Vukicevic UPC++ Adds Data Structures, Hierarchy, and Async to Yili Zheng and Amir Kamil, LBNL, leads Traditional (UPC) PGAS UPC++ • PGAS address space with put/get • Default SPMD: All parallel, all the time • Hierarchical parallelism (on-node,…) • Remote async (dedicated some threads for handling) • Distributed data structure library in progress (multi-D arrays, hash tables) • Programmer-selected runtime (taskq, DAG, etc.) Hierarchical teams, locales and coarrays, support portable code for deep memory hierarchies PGAS locality and lightweight communication match NUMA, small memory/core, and software-managed memories Strawman exascale node architecture Memory Stacks on Package Low Capacity, High Bandwidth Wide Core Latency Optimized Memory DRAM/DIMMS High Capacity Low Bandwidth NVRAM: Burst Buffers / rack-local storage “NIC” on Board Domain-specific data structures (arrays, hash tables, etc.) and 1-sided communication improve programmability Two-sided message passing with buffers requires 4 steps send recv buffers buffer box 0 (local) 1 i (unit stride) 2 3 4 box 1 (remote) i (unit stride) UPC++ 3d-arrays uses only 1 step with details handled by the library box 0 (local) 1 i (unit stride) box 1 (remote) i (unit stride) Yili Zheng www.xcalablemp.org Directive-based PGAS extension for Fortran and C • • • • Proposed by XMP Spec. WG of PC Cluster Consortium. Ver. 1.2.1 spec. is available. Now XMP/C++ on the table. Adopted by Post-T2K and Post-K Projects in Japan. Supports two parallelization paradigms: • Global-view (with HPF-like data/work mapping directives) • Local-view (with coarray) Allows mixture with MPI and/or OpenMP. Data Mapping !$xmp nodes p(2,2) !$xmp template t(n,n) !$xmp distribute t(block,block) onto p real a(n,n) !$xmp align a(i,j) with t(i,j) !$xmp shadow a(1,1) !$xmp reflect (a) !$xmp loop (i,j) on t(i,j) do j = 2, n-1 do i = 2, n-1 w = a(i-1,j) + a(i+1,j) + ... ... Work Mapping Stencil Comm. Hitoshi Murai Omni XMP compiler omni-compiler.org • Reference impl. being developed by RIKEN and U. Tsukuba. • The latest Ver. 0.9.0 is now released as OSS. • Platforms: K computer, Cray, IBM BlueGene, NEC SX, Hitachi SR, Linux clusters, etc. • HPC Challenge class2 award in 2013 • • • • Plasma (3D fluid) Seismic Imaging (3D stencil) Fusion (Particle-in-Cell) etc. Performance (TFLOPS) Applications Peak XMP Number of Nodes HPL performance on K (2013) Hitoshi Murai HabaneroUPC++: a Compiler-free PGAS Library Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, Vivek Sarkar LULESH Performance for HabaneroUPC++ See PGAS 2014 paper and SC14 PGAS booth poster for more details! Vivek Kumar Github site: http://habanero-rice.github.io/habanero-upc/ Grappa: Latency-tolerant PGAS Can a PGAS system like Grappa be the platform for in-memory “big data” frameworks? Graph< VertexData, EdgeData > g; http://grappa.io Model In-memory MapReduce Compared with while(g->active_vertices > 0) { // gather values from neighboring vertices forall(g, [=](Vertex& v){ if (!v->delta_cache_enabled) { forall<async>(v.in_edges(), [=](Edge& e){ Lines of code / speedup v->value += on(e.source, [=](Vertex& src){ return gather_edge(src, e); }); }); 152 lines 10x faster Vertexcentric grap h API 60 lines 1.3x faster Relational query execution 700 lines 12.5x faster } }); Context switch on long-latency remote operations // apply phase forall(g, [=](Vertex& v){ apply_vertex(v); }); Global view, parallel loops // scatter phase (also updates cached gather value) forall(g, [=](Vertex& v){ if (v->active) { forall<async>(v.out_edges(), [=](Edge& e){ on<async>(e.target, [=](Vertex& tgt){ tgt.delta_cache += scatter_edge(e); }); }); } }); } Migrate computation when return value not needed Jacob Nelson CHAPEL ON HSA + XTQ Task Runtime CHAPEL Apps CHAPEL Apps C Program C Program + HSAIL Task Runtime Task Runtime Threads Threads HSA CPU 0 CPU 1 APU 0 GASNet X T Q GASNet-X X T Q Task Runtime HSA APU 1 Chapel Threading + GASNet Chapel On HSA + XTQ Current Chapel Framework Proposed Chapel Framework ‒ Local tasks via Threads ‒ Local tasks via HSA ‒ Remote tasks via GASNet Active Messages ‒ Remote tasks via XTQ Contact: Mauricio Breternitz, (Mauricio.Breternitz@amd.com), AMD Deepak Majeti (deepak@rice.edu), Rice University 28 Deepak Majeti Hardware Support for Efficient PGAS Programming • Mapping of the PGAS memory model to virtual memory through Hardware Support and instruction extension – Low latency, handles local accesses and random patterns • Prototype Hardware using FPGAs • Full-System Simulation (Gem5) – Prototype compiler support based on Berkeley UPC over Gasnet – 5.5x performance increase without the need of manual optimizations, 10% faster/slower than manually optimized code • O. Serres, A. Kayi, A. Anbar, and T. El-Ghazawi, “Hardware support for address mapping in PGAS languages; a UPC case study,” in CF ’14: Proceeding of the ACM Conference on Computing Frontiers, pp. 22:1–22:2, ACM, 2014. • O. Serres, A. Kayi, A. Anbar, and T. El-Ghazawi, “Enabling PGAS productivity with hardware support for shared address mapping; a UPC case study,” in HPCC 2014: the 16th IEEE International Conference on High Performance and Communications, 2014. Olivier Serres Chapel: A Condensed Digest Past: DARPA HPCS-funded research prototype ● Designed from first principles rather than by extension ● supports modern features while avoiding sequential baggage ● NOT SPMD ● global multitasking: more general; suitable for next-gen architectures ● namespace defined by lexical scoping: more natural and intuitive ● Open source and collaborative Present: Towards production-grade ● Performance improvements and language features ● Next-gen architectures (e.g., massively multithreaded, accelerators) ● come hear more at the Emerging Technologies Booth (#233) ● New collaborations (e.g., Colorado State, AMD, ETH Zurich) ● Other application areas (e.g., Big Data, interpreted environment) ● Increase user base (e.g., Co-design Centers, universities, industry) Future: The Chapel Foundation ● An independent entity to oversee the language specification and open source implementation ● Come hear more at the CHUG BoF (5:30-7pm, room 383-84-85) Sung-Eun Choi 30 OpenCoarrays Coarrays in GNU Fortran • OpenCoarrays is a free and efficient transport layer that supports coarray Fortran compilers. – The GNU Fortran compiler already uses it as an external library. • There are currently two versions: MPI based and GASNet based. • GNU Fortran + OpenCoarrays performance: – Gfortran is better than the Intel Compiler in almost every coarray transfer. – Gfortran is better than Cray for some common communication patterns. • OpenCoarrays is distributed under BSD-3 For more information, please visit opencoarrays.org. Scheduled for GCC 5.0. Salvatore Filippone www.company.com PGAS PGAS Booth in SC2014 : #2255 BoF Wednesday 5:30pm - 7pm Application Experiences with Emerging PGAS APIs: MPI-3, OpenSHMEM and GASPI - 386 Chapel Users Group Meeting – 383 OpenSHMEM: Further Developing the SHMEM Standard for the HPC Community – 294 Mailing list To register, send an empty email to: pgas+subscribe@googlegroups.com Website www.pgas.org Any other PGAS announcements ? www.pgas.org 32
© Copyright 2024