Silicon Valley May 8-11, 2017

Schedule Planner

Print
Download PDF
 
  • List View

 
Refine:
  • Session Levels:
  • |
  • |
  • |
  • |

PANEL

Presentation
Details

S7526 - Scaling Deep Learning on High-Performance Computers for Use in Scientific Workloads

Fernanda Foertter HPC User Assistance and Outreach Group, Oak Ridge National Laboratory
Highly-Rated Speaker
Fernanda Foertter is a member of the User Assistance Team at the National Center for Computational Sciences (NCCS) located at Oak Ridge National Laboratory (ORNL). This team is responsible for assisting all users at the Oak Ridge Leadership Computing Facility (OLCF). Fernanda is responsible for the training program at the center and represents OLCF at both the OpenACC and OpenMP organizations.
Jack Wells Director of Science, Oak Ridge Leadership Computing Facility
Jack Wells is the Director of Science for the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science national user facility, and the Titan supercomputer, located at Oak Ridge National Laboratory (ORNL). Wells is responsible for the scientific outcomes of the OLCF's user programs. Jack has previously lead both ORNL's Computational Materials Sciences group in the Computer Science and Mathematics Division and the Nanomaterials Theory Institute in the Center for Nanophase Materials Sciences. Prior to joining ORNL as a Wigner Fellow in 1997, Wells was a postdoctoral fellow within the Institute for Theoretical Atomic and Molecular Physics at the Harvard-Smithsonian Center for Astrophysics. Jack has a Ph.D. in physics from Vanderbilt University, and has authored or co-authored over 80 scientific papers and edited 1 book, spanning nanoscience, materials science and engineering, nuclear and atomic physics computational science, applied mathematics, and text-based data analytics.
Steven Young Research Scientist in Deep Learning, Oak Ridge National Laboratory
Steven Young is a researcher at Oak Ridge National Laboratory working in the Computational Data Analytics Group. His research focuses on applying deep learning to challenging datasets using HPC to enable faster training and quicker discovery. He has a Ph.D. in computer engineering from the University of Tennessee, where he studied machine learning in the Machine Intelligence Lab.
William Tang Principal Research Physicist, Princeton University
William Tang of Princeton University is principal research physicist at the Princeton Plasma Physics Laboratory for which he served as chief scientist (1997-2009) and is currently lecturer with rank and title of professor in astrophysical sciences, and member of the executive board for the Princeton Institute for Computational Science and Engineering, which he helped establish and served as associate director (2003-2009). William is internationally recognized for expertise in the mathematical formalism and associated computational applications dealing with electromagnetic kinetic plasma behavior in complex geometries -- with over 200 publications with more than 150 peer-reviewed papers and an "h-index" or "impact factor" of 44 on the Web of Science, including well over 7,000 total citations. William has taught for over 30 years and has supervised numerous Ph.D. students, including recipients of the Presidential Early Career Award for Scientists and Engineers in 2000 and 2005. He is also head of the Intel Parallel Computing Center at the Princeton Institute for Computational Science & Engineering at Princeton University.
Daniel George Scientist, University of Illinois at Urbana-Champaign, National Center for Supercomputing Applications
Daniel George is a Ph.D. student in astronomy, pursuing the computational science and engineering concentration, at the University of Illinois at Urbana-Champaign. He obtained his bachelor's degree in engineering physics from IIT Bombay. He is currently a research assistant in the Gravity Group at the National Center for Supercomputing Applications and a member of the LIGO collaboration working at the interface of deep learning, high performance computing, and gravitational wave and multimessenger astrophysics. His long-term interests lie in applying cutting-edge computer science and technology, especially machine learning and artificial intelligence, to accelerate discoveries in the fundamental sciences.

Deep learning has become a popular tool for insight on problems where deterministic models don't yet exist. Recent development of deep learning frameworks using GPUs has allowed the application of deep learning to problems where fast solutions are required. The scientific community has traditionally sought to develop deterministic models to describe physical phenomena, using highly scalable systems to simulate problems with ever increasing fidelity. While many science domains have developed robust predictive methods, there are still problems lacking models that can describe observed phenomena. In many of these cases, the problem may contain unknown variables, or be fundamentally hard to solve, where the simulation cannot fully predict observations. These areas include biological systems, chaotic systems, and medical research. There are also fields where a priori models do exist, but surveying the parameter space through simulation of large datasets would have very long time-to-solutions. These areas include instrument data analysis and materials by design. We'll explore how the scientific community is using deep learning to conduct leading-edge research outside of traditional modeling techniques. We'll also explore opportunities and obstacles to scaling deep learning workloads on high performance computing systems.

Level:
Type: Panel
Tags: HPC and Supercomputing; Deep Learning and AI
Industry Segments: Higher Education / Research

Day: TBD
Time: TBD
Location: TBD

Panel

TALK

Presentation
Details

S7113 - GPU-Accelerated Graph Analytics

Howie Huang Associate Professor, The George Washington University
Howie Huang is an associate professor in the Department of Electrical and Computer Engineering at George Washington University.

Future high-performance computing systems will enable fast processing of large datasets, as highlighted by President Obama's executive order on the National Strategic Computing Initiative. Of significant interest is the need for analyzing big graphs arising from a variety of areas -- from social networks and biology, to national security. We'll present our ongoing efforts at George Washington University in accelerating big graph analytics on GPUs. We've developed a GPU-based graph analytics system that delivers exceptional performance through efficient scheduling of a large number of GPU threads and effective utilization of GPU memory hierarchy.

Level: All
Type: Talk
Tags: Accelerated Analytics; HPC and Supercomputing
Industry Segments: Higher Education / Research

Day: TBD
Time: TBD
Location: TBD

S7122 - CUDA Optimization Tips, Tricks and Techniques

Stephen Jones Principal Software Engineer, NVIDIA
Stephen Jones is a principal software engineer in the CUDA group at NVIDIA, working on making the CUDA language and programming model span the needs of parallel programming from high performance computing to artificial intelligence. Prior to NVIDIA, he lead the Simulation & Analytics group at SpaceX, where he worked on various projects, including large-scale simulation of combustion processes in rocket engines. His background is in computational fluid mechanics and plasma physics, but he has worked in diverse industries, including networking, CAD/CAM, and scientific computing.

Optimizing your code can be one of the most challenging tasks in GPU programming, but also one of the most rewarding: the performance difference between an initial version and well-tuned code can be a factor of 10 or more. Some optimizations can be quite straightforward while others require care and deep understanding of how the code is executing. A particular focus will be on optimization of the CPU part of your code, which is frequently overlooked even though it is often easier to tune and just as effective. Sometimes the biggest obstacle is just knowing what to look for, so we'll cover a range of techniques that everyone from beginners to CUDA ninjas might not have thought of before.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Accelerated Analytics; Algorithms; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7126 - Red Blood Cells Simulations with Chemical Transport Properties

Ansel Blumers Graduate Student, Brown University
Ansel Blumers is a graduate student in the Department of Physics at Brown University.

We'll explore new techniques in GPU-accelerated red blood cell simulations. The desire to study the underlying chemical influences on red blood cell functionalities motivates the use of a method that can capture the diffusion and reaction processes. To take advantage of the GPU's parallelism, the new technique involves stream diversion tactic and non-blocking MPI communications to streamline the computation. The speed is then tested against the CPU counterpart. Strong scaling and weak scaling are performed to characterize scalibility.

Level: All
Type: Talk
Tags: Computational Biology; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7128 - How to Enable NVIDIA CUDA Stream Synchronous Communications Using GPUDirect

Davide Rossetti Senior Software Engineer, NVIDIA
Davide Rossetti is lead engineer for GPUDirect at NVIDIA. Previously, he spent more than 15 years at the Italian National Institute for Nuclear Physics as a researcher and member of the APE experiment.
Elena Agostini Ph.D. and Intern at NVIDIA, University of Rome
Elena Agostini received her Ph.D. in Computer Science from the University of Rome “La Sapienza” in collaboration with the National Research Council of Italy. The main topics of her research are GPUs used for cryptanalysis or communications, parallel computing, HPC and network protocols. Her first Internship at NVIDIA, in the Santa Clara (CA) headquarter, consisted of a collaboration with the CUDA team about the GPUDirecty Async technology, recently released by NVIDIA. She is currently doing her second Internship at NVIDIA, helping to improve the technology.

Learn how to enable CUDA stream synchronous communications in your applications by employing novel GPUDirect features.

Level: Advanced
Type: Talk
Tags: HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7133 - Multi-GPU Programming with MPI

Jiri Kraus Senior Devtech Compute, NVIDIA
Highly-Rated Speaker
Jiri Kraus is a senior developer on the European DeveTech team at NVIDIA, where he focuses on multi-GPU programming models and NVIDIA's collaboration with the Juelich Super Computing Centre.

Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We''ll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We'll also cover the latest improvements with CUDA-aware MPI, interaction with Unified Memory, the multi-process service (MPS, aka Hyper-Q for MPI), and MPI support in NVIDIA performance analysis tools.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Programming Languages

Day: TBD
Time: TBD
Location: TBD

S7136 - DNA Sequences Alignment in Multi-GPUs: Energy Payoff on Speculative Executions

Manuel Ujaldon Full Professor and NVIDIA CUDA Fellow, University of Malaga (Spain), Computer Architecture Department
Manuel Ujaldon is a full professor in computer architecture at the University of Malaga. He earned a B.S. in computer science from the University of Granada (Spain, 1991) and an M.S. and Ph.D. in computer science from the University of Malaga (Spain, 1993 and 1996).

Find out the energy cost of launching speculative executions when handling data dependencies to enhance parallelism on multi-GPU platforms. We present CUDAlign 4.0 as case study, a multi-GPU execution for an optimal alignment of huge DNA sequences using the exact Smith-Waterman algorithm. Our speculative approach easily attains 10-20x speed-up versus the baseline pipelined version where GPUs are idle waiting for dependencies to be solved. But working on mispredictions, GPUs waste energy. In the green computing era where GFLOPS/w is the trending metric, we need to know which is worse: wasting time or power. Our experimental study analyzes speculation hit ratios to evaluate extra performance and measures energy spent on mispredictions, to conclude to what extent the speculative approach jeopardizes the GFLOPS/w ratio.

Level: All
Type: Talk
Tags: Computational Biology; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7142 - Multi-GPU Programming Models

Jiri Kraus Senior Devtech Compute, NVIDIA
Highly-Rated Speaker
Jiri Kraus is a senior developer on the European DeveTech team at NVIDIA. He focuses on multi-GPU programming models and NVIDIA's collaboration with the Juelich Supercomputing Centre.
Sreeram Potluri Senior Software Engineer, NVIDIA CORP
Sreeram Potluri is a senior software engineer at NVIDIA. His work focuses on parallel programming models and communication runtimes for GPU clusters.

Do you need to compute larger or faster than a single GPU allows you to? Learn how to scale your application to multiple GPUs. Learn how to use the different available multi-GPU programming models and what are their individual advantages. All programming models will be introduced using same example applying a domain decomposition strategy.

Level: Intermediate
Type: Talk
Tags: Programming Languages; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7155 - Optimized Inter-GPU Collective Operations with NCCL

Sylvain Jeaugey Senior Communication and Computing Engineer, NVIDIA
Sylvain Jeaugey has been optimizing HPC communication libraries for 10+ years with a strong focus on collective operations. Before joining NVIDIA, Sylvain worked for Bull, optimizing MPI libraries to work on 100,000+ cores, then designing the BXI high-speed network for HPC. He is the main developer of the NCCL library.

We'll present the functionalities of NCCL (pronounced "Nickel"), a standalone library of standard collective communication routines, such as all-gather, reduce, broadcast, etc., that have been optimized to achieve high bandwidth over GPU topologies. NCCL can be used in either single- or multi-process (for example, MPI) applications.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Deep Learning and AI

Day: TBD
Time: TBD
Location: TBD

S7160 - NVIDIA GPU Support for Apache Mesos and DC/OS

Klues Kevin Senior Software Engineer, Mesosphere
Kevin Klues is a senior software engineer at Mesosphere working on the Mesos core team. Prior to joining Mesosphere, he worked at Google on an experimental operating system for data centers called Akaros. He and a few others founded the Akaros project while working on their Ph.D.s at UC Berkeley. In a past life, Kevin was a lead developer of the TinyOS project, working at Stanford, the Technical University of Berlin, and the CSIRO in Australia. When not working, you can usually find Kevin on a snowboard or up in the mountains in some capacity or another.

DC/OS is a distributed operating system based on the Apache Mesos distributed systems kernel. DC/OS can be used to automate resource management, schedule process placement, facilitate inter-process communication, and simplify the installation and management of distributed services. In the past, Mesos was famous for helping Twitter to eradicate the "Fail Whale." More recently, DC/OS has been adopted in production by high-profile companies such as Autodesk, Esri, Time Warner Cable, Verizon, and Wellframe. Until recently, however, Mesos did not support GPUs as an allocatable resource. Users that wished to use GPUs had to have separate GPU clusters outside of their primary Mesos or DC/OS installation. By adding first-class support for GPUs, companies can now leverage the power of the GPU in the same shared cluster as their other workloads. We'll introduce how we added support for GPUs to Mesos, including a demo of Tensorflow jobs running on a standard DC/OS installation.

Level: Intermediate
Type: Talk
Tags: Data Center and Cloud Computing; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7161 - OpenStack + AWS, HPC (aaS), and GPUs - a Pragmatic Guide

Martijn de Vries Chief Technology Officer, Bright Computing
Martijn de Vries serves as Chief Technology Officer of Bright Computing, where he is responsible for Bright's software development. Prior to Bright Computing, Martijn was head of software development at ClusterVision. Martijn taught distributed programming in the Computer Science department at Hanze Polytechnic in the Netherlands and programmed for the New York-based internet startup Shop.com.

Why do HPC in a cloud? How to do HPC (aaS), with GPU passthrough, in OpenStack? How to create full GPU HPC cluster, from scratch, on demand, under five minutes, all equipped with NVIDIA's DCGM and CUDA environment, and deep learning libraries/frameworks? Hybrid clouds with GPUs spanning OpenStack and AWS? How to easily and automatically move HPC user data and workloads between the private and public cloud? How to dynamically scale a virtualized HPC cluster, both horizontally (within private cloud) and vertically (to public cloud)? We'll answer these questions during a deep dive into the world of HPC on top of OpenStack and AWS. We'll discuss many ways OpenStack private clouds can be used for bursting HPC workloads, HPC-as-a-service, XaaS (anything-as-a-service), and creating hybrid clouds composed of on-prem private/community cloud OpenStack deployment, which dynamically scale them to public clouds, like AWS. Session includes demo.

Level: Beginner
Type: Talk
Tags: Data Center and Cloud Computing; HPC and Supercomputing; Federal

Day: TBD
Time: TBD
Location: TBD

S7166 - Implementing High-Resolution Fluid Dynamics Solver in a Performance Portable Way

Pierre Kestener Research Engineer, CEA
Pierre Kestener is a research engineer at CEA (France's National Nuclear Energy Research Center) within the Maison de la Simulation, a research division for high-performance computing. His main interest is in helping domain scientists from astrophysics or computational fluid dynamics to design, develop, and optimize production-level code for large computing platforms. He is the lead developer of code RamsesGPU dedicated to magneto-hydrodynamics turbulent flow studies. He also recently started the design of CanoP, a new application platform for CFD with adaptive mesh refinement. As part of the CUDA research and teaching center program, he is involved in teaching GPU programming for students at master level as well as for researchers and engineers during trainings at France's PRACE advanced training center.

We'll report on the use of the kokkos C++ library for designing new performance portable implementations of the algorithms used in astrophysics computational fluid dynamics applications. Among others libraries with similar features, kokkos, which is developed at Sandia National Laboratory, provides a very promising way of designing high-performance computing parallel applications with performance portability across multiple hardware architectures, code readability, and high productivity in mind. Many scientific domains use community codes developed by tens of developers, and such high-level language approach will help them use today's GPU and next generations productively. We'll illustrate several advantages of our new kokkos-based implementation of the computational intensive compressible magneto-hydrodynamics kernels involved in code RamsesGPU, and demonstrate its efficiency on a multi GPU platform (NVIDIA Pascal(TM) P100).

Level: All
Type: Talk
Tags: Computational Fluid Dynamics; HPC and Supercomputing; Computer Aided Engineering

Day: TBD
Time: TBD
Location: TBD

S7175 - Exploratory Visualization of Petascale Particle Data in NVIDIA DGX-1

Benjamin Hernandez Computer Scientist, Oak Ridge National Laboratory
Benjamin Hernandez is a computer scientist in the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory. His research interests are in the intersection of crowd simulations, scientific visualization, interactive computer graphics, and human computer interaction using HPC systems.

Learn to leverage the visualization capabilities of the NVIDIA® DGX-1™ system to visualize particle data. We'll cover techniques suitable for exploratory visualization such as parallel dataset reading and reduction on demand with ADIOS I/O library, GPU-based optimization techniques for particle rendering such as radar view frustum culling, occlusion culling, texture-less point sprites, and OpenGL near zero driver overhead methods. We'll also include implementation details to take advantage of the eight NVIDIA Pascal™ GPUs included in the NVIDIA DGX-1.

Level: All
Type: Talk
Tags: In-Situ and Scientific Visualization; Real-Time Graphics; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7177 - Using Containers for GPU-Accelerated Applications

Felix Abecassis Systems Software Engineer, NVIDIA
Felix Abecassis is a systems software engineer at NVIDIA working on making GPU applications easier to deploy and manage in data centers. He focuses on supporting GPU-accelerated machine learning frameworks. He holds an M.S. in computer science from the French engineering school EPITA.e from the French engineering school EPITA.
Jonathan Calmels Systems Software Engineer, NVIDIA
Jonathan Calmels is a systems software engineer at NVIDIA working primarily on GPU data center software and hyperscale solutions for deep learning. Jonathan holds an M.S. in computer science and engineering.

We'll showcase how to leverage GPUs inside Linux containers using NVIDIA-docker. Containerizing GPU applications provides multiple benefits: 1) Developers can have reproducible builds and deploy their software seamlessly. 2) GPU applications can run across heterogeneous OS/driver/toolkit environments with no performance overhead. 3) GPU devices can be isolated and assigned to different users or different tasks. We'll go through the particularities of GPU containers and demonstrate how to use container images, from the most basic NVIDIA(R) CUDA(R) application to the most complicated deep learning frameworks. We may also present other containers technologies besides Docker/NVIDIA-docker, for instance the Singularity project from Lawrence Berkeley National Laboratory, if not already covered by other speakers.

Level: Intermediate
Type: Talk
Tags: Data Center and Cloud Computing; HPC and Supercomputing; Deep Learning and AI

Day: TBD
Time: TBD
Location: TBD

S7192 - OmpSs+OpenACC: Multi-Target Task-Based Programming Model Exploiting OpenACC GPU Kernels

Guray Ozen Research Assistant , Barcelona Supercomputing Center
Guray Ozen works on compiler and runtime-based accelerator programming systems as a researcher on the programming models team of the Barcelona Supercomputing Center. The aim of his research is to investigate how to improve parallelisation of existing sequential applications by using static or dynamic compilation pipelines. His research also explores programming languages in order to exploit accelerators. He is also the creator of current implementation of the MACC compiler, which supports OpenMP 4.5 directives for automatic GPU offloading. MACC, yet another research compiler to investigate directive-based OpenMP accelerator model for GPUs, is built on top of the Mercurium source-to-source compiler framework and supports OmpSs and almost all directives of the OpenMP accelerator model for GPU.

Discover how the OmpSs programming model enables you to develop different programming models such as OpenACC, multi-thread programming, CUDA, and OpenCL together while providing a single address space and directionality compiler directives. OmpSs is a flagship project in the Barcelona Supercomputing Center, as well as a forerunner of the OpenMP. We'll present the advantages in terms of coding productivity and performance brought by our recent work integrating OpenACC kernels within the OmpSs programming model, as a step forward to our previous OmpSs + CUDA support. We'll also present how to use hybrid GPU and CPU together without any code modification by our runtime system.

Level: Intermediate
Type: Talk
Tags: Programming Languages; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7193 - Achieving Portable Performance for GTC-P with OpenACC on GPU, Multi-Core CPU, and Sunway Many-Core Processor

Stephen Wang GPU Specalist, Shanghai Jiao Tong University
Stephen Wang is a GPU specialist at the Center for HPC at Shanghai Jiao Tong University. Stephen's research focuses on using OpenACC to achieve portable performance for HPC applications on present-day supercomputers.

Gyrokinetic Toroidal Code developed in Princeton (GTC-P) delivers highly-scalable plasma turbulence simulations at extreme scales on world-leading supercomputers such as Tianhe-2 and Titan. The aim of this work to achieve portable performance in a single source code for GTC-P. We developed the first OpenACC implementation for GPU, CPU, and Sunway processor. The results showed the OpenACC version achieved nearly 90% performance of NVIDIA® CUDA® version on GPU and OpenMP version on CPU; the Sunway OpenACC version achieved 2.5X speedup in the entire code. Our work demonstrates OpenACC can deliver portable performance to complex real-science codes like GTC-P. In additional, we request adding thread-id support in OpenACC standard to avoid expensive atomic operations for reductions.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7196 - FMM with Periodic Boundaries Support on GPU

Bartosz Kohnke Software Developer, Max Planck Institute for Biophysical Chemistry
Bartosz Kohnke is a software developer by Max Planck Institute for Biophysical Chemistry in Gottingen in the department of Theoretical and Computational Biophysics. His job is CUDA-parallelization and optimization of the fast multipole method that will become a part of the GROMACS software. Before that Bartosz worked on efficient implementation of super resolution fluctuation imaging algorithms, researching on different parallelization techniques in the Laboratory of Cellular Dynamics at MPI Gottingen. He holds an M.S. in applied computer science from Georg-August-Universtitat Gottingen, Germany, with specialization in scientific computing.

The direct solution of the N-body problem is a simple, yet scientifically important and ubiquitous showcase algorithm for modern GPUs. However, the computational complexity is O(N^2). The fast multipole method is an algorithm that reduces runtime and complexity to optimal O(N) for any required precision. We'll present an optimized, fully NVIDIA(R) CUDA(R)-enabled, templated C++ implementation of the FMM, which considers all stages of the method, from particle input to the forces extraction. We compare different parallelization approaches and show the performance improvement when going from a dynamic parallelization to a presorted list-based approach that fits particular system constraints such as periodic boundary conditions. We'll discuss how to exploit the FMM operators such that both memory access overhead and the number of complex multiplications are minimized. Thereby the kernels are led to the compute bound range, and performance is increased.

Level: Intermediate
Type: Talk
Tags: Computational Physics; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7199 - Interactive HPC: Large Scale In-situ Visualization using NVIDIA Index in ALYA MultiPhysics

Vishal Mehta Senior Engineer, Barcelona Supercomputing Center
Vishal Mehta works as a senior engineer at the Barcelona Supercomputing Center. He is motivated by a co-design approach driven by ambitious applications and influencing the software stack for the development of next-generation, exascale-ready HPC ecosystems. Vishal's fields of interest include computational mechanics, linear algebra, and GPU algorithms for computational science. He has six years of experience in working with GPUs in HPC ecosystem.
Christopher Lux Senior Graphics Software Engineer, NVIDIA IndeX R&D, NVIDIA
Christopher Lux is a senior graphics software engineer at the NVIDIA Advanced Rendering Center. He received is PhD in computer science in 2013 from the Bauhaus-Universität Weimar, Germany. Through his interest in real-time computer graphics and scientific visualization he early on focused his work on the interactive visualization of large-scale datasets from the geo-scientific and medical domain.
Marc Nienhaus Sr. Engineering Manager, Product Technology Lead, NVIDIA IndeX, NVIDIA
Marc Nienhaus is the product technology lead of the NVIDIA IndeX(TM) commercial software at NVIDIA. He manages the NVIDA IndeX software engineering team and is responsible for overall product architecture and applications in various domains. Before joining mental images' R&D rendering department and NVIDIA, Marc was a postdoc at Northwestern University and led research projects at the University of Potsdam. His research interests include parallel and distributed rendering and computing, scientific visualization, GPU-based rendering, and photorealistic and non-photorealistic expressive depictions. He holds a master's in mathematics with a minor in computer science from the University of Muenster and a Ph.D. in computer science from the Hasso Plattner Institute at the University of Potsdam. Marc has published various papers on GPU-based real-time rendering and non-photorealistic rendering.

We'll discuss how NVIDIA IndeX™ Advanced Rendering Tools are helping researchers get more insight through in-situ visualizations. HPC applications have always been centered around large computations, small input, and extremely large simulated output. HPC applications running on big supercomputers are executed using a queuing system, where researchers have to wait a couple of hours before analyzing the outputs. We've designed essential software components that allow in-situ visualizations of sparse volume data from ALYA multiphysics simulation code (Barcelona Supercomputing Center) using NVIDIA IndeX. ALYA multiphysics is one of the two European exascale benchmarks and is used in targeted medicine, cardiac modeling, renewable energy, etc. We'll guide you through techniques that have been used in enabling in-situ rendering and analysis of data.

Level: Intermediate
Type: Talk
Tags: In-Situ and Scientific Visualization; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7204 - HPC and Deep Learning on Azure

Karan Batta Program Manager, Big Compute/HPC Team, Microsoft
Karan Batta is a program manager in the Big Compute/HPC team in Microsoft's Azure, where he leads the vision and deployment of the new Azure GPU N-Series as part of broader Azure Compute IaaS capabilities. Additionally, he leads the media and entertainment vertical solutions as part of the Azure Batch HPC service.

Learn how you can scale your traditional HPC-based applications or workloads in Azure using powerful NVIDIA(R) Tesla(R)-based GPUs and Azure's low-latency networking. Additionally, learn how our customers are running deep learning and AI workloads using these GPUs in Azure to create the best speech recognition models, natural language processing, and image/object detection for scenarios such as digital assistants or autonomous cars.

Level: All
Type: Talk
Tags: Data Center and Cloud Computing; Deep Learning and AI; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7205 - High-End Design & Visualizations on Azure

Karan Batta Program Manager, Big Compute, HPC Team, Microsoft
Karan Batta is a program manager in the Big Compute/HPC team in Microsoft's Azure, where he leads the vision and deployment of the new Azure GPU N-Series as part of broader Azure Compute IaaS capabilities. Additionally, he leads the media and entertainment vertical solutions as part of the Azure Batch HPC service.

Future high-performance computing systems will enable fast processing of large datasets, as highlighted by President Obama's Executive Order on National Strategic Computing Initiative. Of significant interest is the need for analyzing big graphs arising from a variety of areas from social networks and biology to national security. We'll present our ongoing efforts at the George Washington University in accelerating big graph analytics on GPUs. We've developed a GPU-based graph analytics system that delivers exceptional performance through efficient scheduling of a large number of GPU threads and effective utilization of GPU memory hierarchy. Our systems are one of the best GPU-based implementations, consistently ranking highly on Graph500 and Green Graph500.

Level: All
Type: Talk
Tags: Data Center and Cloud Computing; HPC and Supercomputing; Accelerated Analytics

Day: TBD
Time: TBD
Location: TBD

S7240 - Efficient Correlation-Free Many-States Lattice Monte Carlo on GPUs

Jeffrey Kelling Scientist, Helmholtz-Zentrum Dresden-Rossendorf
Jeffrey Kelling is a scientist in the Computational Science group at Helmholts-Zentrum Dresden-Rossendorf, Germany, concerned with high performance computing.

We'll present a method for highly efficient lattice Monte Carlo simulations with correlation-free updates. Achieving freedom from erroneous correlations requires random selection of lattice sites for updates, which must be restricted by suitable domain decomposition to create parallelism. While approaches based on caching limit the number of allowed states, the multisurface-type approach presented here allows arbitrarily complex states. The effectiveness of the method is illustrated in the fact that it allowed us to solve a long-standing dispute around surface growth under random kinetic deposition in the KPZ-universality class. The method has also been applied to Potts models and is suitable for spin-glass simulations, such as those required to test quantum annealers, like D-Wave.

Level: All
Type: Talk
Tags: Computational Physics; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7241 - Spectral Clustering of Large Networks

Alexandre Fender Software Engineer, NVIDIA
Alexandre Fender has worked at NVIDIA for three years as a software engineer, focusing on emerging applications in graph analytics and sparse iterative methods. Alex is involved in the development of CUDA libraries such as nvGRAPH. In parallel, he is finishing his Ph.D. on accelerated iterative eigenvalue solvers based on Krylov methods for networks analysis.
Maxim Naumov Senior Research Scientist, NVIDIA
Highly-Rated Speaker
Maxim Naumov is a senior research scientist at NVIDIA. His interests include parallel algorithms, numerical linear algebra, optimization, and graphs. He also contributes to the Data Analytics nvGRAPH library. Maxim has led the development of the AmgX library, which provides distributed Algebraic Multigrid, Krylov and Relaxation-based schemes. He has worked on the cuBLAS, cuSPARSE, and cuSOLVER(RF) libraries that are part of the CUDA Toolkit. Previously, Maxim held different positions on NVIDIA's CUDA Platform and Emerging Applications teams and Intel's Microprocessor Technology Lab and Computational Software Lab. He received his Ph.D. in computer science, with specialization in computational science and engineering, in 2009 and his B.S. in computer science and mathematics in 2003, from Purdue University - West Lafayette.

We'll explore techniques for expressing graph clustering as an eigenvalue problem. Attendees will learn how to express different metrics, including minimum balanced cut, modularity, and Jaccard, through associated matrices and how to use their eigenvectors to find the clustering of the graph into multiple partitions. We'll also show how to take advantage of efficient implementation of Lanczos and LOBPCG eigenvalue solvers and k-means algorithm on the GPU to compute clustering using our general framework. Finally, we'll highlight the performance and quality of our approach versus existing state-of-the-art techniques.

Level: Intermediate
Type: Talk
Tags: Accelerated Analytics; Algorithms; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7243 - Lightweight Compression Methods Achieving 120GBps and More

Piotr Przymus Postdoc, Aix-Marseille University, France
Piotr Przymus obtained his Ph.D. in 2014 from the University of Warsaw, Poland. Currently he is a postdoc at Aix-Marseille University, France. Piotr has been an active researcher and promoter of GPUs since 2011.

We'll investigate new approaches to parallel lossless lightweight compression methods based on fixed-length minimum bit encoding for GPU processors. We'll discuss various memory access patterns and develop a new optimal memory organization. By utilizing new inter-thread and inter-warp communication abilities, we propose algorithms that suit the GPU architecture better. As a result, we significantly improve compression ratio and bandwidth. This allows for many new applications in computational clusters as well as in computational algorithms. Our claims are supported by tests conducted using simulated data and TPC-H database benchmarking tools.

Level: Beginner
Type: Talk
Tags: In-Situ and Scientific Visualization; HPC and Supercomputing; Media and Entertainment; Algorithms

Day: TBD
Time: TBD
Location: TBD

S7253 - Kokkos Hierarchical Task-Data Parallelism for C++ HPC Applications

H. Carter Edwards Principle Member Technical Staff, Sandia National Laboratories
Highly-Rated Speaker
H. Carter Edwards is the principal investigator and architect for the Kokkos project at Sandia National Laboratories. Carter has over three decades of experience in modeling and simulation software development and over two decades of experience in HPC, parallel processing, and C++ software development. For the last several years, his HPC focus has been on algorithms and programming models for thread-scalable and performance portable parallelism across next-generation platform node architectures. Carter has a B.S. and M.S. in aerospace engineering and a Ph.D. in computational mathematics. He represents Sandia on the ISO C++ language standard committee.

The Kokkos library provides C++ HPC applications with a performance portable programming model for disparate manycore architectures such as NVIDIA® Pascal™, AMD Fusion, and Intel Xeon Phi. Until last year Kokkos supported only composition of data parallel patterns (foreach, reduce, and scan) with range and hierarchical team parallel execution policies. Our latest parallel pattern is a dynamic, directed acyclic graph (DAG) of heterogeneous tasks where each task supports internal data parallelism. At GTC16 we presented preliminary results based upon just-in-time access to an early release of NVIDIA CUDA® 8. We've had a year to mature this highly challenging task-DAG capability and present results using the NVIDIA Pascal GPU.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Tools and Libraries

Day: TBD
Time: TBD
Location: TBD

S7272 - Urban Scale Crowd Data Analysis, Simulation, and Visualization

Isaac Rudomin Senior Researcher, Barcelona Supercomputing Center
Isaac Rudomin is a senior researcher at the Barcelona Supercomputer Center, which he joined in 2012. His focus is on crowd rendering and simulation including generating, simulating, animating, and rendering large and varied crowds using GPUs in consumer-level machines and in HPC heterogeneous clusters with GPUs. Previously, Isaac was on the faculty at Tecnologico de Monterrey Campus Estado de Mexico. He finished his Ph.D. at the University of Pennsylvania under Norman Badler on the topic of cloth modeling. Dr. Dmitri Terzopoulos was a member of the committee.

We'll dive deep into how we use heterogeneous clusters with GPUs for accelerating urban-scale crowd data analysis, simulation, and visualization. Our main contributions are the development of new behavior models that conform to real data, the ability to scale the system by adding computing resources as needed without making programming modifications and the combination of analysis, simulation, and visualization techniques that help us achieve large-scale crowd simulations with realistic behavior.

Level: All
Type: Talk
Tags: In-Situ and Scientific Visualization; HPC and Supercomputing; Computational Physics; Deep Learning and AI

Day: TBD
Time: TBD
Location: TBD

S7277 - Computer Virtual Experiment on Fluidized Beds Using GPU Accelerated CFD-DEM Method

Ji Xu Associate processor, Institute of Process Engineering, Chinese Academy of Sciences
Ji Xu is an associate professor at the Institute of Process Engineering Chinese Academy of Sciences.

Learn how to use GPUs to accelerate CFD-DEM, the computational fluid dynamics - discrete element method, to achieve computer virtual experiment on fluidized beds in the chemical engineering field. We'll discuss how to organize the gas- and solid-phase equations solved concurrently by CPUs and GPUs in a heterogeneous supercomputing system. With systematic optimization of the model, numerical method, software, and hardware, we can simulate lab- to pilot-scale fluidized beds at quasi-realtime speed, and conduct demos of such systems. Our method realizes some real applications tthat need very long time simulations.

Level: All
Type: Talk
Tags: Computer Aided Engineering; Computational Fluid Dynamics; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7281 - Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster

Jonas Markussen PhD student, Simula Research Laboratory
Jonas Markussen is a Ph.D. student at Simula Research Laboratory and the University of Oslo, Norway. Jonas's research interests are distributed processing, computer networks, and high-speed interconnects. At Simula, he is involved in the Unified PCIe IO project, a project in collaboration with Dolphin Interconnect Solutions.

Learn how GPUs can be time-shared between multiple hosts connected in a PCIe cluster using a method called device lending. Unlike approaches for sharing GPUs that typically require specific programming models, device lending makes a GPU appear to the operating system as if it is locally installed. This allows the GPU to be controlled and used by a remote host without any modifications to existing software. We'll present how device lending is implemented using standard PCIe and non-transparent bridging. As a proof-of- concept, we accelerate EIR, a computer-aided medical diagnosis system using machine learning and computer vision to do polyp detection, from being an offline tool to giving real-time feedback by dynamically borrowing remote GPU resources.

Level: Intermediate
Type: Talk
Tags: Data Center and Cloud Computing; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7285 - Unified Memory on the Latest GPU Architectures

Nikolay Sakharnykh Senior Developer Technology Engineer, NVIDIA
Nikolay Sakharnykh is a senior developer technology engineer at NVIDIA, where he works on accelerating HPC and graph analytics applications on GPUs.

Learn about the new features of Unified Memory programming model for heterogeneous architectures. We'll deep dive into architecture and software changes related to Unified Memory, what it means for developers, and how it enables easier data management and new capabilities for your applications. We'll cover in detail Unified Memory features such as on-demand paging, memory oversubscription, memory coherence, and system-wide atomics. Use cases in HPC, deep learning, and graph analytics will be provided along with initial performance results. We'll also discuss common pitfalls and optimization guidelines so you can take full advantage of Unified Memory to increase your productivity.

Level: All
Type: Talk
Tags: Programming Languages; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7296 - CloudLighting: Merging GPU-based HPC with Cloud Services

Anne C Elster Professor of High Performance Computing, Norwegian University of Science &Technoloy /Univ. of Texas at Austin
Anne Elster is a professor at the Norwegian University of Science and Technology, a GPU Research Center, where she runs the Heterogeneous and Parallel Computing Lab (HPC-Lab). She is also a visiting scientist at the University of Texas at Austin, a CUDA Teaching Center.

Learn how you can integrate GPU-enabled HPC and Cloud computing using building on recent container technologies and integration. This presentation will highlight the efforts we are doing as part of the EU Horizon 2020 project CloudLighting where ww look at how to integrate Heterogenous Computing with Cloud technologies.

Level: Intermediate
Type: Talk
Tags: Data Center and Cloud Computing; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7303 - Finding Parallelism in General-Purpose Linear Programming

Daniel Thuerck Ph.D. Student, Technical University Darmstadt, Graphics, Capture and Massively Parallel Computing
Daniel Thuerck is a first year Ph.D. student at GCC, TU Darmstadt. He earned his B.S. in computer science and M.S. in computational ngineering with a strong focus on optimization, both at TU Darmstadt. Daniel's research focuses mainly on parallel optimization algorithms, especially with an application in visual computing. He has interned with NVIDIA research twice in the past year and currently works on parallel linear programming.
Maxim Naumov Senior Research Scientist, NVIDIA
Highly-Rated Speaker
Maxim Naumov is a senior research scientist at NVIDIA. His interests include parallel algorithms, numerical linear algebra, optimization, and graphs. Maxim contributes to data analytics nvGRAPH library and has led the development of the AmgX library, which provides distributed Algebraic Multigrid, Krylov and Relaxation-based schemes. He has also worked on the cuBLAS, cuSPARSE, and cuSOLVER(RF) libraries that are part of the CUDA toolkit. In the past, Maxim held different positions at NVIDIA, including on the CUDA Platform and Emerging Applications teams, and at Intel in the Microprocessor Technology Lab and Computational Software Lab. Maxim received his Ph.D. in computer science, with a specialization in computational science and engineering, in 2009 and his B.S. in computer science and mathematics in 2003, all from Purdue University - West Lafayette.

Get to know two different techniques in retrieving parallelism hidden in a general purpose linear programs (LPs) that are broadly used in operations research, computer vision, and machine learning. With conventional solvers often being restricted to serial computation, we'll show two ways of retrieving inherent parallelism, using: (1) parallel sparse linear algebra techniques with an interior-point method, and (2) a higher-level automatic LP decomposition. After a quick introduction to the topic, we'll present details and results for a diverse range of applications on the GPU.

Level: Intermediate
Type: Talk
Tags: Algorithms; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7305 - Super GPU: Make Programming of Multi-GPU Systems Easy

Michael Frumkin Sr. Compute Architect, NVIDIA
Michael Alex Frumkin is a senior computer architect on performance optimization and analysis of large-scale applications at NVIDIA. Previously, Michael worked at Google, Intel, and NASA working on traffic management, performance optimization of multicore systems, and benchmarking large-scale systems. He holds an M.S. in mathematics from Moscow State University and a Ph.D. in computer sciences from the Graduate School of The Soviet Academy of Sciences. He is an author of more than 70 scientific papers and holds two patents.

Learn a natural way to program multi-GPU systems. The super-GPU programming concept is a natural extension of NVIDIA® CUDA® tiling hierarchy into multi-GPU systems. It allows you to write super-kernels that run on super-GPUs. Tiling simplifies the challenging problem of programming a multi-GPU system that requires coordination of multiple kernels running on nodes connected via a heterogeneous network. We'll illustrate a super-GPU programming model on several applications and benchmarks, including SpMV, Integer Sort, Transpose, FFT, GEMM, and RTM. Use of super GPU provides super-linear speedup of SpMV due to better utilization of L2 of several GPUs. For Sort, FFT, and GEMM, the speedup is close to linear. Multi-GPU Transpose attains the limit imposed by the interconnecting network.

Level: Intermediate
Type: Talk
Tags: Tools and Libraries; HPC and Supercomputing; Deep Learning and AI

Day: TBD
Time: TBD
Location: TBD

S7319 - New Approaches to the Direct Solution of Large-Scale Banded Linear Systems

Samuel Rodriguez Bernabeu Research Engineer , Barcelona Supercomputing Center (BSC)
Samuel Rodriguez Bernabeu is an aerospace engineer with strong knowledge in applied math, parallel programming models, and computer architecture. Samuel works at the Spanish National Supercomputing Institute (BSC) on the design and development of new algorithms for solving large-scale systems of linear equations on supercomputers using direct methods. Before that, he was responsible for the optimization of critical on-board software components for different aerospace projects. His research interests lie in the fields of numerical linear algebra, accuracy and stability of numerical methods, and parallel computing.

We approach the problem of solving large-scale, extremelly ill-conditioned banded linear systems using direct methods. Unlike traditional approaches, we focus on limiting the memory footprint of the algorithms rather than the FLOP count. To reduce the memory demand, BLAS-3 pre- and post-processing of the linear system are required. While this increases considerably the number of calculations required to solve the system, most of this work can be done very efficiently on the GPU. In this way, using GPUs allows us to solve much larger problems than state-of-the-art banded direct solvers on modern architectures. We'll present results for problems arising from realistic oil and gas scenarios, and we'll show that these techniques allow us to solve systems of tens of millions of equations using significantly less memory than currently available direct banded solvers.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7322 - Persistent Kernel: Real-Time, Low-Latency and High-Performance Computation on Pascal

Julien Bernard Software Engineer, Laboratoire d'études spatiales et d'instrumentation en astrophysique
Julien Bernard has been a research associate at LESIA - Observatoire de Paris since 2015. Julien is responsible for the design and development of accelerated solutions in real-time environments within the Green Flash European Project (Horizon 2020). He graduated with honors in embedded software engineering from École d'Ingénieur Denis Diderot (EIDD - Engineering school) at Paris University.

Learn how to design real-time, low-latency, and high-throughput systems based on GPUs and using GPU-Direct for efficient data transfer. We'll demonstrate how persistent kernel provides the ability to handle continuous data stream without any intermediary, bypassing the CPU execution, and reduce latency and jitter. We'll also see how this strategy is used on our NVIDIA DGX-1-based demonstrator in the context of Green Flash, an European project that aims to build a prototype for the next-generation real-time controller targeting the European Extremely Large Telescope's adaptive optics instrumentation.

Level: Intermediate
Type: Talk
Tags: Astronomy and Astrophysics; Performance Optimization; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7324 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions

Dhabaleswar K. (DK) Panda Professor and University Distinguished Scholar, The Ohio State University
Highly-Rated Speaker
Dhabaleswar K. (DK) Panda is a professor and University Distinguished Scholar of Computer Science and Engineering at Ohio State University. D.K. has published over 400 papers in major journals and international conferences. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP, and RoCE) open-source software package, developed by his research group, is used by more than 2,675 organizations in 83 countries. This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade, including the current No. 1. More than 402,000 downloads of this software have taken place from the project's website alone. He is an IEEE fellow and a member of ACM.

Learn about techniques and solutions that bring GPU computing to the world of partitioned global address space (PGAS) models, especially with the emerging OpenSHMEM paradigm. PGAS models are gaining attention for providing shared memory abstractions that make it easy to develop parallel applications with dynamic and irregular communication patterns. However, the existing OpenSHMEM standards do not support direct communication on GPU memory. We'll discuss simple extensions to the OpenSHMEM model to address this issue. We'll also present challenges and solutions in designing NVIDIA® CUDA®-aware runtimes to support these extensions and optimize data movement using CUDA IPC and GPUDirect RDMA features. And we'll demonstrate the impact of these concepts to application performance.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Programming Languages; Tools and Libraries

Day: TBD
Time: TBD
Location: TBD

S7331 - Massively Parallel Landscape-Evolution Modelling using General Purpose Graphical Processing Units

Stephen McGough Senior Lecturer (Associate Professor), Durham University
Stephen McGough is a senior lecturer (equivalent to an associate professor in the U.S.) in computing sciences at Durham University, UK. Stephen obtained his Ph.D. in the area of parallel simulation and has worked for many years in the areas of parallel computing and simulation. This has lead to over 50 publications in the area of parallel computing, including receiving the NVIDIA best paper award at HiPC 2012. His research focuses on the use of novel computing technologies to solve real-world challenges. This has lead to him being a key player as part of the NVIDA CUDA Research Centre.

Landscape Evolution Modeling (LEM) is used to understand how landscapes evolve over millions of years. It is based on a regular grid of cells, each representing a height in the landscape. For each simulated year, we follow the processes of computing: how water flows between cells, the summation of water flowing through cells and the amount of erosion/deposition. Traditionally, due to computational complexity, such simulation models have only been performed on trivially small landscapes of 5,000 cells taking 5 hours to compute 1,000 years. However, researchers wish to perform simulations on massive landscapes (50+ million) over millions of years. We demonstrate here PARALEM, a GPGPU-enabled LEM capable of two to three orders of magnitude speedup in comparison to the best-in-class LEM software.

Level: All
Type: Talk
Tags: Computational Physics; HPC and Supercomputing; Earth Systems Modeling

Day: TBD
Time: TBD
Location: TBD

S7332 - Accelerated Astrophysics: Using NVIDIA® DGX-1™ to Simulate and Understand the Universe

Brant Robertson Associate Professor of Astronomy and Astrophysics, University of California, Santa Cruz
Brant Robertson is an Associate Professor in the Department of Astronomy and Astrophysics at the University of California, Santa Cruz. His research interests include theoretical topics related to galaxy formation, dark matter, hydrodynamics, and numerical simulation methodologies. Brant was previously an assistant professor at the University of Arizona from 2011-2015, held a Hubble Fellowship in the Astronomy Department at the California Institute of Technology from 2009-2011, and a Spitzer and Institute Fellowship at the Kavli Institute for Cosmological Physics and Enrico Fermi Institute at the University of Chicago from 2006-2009. Brant earned his Ph.D. in astronomy from Harvard University in 2006, and received his B.S. in physics and astronomy at the University of Washington, Seattle in 2001. He can be found on Twitter at @brant_robertson.

Get an overview of how GPUs are used by computational astrophysicists to perform numerical simulations and process massive survey data. Astrophysics represents one of the most computationally heavy sciences, where supercomputers are used to analyze enormous amounts of data or to simulate physical processes that cannot be reproduced in the lab. Astrophysicists strive to stay on the cutting edge of computational methods to simulate the universe or process data faster and with more fidelity. We'll discuss two important applications of GPU supercomputing in astrophysics. We'll describe the astrophysical fluid dynamics code CHOLLA that runs on the GPU-enabled supercomputer Titan at Oak Ridge National Lab and can perform some of the largest astrophysical simulations ever attempted. Then we'll describe the MORPHEUS deep learning framework that classifies galaxy morphologies using the NVIDIA DGX-1 deep learning system.

Level: All
Type: Talk
Tags: Astronomy and Astrophysics; HPC and Supercomputing
Industry Segments: Higher Education / Research; Government / National Labs

Day: TBD
Time: TBD
Location: TBD

S7340 - Hydra: A Framework for Data Analysis in Massively Parallel Platforms

Antonio Augusto Alves Junior Post-doc, University of Cincinnati
Antonio Augusto started his activities in research as an undergraduate student in physics, studying the modeling of dissipative quantum systems. As a masters student, Antonio undertook a thesis in theoretical physics devoted to the studies of the electromagnetic field confined in cavities with moving boundaries. His collaboration with the LHCb experiment at CERN began in 2005 when he obtained a Ph.D. position at the Centro Brasileiro des Pesquisas Fisicas in Rio de Janeiro, joining the local group involved in the construction the detector. In 2009, Antonio started a two-year INFN fellowship for foreign physicists at the "Sezione di Roma." His skills in physics together with the ability to deal with software and to develop specific analysis tools and methodologies were determinant for forming a close and effective working group. In 2015, he joined the LHCb group at the University of Cincinnati to work on the development of software for data analysis.

We'll discuss Hydra, a templatized header-only, C++11-compliant library for data analysis on massively parallel platforms targeting, but not limited to, the field high-energy physics research. Hydra supports the description of particle decays via the generation of phase-space Monte Carlo, generic function evaluation, data fitting, multidimensional adaptive numerical integration, and histograming.

Level: Intermediate
Type: Talk
Tags: Computational Physics; Accelerated Analytics; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7344 - Kokkos – The C++ Performance Portability Programming Model

Christian Trott Senior Member of Technical Staff, Sandia National Laboratories
Christian Trott is a high performance computing expert with experience in designing and implementing software for GPU and MIC compute-clusters. Christian's prior scientific work focused on computational material research using Ab-Initio calculations, molecular dynamic simulations, and Monte Carlo methods. As of 2015, Christian is a senior member of technical staff at the Sandia National Laboratories. He is a core developer of the Kokkos programming model with a large role in advising applications on adopting Kokkos to achieve performance portability for next-generation supercomputers. He earned a Ph.D. from the University of Technology Ilmenau in theoretical physics.
H. Carter Edwards Principal Member of Technical Staff, Sandia National Laboratories
Highly-Rated Speaker
H. Carter Edwards is the principal investigator and architect for the Kokkos project at Sandia National Laboratories. Carter has over three decades of experience in modeling and simulation software development and over two decades of experience in HPC, parallel processing, and C++ software development. For the last several years, his HPC focus has been on algorithms and programming models for thread-scalable and performance portable parallelism across next-generation platform node architectures. Carter has a B.S. and M.S. in aerospace engineering and a Ph.D. in computational mathematics. He represents Sandia on the ISO C++ language standard committee.

Kokkos is a programming model developed at Sandia National Laboratories for enabling application developers to achieve performance portability for C++ codes. It is now the primary programming model at Sandia to port production-level applications to modern architectures, including GPUs. We'll discuss the core abstractions of Kokkos for parallel execution as well as data management, and how they are used to provide a critically important set of capabilities for the efficient implementation of a wide range of HPC algorithms. We'll present performance evaluations on a range of platforms to demonstrate the state of the art of performance portability. This will include data from Intel KNL-based systems as well as IBM Power8 with NVIDIA® NVLink™-connected NVIDIA Tesla® P100 GPUs. We'll also provide an overview of how Kokkos fits into the larger exascale project at the Department of Energy, and how it is used to advance the development of parallel programming support in the C++ language standa

Level: All
Type: Talk
Tags: HPC and Supercomputing; Programming Languages

Day: TBD
Time: TBD
Location: TBD

S7345 - High-Performance Broadcast Designs for Streaming Applications on Multi-GPU InfiniBand Clusters

Dhabaleswar K. (DK) Panda Professor and Distinguished University Scholar, The Ohio State University
Highly-Rated Speaker
Dhabaleswar K. (DK) Panda is a professor and University Distinguished Scholar of Computer Science and Engineering at Ohio State University. D.K. has published over 400 papers in major journals and international conferences. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP, and RoCE) open-source software package, developed by his research group, is used by more than 2,675 organizations in 83 countries. This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade, including the current No. 1. More than 402,000 downloads of this software have taken place from the project's website alone. He is an IEEE fellow and a member of ACM.

Learn recent developments in middleware design to boost performance of GPU-based streaming applications. Several runtimes already support and optimize GPU communication using various NVIDIA® CUDA® features. Similarly, some runtimes use InfiniBand hardware multicast to boost broadcast performance for host-based communications. We'll focus on the challenges in combining and fully utilizing GPUDirect RDMA (GDR) and hardware InfiniBand multicast technologies in tandem to design support for high-performance heterogeneous broadcast operation for streaming applications. Further, we present associated challenges and designs in supporting reliability for clusters with multi-HCA and multi-GPU configurations. Performance evaluation of the proposed designs on various system configurations will be presented and analyzed.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Tools and Libraries; Data Center and Cloud Computing

Day: TBD
Time: TBD
Location: TBD

S7350 - Deep Learning as a Service: Experiences in Building GPU-Enabled HPC Clusters

Brian Belgodere Research Software Engineer, IBM Research
Brian Belgodere is a software engineer at IBM Research, working on the Cognitive Compute Cluster developing and building tightly coupled development tools for composing cognitive solutions. Brian has worked in distributed systems, security and compliance, service management, and systems automation. Previously, he worked for IBM's Global Business Services Division and worked as part of the Spatial Analytics and Smarter Water Practices. Brian holds a B.S. in finance and in economics from Carnegie Mellon University and a J.D. from the University of Pittsburgh.

Conducting deep learning research and development requires a combination of cutting-edge hardware, elastic software frameworks, and a collaborative research community. We'll provide the scaffolding for participants to construct an enterprise-scale, GPU-enabled high performance computing solution for machine learning and data science by drawing on the experiences gained while IBM Research built its Cognitive Computing Cluster. We'll start by discussing how to build a secure, shared-resource computing cluster optimized for deep learning. Next, we'll cover how to provide deep learning frameworks supporting speech, vision, language, and text processing and their underlying primitives. Finally, we'll discuss how to build a best practice knowledge base to improve research quality and accelerate discovery.

Level: Intermediate
Type: Talk
Tags: Data Center and Cloud Computing; HPC and Supercomputing; Deep Learning and AI

Day: TBD
Time: TBD
Location: TBD

S7354 - NUFFT Volume Reconstruction for Synchrotron MicroTomographic Data Using GPUs

Dinesh Kumar Computational Post Doctorate Fellow, Lawrence Berkeley National Laboratory
Dinesh Kumar is a computational postdoctorate fellow at Lawrence Berkeley National Laboratory. Dinesh is involved in developing HPC tools for analyzing synchrotron X-ray data, such as tomography and scattering. His previous work includes non-rigid image registration for GYN cancer patients at Virginia Commonwealth University and simulation fo geophysical mass flows over natural terrains during his Ph.D. work at the University at Buffalo.

We'll discuss a GPU implementation of non-uniform fast Fourier transforms-based volume reconstruction of the synchrotron tomographic data. A Python interface manages the workflow, either using a GUI or CLI.

Level: All
Type: Talk
Tags: In-Situ and Scientific Visualization; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7356 - MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Dhabaleswar K. (DK) Panda Professor and University Distinguished Scholar, The Ohio State University
Highly-Rated Speaker
Dhabaleswar K. (DK) Panda is a professor and University Distinguished Scholar of Computer Science and Engineering at Ohio State University. D.K. has published over 400 papers in major journals and international conferences. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP, and RoCE) open-source software package, developed by his research group, is used by more than 2,675 organizations in 83 countries. This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade, including the current No. 1. More than 402,000 downloads of this software have taken place from the project's website alone. He is an IEEE fellow and a member of ACM.
Khaled Hamidouche Research Scientist, The Ohio State University
Khaled Hamidouche is a research scientist in the Department of Computer Science and Engineering at Ohio State University. Khaled is a member of the Network-Based Computing Laboratory, led by Dr. D.K. Panda. His research interests include high-performance interconnects, parallel programming models, accelerator computing, and high-end computing applications. His current focus is on designing high-performance unified MPI, PGAS, and hybrid MPI+PGAS runtimes for InfiniBand clusters and their support for accelerators. Khaled is involved in the design and development of the popular MVAPICH2 library and its derivatives MVAPICH2-MIC, MVAPICH2-GDR, and MVAPICH2-X. He has published over 50 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences, and is a member of ACM.
Hari Subramoni Research Scientist, Ohio State University
Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing.

Learn about the latest developments in MVAPICH2-GDR library that helps MPI developers to exploit maximum performance and scalability on HPC clusters with NVIDIA GPUs. Multiple designs focusing on GPUDirect RDMA(GDR)_Async, non-blocking collectives, support for unified memory and datatype processing will be highlighted to boost the performance of HPC applications. Furthermore, targeting emerging deep learning frameworks, we'll present novel designs and enhancements to the MVAPICH2-GDR library to accommodate the large message and dense GPU computing requirements of the DL frameworks. Using a co-designed scheme between MVAPICH2-GDR and the Caffe workflow, we'll present OSU-Caffe, which supports an MPI-based distributed and scalable DL framework. Performance and scalability numbers of OSU-Caffe for various system configurations and datasets will also be presented.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Deep Learning and AI; Tools and Libraries

Day: TBD
Time: TBD
Location: TBD

S7363 - Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Cheng-Han Du Postdoctoral Researcher, National Taiwan University
Cheng-han Du is a postdoctoral researcher in the Institute of Applied Mathematical Sciences, National Taiwan University. His research interests are high performance computing and hardware acceleration for deep learning and photonic simulation. Cheng-han received his B.S. from the Department of Computer Science and Information Engineering, National Taiwan University, in 2007, and his M.S. and Ph.D. from the Graduate Institute of Photonics and Optoelectronics at the same university in 2009 and 2014, respectively.

We propose several techniques for efficient multi-GPU acceleration in direct linear system solver, which is particularly designed for finite-difference frequency-domain analysis of photonic structures. The algorithm is based on compressed hierarchical Schur method (CHiS), where redundant computation can be avoided with knowledge of duplicated physical structures and numerical elimination process. Since many high-intensity matrix computations are the major workloads in the CHiS algorithm, they can be divided into multiple panels and processed by multiple GPUs. Our implementation uses multithreading to control multiple GPUs. Performance analysis shows that the workload division yields significantly better scale-up results with 4 GPUs compared with naive GPU acceleration.

Level: Advanced
Type: Talk
Tags: Computational Physics; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7366 - Building a GPU-enabled OpenStack Cloud for HPC

Blair Bethwaite Senior HPC Consultant, Monash University
Blair Bethwaite has worked in distributed computing at Monash University, in Melbourne, Australia, for 10 years, OpenStack for the last four. Having served as team lead, architect, administrator, user, researcher, and occasional hacker, Blair's unique perspective as a science power-user, developer, and system architect has helped guide the evolution of the research computing engine central to Monash's 21st Century Microscope.

M3 is the latest generation system of the MASSIVE project, an HPC facility specializing in characterization science (imaging and visualization). Using OpenStack as the compute provisioning layer, M3 is a hybrid HPC/cloud system, custom-integrated by Monash's R@CMon Research Cloud team. Built to support Monash University's high-throughput instrument processing requirements, M3 is half-half GPU-accelerated and CPU-only. We'll discuss the design and tech used to build this innovative platform as well as detailing approaches and challenges to building GPU-enabled and HPC clouds.

Level: All
Type: Talk
Tags: Data Center and Cloud Computing; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7382 - GPUs Unleashed: Analysis of Petascale Molecular Simulations with VMD

John Stone Senior Research Programmer, University of Illinois Urbana-Champaign
Highly-Rated Speaker
John Stone is a senior research programmer in the Theoretical and Computational Biophysics Group at the Beckman Institute for Advanced Science and Technology, and associate director of the NVIDIA CUDA Center of Excellence at the University of Illinois. John is the lead developer of VMD, a high-performance molecular visualization tool used by researchers all over the world. His research interests include molecular visualization, GPU computing, parallel processing, ray tracing, haptics, and virtual environments. John was awarded as an NVIDIA CUDA Fellow in 2010. In 2015, he joined the Khronos Group Advisory Panel for the Vulkan Graphics API. He also provides consulting services for projects involving computer graphics, GPU computing, and high performance computing.

We'll showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest NVIDIA® Tesla® P100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and large-scale runs on petascale computers such as Titan and Blue Waters. We'll highlight the performance benefits obtained from die-stacked memory on the Tesla P100, the NVIDIA NVLink# interconnect on the IBM "Minsky" platform, and the use of NVIDIA CUDA® just-in-time compilation to increase the performance of data-driven algorithms. We will present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we'll describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Accelerated Analytics; Computational Chemistry

Day: TBD
Time: TBD
Location: TBD

S7388 - Developing an Improved Generalized Eigensolver with Limited CPU Offloading

Joshua Romero Graduate Student, Stanford University
Joshua Romero is a graduate student at Stanford University. His graduate research includes the numerical analysis of high-order computational methods for computational fluid dynamics simulations and the development of software to perform simulations using these techniques on modern computing hardware, with emphasis on GPUs. His recent work has focused on the development of general scientific computing applications for GPU hardware.

We'll explore strategies to reduce CPU dependencies within existing hybrid CPU/GPU LAPACK routines, such as those implemented with the open-source MAGMA library. This will be carried out within the context developing an improved generalized eigensolver, written in CUDA Fortran for the open-source Quantum ESPRESSO library. The solver aims to replace offloaded subblock CPU computations within the existing hybrid algorithms with GPU resident subblock computations to limit dependencies on available CPU resources. Performance considerations and strategies used in developing the solver, including the use of profiling tools available within the CUDA toolkit will be covered. Additionally, we'll provide an example developing software using CUDA Fortran.

Level: Intermediate
Type: Talk
Tags: Algorithms; Computational Physics; Performance Optimization; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7391 - Turbocharging VMD Molecular Visualizations with State-of-the-Art Rendering and VR Technologies

John Stone Senior Research Programmer, University of Illinois Urbana-Champaign
Highly-Rated Speaker
John Stone is a senior research programmer in the Theoretical and Computational Biophysics Group at the Beckman Institute for Advanced Science and Technology, and associate director of the NVIDIA CUDA Center of Excellence at the University of Illinois. John is the lead developer of VMD, a high-performance molecular visualization tool used by researchers all over the world. His research interests include molecular visualization, GPU computing, parallel processing, ray tracing, haptics, and virtual environments. John was awarded as an NVIDIA CUDA Fellow in 2010. In 2015, he joined the Khronos Group advisory panel for the Vulkan graphics API. He also provides consulting services for projects involving computer graphics, GPU computing, and high performance computing.

State-of-the-art molecular simulations pose many challenges for effective visualization and analysis due to their size, timescale, and the growing complexity of the structures under study. Fortunately, a panoply of new and emerging technologies can address these challenges. We'll describe our experiences and progress adapting VMD, a widely used molecular visualization and analysis tool, to exploit new rasterization APIs such as EGL and Vulkan, and the NVIDIA OptiX(TM) ray tracing API for interactive, in-situ, and post-hoc molecular visualization on workstations, clouds, and supercomputers, highlighting the latest results on IBM POWER hardware. Commodity VR headsets offer a tremendous opportunity to make immersive molecular visualization broadly available to molecular scientists, but they present many performance challenges for both rasterization- and ray tracing-based visualization. We'll present results from our ongoing work adapting VMD to support popular VR HMDs.

Level: Intermediate
Type: Talk
Tags: In-Situ and Scientific Visualization; Virtual Reality and Augmented Reality; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7393 - Maximizing GPU Throughput Across Multiple Streams - Tips and Tricks

Chuck Seberino Principal Software Engineer, GPU Sequencing, Roche Sequencing Solutions
Chuck Seberino is a principal software engineer at Roche Sequencing Solutions, where he tackles big data problems processing genomic data in real time. Previously, he was at Complete Genomics doing similar work on its high-throughput sequencing. Prior to Chuck's time in life sciences, he worked for government, defense, and robotics companies, including Raytheon Missile Systems and Silicon Graphics. He has developed software for graphics, visual simulation, and GPGPU applications for over 20 years. Chuck has refocused his GPU and HPC expertise into the life sciences space, where he is pursuing an M.S. in bioinformatics at Stanford. He holds a degree in electrical engineering from the University of Arizona.

Efficiently utilizing one or more GPUs involves finding the right balance in three areas of CUDA programming: data movement, hardware architecture, and multi-level parallelism. CUDA Streams can be a powerful way to increase processor throughput if you can manage them properly. We'll go through some use case examples, synchronization pitfalls, and profiler cases to help identify ways to speed up your application.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Tools and Libraries; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7401 - Daino: A High-level Framework for Parallel and Efficient AMR on GPUs

Mohamed Wahib Postdoctoral Researcher, RIKEN Advanced Institute for Computational Science
Mohamed Wahib is a postdoctoral researcher in the HPC Programming Framework Research Team at RIKEN Advanced Institute for Computational Science. Mohamed joined RIKEN AICS in 2012 after he received a Ph.D. in computer science from Hokkaido University, Japan. Prior to his graduate studies, he worked as a researcher at Texas Instruments R&D for four years. Mohamed's research is focused on accelerators and data-centric programming models.

We'll present a high-level framework for producing parallel and efficient adaptive mesh refinement code on GPU-accelerated supercomputers. AMR methods reduce computational requirements of problems by increasing resolution for only areas of interest. However, in practice, efficient AMR implementations are difficult, considering that the mesh hierarchy management must be optimized for the underlying hardware. Architecture complexity of GPUs can render efficient AMR to be particularity challenging in GPU-accelerated supercomputers. We'll present a compiler-based, high-level framework that can automatically transform serial uniform mesh code annotated by the user into parallel adaptive mesh code optimized for GPU-accelerated supercomputers. We show experimental results on three production applications. The speedups of code generated by our framework are comparable to hand-written AMR code while achieving good strong and weak scaling up to 3,640 GPUs.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Programming Languages; Performance Optimization

Day: TBD
Time: TBD
Location: TBD

S7418 - Low-Communication FFT with Fast Multipole Method

Cris Cecka Senior Research Scientist, NVIDIA
Cris Cecka joined NVIDIA Research in 2015 to combine and deploy his interests in developing advanced numerical algorithms and software. Previously, Cris worked at the new Institute for Applied Computational Science at Harvard University as a lecturer and research scientist, where he developed courses on parallel computing and robust software development for scientific computing. He also worked in the Mathematics Department at the Massachusetts Institute of Technology as a research associate, where focused on developing and applying integral equation methods and generalized N-body problems using hierarchical methods. He received his Ph.D. from Stanford University in computational and mathematical engineering in 2011.

We'll review a successful method for accelerating the 1D FFT by reducing the amount of communication required to be performed. The resulting method discards nearly two-thirds of the communication in exchange for the application of many hierarchical structured dense matrices, which can be applied efficiently via the fast multipole method (FMM). This FMM is formulated to be maximally computationally efficient on modern architectures and require little auxiliary space and data. We'll review the formulation, stages of computation, free parameters, and heuristics for choosing them, and efficient implementation strategies for an optimized FMM-FFT distributed across many GPUs. We'll present results obtained on up to eight Telsa P100 GPUs that show 1.2-2.2x speedup over the distributed 1D FFT provided by CUFFTXT 8.0.

Level: Intermediate
Type: Talk
Tags: Algorithms; Performance Optimization; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7428 - Singularity: Containers for Scientific Reproducibility, Mobility and High Performance Computing

Gregory Kurtzer HPC Systems Architect and developer, Lawrence Berkeley National Laboratory
Gregory Kurtzer has created many open source initiatives related to HPC, including Centos Linux, Warewulf, Perceus, and, most recently, Singularity. Gregory serves as a member of the OpenHPC Technical Steering Committee and is the IT HPC systems architect and software developer for Lawrence Berkeley National Laboratory.

Learn about Singularity, a container system designed to support the computational needs of scientists, including scientific reproducibility, extreme mobility of compute, and easily integratable with high-performance computational resources. Singularity supports all standard computational workflows, is resource manager agnostic, and natively supports GPUs.

Level: All
Type: Talk
Tags: HPC and Supercomputing; Tools and Libraries

Day: TBD
Time: TBD
Location: TBD

S7435 - Adapting DL to New Data: An Evolutionary Algorithm for Optimizing Deep Networks

Steven Young Research Scientist in Deep Learning, Oak Ridge National Laboratory
Steven Young is a researcher at Oak Ridge National Laboratory working in the Computational Data Analytics Group. His research focuses on applying deep learning to challenging datasets using HPC to enable faster training and quicker discovery. He has a Ph.D. in computer engineering from the University of Tennessee, where he studied machine learning in the Machine Intelligence Lab.

There has been a surge of success in using deep learning in imaging and speech applications for its relatively automatic feature generation and, in particular, for convolutional neural networks, high-accuracy classification abilities. While these models learn their parameters through data-driven methods, model selection (as architecture construction) through hyper-parameter choices remains a tedious and highly intuition driven task. To address this, multi-node evolutionary neural networks for deep learning (MENNDL) is proposed as a method for automating network selection on computational clusters through hyper-parameter optimization performed via genetic algorithms. MENNDL is capable of evolving not only the numeric hyper-parameters (for example, number of hidden nodes or convolutional kernel size), but is also capable of evolving the arrangement of layers within the network.

Level: Intermediate
Type: Talk
Tags: Deep Learning and AI; HPC and Supercomputing
Industry Segments: Higher Education / Research; Government / National Labs

Day: TBD
Time: TBD
Location: TBD

S7444 - What the Profiler is Telling You: Optimizing GPU Kernels

Christoph Angerer Developer Technology Engineer, NVIDIA
Christoph Angerer is a developer in NVIDIA's European Developer Technology team. Based in Munich, Germany, he works with developers accelerating applications on GPUs. He holds a Ph.D. in computer science from ETH Zurich in Switzerland.
Jakob Progsch Developer Technology Engineer, NVIDIA
Jakob Progsch is a member of NVIDIA's European developer technology team working on scientific and machine learning applications. Jakob graduated with a master's in computational science and engineering from ETH Zurich in Switzerland.

In this session we explore how to analyze and optimize the performance of kernels running on the GPU. Working with a real-world example, we will walk through an analysis-driven process leading to a series of kernel-level optimizations, using NVIDIA's profiling tools as an example. Attendees will learn about the fundamental performance limiters—instruction throughput, memory throughput, and latency—and we will present strategies to identify and tackle each type of limiter. This session is accompanied by Session S7445, which considers performance optimization at application level.

Level: All
Type: Talk
Tags: Performance Optimization; Algorithms; Tools and Libraries; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7445 - What the Profiler is Telling You: Optimizing Whole Application Performance

Jakob Progsch Developer Technologies Engineer, NVIDIA, NVIDIA
Jakob Progsch is a member of the European developer technology team at NVIDIA, where he works on scientific and machine learning applications. Jakob graduated with a master's in computational science and engineering from ETH Zurich in Switzerland.
Mathias Wagner Sr. Developer Technology Engineer, NVIDIA
Mathias Wagner is a member of the European developer technology team at NVIDIA, where he works on high performance computing and scientific applications. Before joining NVIDIA, he worked as a postdoc in high-energy physics in Europe and the U.S. focusing on lattice quantum chromodynamics simulations using GPUs. Mathias holds a Ph.D. in theoretical physics from Darmstadt University of Technology.

In this session we explore how to analyze and optimize the performance of GPU-accelerated applications. Working with a real-world example, attendees will learn how to analyze application performance by measuring data transfers, unified memory page migrations, inter-GPU communication, and performing critical path analysis. Using the example application, and using NVIDIA's profiling tools as an example tool set, we will walk through various optimizations and discuss their impact on the performance of the whole application. This session is accompanied by Session S7444, which considers performance optimization of GPU kernels.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Algorithms; Tools and Libraries; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7449 - Driving the Assembly of the Zebrafish Connectome through Deep Learning

Ishtar Nyawira Co-President, Timmy Global Health: Pitt Chapter, University of Pittsburgh
Ishtar Nyawira is a computer science major at the University of Pittsburgh (class of 2018). Upon entering her freshman year, she chose to study biology but quickly grew interested in computer science, despite having little background in the field. After changing her major in her third year, she became wholly dedicated to educating herself inside and outside of the classroom in the fields of computer science. After she graduates with a B.S. in computer science and a minor in Korean, she will pursue a Ph.D. in machine learning or computer science. She works at the Pittsburgh Supercomputing Center on a machine learning project that will harness the power of deep learning to automate the process of high-resolution biomedical image annotation. Her current research interests include machine learning and deep learning, natural language processing and computational linguistics, software engineering, biological modeling and simulation, and the pairing of HPC and AI.
Nick Nystrom Senior Director of Research, Pittsburgh Supercomputing Center
Nick Nystrom is senior director of research at the Pittsburgh Supercomputing Center. Nick leads the scientific research and future technology teams of PSC, including the user support for scientific applications, biomedical, and public health applications groups, as well as a core team targeting strategic applications, allocations, and project management. He is principal investigator for "Bridges," a new kind of supercomputer that converges HPC and HPDA and aims to aid researchers who are new to HPC. His research interests include machine learning and data analytics, genomics, causal modeling, coupling HPC applications and AI, graph algorithms, hardware and software architecture, software engineering for HPC, and performance modeling. Nick earned his B.S. in chemistry, math, and physics and his Ph.D. in quantum chemistry from the University of Pittsburgh.

Tracing pathways through large volumes of data is an incredibly tedious, time-consuming process that significantly encumbers progress in neuroscience and the tracing of neurons through an organism. We'll explore the potential for applying deep learning to the automation of high-resolution scanning electron microscope image data segmentation. We've started with neural pathway tracing through 5.1GB of whole-brain serial-section slices from larval zebrafish collected by the Center for Brain Science at Harvard. This kind of manual image segmentation requires years of careful work to properly trace the neural pathways in an organism as small as a zebrafish larvae, which is approximately 5mm in total body length. Automating this process could vastly improve productivity, which would lead to faster data analysis and more breakthroughs in understanding the complexity of the brain.

Level: All
Type: Talk
Tags: Deep Learning and AI; HPC and Supercomputing
Industry Segments: Healthcare & Life Sciences

Day: TBD
Time: TBD
Location: TBD

S7452 - Cutting Edge OptiX Ray Tracing Techniques for Visualization of Biomolecular and Cellular Simulations in VMD

John Stone Senior Research Programmer, University of Illinois at Urbana-Champaign
Highly-Rated Speaker
John Stone is a senior research programmer in the Theoretical and Computational Biophysics Group at the Beckman Institute for Advanced Science and Technology, and associate director of the NVIDIA CUDA Center of Excellence at the University of Illinois. John is the lead developer of VMD, a high-performance molecular visualization tool used by researchers all over the world. His research interests include molecular visualization, GPU computing, parallel processing, ray tracing, haptics, and virtual environments. John was awarded as an NVIDIA CUDA Fellow in 2010. In 2015, he joined the Khronos Group advisory panel for the Vulkan graphics API. He also provides consulting services for projects involving computer graphics, GPU computing, and high performance computing.

We'll present the latest advances in the use of NVIDIA ® OptiX™ for high-fidelity rendering of state-of-the-art biomolecular and cellular simulations. We'll present the latest technical advances in the OptiX-based ray -racing engines in VMD, which are heavily used for both interactive progressive ray-tracing (local and remote), and for batch mode in-situ or post-hoc visualization of petascale molecular dynamics simulations.

Level: All
Type: Talk
Tags: Rendering and Ray Tracing; In-Situ and Scientific Visualization; Healthcare and Life Sciences; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7469 - Parallel Depth-First Search on GPU

Maxim Naumov Sr. Research Scientist, NVIDIA
Highly-Rated Speaker
Maxim Naumov is a senior research scientist at NVIDIA. His interests include parallel algorithms, numerical linear algebra, optimization, and graphs. Maxim contributes to data analytics nvGRAPH library and has led the development of the AmgX library, which provides distributed Algebraic Multigrid, Krylov and Relaxation-based schemes. He has also worked on the cuBLAS, cuSPARSE, and cuSOLVER(RF) libraries that are part of the CUDA toolkit. In the past, Maxim held different positions at NVIDIA, including on the CUDA Platform and Emerging Applications teams, and at Intel in the Microprocessor Technology Lab and Computational Software Lab. Maxim received his Ph.D. in computer science, with a specialization in computational science and engineering, in 2009 and his B.S. in computer science and mathematics in 2003, all from Purdue University - West Lafayette.
Alysson Vrielink Sr. Research Scientist, SLAC National Accelerator Laboratory, Stanford University
Alysson Vrielink is currently working towards her Ph.D. in Electrical Engineering at Stanford University, conducting research on high frequency, high average power radiofrequency (RF) sources at SLAC National Accelerator Laboratory. Her interests include high performance parallel computing for scientific applications, applied mathematics and numerical methods, classical electromagnetism and novel RF structure design. She is Siemann Graduate Fellow, holds an NSERC postgraduate doctoral award and was recently elected as student representative to the American Physical Society, Division of Beam Physics. She received her B.S. in Engineering Physics from the University of British Columbia in 2013.

The Depth-First Search (DFS) algorithm is a fundamental building block used in many higher level applications, such as topological sort and connectivity and planarity testing of graphs. We'll briefly review prior results and propose two novel variations of parallel DFS on DAGs. The first traverses the graph three times in a breadth-first search-like fashion. The second assigns a weight to each edge, such that the shortest path from root to a node corresponds to the DFS path. The parallel algorithm visits all nodes in the graph multiple times and as a result computes the DFS parent relationship, pre- (discovery) and post-order (finish) time for every node. In some cases, the parallel DFS on GPU can outperform sequential DFS on CPU by up to 6x. However, the performance of the algorithm depends highly on the structure of the graph, and is related to the length of the longest path and the degree of nodes in the graph.

Level: All
Type: Talk
Tags: Algorithms; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7471 - Combining NVIDIA Docker and Databases to Enhance Agile Development and Optimize Resource Allocation

Sophie Voisin Research & Development Associate, Oak Ridge National Laboratory
Sophie Voisin is an R&D associate at Oak Ridge National Laboratory developing high performance computing methods for geospatial data analysis for the GIST group. Sophie received her Ph.D. in computer science and image processing from the Université de Bourgogne (France) in 2008 and joined ORNL in 2010 to work on numerous image processing-related projects, successively performing quantitative analysis of neutron 2D and 3D image data; developing new techniques for eye-gaze data analysis, for which she is a co-recipient of an R&D 100 award (2014); and now implementing multidimensional image processing algorithms on GPU platforms for high performance computing of satellite imagery.
Christopher Davis Geospatial Software Engineer , Oak Ridge National Laboratory
Chris Davis is a geospatial software engineer at Oak Ridge National Laboratory. He is the engineering lead for a high performance computing effort for solving geospatial data problems. He has over 15 years of scientific and engineering software development experience, ranging from proof-of-concept to full production code bases. Over this time, he has accumulated domain experience with EO, multi-spectral, hyper-spectral, LiDAR, and FMV data processing. He has contributed to several components in the ENVI remote sensing data processing software package. He holds a B.S. in electrical engineering from George Mason University.

Learn how to use NVIDIA Docker combined with database analysis to improve your agile development process, generalize hardware requirements, speed up deployment, and identify optimal configurations. Discover how to leverage the resource isolation of Docker containers to test different GPU-architecture performances and resource allocation to optimize system use and maximize processing throughput. Learn how to test this resource isolation using agile methods including development of a processing chain from multi-threaded CPU, to single GPU, and finally to multi-GPU architecture. Hear our observations about compilation timing, execution performance, resource allocation, and generation of CUDA binaries within containers while showcasing an automated image registration pipeline.

Level: Intermediate
Type: Talk
Tags: Data Center and Cloud Computing; HPC and Supercomputing; Deep Learning and AI

Day: TBD
Time: TBD
Location: TBD

S7472 - Comparative Study of CNN Models for Detection of Clouds in Overhead Imagery

Byung Hoon Park R&D Staff Scientist, Oak Ridge National Laboratory
Byung Hoon Park, an R&D staff member at Oak Ridge National Laboratory, has worked as a data scientist focusing on high performance computing. His past research includes computational statistics, data mining, discriminative probabilistic graphical models, and distributed parallel machine learning algorithms. Recently he has participated in a number of HPC software R&D projects that aim to bring HPC capabilities into scientific and application domains, including biology, climate, healthcare, and quality assurance of airborne imageries. He also serves as an adjunct associate professor at the Department of Business Analytics and Statistics of the University of Tennessee, Knoxville.

Learn how to improve pixel-wise image quality and geolocation accuracy by leveraging high-end hybrid computing resources. This particular test case involves the use of deep learning in the detection and masking of cloud objects, and imagery content that reduces image quality and usability, from overhead imagery. Timely results are attained through expediting selection and deployment of a deep learning model for overhead imagery for the cloud detection problem. An optimum deep learning model is selected through evaluation of a set of convolutional neural networks for their ability to detect cloud objects. Evaluation of each network is performed using a number of open-source neural network packages to give comparative performance results. In addition, two complementary image segmentation techniques are implemented in parallel, one operating on CPUs and the other on GPUs, to rapidly obtain candidate regions for cloud objects at a fine resolution.

Level: Intermediate
Type: Talk
Tags: Deep Learning and AI; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7473 - Towards a Fully Automated, High-Performance Pipeline for Stereo Reconstruction from Earth Observing Satellites

Dave Kelbe Postdoctoral Research Associate; Imaging Scientist, Oak Ridge National Laboratory
Dave Kelbe is an imaging scientist with a passion for applying satellite remote sensing technology to global geospatial challenges. Currently embedded in the Computational Science and Engineering Division at Oak Ridge National Laboratory in Knoxville, Tennessee, Dave's research relies heavily upon high-performance GPU computing to process large volumes of satellite imagery at scale. Dave received a Ph.D. in imaging science with an emphasis on remote sensing and 3D image processing from Rochester Institute of Technology.

Learn how CPU-GPU parallelization is used for high-throughput 3D surface point cloud generation from Earth-observing satellites. Stereo photogrammetry, used in computer vision applications, analyzes the parallax between image pairs to estimate depth. However, extending this workflow to satellite imagery presents computational challenges; notably, near-continuous streams of gigapixel-sized images. We leverage multicore and multiple Tesla K80 GPUs to assemble a fully automated pipeline capable of rapidly processing large image streams. Initial timings demonstrated an 89x (~10x over OpenMP, multicore scaling) performance improvement over its publicly available version. We'll share lessons learned in extending stereo reconstruction algorithms into satellite imaging, at scale.

Level: Intermediate
Type: Talk
Tags: Computer Vision and Machine Vision; HPC and Supercomputing; Deep Learning and AI

Day: TBD
Time: TBD
Location: TBD

S7503 - Massively Parallel Algorithm and Implementation of RI-MP2 Energy Calculations for Multi-GPU Supercomputers

Michio Katouda Research Scientist, RIKEN
Michio Katouda is a researcher of theoretical and computational chemistry. He received his Ph.D. in chemistry from Waseda University in Tokyo in 2011. Afterwards, he was appointed as a research scientist in the Computational Molecular Science Research Team at RIKEN Advanced Institute for Computational Science. His main research interests are the development of efficient computation techniques and massively parallel algorithm of molecular electronic structure theory, such as Moller-Plesset perturbation theory and density functional theory for large molecules and extended systems. He is a developer of massively parallel RI-MP2 code in NTChem software and GAMESS-US software.

We performed the multi-GPU massively parallel implementation of resolution-of-identity second order Moller-Plesset perturbation (RI-MP2) energy calculation suitable for calculations of large molecules on CPU/GPU hybrid supercomputers. We'll report the overview of implementation and the results of performance evaluation of the implementation using up to 1,349 nodes and 4,047 GPUs of the TSUBAME 2.5 supercomputer. The GPU computation speeds up considerably (4.1-6.6 times) the RI-MP2 calculations. Parallel scalability of present GPU implementation is good with the number of nodes. 514.7 TFLOPs of the measured peak performance is attained for the GPU job of (C96H24)2 using 1,349 nodes and 4,047 GPUs of TSUBAME 2.5, which is much higher than that of CPU jobs (87.5 TFLOPs). We also present application of the inter-molecular interaction analysis of nano-carbon molecular assemblies such as nanographenes.

Level: Advanced
Type: Talk
Tags: Computational Chemistry; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7524 - Impacts and Paradigms Enabled by GPUs in Engineering Simulations of Discrete Elements

Nicolin Govender Senior Scientist, Research Center Pharmaceutical Engineering GmbH / CSIR: Center for High Performance Computing
Nicolin Govender is a senior research scientist at the Center for Pharmaceutical Research and Engineering (RCPE GmbH) in Austria and the Center for High Performance Computing (CHPC) in South Africa. Nicolin is associated with several world-leading institutes as an affiliated scientist. He is a member of the ATLAS Collaboration at CERN in Geneva, Switzerland, where he works on projects associated with computing in high-energy physics. He is also a visiting researcher at the Ecole Mines Douai and University of Lille in France and works on collaborative projects on granular mechanics, and at the University of Utah working on collaborative projects related to mining and minerals engineering. Nicolin has published over 25 journal papers spanning high-energy physics, DEM and high performance computing with an H-factor of 25. He is the developer of BLAZE-DEM, currently the fastest DEM code in the world and capable for simulating polyhedral particles on the GPU.
Daniel N. Wilke Senior Lecturer (PhD) in the Department of Mechanical and Aeronautical Engineering, University of Pretoria
Daniel N. Wilke is a senior lecturer (PhD) in the Department of Mechanical and Aeronautical Engineering at the University of Pretoria. Daniel is a design optimization researcher that investigates computational finite and discrete element applications within the Centre for Asset and Integrity Management. C-AIM is focused on life cycle management of physical assets for key industries in South Africa. This includes the optimization of industrial processes, which requires computationally demanding large-scale analyses. He proposed gradient-only optimization in 2006 as an alternative optimization formulation that allows for multi-fidelity simulation models to be used when accurate sensitivities are available. He is also a co-developer relating formulation of the discrete element computational platform BlazeDEM-GPU. Since 2015, Daniel has been a Tuks Young Research Leadership Fellow, which is aimed at driving and developing research excellence within Africa.

We'll explore the impact of the GPU in engineering simulations of discrete elements and glimpse into the future of simulations and engineering training. We consider the roles played by the open-source Blaze-DEMGPU framework we developed, as well as the commercial framework XPS, developed specifically for the pharmaceutical industry by the RCPE GmbH (Research Center Pharmaceutical Engineering GmbH) that allows engineers to simulate process changes before being actually implemented. Industrial-scale discrete element simulations remain a big challenge, but the GPU architecture is changing that perception fast, as is demonstrated by the open-source framework Blaze-DEM and the commercial framework XPS. However, engineering simulation remains characterized by either the analyze-wait-modify-analyze cycle or more recently the batch analyze-wait-modify-batch analyze cycle. The GPU is enabling a new and alternative paradigm denoted interactive simulation and design (ISD) as is demonstrated by Blaze-DEMGPU. We'll explore the algorithmic development of Blaze-DEMGPU in detail with a short historical tour outlining the development as the GPU architectures changed from Kepler to Pascal, enabling higher fidelity models in addition to the natural progression from the conventional analysis cycle towards ISD and the various roles machine learning can play.

Level: All
Type: Talk
Tags: Computational Physics; Computational Biology; Computational Chemistry; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7527 - Unstructured Low-Order Finite-Element Earthquake Simulation Using OpenACC on Pascal GPUs

Takuma Yamaguchi Master course student, The University of Tokyo
Takuma Yamaguchi is a master's student in the Department of Civil Engineering at the University of Tokyo. His research focus is on high performance computing targeting earthquake simulations. More specifically, his work performs fast crustal deformation computation for multiple computation enhanced by GPUs. Takuma has a B.E. from the University of Tokyo.

We'll show a method that decreases random memory accesses for GPUs by splitting up calculations properly. The target application is unstructured low-order finite element analysis, the core application for manufacturing analyses. To reduce the memory access cost, we apply the element-by-element method for matrix-vector multiplication in the analysis. This method conducts local matrix-vector computation for each element in parallel. Atomic and cache hardware in GPUs has improved and we can utilize the data locality in the element node connectivity by using atomic functions for addition of local results. We port codes to GPUs using OpenACC directives and attain high performance with low development costs. We'll also describe the performance on NVIDIA DGX-1, which contains eight Pascal GPUs.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Computational Physics; Computer Aided Engineering; Manufacturing Industries; Computational Fluid Dynamics

Day: TBD
Time: TBD
Location: TBD

S7535 - Potential Field Solutions of the Solar Corona: Converting a PCG Solver from MPI to MPI+OpenACC

Ronald Caplan Computational Scientist, Predictive Science Inc.
Ronald Caplan is a computational scientist whose main interests are in developing and optimizing numerical methods for simulating physics-based models and their implementations in parallel high performance computing environments. His research currently focuses on the continued development and optimization of Predictive Science's magnetohydrodynamic codes used to study the solar corona and heliosphere, as well as providing computational solutions for additional projects.

We'll describe a real-world example of adding OpenACC to a legacy MPI FORTRAN Preconditioned Conjugate Gradient code, and show timing results for multi-node, multi-GPU runs. The code's application is obtaining 3D spherical potential field (PF) solutions of the solar corona using observational boundary conditions. PF solutions yield approximations of the coronal magnetic field structure and can be used as initial/boundary conditions for MHD simulations with applications to space weather prediction. We highlight key tips and strategies used when converting the MPI code to MPI+OpenACC, including linking Fortran code to the cuSparse library, using CUDA-aware MPI, maintaining performance portability, and dealing with multi-node, multi-GPU run-time environments. We'll show timing results for three increasing-sized problems for running the code with MPI-only (up to 1728 CPU cores), and with MPI+GPU (up to 60 GPUs) using NVIDIA K80 and P100 GPUs.

Level: Intermediate
Type: Talk
Tags: Astronomy and Astrophysics; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7539 - Petascale Molecular Dynamics Simulations from Titan to Summit

James Phillips Senior Research Programmer, University of Illinois
Highly-Rated Speaker
James Phillips is a senior research programmer in the Theoretical and Computational Biophysics Group at the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. He has a Ph.D. in physics from the University of Illinois. Since 1999, James has been the lead developer of the highly scalable parallel molecular dynamics program NAMD, for which he received a Gordon Bell Award in 2002. His research interests include improving the performance and accuracy of biomolecular simulations through parallelization, optimization, hardware acceleration, better algorithms, and new methods.

The highly parallel molecular dynamics code NAMD is used on the GPU-accelerated Cray XK7 Blue Waters and ORNL Titan machines to perform petascale biomolecular simulations, including a 64-million-atom model of the HIV virus capsid. In 2007, NAMD was one of the first codes to run on a GPU cluster, and it's now being prepared for the ORNL Summit supercomputer, which will feature IBM Power9 CPUs, NVIDIA GPUs, and the NVLink CPU-GPU interconnect. Learn the opportunities and pitfalls of taking GPU computing to the petascale, along with recent NAMD performance advances and early results from the Summit Power8+/P100 "Minsky" development cluster.

Level: Intermediate
Type: Talk
Tags: HPC and Supercomputing; Computational Chemistry

Day: TBD
Time: TBD
Location: TBD

S7540 - CAE Productivity and GPU Technology

Wayne Mindle Director of Sales & Marketing, CertaSIM, LLC
Highly-Rated Speaker
Wayne Mindle is director of Sales and Marketing at CertaSIM, LLC, the U.S. and Canadian distributor of the IMPETUS Afea Software Suite. Wayne has worked for several major aerospace companies, a consulting company for the FAA, and, prior to his association with CertaSIM, spent 15 years at Livermore Software Technology Corp. as the lead technical sales engineer. He earned his Ph.D. from Northwestern University in the area of applied mechanics, more specifically finite element analysis as applied to the area of nonlinear explicit transient dynamic problems.

We'll present performance results for the NVIDIA Tesla P100. Simulation is the key to greater productivity in many areas of product development and GPU technology plays a crucial role in achieving that goal. We'll use the simulation of a full 3D particle compaction process to compare run times with the NVIDIA Tesla K40. The results are generated from a commercially available nonlinear explicit transient dynamic finite element solver that takes full advantage of GPU technology for parallelization. The commercial software used to create the finite element mesh includes newly developed meshing techniques that make it easy to create the model. We'll also discuss details of the commercially available hardware used to perform the simulation, which has been certified for the P100.

Level: All
Type: Talk
Tags: Computer Aided Engineering; HPC and Supercomputing; Computational Fluid Dynamics; Manufacturing Industries

Day: TBD
Time: TBD
Location: TBD

S7549 - Deep Learning Acceleration of Progress toward Delivery of Fusion Energy

William Tang Principal Research Physicist, Princeton Plasma Physics Laboratory, Princeton University
William Tang of Princeton University is principal research physicist at the Princeton Plasma Physics Laboratory for which he served as chief scientist (1997-2009) and is currently lecturer with rank and title of professor in astrophysical sciences, and member of the executive board for the Princeton Institute for Computational Science and Engineering, which he helped establish and served as associate director (2003-2009). William is internationally recognized for expertise in the mathematical formalism and associated computational applications dealing with electromagnetic kinetic plasma behavior in complex geometries -- with over 200 publications with more than 150 peer-reviewed papers and an "h-index" or "impact factor" of 44 on the Web of Science, including well over 7,000 total citations. William has taught for over 30 years and has supervised numerous Ph.D. students, including recipients of the Presidential Early Career Award for Scientists and Engineers in 2000 and 2005. He is also head of the Intel Parallel Computing Center at the Princeton Institute for Computational Science & Engineering at Princeton University.

Expediting delivery of fusion power -- identified by the 2015 CNN "Moonshots for the 21st Century" series as one of six grand challenges for the modern world -- can be enabled by engaging big-data-driven machine/deep learning predictive methods. Princeton's associated project has access to over a half-petabyte of the EUROFUSION/JET disruption database, and it's new FRNN (Fusion Recurrent Neural Net) code exhibits excellent scaling to nearly 200 GPUs. We'll target extending this exciting trend on NVIDIA's powerful SATURN V to its nearly 1,000 GPUs (124 nodes with eight Pascal P100 GPUs per node) in time for presentation at GTC 2017.

Level: All
Type: Talk
Tags: Deep Learning and AI; Computational Physics; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7561 - GPU Acceleration of a Large Eddy Simulation Software for High-Pressure, Supercritical Reacting Flows

Ramanan Sankaran Computational Scientist, Oak Ridge National Laboratory
Ramanan Sankaran is a computational scientist at the Oak Ridge National Laboratory. He received his Ph.D. in mechanical engineering from the University of Michigan, Ann Arbor. Ramanan performs numerical studies of reacting and multiphase flows using high performance computing to understand the fundamental characteristics of fluid flows in engineering applications. He is an expert in computational combustion and engineering with more than 17 years of experience in the modeling and simulation of combustion. His research focuses on numerically studying various combustion phenomena such as auto-ignition, turbulent premixed, and non-premixed combustion and combustion chemical kinetics. He also develops scalable and massively parallel software tools for combustion and engineering simulations and analysis of large simulation datasets.

RAPTOR is a massively parallel flow solver for the simulation of turbulent combustion. In preparation for the upcoming Summit system at the Oak Ridge Leadership Computing Facility, a performance portable and GPU-ready version of RAPTOR has been developed. A combination of programming models have been used to convert the distributed memory parallel code to a hybrid parallel code with multiple levels of parallelism. Major performance-critical kernels have been reimplemented in C++ using the Kokkos programming model. The main flow solver has been accelerated using OpenMP compiler directives. We'll present the performance characteristics of RAPTOR on the IBM Minsky system for a high-pressure, supercritical reacting flow problem with applications in the aerospace and energy industry.

Level: Intermediate
Type: Talk
Tags: Computational Fluid Dynamics; HPC and Supercomputing; Computer Aided Engineering

Day: TBD
Time: TBD
Location: TBD

S7562 - Deep Learning to Enable Real-Time Gravitational Wave and Multimessenger Astrophysics

Daniel George Scientist, University of Illinois at Urbana-Champaign, National Center for Supercomputing Applications
Daniel George is a Ph.D. student in astronomy, pursuing the computational science and engineering concentration, at the University of Illinois at Urbana-Champaign. He obtained his bachelor's degree in engineering physics from IIT Bombay. He is currently a research assistant in the Gravity Group at the National Center for Supercomputing Applications and a member of the LIGO collaboration working at the interface of deep learning, high performance computing, and gravitational wave and multimessenger astrophysics. His long-term interests lie in applying cutting-edge computer science and technology, especially machine learning and artificial intelligence, to accelerate discoveries in the fundamental sciences.
Eliu Huerta Gravity Group Leader, University of Illinois at Urbana-Champaign, National Center for Supercomputing Applications
Eliu Huerta is the head of the Gravity Group at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign. Eliu obtained a master's degree in applied mathematics and theoretical physics and a Ph.D. in theoretical astrophysics at the University of Cambridge, U.K. His work is at the interface of analytical and numerical general relativity, and on the exploitation of advanced cyberinfrastructure facilities to create scenarios for multi-messenger astrophysics. He is a member of the LIGO Scientific Collaboration, the NANOGrav Consortium, and the Dark Energy Survey.

The aLIGO Advanced Laser Interferometer Gravitational Observatory went on line last year and very rapidly produced data confirming Einstein's theory of gravitational waves. This discovery and the success of the detection device open the door for another dimension to be added to and combined with other electromagnetic detection devices (telescopes, radio telecopes, etc.) to dramatically increase the potential to understand the workings of deep space and astronomical phenomena at the origins of the universe. The project used data produced by the CACTUS HPC simulation to produce datasets that were used to train a DNN using the MXNet framework. The results were that the prediction accuracy increased over classical waveform analysis and reduced the number of processors from hundreds of CPUs to one GPU, where the prediction was achieved with a latency of 1 millisecond. The work was done on the BlueWaters supercomputer and at the Innovation Lab at NCSA. The reduction in the "pipeline size" (number of CPUs needed to make a detection) and the improved latency open up the potential for multi-messenger astrophysics, where an observation that is "heard" with the gravitational wave detector can be used to steer a detector in the visible or EM spectrum where to look.

Level: All
Type: Talk
Tags: HPC and Supercomputing; Deep Learning and AI; Astronomy and Astrophysics; Computational Physics
Industry Segments: Higher Education / Research

Day: TBD
Time: TBD
Location: TBD

S7569 - High-Performance Data Loading and Augmentation for Deep Neural Network Training

Trevor Gale Student, Northeastern University
Trevor Gale is a computer engineering student at Northeastern University. His interests include high performance computing, machine learning, and general-purpose graphics processors. He has previously worked on scalable deep neural network training on many-GPU distributed systems, developed graph algorithms for GPU clusters, and built tools to study the memory reliability of GPUs. In 2016, Trevor interned at Samsung Research America on the General-Purpose Acceleration Framework team, where he worked on the dMath distributed mathematics library and the Expresso deep learning framework.
Steven Eliuk Project Lead, Samsung
Steven Eliuk is a graduate of the University of Alberta Computing Science department, where he completed his Ph.D. in distributed algorithms in applied sciences. He was awarded numerous awards from the Natural Science and Engineering Council of Canada, Alberta Ingenuity fund, University of Alberta, and more. Previous experience includes IBM, where he was director at the Servier Virtual Cardiac Center, which focused on reducing radiation and enhancing CT scans in pediatrics. He currently leads a distributed algorithms group at Samsung Electronics, where they have focused on primitive acceleration for machine learning since 2013. The team has produced the most performant version of distributed Caffe while considering numerical stability and providing strict mathematic when possible, all while providing better accuracy and no loss of precision when scaling from one to 128 GPUs.

Next-generation GPUs have revealed that data loading and augmentation can be a major bottleneck to accelerating deep neural network training on many-GPU distributed systems. This work presents the design and implementation of a high-performance data loading and augmentation system for the Expresso deep learning framework developed by Samsung. Our system leverages multiple levels of parallelism and automatic runtime performance tuning to achieve speedups of 15.5% on average across our experiments.

Level: Advanced
Type: Talk
Tags: Deep Learning and AI; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

S7600 - ChainerMN: Scalable Distributed Deep Learning with Chainer

Takuya Akiba Researcher, Preferred Networks, Inc.
Takuya Akiba is a researcher at Preferred Networks, Inc., working on research and development for making deep learning faster and more scalable. He received a Ph.D. in information science and technology from the University of Tokyo, Japan, in 2015.

We'll present ChainerMN, a multi-node distributed deep learning framework, together with the basics of distributed deep learning. Even though GPUs are continuously gaining more computation throughput, it is still very time-consuming to train state-of-the-art deep neural network models. For better scalability and productivity, it is paramount to accelerate the training process by using multiple GPUs. To enable high-performance and flexible distributed training, we developed ChainerMN, built on top of Chainer. We'll first introduce the basic approaches to distributed deep learning. Then, we'll explain the design choice, basic usage, and implementation details of Chainer and ChainerMN. We'll report benchmark results and discuss the future directions of distributed deep learning.

Level: Intermediate
Type: Talk
Tags: Deep Learning and AI; HPC and Supercomputing; AI Startup

Day: TBD
Time: TBD
Location: TBD

S7635 - Comparison of OpenACC and OpenMP4.5 Offloading: Speeding Up Simulations of Stellar Explosions

Tom Papatheodore Solutions Architect, NVIDIA
Tom Papatheodore is a solutions architect at NVIDIA, located at Oak Ridge National Laboratory, where he helps users and staff members take advantage of NVIDIA GPUs within the facility's current and upcoming (Summit) supercomputers.

Learn about a case-study comparing OpenACC and OpenMP4.5 in the context of stellar explosions. Modeling supernovae requires multi-physics simulation codes to capture hydrodynamics, nuclear burning, gravitational forces, etc. As a nuclear detonation burns through the stellar material, it also increases the temperature. An equation of state (EOS) is then required to determine, say, the new pressure associated with this temperature increase. In fact, an EOS is needed after the thermodynamic conditions are changed by any physics routines. This means it is called many times throughout a simulation, requiring the need for a fast EOS implementation. Fortunately, these calculations can be performed independently during each time step, so the work can be offloaded to GPUs. Using the IBM/NVIDIA early test system (precursor to the upcoming Summit supercomputer) at Oak Ridge National Laboratory, we use a hybrid MPI+OpenMP (traditional CPU threads) driver program to offload work to GPUs. We'll compare the performance results as well as some of the currently available features of OpenACC and OpenMP4.5.

Level: Intermediate
Type: Talk
Tags: Astronomy and Astrophysics; HPC and Supercomputing
Industry Segments: Higher Education / Research

Day: TBD
Time: TBD
Location: TBD

S7668 - Practical Aspects of Porting Monte Carlo Exotic Derivative Pricing Engines to IBM Power 8+ with Tesla P100 GPUs

Oleg Rasskazov Executive Director, Quantitative Research, JP Morgan Chase
Oleg Rasskazov is an executive director of QR Analytics at JP Morgan Chase. Oleg has spent 10 years in quantitative research at JP Morgan, working on high-performance compute for equities, commodities, and foreign exchange. Oleg has a Ph.D. in applied math focusing in computer-assisted proofs.

The Pascal generation of GPUs is bringing an increased compute density to data centers and NVLink on IBM Power 8 CPUs makes this compute density ever more accessible to HPC applications. However, reduced memory-to-compute ratios present some unique challenges for the cost of throughput-oriented compute. We'll present a case study of moving up production Monte Carlo GPU codes to IBM's "Minsky" S822L servers with NVIDIA Tesla P100 GPUs.

Level: Intermediate
Type: Talk
Tags: Finance; HPC and Supercomputing
Industry Segments: Financial Services

Day: TBD
Time: TBD
Location: TBD

S7676 - Half Precision Benchmarking for HPC

Piotr Luszczek Research Director, University of Tennessee
Piotr Luszczek works on matrix factorizations and hardware and software benchmarking, including development of established industry benchmarks such as HPL, HPCC, and HPCG, at the University of Tennessee. Piotr's work at MathWorks was concentrated on parallel language design and its implementation with particular emphasis on high-performance programming, and resulted in three patent awards. The work there refocused his research on performance modeling and evaluation in the context of tuning of parallelizing compilers as well as energy-conscious aspects of heterogeneous and embedded computing. He also investigates how scientific codes are influenced by power and energy constraints and how to include performance-conscious optimizations into sustainable computational science while maintaining a balanced system design with large-scale resilient solvers in mind.

With Tegra X1 and Pascal architecture Tesla P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. We'll introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms.

Level: Intermediate
Type: Talk
Tags: Algorithms; Deep Learning and AI; HPC and Supercomputing
Industry Segments: Government / National Labs; Higher Education / Research

Day: TBD
Time: TBD
Location: TBD

S7724 - Tofu: Parallelizing Deep Learning Using Automatic Tensor Tiling

Minjie Wang Student, New York University
Minjie Wang is a third year Ph.D. student at New York University and a member of the NYU systems group. Before joining NYU, Minjie got his master's and bachelor's at Shanghai Jiao Tong University. He also spent two years as a research intern in Microsoft Research Asia, where he found his research interests in machine learning systems and built his first deep learning system: Minerva. Minjie was also one of the first members of the Deep Machine Learning Community. He is one of the main developers of the MXNet, NNVM, and MinPy projects. He is the recipient of 2016 NVIDIA Graduate Fellowship.

We'll cover how to automatically select the best parallelism for a deep learning algorithm. Current deep learning systems like Tensorflow and MXNet focus on only one specific parallelization strategy, data parallelism, which requires large training batch sizes to scale. An alternative approach, model parallelism, does not have such requirements but is inefficient when model parameter size is large. Choosing the right parallelism is tedious for users because it requires extensive analysis of the whole program. Therefore, we propose Tofu that could automatically parallelize a deep learning program. We first cast the problem of finding the best parallelization strategy as the problem of finding the best tiling to partition the computation with the least overall communication. We propose an algorithm that is provably optimal and the resulting optimal parallelization solution is a hybrid of data and model parallelism. Our system, called Tofu, can automatically transform the dataflow graph captured by an existing deep learning system frontend into a parallel dataflow graph based on the optimal tiling it has found.

Level: Advanced
Type: Talk
Tags: Deep Learning and AI; HPC and Supercomputing
Industry Segments: Higher Education / Research

Day: TBD
Time: TBD
Location: TBD

S7796 - Heterogeneous Hierarchical Async Tasking: Making it Real

CJ Newburn Principal HPC Architect for Compute SW, NVIDIA
Chris J. Newburn (CJ) is the Principal HPC Architect for NVIDIA Compute Software, with a special focus on programming models for scale. He has contributed to a combination of hardware and software technologies over the last twenty years. He has a passion for architecting the richness of heterogeneous platforms so that they are easier to use and have lasting value. He has over 80 patents. We wrote a binary-optimizing, multi-grained parallelizing compiler as part of his Ph.D. at Carnegie Mellon University. Before grad school, in the 80s, he did stints at a couple of start-ups, working on a voice recognizer and a VLIW supercomputer. He's delighted to have worked on volume products that his Mom uses.

The HiHAT (hierarchical heterogeneous async tasking) effort is gaining steam with US Government funders, tasking framework and language runtime vendors in academia and industry. Come find out why! See how it is the infrastructure of choice for enabling retargetability, by offering a sound architecture within which hardware vendors can plug highly-tuned implementations, and a platform on which various runtimes can get rich functionality with razor-thin overhead. Hear leading customers present what's most important to them - what their priorities, provisioning constraints and interface requirements are. Get a first peek at the functionality HiHAT is designed to provide. The session will provide an overview of future implementation plans and outline next steps for the overall project.

Level: Advanced
Type: Talk
Tags: HPC and Supercomputing; Data Center and Cloud Computing
Industry Segments: General; Government / National Labs

Day: TBD
Time: TBD
Location: TBD

Talk

INSTRUCTOR-LED LAB

Presentation
Details

L7106 - Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs

Pau Farre Software Engineer, Barcelona Supercomputing Center (BSC)
Paul Farre is a software engineer at the Barcelona Supercomputing Center, where, as a member of the BSC/UPC NVIDIA GPU Center of Excellence, he works on porting BSC applications to GPUs. Paul received his master's degree from UPC in 2016. In 2014, he achieved his university degree and joined BSC.
Antonio J. Peña Senior Researcher, Barcelona Supercomputing Center (BSC)
Antonio Peña has served as the manager of the BSC/UPC NVIDIA GPU Center of Excellence since March 2016. Antonio is a former postdoctoral appointee at Argonne National Laboratory. He earned his Ph.D. from Jaume I University, in Spain.

We'll guide you step by step to port and optimize an oil-and-gas miniapplication to efficiently leverage the amazing computing power of NVIDIA GPUs. While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum performance from NVIDIA GPUs. OmpSs, on the other hand, is a GPU-aware task-based programming model that may be combined with CUDA, and recently with OpenACC as well. Using OpenACC, we'll start benefiting from GPU computing, obtaining great coding productivity, and a nice performance improvement. We can next fine-tune the critical application parts developing CUDA kernels to hand-optimize the problem. OmpSs combined with either OpenACC or CUDA will enable seamless task parallelism leveraging all system devices. Prerequisites: Basic knowledge of OpenACC and CUDA. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Intermediate
Type: Instructor-Led Lab
Tags: Programming Languages; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

L7107 - Kokkos, Manycore Performance Portability Made Easy for C++ HPC Applications

H. Carter Edwards Principle Member of Technical Staff, Sandia National Laboratories
Highly-Rated Speaker
H. Carter Edwards is the principal investigator and architect for the Kokkos project at Sandia National Laboratories. Carter has over three decades of experience in modeling and simulation software development and over two decades of experience in HPC, parallel processing, and C++ software development. For the last several years, his HPC focus has been on algorithms and programming models for thread-scalable and performance portable parallelism across next-generation platform node architectures. Carter has a B.S. and M.S. in aerospace engineering and a Ph.D. in computational mathematics. He represents Sandia on the ISO C++ language standard committee.
Christian Trott Senior Member Technical Staff, Sandia National Laboratories
Christian Trott is a high performance computing expert with experience in designing and implementing software for GPU and MIC compute-clusters. He earned a Dr. rer. nat. from the University of Technology Ilmenau in theoretical physics. Prior scientific work focused on computational material research using Ab-Initio calculations, molecular dynamic simulations and monte carlo methods. As of 2015 Christian is a senior member of technical staff at the Sandia National Laboratories. He is a core developer of the Kokkos programming model with a large role in advising applications on adopting Kokkos to achieve performance portability for next generation super computers.
Fernanda Foertter ?HPC User Support Specialist/Programmer, Oak Ridge National Laboratory
Highly-Rated Speaker
(tbd: on file with GTC)

The Kokkos C++ library enables development of HPC scientific applications that are performance portable across disparate manycore architectures such as NVIDIA® Kepler™, AMD Fusion, and Intel Xeon Phi. Kokkos leverages the NVIDIA CUDA® 8.0 device lambda capability to provide a highly intuitive and easy to use parallel programming model. Kokkos simplifies data management for heterogeneous memory (CPU, GPU, UVM, etc.) through a unique polymorphic multidimensional array interface. View polymorphism includes mutable multidimensional layout, transparent overloads for atomic operations, and simplified access to GPU texture hardware. Kokkos advanced features culminate in portable team parallelism that performantly maps onto CUDA grids, blocks, and shared memory.

Level: Intermediate
Type: Instructor-Led Lab
Tags: HPC and Supercomputing; Tools and Libraries

Day: TBD
Time: TBD
Location: TBD

L7114 - Multi GPU Programming with MPI and OpenACC

Jiri Kraus Senior Devtech Compute, NVIDIA
Highly-Rated Speaker
Jiri Kraus is a senior developer in the European DevTech team at NVIDIA. He focuses on multi-GPU programming models and NVIDIA's collaboration with the Juelich Super Computing Center.
Robert Henschel Director Science Community Tools, Indiana University
Robert Henschel received his M.Sc. from Technische Universität Dresden, Germany. He joined Indiana University in 2008, first as the manager for the Scientific Applications Group and since 2016 as the director for Science Community Tools. His responsibilities include making sure that the scientists from IU can utilize the IU IT resources to enable scientific discoveries and breakthroughs.
Guido Juckeland Head of Computational Science Group, Helmholtz-Zentrum Dresden-Rossendorf
Guido Juckeland received his Ph.D. from Technische Universität Dresden for his work on trace-based performance analysis for hardware accelerators. He has a long history in working with GPUs and teaching GPU programming. In 2016 he joined HZDR to head the newly founded computational science group. His responsibilities include working with the researchers at better utilizing the central IT resources for scientific purposes.

Learn how to program multi-GPU systems or GPU clusters using the message passing interface and OpenACC. We'll start with a quick introduction to MPI and how an NVIDIA(R) CUDA(R)-aware MPI implementation can be used with OpenACC. Other topics covered will include how to handle GPU affinity in multi-GPU systems and using NVIDIA performance analysis tools. As we'll be using GPUs hosted in the cloud, all you are required to bring is a laptop with a modern browser. Prerequisites: C or FORTRAN, Basic OpenACC and MPI are strongly recommended but not required. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Intermediate
Type: Instructor-Led Lab
Tags: Programming Languages; HPC and Supercomputing

Day: TBD
Time: TBD
Location: TBD

L7115 - In-Depth Performance Analysis for OpenACC/CUDA®/OpenCL Applications with Score-P and Vampir

Guido Juckeland Head of Computational Science Group, Helmholtz-Zentrum Dresden-Rossendorf (HZDR)
Guido Juckeland joined HZDR in 2016 to head its newly founded computational science group. His responsibilities include working with researchers at better utilizing the central IT resources for scientific purposes. Guido received his Ph.D. from Technische Universitat Dresden for his work on trace-based performance analysis for hardware accelerators. He has a long history in working with GPUs and teaching GPU programming.
Robert Henschel Director Science Community Tools, Indiana University
Robert received his M.Sc. from Technische Universität Dresden, Germany. He joined Indiana University in 2008, first as the manager for the Scientific Applications Group and since 2016 as the director for Science Community Tools. His responsibilities include making sure that the scientists from IU can utilize the IU IT resources to enable scientific discoveries and breakthroughs.
Jiri Kraus Senior Devtech Compute, NVIDIA
Highly-Rated Speaker
Jiri Kraus is a senior developer in NVIDIA's European DevTech team. In his work he focuses on multi GPU programming models and the NVIDIA collaborations with the Juelich Super Computing Centre.

Work with Score-P/Vampir to learn how to dive into the execution properties of CUDA and OpenACC applications. We'll show how to use Score-P to generate a trace file and how to study it with Vampir. Additionally, we'll use the newly established OpenACC tools interface to present how OpenACC applications can be studied for performance bottlenecks. This lab uses GPU resources in the cloud, so bring your laptop.

Level: Intermediate
Type: Instructor-Led Lab
Tags: HPC and Supercomputing; Tools and Libraries

Day: TBD
Time: TBD
Location: TBD

L7130 - Introduction to OpenACC Directives

Jeff Larkin DevTech Software Engineer, NVIDIA
Highly-Rated Speaker
TBD
Fernanda Foertter HPC User Support Specialist, Oak Ridge National Lab
Highly-Rated Speaker
Fernanda Foertter is a member of the User Assistance Team at the National Center for Computational Sciences (NCCS) located at Oak Ridge National Laboratory (ORNL). This team is responsible for assisting all users at the Oak Ridge Leadership Computing Facility (OLCF). Additionally, Fernanda is responsible for the training program at the center, having created the scientific GPU Hackathons. She also has interests in data engineering and programming paradigms on heterogenous architectures. Her interest in high performance computing started in 2006 while she was computational Materials Science graduate student working on molecular dynamics simulations.
Guido Juckeland Head of Computational Science Group, Helmholtz-Zentrum Dresden-Rossendorf (HZDR)
Guido Juckeland joined HZDR in 2016 to head its newly founded computational science group. His responsibilities include working with researchers at better utilizing the central IT resources for scientific purposes. Guido received his Ph.D. from Technische Universitat Dresden for his work on trace-based performance analysis for hardware accelerators. He has a long history in working with GPUs and teaching GPU programming.

During this lab, you will learn about OpenACC, a user-driven standard that offers a directive-based programming model for the scientific community to port their codes to multiple platforms without significant programming effort. The lab will cover introduction on how to analyze and parallelize your code, as well as perform optimizations like managing data movements. With access to a variety of supercomputers, researchers are looking for a solution that allows their codes to run not only on GPUs but on any architecture with minimal or no code change. Scientists report 2-10x performance increase with as little as a few weeks effort using OpenACC. Prerequisites: While the lab does not assume any previous experience with OpenACC directives or GPU programming in general, programming experience with C or FORTRAN is desirable. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Beginner
Type: Instructor-Led Lab
Tags: HPC and Supercomputing; Programming Languages
Industry Segments: Higher Education / Research; Government / National Labs; Software

Day: TBD
Time: TBD
Location: TBD