Main Page/hybrid prog

From Nekcem
Jump to navigationJump to search

This is the document page for hybrid programming proposal for NekCEM (some notes taken from DOE Exascale workshop).

Programming Model Approaches

  • hybrid/evolutionary: MPI+ __ ?
    • MPI for inter-node prog, since # notes and inter-node concerns not expected to change dramatically
      • support for hybrid programming/interoperability
      • purer one-sided communications; active messages
      • asynchronous collectives
    • something else for intra-node
      1. OpenMP (Shared memory, aka Global Address Space)
        • introduction of locality-oriented concepts?
        • efforts in OpenMP 3.0 ?
      2. PGAS languages (Partitioned Global Address Space)
        • already support a notation of locality in a shared namespace
        • UPC (Unified Parallel C)/CAF need to relax strictly SPMD execution model
      3. Sequoia
        • support a strong notation of vertical locality
  • unified/holistic: __ ?
    • a single notation for inter- and intra-node programming?
    • traditional PGAS languages: UPC, CAF, Titanium
      • require extension to handle nested parallelism, vertical locality
    • HPCS languages: Chapel, X10, Fortress(?)
      • designed with locality and post-SPMD parallelism in mind
    • other candidates: Charm++, Global Arrays, Parallel X, ...
  • others
    • mainstream multi-core/GPU language: (sufficient promise to be funded?)
    • domain-specific language
      • fit your problem?
      • should focus on more general solutions
    • functional languages
      • never heavily adopted in mainstream or HPC
      • copy-on-write optimization and alias analysis?
    • parallel scripting languages?

Pros and Cons of Pthread and OpenMP

  • Pthread
    • Pros: low-level control of program, well supported
    • Cons: need to cast the codebase into a threaded model, requires considerable threading-specific code; hard-code thread number etc, not very portable
    • Misc: can use thread pool when not sure about machine processor details (to be more portable than hard-coded thread#)
  • OpenMP
    • Pros: medium-grained control over threading functionality; auto-adjust according to machine specifics; use pragmas (over API) not interfere with single-threaded codebase; easy to debug as well;
    • Cons: compiler support on BG/P?
    • Misc:

Expectation

  • parallelism: nested, dynamic, loosely-coupled, data-driven (i.e. post-SPMD programming/execution models)
    • to take advantage of architecture
    • to better support load balancing and resilience
  • locality: concepts for vertical control as well as horizontal (i.e. locality within a node rather than simply between nodes)

Tools: Debuggers, perf. analysis

  • challenges
    • need aggregation to hide details
    • need to report info in user's terms
  • good area for innovation (e.g. execution visualization to understand mapping of code to hardware)

Misc Notes

  • three main types of distributed memory programming:
    • active message. Active Messages are actually a lower-level mechanism that can be used to implement data parallel or message passing efficiently.
    • data parallel(aka loop-level parallelism). Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism).
    • message passing
  • multi-core and many-core processors
    • A multi-core processor is composed of two or more independent cores. One can describe it as an integrated circuit which has two or more individual processors (called cores in this sense).[1] Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.
    • A many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely due to issues with congestion supplying sufficient instructions and data to the many processors. This threshold is roughly in the range of several tens of cores and probably requires a network on chip.
  • ILP and TLP
    • Some instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications, but are inefficient for others that tend to contain difficult-to-predict code.
    • Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. The Multithreading paradigm has become more popular as efforts to further exploit instruction level parallelism have stalled since the late-1990s.
  • multi-threading v.s. multi-processing
    • multi-threading advantage
      • If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage of the unused computing resources, which thus can lead to faster overall execution, as these resources would have been idle if only a single thread was executed.
    • multi-threading disadvantage
      • Execution times of a single-thread are not improved but can be degraded, even when only one thread is executing. This is due to slower frequencies and/or additional pipeline stages that are necessary to accommodate thread-switching hardware.
    • Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them.
    • SMP MIMD multiprocessing

Questions

  • thread-safe MPI implementation?
    • thread-safe usually means MPI_THREAD_MULTIPLE
    • only need MPI_THREAD_FUNNEL for master-only style.