Main Page/hybrid prog

This is the document page for hybrid programming proposal for NekCEM (some notes taken from DOE Exascale workshop).

Programming Model Approaches

hybrid/evolutionary: MPI+ __ ?
- MPI for inter-node prog, since # notes and inter-node concerns not expected to change dramatically
  - support for hybrid programming/interoperability
  - purer one-sided communications; active messages
  - asynchronous collectives
- something else for intra-node
  1. OpenMP (Shared memory, aka Global Address Space)
    - introduction of locality-oriented concepts?
    - efforts in OpenMP 3.0 ?
  2. PGAS languages (Partitioned Global Address Space)
    - already support a notation of locality in a shared namespace
    - UPC (Unified Parallel C)/CAF need to relax strictly SPMD execution model
  3. Sequoia
    - support a strong notation of vertical locality

unified/holistic: __ ?
- a single notation for inter- and intra-node programming?
- traditional PGAS languages: UPC, CAF, Titanium
  - require extension to handle nested parallelism, vertical locality
- HPCS languages: Chapel, X10, Fortress(?)
  - designed with locality and post-SPMD parallelism in mind
- other candidates: Charm++, Global Arrays, Parallel X, ...

others
- mainstream multi-core/GPU language: (sufficient promise to be funded?)
- domain-specific language
  - fit your problem?
  - should focus on more general solutions
- functional languages
  - never heavily adopted in mainstream or HPC
  - copy-on-write optimization and alias analysis?
- parallel scripting languages?

Pros and Cons of Pthread and OpenMP

Pthread
- Pros: low-level control of program, well supported
- Cons: need to cast the codebase into a threaded model, requires considerable threading-specific code; hard-code thread number etc, not very portable
- Misc: can use thread pool when not sure about machine processor details (to be more portable than hard-coded thread#)
OpenMP
- Pros: medium-grained control over threading functionality; auto-adjust according to machine specifics; use pragmas (over API) not interfere with single-threaded codebase; easy to debug as well;
- Cons: compiler support on BG/P?
- Misc:

Expectation

parallelism: nested, dynamic, loosely-coupled, data-driven (i.e. post-SPMD programming/execution models)
- to take advantage of architecture
- to better support load balancing and resilience

locality: concepts for vertical control as well as horizontal (i.e. locality within a node rather than simply between nodes)

Tools: Debuggers, perf. analysis

challenges
- need aggregation to hide details
- need to report info in user's terms
good area for innovation (e.g. execution visualization to understand mapping of code to hardware)

Misc Notes

three main types of distributed memory programming:
- active message. Active Messages are actually a lower-level mechanism that can be used to implement data parallel or message passing efficiently.
- data parallel(aka loop-level parallelism). Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism).
- message passing
multi-core and many-core processors
- A multi-core processor is composed of two or more independent cores. One can describe it as an integrated circuit which has two or more individual processors (called cores in this sense).[1] Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.
- A many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely due to issues with congestion supplying sufficient instructions and data to the many processors. This threshold is roughly in the range of several tens of cores and probably requires a network on chip.
ILP and TLP
- Some instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications, but are inefficient for others that tend to contain difficult-to-predict code.
- Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. The Multithreading paradigm has become more popular as efforts to further exploit instruction level parallelism have stalled since the late-1990s.
multi-threading v.s. multi-processing
- multi-threading advantage
  - If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage of the unused computing resources, which thus can lead to faster overall execution, as these resources would have been idle if only a single thread was executed.
- multi-threading disadvantage
  - Execution times of a single-thread are not improved but can be degraded, even when only one thread is executing. This is due to slower frequencies and/or additional pipeline stages that are necessary to accommodate thread-switching hardware.
- Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them.
- SMP MIMD multiprocessing

Questions

thread-safe MPI implementation?
- thread-safe usually means MPI_THREAD_MULTIPLE
- only need MPI_THREAD_FUNNEL for master-only style.

Main Page/hybrid prog

Contents

Programming Model Approaches

Pros and Cons of Pthread and OpenMP

Expectation

Tools: Debuggers, perf. analysis

Misc Notes

Questions

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools