Nekcem - User contributions [en]

Main Page/DAT

2012-06-13T20:43:15Z

Jingfu:

== SIZEu file ==

ldim: dimension

lxi: the degree of polynomials
lx1: the number of grid points on the face
ly1=lx1; lz1=lx1

lelt: the maximun number of element per core

lp : the maximum number of cores

use (E,lelt,lx1,lp), to represent size of prob

E=total element numbers, lelt=element # per core,
lx1= grid points in one direction, lp= # of cores.

There are many different rea with c3d_6 (E=136K), c3d_7(E=273K), etc..

Even for a fixed num of element with c3d_7 (E=273K), men usage is different for
different # of cores (lp=32k, 65k, 131k).

made huge change in the code for 2 times
reduction in mem usage to go further up from 1.1 billion to 2.2 billion cases.

from (E=273, lx1=16, lp= 131k): limit in the past ---> (E=546k, lx1=16, lp=131k)

(E=999k, lx1=16, lp=131k) was 500M. So I couldn't do on BGP. But is be ok on XK6,
even with lp=262k.

In there, if we assume "nc" is approximately same as the total grids "n".
we have the following:

For the header:
(1) coordinate => 3 columns * 4 bytes
(2) cell data => 9 columns * 4 bytes
(3) cell type => 1 columns * 4 bytes

For the 8 fields:
3 columns * 4 bytes

So, we have 275M*(8 fields *3*4)+ 275M*(3+9+1)*4 = 40 GB

Or, neglecting the cel type, we get 275M(8*3*4+12*4)=39GB

The problem size on 32k cores was npt= 546000*16*16*16 = E*lx1*lx1*lx1
(where lx1=lxi+1). i.e., npt=2,236,416,000 = 2.2 billion grids

The output size with 4 fields will be:
2236416000*(4*3*4)+2236416000*(1+9+3)*4 = 223,641,600,000 (223GB)

Memory required is 420M .

TODO:
make sure thread joined at last step;
make sure buffer size is optimized according to the formula given above;

Main Page/DAT

2012-06-13T20:43:04Z

Jingfu:

Main Page/DAT

2012-06-13T20:42:54Z

Jingfu:

Main Page/DAT

2012-06-13T20:41:54Z

Jingfu:

== SIZEu file ==

ldim: dimension

lxi: the degree of polynomials
lx1: the number of grid points on the face
ly1=lx1; lz1=lx1

lelt: the maximun number of element per core

lp : the maximum number of cores

We'll have to use (E,lelt,lx1,lp), to represent size of prob, instead of c3d.rea.

E=total element numbers, lelt=element # per core,
lx1= grid points in one direction, lp= # of cores.

I had many different rea with c3d_6 (E=136K), c3d_7(E=273K), etc..

Even for a fixed num of element with c3d_7 (E=273K), men usage is different for
different # of cores (lp=32k, 65k, 131k).

made huge change in the code for 2 times
reduction in mem usage to go further up from 1.1 billion to 2.2 billion cases.

from (E=273, lx1=16, lp= 131k): limit in the past ---> (E=546k, lx1=16, lp=131k)

(E=999k, lx1=16, lp=131k) was 500M. So I couldn't do on BGP. But is be ok on XK6,
even with lp=262k.

In there, if we assume "nc" is approximately same as the total grids "n".
we have the following:

For the header:
(1) coordinate => 3 columns * 4 bytes
(2) cell data => 9 columns * 4 bytes
(3) cell type => 1 columns * 4 bytes

For the 8 fields:
3 columns * 4 bytes

So, we have 275M*(8 fields *3*4)+ 275M*(3+9+1)*4 = 40 GB

Or, neglecting the cel type, we get 275M(8*3*4+12*4)=39GB

The problem size on 32k cores was npt= 546000*16*16*16 = E*lx1*lx1*lx1
(where lx1=lxi+1). i.e., npt=2,236,416,000 = 2.2 billion grids

The output size with 4 fields will be:
2236416000*(4*3*4)+2236416000*(1+9+3)*4 = 223,641,600,000 (223GB)

Memory required is 420M .

TODO:
make sure thread joined at last step;
make sure buffer size is optimized according to the formula given above;

Main Page/DAT

2012-03-25T22:39:55Z

Jingfu:

== SIZEu file ==

ldim: dimension

lxi: the degree of polynomials
lx1: the number of grid points on the face
ly1=lx1; lz1=lx1

lelt: the maximun number of element per core

lp : the maximum number of cores

We'll have to use (E,lelt,lx1,lp), to represent size of prob, instead of c3d.rea.

E=total element numbers, lelt=element # per core,
lx1= grid points in one direction, lp= # of cores.

I had many different rea with c3d_6 (E=136K), c3d_7(E=273K), etc..

Even for a fixed num of element with c3d_7 (E=273K), men usage is different for
different # of cores (lp=32k, 65k, 131k). So... sorry I wouldn't know which
case if it's just c3d...

By the way, please remember I made huge change in the code so far for 2 times
reduction in mem usage to go further up from 1.1 billion to 2.2 billion cases.

from (E=273, lx1=16, lp= 131k): limit in the past ---> (E=546k, lx1=16, lp=131k)

(E=999k, lx1=16, lp=131k) was 500M. So I couldn't do on BGP. But is be ok on XK6,
even with lp=262k.

If you still keep the old version old version of the code: you can compile and
see what men usage was. From example below, always the fourth one (92352484) will
be the mem usage.

In there, if we assume "nc" is approximately same as the total grids "n".
we have the following:

For the header:
(1) coordinate => 3 columns * 4 bytes
(2) cell data => 9 columns * 4 bytes
(3) cell type => 1 columns * 4 bytes

For the 8 fields:
3 columns * 4 bytes

So, we have 275M*(8 fields *3*4)+ 275M*(3+9+1)*4 = 40 GB ?

Or, neglecting the cel type, we get 275M(8*3*4+12*4)=39GB ?

The problem size on 32k cores was npt= 546000*16*16*16 = E*lx1*lx1*lx1
(where lx1=lxi+1). i.e., npt=2,236,416,000 = 2.2 billion grids

The output size with 4 fields will be:
2236416000*(4*3*4)+2236416000*(1+9+3)*4 = 223,641,600,000 (223GB)

Memory required is 420M (I wouldn't know how to get breakdown of memory for computation and I/O)
But let me know if anyway that we can get the info -- we can work on it.

TODO:
make sure thread joined at last step;
make sure buffer size is optimized according to the formula given above;

Main Page/DAT

2012-03-25T04:00:51Z

Jingfu:

== SIZEu file ==

ldim: dimension

lxi: the degree of polynomials
lx1: the number of grid points on the face
ly1=lx1; lz1=lx1

lelt: the maximun number of element per core

lp : the maximum number of cores

We'll have to use (E,lelt,lx1,lp), to represent size of prob, instead of c3d.rea.

E=total element numbers, lelt=element # per core,
lx1= grid points in one direction, lp= # of cores.

I had many different rea with c3d_6 (E=136K), c3d_7(E=273K), etc..

Even for a fixed num of element with c3d_7 (E=273K), men usage is different for
different # of cores (lp=32k, 65k, 131k). So... sorry I wouldn't know which
case if it's just c3d...

By the way, please remember I made huge change in the code so far for 2 times
reduction in mem usage to go further up from 1.1 billion to 2.2 billion cases.

from (E=273, lx1=16, lp= 131k): limit in the past ---> (E=546k, lx1=16, lp=131k)

(E=999k, lx1=16, lp=131k) was 500M. So I couldn't do on BGP. But is be ok on XK6,
even with lp=262k.

If you still keep the old version old version of the code: you can compile and
see what men usage was. From example below, always the fourth one (92352484) will
be the mem usage.

In there, if we assume "nc" is approximately same as the total grids "n".
we have the following:

For the header:
(1) coordinate => 3 columns * 4 bytes
(2) cell data => 9 columns * 4 bytes
(3) cell type => 1 columns * 4 bytes

For the 8 fields:
3 columns * 4 bytes

So, we have 275M*(8 fields *3*4)+ 275M*(3+9+1)*4 = 40 GB ?

Or, neglecting the cel type, we get 275M(8*3*4+12*4)=39GB ?

The problem size on 32k cores was npt= 546000*16*16*16 = E*lx1*lx1*lx1
(where lx1=lxi+1). i.e., npt=2,236,416,000 = 2.2 billion grids

The output size with 4 fields will be:
2236416000*(4*3*4)+2236416000*(1+9+3)*4 = 223,641,600,000 (223GB)

Memory required is 420M (I wouldn't know how to get breakdown of memory for computation and I/O)
But let me know if anyway that we can get the info -- we can work on it.

Main Page/DAT

2012-03-25T03:31:20Z

Jingfu:

== SIZEu file ==

ldim: dimension

lxi: the degree of polynomials
lx1: the number of grid points on the face
ly1=lx1; lz1=lx1

lelt: the maximun number of element per core

lp : the maximum number of cores

We'll have to use (E,lelt,lx1,lp), to represent size of prob, instead of c3d.rea.

E=total element numbers, lelt=element # per core,
lx1= grid points in one direction, lp= # of cores.

I had many different rea with c3d_6 (E=136K), c3d_7(E=273K), etc..

Even for a fixed num of element with c3d_7 (E=273K), men usage is different for
different # of cores (lp=32k, 65k, 131k). So... sorry I wouldn't know which
case if it's just c3d...

By the way, please remember I made huge change in the code so far for 2 times
reduction in mem usage to go further up from 1.1 billion to 2.2 billion cases.

from (E=273, lx1=16, lp= 131k): limit in the past ---> (E=546k, lx1=16, lp=131k)

(E=999k, lx1=16, lp=131k) was 500M. So I couldn't do on BGP. But is be ok on XK6,
even with lp=262k.

If you still keep the old version old version of the code: you can compile and
see what men usage was. From example below, always the fourth one (92352484) will
be the mem usage.

In there, if we assume "nc" is approximately same as the total grids "n".
we have the following:

For the header:
(1) coordinate => 3 columns * 4 bytes
(2) cell data => 9 columns * 4 bytes
(3) cell type => 1 columns * 4 bytes

For the 8 fields:
3 columns * 4 bytes

So, we have 275M*(8 fields *3*4)+ 275M*(3+9+1)*4 = 40 GB ?

Or, neglecting the cel type, we get 275M(8*3*4+12*4)=39GB ?

Main Page/DAT

2012-03-25T03:28:35Z

Jingfu:

Main Page/DAT

2012-03-25T03:27:05Z

Jingfu:

== SIZEu file ==

ldim: dimension

lxi: the degree of polynomials
lx1: the number of grid points on the face
ly1=lx1; lz1=lx1

lelt: the maximun number of element per core

lp : the maximum number of cores

We'll have to use (E,lelt,lx1,lp), to represent size of prob, instead of c3d.rea.

E=total element numbers, lelt=element # per core,
lx1= grid points in one direction, lp= # of cores.

I had many different rea with c3d_6 (E=136K), c3d_7(E=273K), etc..

Even for a fixed num of element with c3d_7 (E=273K), men usage is different for
different # of cores (lp=32k, 65k, 131k). So... sorry I wouldn't know which
case if it's just c3d...

By the way, please remember I made huge change in the code so far for 2 times
reduction in mem usage to go further up from 1.1 billion to 2.2 billion cases.

from (E=273, lx1=16, lp= 131k): limit in the past ---> (E=546k, lx1=16, lp=131k)

(E=999k, lx1=16, lp=131k) was 500M. So I couldn't do on BGP. But is be ok on XK6,
even with lp=262k.

If you still keep the old version old version of the code: you can compile and
see what men usage was. From example below, always the fourth one (92352484) will
be the mem usage.

======
jl_sparse_cholesky.o jl_poly.o jl_tensor.o jl_findpt.o jl_pfindpt.o comm_mpi2.o rbIO_nekcem.o vtkbin.o coIO_nekcem.o coIO_nekcem_read.o io_util.o mpiio_util.o io_driver.o -llapack -lblas
text data bss dec hex filename
4173564 266664 87912256 92352484 5812fe4 nekcem
I am done
======

Let me know if not clear on this --

Main Page/RUN

2012-03-15T04:27:41Z

Jingfu: /* Execute */

== Getting the Source ==

NEKCEM is available for download via the Subversion repository:

svn co https://svn.mcs.anl.gov/repos/NEKCEM

It is also recommended to download ParaView.

== Contents of NEKCEM package ==

The NEKCEM package contains the source code, scripts, examples,
libraries used, and documentation.

* src: source code
* bin: a collection of scripts for building and running NEKCEM
makenek: To compile ../../bin/makenek under an 'example' dir; See makenek --help for options
nek: To run ../../bin/nek; See nek --help for options
cleanall: To clean ../../bin/cleanall
* examples: sample problems including SIZEu, *.rea, *.map, *.usr (some special cases have additional files)
* libs: BLAS and LAPACK can be placed here if not already installed on your system
* tool: source codes for other utilities, mainly for meshing (detail below)
* doc: documentation

== Compile ==

cd NEKCEM/trunk/examples/cylwav
../../bin/makenek cylwave

== Execute ==

cd NEKCEM/trunk/examples/cylwave
../../bin/nek cylwave #np

Note: on Jaguar, do ../../bin/nek cylware #np1 #np1, where #np1 is the actual core number you need,
and #np2 is the core number you request from system, which has to be a multiple of 16.

Main Page/RUN

2012-03-15T04:27:26Z

Jingfu: /* Execute */

Main Page/RUN

2012-03-15T04:27:00Z

Jingfu:

== Getting the Source ==

NEKCEM is available for download via the Subversion repository:

svn co https://svn.mcs.anl.gov/repos/NEKCEM

It is also recommended to download ParaView.

== Contents of NEKCEM package ==

The NEKCEM package contains the source code, scripts, examples,
libraries used, and documentation.

* src: source code
* bin: a collection of scripts for building and running NEKCEM
makenek: To compile ../../bin/makenek under an 'example' dir; See makenek --help for options
nek: To run ../../bin/nek; See nek --help for options
cleanall: To clean ../../bin/cleanall
* examples: sample problems including SIZEu, *.rea, *.map, *.usr (some special cases have additional files)
* libs: BLAS and LAPACK can be placed here if not already installed on your system
* tool: source codes for other utilities, mainly for meshing (detail below)
* doc: documentation

== Compile ==

cd NEKCEM/trunk/examples/cylwav
../../bin/makenek cylwave

== Execute ==

cd NEKCEM/trunk/examples/cylwave
../../bin/nek cylwave #np

Note: on Jaguar, do ../../bin/nek cylware #np1 #np1, where #np1 is the actual core number you need, and #np2 is the core number you request from system, which has to be a multiple of 16.

Main Page/PIO

2011-12-17T04:36:08Z

Jingfu: /* Usage Introduction */

This is the document page for parallel I/O library developed for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].

== Background ==
;File Format
:Binary (used for production, compact size), or ASCII (used for debugging, human-readable)

== Usage Introduction ==

Users can use #1,2,3,4,5,6,7,8 in .rea file for a specific example, by parameter 81. 

Several '''advanced parallel I/O''' algorithms based on MPI-IO library were developed. 
The analysis on the performance for those approaches are detailed in ''Parallel I/O
performance for application-level checkpointing on the Blue Gene/P, by Jing Fu et. al.'' [http://www.mcs.anl.gov/~mmin]

proc=processors
'''param(81) = 4''': collective IO for N proc to 1 file ---> mpi-binary-N1-xxx.vtk
'''param(81) = 5''': collective IO for N proc to multiple M-files ---> mpi-binary-NM-xxx.vtk
'''param(81) = 6''': reduced-blocking IO for N proc to 1 file with M writers ---> mpi-binary-NM1-xxx.vtk
'''param(81) = 7''': reduced-blocking IO for N proc to 1 file with M writers ---> mpi-ascii-NM1-xxx.vtk
'''param(81) = 8''': reduced-blocking IO for N proc to multiple M-files with M writers ---> mpi-binary-NMM-xxx.vtk

'''Note''' that param(82) and param(83) need to be set correctly in *.rea file.
'''param(80)''' = ''exact'' total number of fields to be written
'''param(82)''' = number of output files
'''param(83)''' = frequency of restart output files, 0 (no restart output), #nn (iostep*nn)
'''param(84)''' = invoked with dump_number from the name of the restarting output file

----
For '''traditional I/O''' approaches based on one file per processor, using ''old libraries'', one can use

'''param(81) = 2''': use Fortran I/O library (ASCII, VTK format) ---> ascii-xxx.vtk
'''param(81) = 3''': use C-POSIX I/O libraries (binary, VTK format) ---> binary-xxx.vtk

'''param(81) = 1''': use nek5000's old output format (ASCII) -> xxx.fld

== Visualization ==

The output files with param(81)=2,3,4,5,6,7,8 can be visualized with '''ParaView''' and '''VisIt'''. 
The output with param(81)=1 can be visualized with nek5000 visualization tool, '''postx'''

Main Page/bgQ

2011-08-18T06:08:29Z

Jingfu: Blanked the page

Main Page/hybrid prog

2011-06-13T16:31:13Z

Jingfu: /* Questions */

This is the document page for hybrid programming proposal for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM] (some notes taken from DOE Exascale workshop).

==Programming Model Approaches ==
* hybrid/evolutionary: MPI+ __ ?
** MPI for inter-node prog, since # notes and inter-node concerns not expected to change dramatically
*** support for hybrid programming/interoperability
*** purer one-sided communications; active messages
*** asynchronous collectives
** something else for intra-node
**# OpenMP (Shared memory, aka Global Address Space)
**#* introduction of locality-oriented concepts?
**#* efforts in OpenMP 3.0 ?
**# PGAS languages (Partitioned Global Address Space)
**#* already support a notation of locality in a shared namespace
**#* UPC (Unified Parallel C)/CAF need to relax strictly SPMD execution model
**# Sequoia
**#* support a strong notation of vertical locality

* unified/holistic: __ ?
** a single notation for inter- and intra-node programming?
** traditional PGAS languages: UPC, CAF, Titanium
*** require extension to handle nested parallelism, vertical locality
** HPCS languages: Chapel, X10, Fortress(?)
*** designed with locality and post-SPMD parallelism in mind
** other candidates: Charm++, Global Arrays, Parallel X, ...

* others
** mainstream multi-core/GPU language: (sufficient promise to be funded?)
**domain-specific language
*** fit your problem?
*** should focus on more general solutions
**functional languages
*** never heavily adopted in mainstream or HPC
*** copy-on-write optimization and alias analysis?
** parallel scripting languages?

==Pros and Cons of Pthread and OpenMP==
*Pthread
** Pros: low-level control of program, well supported
** Cons: need to cast the codebase into a threaded model, requires considerable threading-specific code; hard-code thread number etc, not very portable
** Misc: can use thread pool when not sure about machine processor details (to be more portable than hard-coded thread#)
*OpenMP
** Pros: medium-grained control over threading functionality; auto-adjust according to machine specifics; use pragmas (over API) not interfere with single-threaded codebase; easy to debug as well;
** Cons: compiler support on BG/P?
** Misc:

==Expectation==
*parallelism: nested, dynamic, loosely-coupled, data-driven (i.e. ''post-SPMD'' programming/execution models)
** to take advantage of architecture
** to better support load balancing and resilience

* locality: concepts for vertical control as well as horizontal (i.e. locality within a node rather than simply between nodes)

== Tools: Debuggers, perf. analysis==
* challenges
** need aggregation to hide details
** need to report info in user's terms
*good area for innovation (e.g. execution visualization to understand mapping of code to hardware)

==Misc Notes==
* three main types of distributed memory programming:
** active message. Active Messages are actually a lower-level mechanism that can be used to implement data parallel or message passing efficiently.
** data parallel(aka loop-level parallelism). Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism).
** message passing
* multi-core and many-core processors
** A multi-core processor is composed of two or more independent cores. One can describe it as an integrated circuit which has two or more individual processors (called cores in this sense).[1] Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.
** A many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely due to issues with congestion supplying sufficient instructions and data to the many processors. This threshold is roughly in the range of several tens of cores and probably requires a network on chip.
* ILP and TLP
** Some instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications, but are inefficient for others that tend to contain difficult-to-predict code.
** Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. The Multithreading paradigm has become more popular as efforts to further exploit instruction level parallelism have stalled since the late-1990s.
* multi-threading v.s. multi-processing
** multi-threading advantage
*** If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage of the unused computing resources, which thus can lead to faster overall execution, as these resources would have been idle if only a single thread was executed.
**multi-threading disadvantage
*** Execution times of a single-thread are not improved but can be degraded, even when only one thread is executing. This is due to slower frequencies and/or additional pipeline stages that are necessary to accommodate thread-switching hardware.
** Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them.
** SMP MIMD multiprocessing

==Questions==
* thread-safe MPI implementation?
** thread-safe usually means MPI_THREAD_MULTIPLE
** only need MPI_THREAD_FUNNEL for master-only style.

Main Page/PIO

2011-06-13T16:23:39Z

Jingfu:

This is the document page for parallel I/O library developed for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].

== Background ==
;File Format
:Binary (used for production, compact size), or ASCII (used for debugging, human-readable)

== Usage Introduction ==

Users can use #1,2,3,4,5,6,7,8 in .rea file for a specific example, by parameter 81. 

Several '''advanced parallel I/O''' algorithms based on MPI-IO library were developed. 
The analysis on the performance for those approaches are detailed in ''Parallel I/O
performance for application-level checkpointing on the Blue Gene/P, by Jing Fu et. al.'' [http://www.mcs.anl.gov/~mmin]

proc=processors
'''param(81) = 4''': collective IO for N proc to 1 file ---> mpi-binary-N1-xxx.vtk
'''param(81) = 5''': collective IO for N proc to multiple M-files ---> mpi-binary-NM-xxx.vtk
'''param(81) = 6''': reduced-blocking IO for N proc to 1 file with M writers ---> mpi-binary-NM1-xxx.vtk
'''param(81) = 7''': reduced-blocking IO for N proc to 1 file with M writers ---> mpi-ascii-NM1-xxx.vtk
'''param(81) = 8''': reduced-blocking IO for N proc to multiple M-files with M writers ---> mpi-binary-NMM-xxx.vtk

'''Note''' that param(82) and param(83) need to be set correctly in *.rea file.
'''param(82)''' = number of output files
'''param(83)''' = max number of fields to be written

----
For '''traditional I/O''' approaches based on one file per processor, using ''old libraries'', one can use

'''param(81) = 2''': use Fortran I/O library (ASCII, VTK format) ---> ascii-xxx.vtk
'''param(81) = 3''': use C-POSIX I/O libraries (binary, VTK format) ---> binary-xxx.vtk

'''param(81) = 1''': use nek5000's old output format (ASCII) -> xxx.fld

== Visualization ==

The output files with param(81)=2,3,4,5,6,7,8 can be visualized with '''ParaView''' and '''VisIt'''. 
The output with param(81)=1 can be visualized with nek5000 visualization tool, '''postx'''

Main Page/C Fortran

2011-05-19T18:49:55Z

Jingfu:

== PGI compiler name mangling convention ==
When programs are compiled using one of the PGI Fortran compilers on UNIX systems, an underscore is appended to Fortran global names (names of functions, subroutines and common blocks).
This mechanism distinguishes Fortran name space from C/C++ name space.

If you call a C/C++ function from Fortran, you should rename the C/C++ function by appending an underscore (or use C$PRAGMA C in the Fortran program, refer to Chapter 9, Optimization Directives and Pragmas, for details on C$PRAGMA C)

If you call a Fortran function from C/C++, you should append an underscore to the Fortran function name in the calling program.
(source: http://www.tacc.utexas.edu/services/userguides/pgi/pgiws_ug/pgi32u07.htm#Heading93)

==Name mangling in Fortran (from Wikipedia)==

Name mangling is also necessary in Fortran compilers, originally because the language is case insensitive. Further mangling requirements were imposed later in the evolution of the language because of the addition of modules and other features in the Fortran 90 standard. The case mangling, especially, is a common issue that must be dealt with in order to call Fortran libraries (such as LAPACK) from other languages (such as C).

Because of the case insensitivity, the name of a subroutine or function "FOO" must be converted to a canonical case and format by the Fortran compiler so that it will be linked in the same way regardless of case. Different compilers have implemented this in various ways, and no standardization has occurred.

The AIX and HP-UX Fortran compilers convert all identifiers to lower case ("foo"), while the Cray Unicos Fortran compilers converted identifiers all upper case ("FOO").

The GNU g77 compiler converts identifiers to lower case plus an underscore ("foo_"), except that identifiers already containing an underscore ("FOO_BAR") have two underscores appended ("foo_bar__"), following a convention established by f2c.

Many other compilers, including SGI's IRIX compilers, gfortran, and Intel's Fortran compiler, convert all identifiers to lower case plus an underscore ("foo_" and "foo_bar_").

Identifiers in Fortran 90 modules must be further mangled, because the same subroutine name may apply to different routines in different modules.

2.1- SUBROUTINE

For a Fortran CALL SUB the corresponding C routine has to be named:

SUB
all upper case on Cray with cft77 compiler
sub
all lower case on Apollo with ftn compiler
case insensitive on IBM/370 and VMS
sub_
lower case with underscore added on all other system

Main Page/C Fortran

2011-05-19T18:49:40Z

Jingfu:

== PGI compiler name mangling convention ==
When programs are compiled using one of the PGI Fortran compilers on UNIX systems, an underscore is appended to Fortran global names (names of functions, subroutines and common blocks).
This mechanism distinguishes Fortran name space from C/C++ name space.

If you call a C/C++ function from Fortran, you should rename the C/C++ function by appending an underscore (or use C$PRAGMA C in the Fortran program, refer to Chapter 9, Optimization Directives and Pragmas, for details on C$PRAGMA C)

If you call a Fortran function from C/C++, you should append an underscore to the Fortran function name in the calling program.
(source: http://www.tacc.utexas.edu/services/userguides/pgi/pgiws_ug/pgi32u07.htm#Heading93)

==Name mangling in Fortran (from Wikipedia)==

Name mangling is also necessary in Fortran compilers, originally because the language is case insensitive. Further mangling requirements were imposed later in the evolution of the language because of the addition of modules and other features in the Fortran 90 standard. The case mangling, especially, is a common issue that must be dealt with in order to call Fortran libraries (such as LAPACK) from other languages (such as C).

Because of the case insensitivity, the name of a subroutine or function "FOO" must be converted to a canonical case and format by the Fortran compiler so that it will be linked in the same way regardless of case. Different compilers have implemented this in various ways, and no standardization has occurred.

The AIX and HP-UX Fortran compilers convert all identifiers to lower case ("foo"), while the Cray Unicos Fortran compilers converted identifiers all upper case ("FOO").

The GNU g77 compiler converts identifiers to lower case plus an underscore ("foo_"), except that identifiers already containing an underscore ("FOO_BAR") have two underscores appended ("foo_bar__"), following a convention established by f2c.
Many other compilers, including SGI's IRIX compilers, gfortran, and Intel's Fortran compiler, convert all identifiers to lower case plus an underscore ("foo_" and "foo_bar_").

Identifiers in Fortran 90 modules must be further mangled, because the same subroutine name may apply to different routines in different modules.

2.1- SUBROUTINE

For a Fortran CALL SUB the corresponding C routine has to be named:

SUB
all upper case on Cray with cft77 compiler
sub
all lower case on Apollo with ftn compiler
case insensitive on IBM/370 and VMS
sub_
lower case with underscore added on all other system

Main Page/C Fortran

2011-05-19T18:48:23Z

Jingfu:

== PGI compiler name mangling convention ==
When programs are compiled using one of the PGI Fortran compilers on UNIX systems, an underscore is appended to Fortran global names (names of functions, subroutines and common blocks).
This mechanism distinguishes Fortran name space from C/C++ name space.

If you call a C/C++ function from Fortran, you should rename the C/C++ function by appending an underscore (or use C$PRAGMA C in the Fortran program, refer to Chapter 9, Optimization Directives and Pragmas, for details on C$PRAGMA C)

If you call a Fortran function from C/C++, you should append an underscore to the Fortran function name in the calling program.

source: http://www.tacc.utexas.edu/services/userguides/pgi/pgiws_ug/pgi32u07.htm#Heading93

==Name mangling in Fortran (from Wikipedia)==

Name mangling is also necessary in Fortran compilers, originally because the language is case insensitive. Further mangling requirements were imposed later in the evolution of the language because of the addition of modules and other features in the Fortran 90 standard. The case mangling, especially, is a common issue that must be dealt with in order to call Fortran libraries (such as LAPACK) from other languages (such as C).

Because of the case insensitivity, the name of a subroutine or function "FOO" must be converted to a canonical case and format by the Fortran compiler so that it will be linked in the same way regardless of case. Different compilers have implemented this in various ways, and no standardization has occurred. The AIX and HP-UX Fortran compilers convert all identifiers to lower case ("foo"), while the Cray Unicos Fortran compilers converted identifiers all upper case ("FOO"). The GNU g77 compiler converts identifiers to lower case plus an underscore ("foo_"), except that identifiers already containing an underscore ("FOO_BAR") have two underscores appended ("foo_bar__"), following a convention established by f2c. Many other compilers, including SGI's IRIX compilers, gfortran, and Intel's Fortran compiler, convert all identifiers to lower case plus an underscore ("foo_" and "foo_bar_").

Identifiers in Fortran 90 modules must be further mangled, because the same subroutine name may apply to different routines in different modules.

2.1- SUBROUTINE

For a Fortran CALL SUB the corresponding C routine has to be named:

SUB
all upper case on Cray with cft77 compiler
sub
all lower case on Apollo with ftn compiler
case insensitive on IBM/370 and VMS
sub_
lower case with underscore added on all other system

Main Page/C Fortran

2011-05-19T18:47:38Z

Jingfu:

Main Page/C Fortran

2011-05-19T18:47:03Z

Jingfu:

Main Page/C Fortran

2011-05-19T18:30:01Z

Jingfu: Created page with "2.1- SUBROUTINE For a Fortran CALL SUB the corresponding C routine has to be named: SUB all upper case on Cray with cft77 compiler sub all lower case on Apollo with ft…"

2.1- SUBROUTINE

For a Fortran CALL SUB the corresponding C routine has to be named:

SUB
all upper case on Cray with cft77 compiler
sub
all lower case on Apollo with ftn compiler
case insensitive on IBM/370 and VMS
sub_
lower case with underscore added on all other system

Main Page/faq

2011-05-19T18:28:33Z

Jingfu:

This is the resource listing page for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Resource Links==
*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/PIO Parallel I/O of NekCEM]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/hybrid_prog Hybrid programming (proposed work)]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/aio Sync/Async Blocking/non-blocking I/O]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/bgQ Blue Gene/Q and Mira]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/C_Fortran C Fortran mixed programming]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/Meeting_Notes Meeting Notes]

== Implementation ==

=== I/O code ===

* I/O functions were initiated from cem_out function of cem_dg.F (and cem_dg2.F).
* Implementation of parallel I/O routine were defined in vtkbin.c and rbIO_nekcem.c
* vtkcommon.c and vtkcommon.h serve as a place to hold common functions as well as global variables. 

* cem_out_fields3 (in cem_dg.F)
** openfile3(dumpno, nid) !vtkbin.c
** vtk_dump_header3
*** writeheader3() !vtkbin.c
*** writenodes3() !vtkbin.c
*** write2dcells3 !vtkbin.c or write3dcells3 !vtkbin.c
** vtk_dump_field3
*** writefield3 !vtkbin.c
** close_file3 !vtkbin.c

* Binary file → ASCII file: transfer double/float/int/read to chars then write out
** float (4 bytes) → %18.8E
** int (4 bytes) → %10d
** long long (8 bytes) → %18lld
** elemType → %4d

=== NekCEM notes===
* scaling
** strong scaling: defined as how the solution time varies with the number of processors for a fixed total problem size.
** weak scaling: defined as how the solution time varies with the number of processors for a fixed problem size per processor.
* pre-compute file size
** #grid point = nx * ny * nz * nelt; size = #grid point * 3 * float
** cell type: 2d → 4 * #cell * int + 1* #cell * int (3d → 9)
** #field = nfields * 3 * #grid point; size = #field * float;

* .box → num elements in x,y,z
* .rea → input data
* SIZEu → SIZE parameters:
** lxi ?
** lp = #proc
** lelx = 20 each dimension
** lelv = alloc max # of element per proc
* .usr → subuser.F
* cem() in cem_dg.F is the main solver and application entry point

* only CELL and point data need to be re-computed

* compile and run NekCEM
** in a specific case, ../../bin/cleanall, ../../bin/makenek, ../../bin/nek "case_name" #proc (e.g. in cylwave, ../../bin/nek cylwave 4)

== To-do List ==
*More tests on BG/P for config with ng = M and 1< nf < M
*Tests on Kraken and Jaguar
*Pthread + MPI for I/O
*OpenMP/Pthread + MPI for NekCEM computation
*Parallel I/O for reading .rea file

== Miscellaneous notes ==
* Fortran generated binary file may not be correctly read in C.
* -lstdc++ for link
* libF77 and libI77
* common.h and common_c.h
* write() in Fortran: 6 refer to screen, * is to screen as well ..

Main Page/hybrid prog

2011-03-21T02:01:56Z

Jingfu:

Main Page/hybrid prog

2011-03-21T02:01:37Z

Jingfu:

This is the document page for hybrid programming proposal for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM] (some notes taken from DOE Exascale workshop).

==Programming Model Approaches ==
* hybrid/evolutionary: MPI+ __ ?
** MPI for inter-node prog, since # notes and inter-node concerns not expected to change dramatically
*** support for hybrid programming/interoperability
*** purer one-sided communications; active messages
*** asynchronous collectives
** something else for intra-node
**# OpenMP (Shared memory, aka Global Address Space)
**#* introduction of locality-oriented concepts?
**#* efforts in OpenMP 3.0 ?
**# PGAS languages (Partitioned Global Address Space)
**#* already support a notation of locality in a shared namespace
**#* UPC (Unified Parallel C)/CAF need to relax strictly SPMD execution model
**# Sequoia
**#* support a strong notation of vertical locality

* unified/holistic: __ ?
** a single notation for inter- and intra-node programming?
** traditional PGAS languages: UPC, CAF, Titanium
*** require extension to handle nested parallelism, vertical locality
** HPCS languages: Chapel, X10, Fortress(?)
*** designed with locality and post-SPMD parallelism in mind
** other candidates: Charm++, Global Arrays, Parallel X, ...

* others
** mainstream multi-core/GPU language: (sufficient promise to be funded?)
**domain-specific language
*** fit your problem?
*** should focus on more general solutions
**functional languages
*** never heavily adopted in mainstream or HPC
*** copy-on-write optimization and alias analysis?
** parallel scripting languages?

==Pros and Cons of Pthread and OpenMP
*Pthread
** Pros: low-level control of program, well supported
** Cons: need to cast the codebase into a threaded model, requires considerable threading-specific code; hard-code thread number etc, not very portable
** Misc: can use thread pool when not sure about machine processor details (to be more portable than hard-coded thread#)
*OpenMP
** Pros: medium-grained control over threading functionality; auto-adjust according to machine specifics; use pragmas (over API) not interfere with single-threaded codebase; easy to debug as well;
** Cons: compiler support on BG/P?
** Misc:

==Expectation==
*parallelism: nested, dynamic, loosely-coupled, data-driven (i.e. ''post-SPMD'' programming/execution models)
** to take advantage of architecture
** to better support load balancing and resilience

* locality: concepts for vertical control as well as horizontal (i.e. locality within a node rather than simply between nodes)

== Tools: Debuggers, perf. analysis==
* challenges
** need aggregation to hide details
** need to report info in user's terms
*good area for innovation (e.g. execution visualization to understand mapping of code to hardware)

==Misc Notes==
* three main types of distributed memory programming:
** active message. Active Messages are actually a lower-level mechanism that can be used to implement data parallel or message passing efficiently.
** data parallel(aka loop-level parallelism). Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism).
** message passing
* multi-core and many-core processors
** A multi-core processor is composed of two or more independent cores. One can describe it as an integrated circuit which has two or more individual processors (called cores in this sense).[1] Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.
** A many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely due to issues with congestion supplying sufficient instructions and data to the many processors. This threshold is roughly in the range of several tens of cores and probably requires a network on chip.
* ILP and TLP
** Some instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications, but are inefficient for others that tend to contain difficult-to-predict code.
** Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. The Multithreading paradigm has become more popular as efforts to further exploit instruction level parallelism have stalled since the late-1990s.
* multi-threading v.s. multi-processing
** multi-threading advantage
*** If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage of the unused computing resources, which thus can lead to faster overall execution, as these resources would have been idle if only a single thread was executed.
**multi-threading disadvantage
*** Execution times of a single-thread are not improved but can be degraded, even when only one thread is executing. This is due to slower frequencies and/or additional pipeline stages that are necessary to accommodate thread-switching hardware.
** Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them.
** SMP MIMD multiprocessing

==Questions==
* Which MPI implementations are thread-safe?

Main Page/faq

2011-02-25T01:49:38Z

Jingfu:

This is the resource listing page for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Resource Links==
*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/PIO Parallel I/O of NekCEM]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/hybrid_prog Hybrid programming (proposed work)]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/aio Sync/Async Blocking/non-blocking I/O]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/bgQ Blue Gene/Q and Mira]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/Meeting_Notes Meeting Notes]

== Implementation ==

=== I/O code ===

* I/O functions were initiated from cem_out function of cem_dg.F (and cem_dg2.F).
* Implementation of parallel I/O routine were defined in vtkbin.c and rbIO_nekcem.c
* vtkcommon.c and vtkcommon.h serve as a place to hold common functions as well as global variables. 

* cem_out_fields3 (in cem_dg.F)
** openfile3(dumpno, nid) !vtkbin.c
** vtk_dump_header3
*** writeheader3() !vtkbin.c
*** writenodes3() !vtkbin.c
*** write2dcells3 !vtkbin.c or write3dcells3 !vtkbin.c
** vtk_dump_field3
*** writefield3 !vtkbin.c
** close_file3 !vtkbin.c

* Binary file → ASCII file: transfer double/float/int/read to chars then write out
** float (4 bytes) → %18.8E
** int (4 bytes) → %10d
** long long (8 bytes) → %18lld
** elemType → %4d

=== NekCEM notes===
* scaling
** strong scaling: defined as how the solution time varies with the number of processors for a fixed total problem size.
** weak scaling: defined as how the solution time varies with the number of processors for a fixed problem size per processor.
* pre-compute file size
** #grid point = nx * ny * nz * nelt; size = #grid point * 3 * float
** cell type: 2d → 4 * #cell * int + 1* #cell * int (3d → 9)
** #field = nfields * 3 * #grid point; size = #field * float;

* .box → num elements in x,y,z
* .rea → input data
* SIZEu → SIZE parameters:
** lxi ?
** lp = #proc
** lelx = 20 each dimension
** lelv = alloc max # of element per proc
* .usr → subuser.F
* cem() in cem_dg.F is the main solver and application entry point

* only CELL and point data need to be re-computed

* compile and run NekCEM
** in a specific case, ../../bin/cleanall, ../../bin/makenek, ../../bin/nek "case_name" #proc (e.g. in cylwave, ../../bin/nek cylwave 4)

== To-do List ==
*More tests on BG/P for config with ng = M and 1< nf < M
*Tests on Kraken and Jaguar
*Pthread + MPI for I/O
*OpenMP/Pthread + MPI for NekCEM computation
*Parallel I/O for reading .rea file

== Miscellaneous notes ==
* Fortran generated binary file may not be correctly read in C.
* -lstdc++ for link
* libF77 and libI77
* common.h and common_c.h
* write() in Fortran: 6 refer to screen, * is to screen as well ..

Main Page/faq

2011-02-25T01:48:51Z

Jingfu:

This is the resource listing page for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Resource Links==
*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/PIO Parallel I/O of NekCEM]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/hybrid_prog Hybrid programming (proposed work)]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/aio Sync/Async Blocking/non-blocking I/O]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/bgQ Blue Gene/Q and Mira]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/Meeting_Notes Meeting Notes]

== Implementation ==

=== I/O code ===

* I/O functions were initiated from cem_out function of cem_dg.F (and cem_dg2.F).
* Implementation of parallel I/O routine were defined in vtkbin.c and rbIO_nekcem.c
* vtkcommon.c and vtkcommon.h serve as a place to hold common functions as well as global variables. 

* cem_out_fields3 (in cem_dg.F)
** openfile3(dumpno, nid) !vtkbin.c
** vtk_dump_header3
*** writeheader3() !vtkbin.c
*** writenodes3() !vtkbin.c
*** write2dcells3 !vtkbin.c or write3dcells3 !vtkbin.c
** vtk_dump_field3
*** writefield3 !vtkbin.c
** close_file3 !vtkbin.c

* Binary file → ASCII file: transfer double/float/int/read to chars then write out
** float (4 bytes) → %18.8E
** int (4 bytes) → %10d
** long long (8 bytes) → %18lld
** elemType → %4d

=== NekCEM notes===
* scaling
** strong scaling: defined as how the solution time varies with the number of processors for a fixed total problem size.
** weak scaling: defined as how the solution time varies with the number of processors for a fixed problem size per processor.
* pre-compute file size
** #grid point = nx * ny * nz * nelt; size = #grid point * 3 * float
** cell type: 2d → 4 * #cell * int + 1* #cell * int (3d → 9)
** #field = nfields * 3 * #grid point; size = #field * float;

* .box → num elements in x,y,z
* .rea → input data
* SIZEu → SIZE parameters:
** lxi ?
** lp = #proc
** lelx = 20 each dimension
** lelv = alloc max # of element per proc
* .usr → subuser.F
* cem() in cem_dg.F is the main solver and application entry point

* only CELL and point data need to be re-computed

* compile and run NekCEM
** in a specific case, ../../bin/cleanall, ../../bin/makenek, ../../bin/nek "case_name" #proc
** e.g., in cylwave, ../../bin/nek cylwave 4

== To-do List ==
*More tests on BG/P for config with ng = M and 1< nf < M
*Tests on Kraken and Jaguar
*Pthread + MPI for I/O
*OpenMP/Pthread + MPI for NekCEM computation
*Parallel I/O for reading .rea file

== Miscellaneous notes ==
* Fortran generated binary file may not be correctly read in C.
* -lstdc++ for link
* libF77 and libI77
* common.h and common_c.h
* write() in Fortran: 6 refer to screen, * is to screen as well ..

Main Page/faq

2011-02-25T01:48:24Z

Jingfu:

This is the resource listing page for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Resource Links==
*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/PIO Parallel I/O of NekCEM]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/hybrid_prog Hybrid programming (proposed work)]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/aio Sync/Async Blocking/non-blocking I/O]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/bgQ Blue Gene/Q and Mira]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/Meeting_Notes Meeting Notes]

== Implementation ==

=== I/O code ===

* I/O functions were initiated from cem_out function of cem_dg.F (and cem_dg2.F).
* Implementation of parallel I/O routine were defined in vtkbin.c and rbIO_nekcem.c
* vtkcommon.c and vtkcommon.h serve as a place to hold common functions as well as global variables. 

* cem_out_fields3 (in cem_dg.F)
** openfile3(dumpno, nid) !vtkbin.c
** vtk_dump_header3
*** writeheader3() !vtkbin.c
*** writenodes3() !vtkbin.c
*** write2dcells3 !vtkbin.c or write3dcells3 !vtkbin.c
** vtk_dump_field3
*** writefield3 !vtkbin.c
** close_file3 !vtkbin.c

* Binary file → ASCII file: transfer double/float/int/read to chars then write out
** float (4 bytes) → %18.8E
** int (4 bytes) → %10d
** long long (8 bytes) → %18lld
** elemType → %4d

=== NekCEM notes===
* compile and run NekCEM
** in a specific case, ../../bin/cleanall, ../../bin/makenek, ../../bin/nek "case_name" #proc
** e.g., in cylwave, ../../bin/nek cylwave 4

* scaling
** strong scaling: defined as how the solution time varies with the number of processors for a fixed total problem size.
** weak scaling: defined as how the solution time varies with the number of processors for a fixed problem size per processor.
* pre-compute file size
** #grid point = nx * ny * nz * nelt; size = #grid point * 3 * float
** cell type: 2d → 4 * #cell * int + 1* #cell * int (3d → 9)
** #field = nfields * 3 * #grid point; size = #field * float;

* .box → num elements in x,y,z
* .rea → input data
* SIZEu → SIZE parameters:
** lxi ?
** lp = #proc
** lelx = 20 each dimension
** lelv = alloc max # of element per proc
* .usr → subuser.F
* cem() in cem_dg.F is the main solver and application entry point

* only CELL and point data need to be re-computed
== To-do List ==
*More tests on BG/P for config with ng = M and 1< nf < M
*Tests on Kraken and Jaguar
*Pthread + MPI for I/O
*OpenMP/Pthread + MPI for NekCEM computation
*Parallel I/O for reading .rea file

== Miscellaneous notes ==
* Fortran generated binary file may not be correctly read in C.
* -lstdc++ for link
* libF77 and libI77
* common.h and common_c.h
* write() in Fortran: 6 refer to screen, * is to screen as well ..

Main Page/Meeting Notes

2011-02-22T22:16:56Z

Jingfu:

This page records meeting notes about [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Notes==
=== 02/17/2011, Misun, Jing Fu===
* Paper
** will keep iterating the NekCEM I/O paper in accordance to review/feedback from ICS
** Chris suggest [http://www.ppam.pl/ PPAM] (submit: 4/30, notify: 6/15) over ICPP and Cluster
** Results on Jaguar is going to take a while (allocation application turnaround time, compile NekCEM on Cray machines, tune performance, summarize results in paper)
** Results from Jaguar/Lustre can go into follow-up journal paper (most likely by the end of summer)

* hybrid model
** I/O w/ pthread is coming along on SMP machines, will try Intrepid soon
** OpenMP should expect strong support (according to Bronis etc.), will likely be used for computation frame
** pthread should be acceptable for I/O tasks
** MPI task?
** test sub-communicator collective performance degradation (for potential total comm split)
*** for all_reduce, going through subcomm would force routine go torus rather than collective tree network, causing a 10x-30x performance drop for double and integer all_reduce, compared to on MPI_COMM_WORLD; tested on 32k and 64k

* summer time frame
** Flexible, Misun only absent July 26-29

Main Page/Meeting Notes

2011-02-22T22:16:20Z

Jingfu:

Main Page/Meeting Notes

2011-02-22T22:06:30Z

Jingfu:

Main Page/faq

2011-02-22T22:01:29Z

Jingfu:

This is the resource listing page for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Resource Links==
*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/PIO Parallel I/O of NekCEM]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/hybrid_prog Hybrid programming (proposed work)]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/aio Sync/Async Blocking/non-blocking I/O]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/bgQ Blue Gene/Q and Mira]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/Meeting_Notes Meeting Notes]

== Implementation ==

=== I/O code ===

* I/O functions were initiated from cem_out function of cem_dg.F (and cem_dg2.F).
* Implementation of parallel I/O routine were defined in vtkbin.c and rbIO_nekcem.c
* vtkcommon.c and vtkcommon.h serve as a place to hold common functions as well as global variables. 

* cem_out_fields3 (in cem_dg.F)
** openfile3(dumpno, nid) !vtkbin.c
** vtk_dump_header3
*** writeheader3() !vtkbin.c
*** writenodes3() !vtkbin.c
*** write2dcells3 !vtkbin.c or write3dcells3 !vtkbin.c
** vtk_dump_field3
*** writefield3 !vtkbin.c
** close_file3 !vtkbin.c

* Binary file → ASCII file: transfer double/float/int/read to chars then write out
** float (4 bytes) → %18.8E
** int (4 bytes) → %10d
** long long (8 bytes) → %18lld
** elemType → %4d

=== NekCEM notes===
* scaling
** strong scaling: defined as how the solution time varies with the number of processors for a fixed total problem size.
** weak scaling: defined as how the solution time varies with the number of processors for a fixed problem size per processor.
* pre-compute file size
** #grid point = nx * ny * nz * nelt; size = #grid point * 3 * float
** cell type: 2d → 4 * #cell * int + 1* #cell * int (3d → 9)
** #field = nfields * 3 * #grid point; size = #field * float;

* .box → num elements in x,y,z
* .rea → input data
* SIZEu → SIZE parameters:
** lxi ?
** lp = #proc
** lelx = 20 each dimension
** lelv = alloc max # of element per proc
* .usr → subuser.F
* cem() in cem_dg.F is the main solver and application entry point

* only CELL and point data need to be re-computed
== To-do List ==
*More tests on BG/P for config with ng = M and 1< nf < M
*Tests on Kraken and Jaguar
*Pthread + MPI for I/O
*OpenMP/Pthread + MPI for NekCEM computation
*Parallel I/O for reading .rea file

== Miscellaneous notes ==
* Fortran generated binary file may not be correctly read in C.
* -lstdc++ for link
* libF77 and libI77
* common.h and common_c.h
* write() in Fortran: 6 refer to screen, * is to screen as well ..

Main Page/Meeting Notes

2011-02-22T22:01:02Z

Jingfu:

This page records meeting notes about [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Notes==
=== 02/17/2011, Misun, Jing Fu===
*
**

Main Page/Meeting Notes

2011-02-22T22:00:31Z

Jingfu: Created page with "This page records meeting notes about [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM]. ==Notes== === 02/17/2011 === * Misun, Jing Fu **"

This page records meeting notes about [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Notes==
=== 02/17/2011 ===
* Misun, Jing Fu
**

Main Page/bgQ

2011-02-17T20:44:48Z

Jingfu:

Argonne National Laboratory is planning to move up to a 10-petaflop Blue Gene/Q supercomputer next year, supporting the DOE lab's scientific research. The new machine continues Argonne's six-year Blue Gene tradition, which has installed every iteration of the architecture in IBM's BG franchise.

The Mira system is based on IBM's next-generation PowerPC SoC, in this case the 16-core Power A2 processor ([https://wiki.mcs.anl.gov/nekcem/index.php/File:Wire.pdf PDF]), a 64-bit CPU capable of handling 4 threads simultaneously. The processor has 32 KB of L1 cache -- 16 KB for data and 16 KB for instructions. L2 cache is made up of 8 MB of embedded DRAM (eDRAM ), a high-density on-chip memory technology that IBM uses for Blue Gene and its latest Power7 processors. Memory and I/O controllers are integrated on-chip.

Each server node will contain a single A2 processor and sport either 8 or 16 GB of memory. A fully populated Blue Gene/Q rack contains 1024 nodes, representing 16K cores. I/O has been split from the server nodes so that configurations can scale compute and I/O independently. A rack can accommodate between 8 and 128 I/O nodes. Conveniently, the I/O nodes use the same Power A2 chip as the compute servers.

Server-to-server communication is performed over a 5D Torus, which is capable of up to 40 gigabits per second, four times the speed of the Blue Gene/P interconnect. The 5D Torus employs fiber optics, the first Blue Gene design to do so.

Compute performance is delivered by using a large number of relatively low-speed cores -- a hallmark of the Blue Gene architecture. Unlike the speedy 3.3 GHz Power7 chips that will go into the future Blue Waters supercomputer at the NCSA, the A2 processor for Blue Gene hums along at a modest 1.6 GHz (although faster versions of this chip can hit 3 GHz). According to IBM, Mira will encapsulate 750K cores, which works out to about 48,000 CPUs. Total memory is 750 TB, backed by 70 petabytes of disk storage.

The low-speed, high-core approach makes for a very energy-efficient package. A Blue Gene/Q prototype grabbed first place on the November 2010 Green500 list, with a Linpack rating of 1684.2 megaflops/watt. That bested even the latest Fermi GPU accelerated supers, like the TSUBAME 2.0 system recently installed at Tokyo Tech, as well as IBM's fastest Cell (PowerXCell 8i) processor-accelerated QS22 clusters. To further boost energy efficiency and maintain reliability, all Blue Gene/Q racks are water cooled.

Because of its size, Argonne is looking at Mira as a stepping stone to exaflop supercomputing. With less than a million cores though, programmers will have to use some imagination to scale their codes to the hundreds of millions of cores envisioned in a true exascale system.

However, by the time IBM and others start building such machines, the Blue Gene PowerPC-based architecture is likely to be subsumed into the company's Power-based line-up (which at the processor ISA level, at least, is quite similar). Based on a recent conversation with Herb Schultz, marketing manager for IBM's Deep Computing unit, the Power and Blue Gene lines may merge around the middle of this decade. That would suggest that Blue Gene/Q could very well be the last in the Blue Gene lineage.

Main Page/bgQ

2011-02-17T20:42:07Z

Jingfu:

The Mira system is based on IBM's next-generation PowerPC SoC, in this case the 16-core Power A2 processor ([https://wiki.mcs.anl.gov/nekcem/index.php/File:Wire.pdf PDF]), a 64-bit CPU capable of handling 4 threads simultaneously. The processor has 32 KB of L1 cache -- 16 KB for data and 16 KB for instructions. L2 cache is made up of 8 MB of embedded DRAM (eDRAM ), a high-density on-chip memory technology that IBM uses for Blue Gene and its latest Power7 processors. Memory and I/O controllers are integrated on-chip.

Each server node will contain a single A2 processor and sport either 8 or 16 GB of memory. A fully populated Blue Gene/Q rack contains 1024 nodes, representing 16K cores. I/O has been split from the server nodes so that configurations can scale compute and I/O independently. A rack can accommodate between 8 and 128 I/O nodes. Conveniently, the I/O nodes use the same Power A2 chip as the compute servers.

Server-to-server communication is performed over a 5D Torus, which is capable of up to 40 gigabits per second, four times the speed of the Blue Gene/P interconnect. The 5D Torus employs fiber optics, the first Blue Gene design to do so.

Compute performance is delivered by using a large number of relatively low-speed cores -- a hallmark of the Blue Gene architecture. Unlike the speedy 3.3 GHz Power7 chips that will go into the future Blue Waters supercomputer at the NCSA, the A2 processor for Blue Gene hums along at a modest 1.6 GHz (although faster versions of this chip can hit 3 GHz). According to IBM, Mira will encapsulate 750K cores, which works out to about 48,000 CPUs. Total memory is 750 TB, backed by 70 petabytes of disk storage.

The low-speed, high-core approach makes for a very energy-efficient package. A Blue Gene/Q prototype grabbed first place on the November 2010 Green500 list, with a Linpack rating of 1684.2 megaflops/watt. That bested even the latest Fermi GPU accelerated supers, like the TSUBAME 2.0 system recently installed at Tokyo Tech, as well as IBM's fastest Cell (PowerXCell 8i) processor-accelerated QS22 clusters. To further boost energy efficiency and maintain reliability, all Blue Gene/Q racks are water cooled.

Because of its size, Argonne is looking at Mira as a stepping stone to exaflop supercomputing. With less than a million cores though, programmers will have to use some imagination to scale their codes to the hundreds of millions of cores envisioned in a true exascale system.

However, by the time IBM and others start building such machines, the Blue Gene PowerPC-based architecture is likely to be subsumed into the company's Power-based line-up (which at the processor ISA level, at least, is quite similar). Based on a recent conversation with Herb Schultz, marketing manager for IBM's Deep Computing unit, the Power and Blue Gene lines may merge around the middle of this decade. That would suggest that Blue Gene/Q could very well be the last in the Blue Gene lineage.

File:Wire.pdf

2011-02-17T20:41:28Z

Jingfu:

Main Page/faq

2011-02-17T20:35:40Z

Jingfu:

This is the resource listing page for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Resource Links==
*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/PIO Parallel I/O of NekCEM]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/hybrid_prog Hybrid programming (proposed work)]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/aio Sync/Async Blocking/non-blocking I/O]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/bgQ Blue Gene/Q and Mira]

== Implementation ==

=== I/O code ===

* I/O functions were initiated from cem_out function of cem_dg.F (and cem_dg2.F).
* Implementation of parallel I/O routine were defined in vtkbin.c and rbIO_nekcem.c
* vtkcommon.c and vtkcommon.h serve as a place to hold common functions as well as global variables. 

* cem_out_fields3 (in cem_dg.F)
** openfile3(dumpno, nid) !vtkbin.c
** vtk_dump_header3
*** writeheader3() !vtkbin.c
*** writenodes3() !vtkbin.c
*** write2dcells3 !vtkbin.c or write3dcells3 !vtkbin.c
** vtk_dump_field3
*** writefield3 !vtkbin.c
** close_file3 !vtkbin.c

* Binary file → ASCII file: transfer double/float/int/read to chars then write out
** float (4 bytes) → %18.8E
** int (4 bytes) → %10d
** long long (8 bytes) → %18lld
** elemType → %4d

=== NekCEM notes===
* scaling
** strong scaling: defined as how the solution time varies with the number of processors for a fixed total problem size.
** weak scaling: defined as how the solution time varies with the number of processors for a fixed problem size per processor.
* pre-compute file size
** #grid point = nx * ny * nz * nelt; size = #grid point * 3 * float
** cell type: 2d → 4 * #cell * int + 1* #cell * int (3d → 9)
** #field = nfields * 3 * #grid point; size = #field * float;

* .box → num elements in x,y,z
* .rea → input data
* SIZEu → SIZE parameters:
** lxi ?
** lp = #proc
** lelx = 20 each dimension
** lelv = alloc max # of element per proc
* .usr → subuser.F
* cem() in cem_dg.F is the main solver and application entry point

* only CELL and point data need to be re-computed
== To-do List ==
*More tests on BG/P for config with ng = M and 1< nf < M
*Tests on Kraken and Jaguar
*Pthread + MPI for I/O
*OpenMP/Pthread + MPI for NekCEM computation
*Parallel I/O for reading .rea file

== Miscellaneous notes ==
* Fortran generated binary file may not be correctly read in C.
* -lstdc++ for link
* libF77 and libI77
* common.h and common_c.h
* write() in Fortran: 6 refer to screen, * is to screen as well ..

Main Page/bgQ

2011-02-17T20:35:07Z

Jingfu: Created page with "The Mira system is based on IBM's next-generation PowerPC SoC, in this case the 16-core Power A2 processor (PDF), a 64-bit CPU capable of handling 4 threads simultaneously. The p…"

The Mira system is based on IBM's next-generation PowerPC SoC, in this case the 16-core Power A2 processor (PDF), a 64-bit CPU capable of handling 4 threads simultaneously. The processor has 32 KB of L1 cache -- 16 KB for data and 16 KB for instructions. L2 cache is made up of 8 MB of embedded DRAM (eDRAM ), a high-density on-chip memory technology that IBM uses for Blue Gene and its latest Power7 processors. Memory and I/O controllers are integrated on-chip.

Each server node will contain a single A2 processor and sport either 8 or 16 GB of memory. A fully populated Blue Gene/Q rack contains 1024 nodes, representing 16K cores. I/O has been split from the server nodes so that configurations can scale compute and I/O independently. A rack can accommodate between 8 and 128 I/O nodes. Conveniently, the I/O nodes use the same Power A2 chip as the compute servers.

Server-to-server communication is performed over a 5D Torus, which is capable of up to 40 gigabits per second, four times the speed of the Blue Gene/P interconnect. The 5D Torus employs fiber optics, the first Blue Gene design to do so.

Compute performance is delivered by using a large number of relatively low-speed cores -- a hallmark of the Blue Gene architecture. Unlike the speedy 3.3 GHz Power7 chips that will go into the future Blue Waters supercomputer at the NCSA, the A2 processor for Blue Gene hums along at a modest 1.6 GHz (although faster versions of this chip can hit 3 GHz). According to IBM, Mira will encapsulate 750K cores, which works out to about 48,000 CPUs. Total memory is 750 TB, backed by 70 petabytes of disk storage.

The low-speed, high-core approach makes for a very energy-efficient package. A Blue Gene/Q prototype grabbed first place on the November 2010 Green500 list, with a Linpack rating of 1684.2 megaflops/watt. That bested even the latest Fermi GPU accelerated supers, like the TSUBAME 2.0 system recently installed at Tokyo Tech, as well as IBM's fastest Cell (PowerXCell 8i) processor-accelerated QS22 clusters. To further boost energy efficiency and maintain reliability, all Blue Gene/Q racks are water cooled.

Because of its size, Argonne is looking at Mira as a stepping stone to exaflop supercomputing. With less than a million cores though, programmers will have to use some imagination to scale their codes to the hundreds of millions of cores envisioned in a true exascale system.

However, by the time IBM and others start building such machines, the Blue Gene PowerPC-based architecture is likely to be subsumed into the company's Power-based line-up (which at the processor ISA level, at least, is quite similar). Based on a recent conversation with Herb Schultz, marketing manager for IBM's Deep Computing unit, the Power and Blue Gene lines may merge around the middle of this decade. That would suggest that Blue Gene/Q could very well be the last in the Blue Gene lineage.

Main Page/hybrid prog

2011-02-09T03:37:45Z

Jingfu:

This is the document page for hybrid programming proposal for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM] (some notes taken from DOE Exascale workshop).

==Programming Model Approaches ==
* hybrid/evolutionary: MPI+ __ ?
** MPI for inter-node prog, since # notes and inter-node concerns not expected to change dramatically
*** support for hybrid programming/interoperability
*** purer one-sided communications; active messages
*** asynchronous collectives
** something else for intra-node
**# OpenMP (Shared memory, aka Global Address Space)
**#* introduction of locality-oriented concepts?
**#* efforts in OpenMP 3.0 ?
**# PGAS languages (Partitioned Global Address Space)
**#* already support a notation of locality in a shared namespace
**#* UPC (Unified Parallel C)/CAF need to relax strictly SPMD execution model
**# Sequoia
**#* support a strong notation of vertical locality

* unified/holistic: __ ?
** a single notation for inter- and intra-node programming?
** traditional PGAS languages: UPC, CAF, Titanium
*** require extension to handle nested parallelism, vertical locality
** HPCS languages: Chapel, X10, Fortress(?)
*** designed with locality and post-SPMD parallelism in mind
** other candidates: Charm++, Global Arrays, Parallel X, ...

* others
** mainstream multi-core/GPU language: (sufficient promise to be funded?)
**domain-specific language
*** fit your problem?
*** should focus on more general solutions
**functional languages
*** never heavily adopted in mainstream or HPC
*** copy-on-write optimization and alias analysis?
** parallel scripting languages?

==Expectation==
*parallelism: nested, dynamic, loosely-coupled, data-driven (i.e. ''post-SPMD'' programming/execution models)
** to take advantage of architecture
** to better support load balancing and resilience

* locality: concepts for vertical control as well as horizontal (i.e. locality within a node rather than simply between nodes)

== Tools: Debuggers, perf. analysis==
* challenges
** need aggregation to hide details
** need to report info in user's terms
*good area for innovation (e.g. execution visualization to understand mapping of code to hardware)

==Misc Notes==
* three main types of distributed memory programming:
** active message. Active Messages are actually a lower-level mechanism that can be used to implement data parallel or message passing efficiently.
** data parallel(aka loop-level parallelism). Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism).
** message passing
* multi-core and many-core processors
** A multi-core processor is composed of two or more independent cores. One can describe it as an integrated circuit which has two or more individual processors (called cores in this sense).[1] Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.
** A many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely due to issues with congestion supplying sufficient instructions and data to the many processors. This threshold is roughly in the range of several tens of cores and probably requires a network on chip.
* ILP and TLP
** Some instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications, but are inefficient for others that tend to contain difficult-to-predict code.
** Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. The Multithreading paradigm has become more popular as efforts to further exploit instruction level parallelism have stalled since the late-1990s.
* multi-threading v.s. multi-processing
** multi-threading advantage
*** If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage of the unused computing resources, which thus can lead to faster overall execution, as these resources would have been idle if only a single thread was executed.
**multi-threading disadvantage
*** Execution times of a single-thread are not improved but can be degraded, even when only one thread is executing. This is due to slower frequencies and/or additional pipeline stages that are necessary to accommodate thread-switching hardware.
** Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them.
** SMP MIMD multiprocessing

==Questions==
* Which MPI implementations are thread-safe?

Main Page/hybrid prog

2011-02-09T03:37:03Z

Jingfu:

This is the document page for hybrid programming proposal for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM] (some notes taken from DOE Exascale workshop).

==Programming Model Approaches ==
* hybrid/evolutionary: MPI+ __ ?
** MPI for inter-node prog, since # notes and inter-node concerns not expected to change dramatically
*** support for hybrid programming/interoperability
*** purer one-sided communications; active messages
*** asynchronous collectives
** something else for intra-node
**# OpenMP (Shared memory, aka Global Address Space)
**#* introduction of locality-oriented concepts?
**#* efforts in OpenMP 3.0 ?
**# PGAS languages (Partitioned Global Address Space)
**#* already support a notation of locality in a shared namespace
**#* UPC (Unified Parallel C)/CAF need to relax strictly SPMD execution model
**# Sequoia
**#* support a strong notation of vertical locality

* unified/holistic: __ ?
** a single notation for inter- and intra-node programming?
** traditional PGAS languages: UPC, CAF, Titanium
*** require extension to handle nested parallelism, vertical locality
** HPCS languages: Chapel, X10, Fortress(?)
*** designed with locality and post-SPMD parallelism in mind
** other candidates: Charm++, Global Arrays, Parallel X, ...

* others
** mainstream multi-core/GPU language: (sufficient promise to be funded?)
**domain-specific language
*** fit your problem?
*** should focus on more general solutions
**functional languages
*** never heavily adopted in mainstream or HPC
*** copy-on-write optimization and alias analysis?
** parallel scripting languages?

==Expectation==
*parallelism: nested, dynamic, loosely-coupled, data-driven (i.e. ''post-SPMD'' programming/execution models)
** to take advantage of architecture
** to better support load balancing and resilience

* locality: concepts for vertical control as well as horizontal (i.e. locality within a node rather than simply between nodes)

== Tools: Debuggers, perf. analysis==
* challenges
** need aggregation to hide details
** need to report info in user's terms
*good area for innovation (e.g. execution visualization to understand mapping of code to hardware)

==Misc Notes==
* three main types of distributed memory programming:
** active message. Active Messages are actually a lower-level mechanism that can be used to implement data parallel or message passing efficiently.
** data parallel(aka loop-level parallelism). Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism).
** message passing
* multi-core and many-core processors
** A multi-core processor is composed of two or more independent cores. One can describe it as an integrated circuit which has two or more individual processors (called cores in this sense).[1] Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.
** A many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely due to issues with congestion supplying sufficient instructions and data to the many processors. This threshold is roughly in the range of several tens of cores and probably requires a network on chip.
* ILP and TLP
** Some instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications, but are inefficient for others that tend to contain difficult-to-predict code.
** Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. The Multithreading paradigm has become more popular as efforts to further exploit instruction level parallelism have stalled since the late-1990s.
* multi-threading v.s. multi-processing
** multi-threading advantage
*** If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage of the unused computing resources, which thus can lead to faster overall execution, as these resources would have been idle if only a single thread was executed.
**multi-threading disadvantage
*** Execution times of a single-thread are not improved but can be degraded, even when only one thread is executing. This is due to slower frequencies and/or additional pipeline stages that are necessary to accommodate thread-switching hardware.
** Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them.
** SMP MIMD multiprocessing

==Questions==
* is MPI implementation thread-safe?

Main Page/hybrid prog

2011-02-09T03:04:11Z

Jingfu:

This is the document page for hybrid programming proposal for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM] (some notes taken from DOE Exascale workshop).

==Programming Model Approaches ==
* hybrid/evolutionary: MPI+ __ ?
** MPI for inter-node prog, since # notes and inter-node concerns not expected to change dramatically
*** support for hybrid programming/interoperability
*** purer one-sided communications; active messages
*** asynchronous collectives
** something else for intra-node
**# OpenMP (Shared memory, aka Global Address Space)
**#* introduction of locality-oriented concepts?
**#* efforts in OpenMP 3.0 ?
**# PGAS languages (Partitioned Global Address Space)
**#* already support a notation of locality in a shared namespace
**#* UPC (Unified Parallel C)/CAF need to relax strictly SPMD execution model
**# Sequoia
**#* support a strong notation of vertical locality

* unified/holistic: __ ?
** a single notation for inter- and intra-node programming?
** traditional PGAS languages: UPC, CAF, Titanium
*** require extension to handle nested parallelism, vertical locality
** HPCS languages: Chapel, X10, Fortress(?)
*** designed with locality and post-SPMD parallelism in mind
** other candidates: Charm++, Global Arrays, Parallel X, ...

* others
** mainstream multi-core/GPU language: (sufficient promise to be funded?)
**domain-specific language
*** fit your problem?
*** should focus on more general solutions
**functional languages
*** never heavily adopted in mainstream or HPC
*** copy-on-write optimization and alias analysis?
** parallel scripting languages?

==Expectation==
*parallelism: nested, dynamic, loosely-coupled, data-driven (i.e. ''post-SPMD'' programming/execution models)
** to take advantage of architecture
** to better support load balancing and resilience

* locality: concepts for vertical control as well as horizontal (i.e. locality within a node rather than simply between nodes)

== Tools: Debuggers, perf. analysis==
* challenges
** need aggregation to hide details
** need to report info in user's terms
*good area for innovation (e.g. execution visualization to understand mapping of code to hardware)

==Misc Notes==
* three main types of distributed memory programming:
** active message. Active Messages are actually a lower-level mechanism that can be used to implement data parallel or message passing efficiently.
** data parallel(aka loop-level parallelism). Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism).
** message passing
* multi-core and many-core processors
** A multi-core processor is composed of two or more independent cores. One can describe it as an integrated circuit which has two or more individual processors (called cores in this sense).[1] Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.
** A many-core processor is one in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely due to issues with congestion supplying sufficient instructions and data to the many processors. This threshold is roughly in the range of several tens of cores and probably requires a network on chip.
* ILP and TLP
** Some instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications, but are inefficient for others that tend to contain difficult-to-predict code.
** Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. The Multithreading paradigm has become more popular as efforts to further exploit instruction level parallelism have stalled since the late-1990s.
* multi-threading v.s. multi-processing
** multi-threading advantage
*** If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage of the unused computing resources, which thus can lead to faster overall execution, as these resources would have been idle if only a single thread was executed.
**multi-threading disadvantage
*** Execution times of a single-thread are not improved but can be degraded, even when only one thread is executing. This is due to slower frequencies and/or additional pipeline stages that are necessary to accommodate thread-switching hardware.
** Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them.
** SMP MIMD multiprocessing

Main Page/hybrid prog

2011-02-09T01:52:07Z

Jingfu:

Main Page/faq

2011-02-06T16:28:26Z

Jingfu:

This is the resource listing page for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Resource Links==
*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/PIO Parallel I/O of NekCEM]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/hybrid_prog Hybrid programming (proposed work)]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/aio Sync/Async Blocking/non-blocking I/O]

== Implementation ==

=== I/O code ===

* I/O functions were initiated from cem_out function of cem_dg.F (and cem_dg2.F).
* Implementation of parallel I/O routine were defined in vtkbin.c and rbIO_nekcem.c
* vtkcommon.c and vtkcommon.h serve as a place to hold common functions as well as global variables. 

* cem_out_fields3 (in cem_dg.F)
** openfile3(dumpno, nid) !vtkbin.c
** vtk_dump_header3
*** writeheader3() !vtkbin.c
*** writenodes3() !vtkbin.c
*** write2dcells3 !vtkbin.c or write3dcells3 !vtkbin.c
** vtk_dump_field3
*** writefield3 !vtkbin.c
** close_file3 !vtkbin.c

* Binary file → ASCII file: transfer double/float/int/read to chars then write out
** float (4 bytes) → %18.8E
** int (4 bytes) → %10d
** long long (8 bytes) → %18lld
** elemType → %4d

=== NekCEM notes===
* scaling
** strong scaling: defined as how the solution time varies with the number of processors for a fixed total problem size.
** weak scaling: defined as how the solution time varies with the number of processors for a fixed problem size per processor.
* pre-compute file size
** #grid point = nx * ny * nz * nelt; size = #grid point * 3 * float
** cell type: 2d → 4 * #cell * int + 1* #cell * int (3d → 9)
** #field = nfields * 3 * #grid point; size = #field * float;

* .box → num elements in x,y,z
* .rea → input data
* SIZEu → SIZE parameters:
** lxi ?
** lp = #proc
** lelx = 20 each dimension
** lelv = alloc max # of element per proc
* .usr → subuser.F
* cem() in cem_dg.F is the main solver and application entry point

* only CELL and point data need to be re-computed
== To-do List ==
*More tests on BG/P for config with ng = M and 1< nf < M
*Tests on Kraken and Jaguar
*Pthread + MPI for I/O
*OpenMP/Pthread + MPI for NekCEM computation
*Parallel I/O for reading .rea file

== Miscellaneous notes ==
* Fortran generated binary file may not be correctly read in C.
* -lstdc++ for link
* libF77 and libI77
* common.h and common_c.h
* write() in Fortran: 6 refer to screen, * is to screen as well ..

Main Page/aio

2011-02-06T15:47:08Z

Jingfu:

*Synchronous means that the OS will work on your IO request once they get your request.

*Asynchronous means the OS will delay to work on your IO request. Maybe it is because OS is busy working on some other stuff or the IO is not available.

*Blocking IO means you call the IO function to get the data. The function you called will not return until the data is available.

*Non-Blocking IO means you call the IO function and the IO function returns right away. The IO function will just post a message to the work queue and the other process will fetch the message from this queue and then begin to get the data from the IO. Once the data is ready, they will notify you the data is ready.

http://www.ibm.com/developerworks/linux/library/l-async/?S_TACT=105AGX52&S_CMP=cn-a-l

Main Page/aio

2011-02-06T15:46:54Z

Jingfu:

*Synchronous means that the OS will work on your IO request once they get your request.

*Asynchronous means the OS will delay to work on your IO request. Maybe it is
because OS is busy working on some other stuff or the IO is not available.

*Blocking IO means you call the IO function to get the data. The function you
called will not return until the data is available.

*Non-Blocking IO means you call the IO function and the IO function returns right away. The IO function will just post a message to the work queue and the other process will fetch the message from this queue and then begin to get the data from the IO. Once the data is ready, they will notify you the data is ready.

http://www.ibm.com/developerworks/linux/library/l-async/?S_TACT=105AGX52&S_CMP=cn-a-l

Main Page/aio

2011-02-06T15:45:25Z

Jingfu: Created page with "synchronous means that the OS will work on your IO request once they get you r request. asynchronous means the OS will delay to work on your IO request. Maybe it is because OS i…"

synchronous means that the OS will work on your IO request once they get you
r request.

asynchronous means the OS will delay to work on your IO request. Maybe it is
because OS is busy working on some other stuff or the IO is not aviable.

Blocking IO means you call the IO function to get the data. The function you
called will not return until the data is avaiable.

Non-Blocking IO means you call the IO function and the IO function returns r
ight away. The IO fucntion will just post a message to the work queue and th
e other process will fetch the message from this queue and then begin to get
the data from the IO. Once the data is ready, they will notify you the data
is ready.

http://www.ibm.com/developerworks/linux/library/l-async/?S_TACT=105AGX52&S_CMP=cn-a-l

Main Page/hybrid prog

2011-02-04T16:22:20Z

Jingfu:

Main Page/faq

2011-02-01T03:45:40Z

Jingfu:

This is the resource listing page for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Resource Links==
*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/PIO Parallel I/O of NekCEM]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/hybrid_prog Hybrid programming (proposed work)]

== Implementation ==

=== I/O code ===

* I/O functions were initiated from cem_out function of cem_dg.F (and cem_dg2.F).
* Implementation of parallel I/O routine were defined in vtkbin.c and rbIO_nekcem.c
* vtkcommon.c and vtkcommon.h serve as a place to hold common functions as well as global variables. 

* cem_out_fields3 (in cem_dg.F)
** openfile3(dumpno, nid) !vtkbin.c
** vtk_dump_header3
*** writeheader3() !vtkbin.c
*** writenodes3() !vtkbin.c
*** write2dcells3 !vtkbin.c or write3dcells3 !vtkbin.c
** vtk_dump_field3
*** writefield3 !vtkbin.c
** close_file3 !vtkbin.c

* Binary file → ASCII file: transfer double/float/int/read to chars then write out
** float (4 bytes) → %18.8E
** int (4 bytes) → %10d
** long long (8 bytes) → %18lld
** elemType → %4d

=== NekCEM notes===
* scaling
** strong scaling: defined as how the solution time varies with the number of processors for a fixed total problem size.
** weak scaling: defined as how the solution time varies with the number of processors for a fixed problem size per processor.
* pre-compute file size
** #grid point = nx * ny * nz * nelt; size = #grid point * 3 * float
** cell type: 2d → 4 * #cell * int + 1* #cell * int (3d → 9)
** #field = nfields * 3 * #grid point; size = #field * float;

* .box → num elements in x,y,z
* .rea → input data
* SIZEu → SIZE parameters:
** lxi ?
** lp = #proc
** lelx = 20 each dimension
** lelv = alloc max # of element per proc
* .usr → subuser.F
* cem() in cem_dg.F is the main solver and application entry point

* only CELL and point data need to be re-computed
== To-do List ==
*More tests on BG/P for config with ng = M and 1< nf < M
*Tests on Kraken and Jaguar
*Pthread + MPI for I/O
*OpenMP/Pthread + MPI for NekCEM computation
*Parallel I/O for reading .rea file

== Miscellaneous notes ==
* Fortran generated binary file may not be correctly read in C.
* -lstdc++ for link
* libF77 and libI77
* common.h and common_c.h
* write() in Fortran: 6 refer to screen, * is to screen as well ..

Main Page/faq

2011-02-01T03:44:48Z

Jingfu:

This is the resource listing page for [http://www.mcs.anl.gov/~mmin/nekcem.html NekCEM].
==Resource Links==
*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/PIO Parallel I/O of NekCEM]

*[https://wiki.mcs.anl.gov/nekcem/index.php/Main_Page/hybrid_prog Hybrid programming (proposed work)]

== Implementation ==

=== I/O code ===

* I/O functions were initiated from cem_out function of cem_dg.F (and cem_dg2.F).
* Implementation of parallel I/O routine were defined in vtkbin.c and rbIO_nekcem.c
* vtkcommon.c and vtkcommon.h serve as a place to hold common functions as well as global variables. 

* cem_out_fields3 (in cem_dg.F)
** openfile3(dumpno, nid) !vtkbin.c
** vtk_dump_header3
*** writeheader3() !vtkbin.c
*** writenodes3() !vtkbin.c
*** write2dcells3 !vtkbin.c or write3dcells3 !vtkbin.c
** vtk_dump_field3
*** writefield3 !vtkbin.c
** close_file3 !vtkbin.c

* Binary file → ASCII file: transfer double/float/int/read to chars then write out
** float (4 bytes) → %18.8E
** int (4 bytes) → %10d
** long long (8 bytes) → %18lld
** elemType → %4d

=== NekCEM notes for myself===
* scaling
** strong scaling: defined as how the solution time varies with the number of processors for a fixed total problem size.
** weak scaling: defined as how the solution time varies with the number of processors for a fixed problem size per processor.
* pre-compute file size
** #grid point = nx * ny * nz * nelt; size = #grid point * 3 * float
** cell type: 2d → 4 * #cell * int + 1* #cell * int (3d → 9)
** #field = nfields * 3 * #grid point; size = #field * float;

* .box → num elements in x,y,z
* .rea → input data
* SIZEu → SIZE parameters:
** lxi ?
** lp = #proc
** lelx = 20 each dimension
** lelv = alloc max # of element per proc
* .usr → subuser.F
* cem() in cem_dg.F is the main solver and application entry point

* only CELL and point data need to be re-computed
== To-do List ==
*More tests on BG/P for config with ng = M and 1< nf < M
*Tests on Kraken and Jaguar
*Pthread + MPI for I/O
*OpenMP/Pthread + MPI for NekCEM computation
*Parallel I/O for reading .rea file

== Miscellaneous notes ==
* Fortran generated binary file may not be correctly read in C.
* -lstdc++ for link
* libF77 and libI77
* common.h and common_c.h
* write() in Fortran: 6 refer to screen, * is to screen as well ..