par_mem - memory parallelism benchmark

NAME

       par_mem - memory parallelism benchmark

SYNOPSIS

       par_mem  [  -L  <line  size>  ]  [  -M  <len>  ]  [ -W <warmups> ] [ -N
       <repetitions> ]

DESCRIPTION

       par_mem measures the available parallelism in the memory hierarchy,  up
       to  len  bytes.   Modern  processors  can often service multiple memory
       requests in parallel, while older processors typically blocked on  LOAD
       instructions and had no available parallelism (other than that provided
       by cache prefetching).  par_mem measures the available parallelism at a
       variety  of points, since the available parallelism is often a function
       of the data location in the memory hierarchy.

       In order to  measure  the  available  parallelism  par_mem  conducts  a
       variety  of  experiments  at  each  memory  size; one for each level of
       parallelism.  It builds a pointer chain of the desired length.  It then
       creates  an  array  of  pointers which point to chain entries which are
       evenly spaced across the chain.  Then it starts  running  the  pointers
       forward through the chain in parallel.  It can then measure the average
       memory latency  for  each  level  of  parallelism,  and  the  available
       parallelism  is  the  minimum  average memory latency for parallelism 1
       divided by the average memory latency across all  levels  of  available
       parallelism.

       For  example,  the  inner  loop which measures parallelism 2 would look
       something like:

       for (i = 0; i < N; ++i) {      p0  =  (char  **)*p0;       p1  =  (char
       **)*p1; }

       in  a  for  loop  (the overhead of the for loop is not significant; the
       loop is an unrolled loop  100  loads  long).   In  this  case,  if  the
       hardware  can process two LOAD operations in parallel, then the overall
       latency of the loop should be equivalent to that of  a  single  pointer
       chain,  so the measured parallelism would be roughly two.  If, however,
       the hardware can only process a single LOAD operation at  once,  or  if
       there  is  (significant)  resource  contention  between  the  two  LOAD
       operations, then the loop will be much slower than a loop with a single
       pointer  chain,  so the measured parallelism will be less than two, and
       probably no smaller than one.

OUTPUT

       Output format is intended as input to xgraph or  some  similar  program
       (we use a perl script that produces pic input).  There is a set of data
       produced for each stride.  The data set title is the  stride  size  and
       the  data points are the array size in megabytes (floating point value)
       and the load latency over all points in that array.

AUTHOR

       Carl Staelin and Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)2000 Carl Staelin and Larry McVoy$Date$

NAME

SYNOPSIS

DESCRIPTION

OUTPUT

SEE ALSO

AUTHOR