lat_mem_rd - memory read latency benchmark

NAME

       lat_mem_rd - memory read latency benchmark

SYNOPSIS

       lat_mem_rd  [  -P <parallelism> ] [ -W <warmups> ] [ -N <repetitions> ]
       size_in_megabytes stride [ stride stride...  ]

DESCRIPTION

       lat_mem_rd measures memory read latency for varying  memory  sizes  and
       strides.   The  results  are  reported in nanoseconds per load and have
       been verified accurate to within a few nanoseconds on an SGI Indy.

       The entire  memory  hierarchy  is  measured,  including  onboard  cache
       latency and size, external cache latency and size, main memory latency,
       and TLB miss latency.

       Only data accesses are measured; the instruction cache is not measured.

       The  benchmark  runs as two nested loops.  The outer loop is the stride
       size.  The inner loop is the array size.   For  each  array  size,  the
       benchmark  creates  a  ring of pointers that point backward one stride.
       Traversing the array is done by

            p = (char **)*p;

       in a for loop (the over head of the for loop is  not  significant;  the
       loop is an unrolled loop 100 loads long).

       The  size  of  the  array  varies  from  512 bytes to (typically) eight
       megabytes.  For the small sizes, the cache will have an effect, and the
       loads  will  be  much faster.  This becomes much more apparent when the
       data is plotted.

       Since this benchmark uses fixed-stride offsets in the pointer chain, it
       may   be   vulnerable  to  smart,  stride-sensitive  cache  prefetching
       policies.   Older  machines  were  typically  able  to   prefetch   for
       sequential  access patterns, and some were able to prefetch for strided
       forward access patterns, but only a few  could  prefetch  for  backward
       strided  patterns.   These capabilities are becoming more widespread in
       newer processors.

OUTPUT

       Output format is intended as input to xgraph or  some  similar  program
       (we use a perl script that produces pic input).  There is a set of data
       produced for each stride.  The data set title is the  stride  size  and
       the  data points are the array size in megabytes (floating point value)
       and the load latency over all points in that array.

INTERPRETING THE OUTPUT

       The output is best examined in a graph where you typically get a  graph
       that  has four plateaus.  The graph should plotted in log base 2 of the
       array size on the X axis and the latency on the Y axis.  Each stride is
       then  plotted  as  a curve.  The plateaus that appear correspond to the
       onboard cache (if present), external cache (if  present),  main  memory
       latency, and TLB miss latency.

       As  a  rough  guide,  you  may  be able to extract the latencies of the
       various parts as follows, but you should really  look  at  the  graphs,
       since these rules of thumb do not always work (some systems do not have
       onboard cache, for example).

       onboard cache   Try stride of 128 and array size of .00098.

       external cache  Try stride of 128 and array size of .125.

       main memory     Try stride of 128 and array size of 8.

       TLB miss        Try the largest stride and the largest array.

BUGS

       This program is dependent on the correct operation of mhz(8).   If  you
       are  getting  numbers  that seem off, check that mhz(8) is giving you a
       clock rate that you believe.

ACKNOWLEDGEMENT

       Funding  for  the  development  of  this  tool  was  provided  by   Sun
       Microsystems Computer Corporation.

AUTHOR

       Carl Staelin and Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)1994 Larry McVoy                 $Date$

NAME

SYNOPSIS

DESCRIPTION

OUTPUT

INTERPRETING THE OUTPUT

BUGS

ACKNOWLEDGEMENT

SEE ALSO

AUTHOR