NAME
israndom - randomness testing using data compressors over fixed-size
alphabets
SYNOPSIS
israndom [-a alphasize] [-c compressor] [-s samplelen] [-qhnr]
[filename]
DESCRIPTION
israndom tests a sequence of symbols for randomness. israndom tries to
determine if a given sequence of trials could reasonably be assumed to
be from a random uniform distribution over a fixed-size alphabet of
2-256 symbols.
israndom assumes that each sequence (or sample trial) is represented by
exactly one byte. The only exceptions to this rule are in the case of
the
-n and -r options which ignore newlines and carriage returns,
respectively (see below).
israndom is based on the mathematical ideas of Shannon, Kolmogorov, and
Cilibrasi and uses the following formula to determine an expected size
for a sample of
k trials of a uniform distribution over an alphasize- symbol
alphabet. Each symbol takes log(alphasize) bits, so the total
cost (in bits) c for the ensemble of samples is k log(alphasize)
bits. This number is rounded up to the nearest byte and
increased by one to arrive at the final estimate of the expected
communication cost on the assumption of uniform randomness.
If the compressed size of
k samples is less than c then this represents a randomness
deficiency and the randomness test fails. israndom will exit
with a nonzero exit status. If israndom indicates that a source
is nonrandom, this fact is effectively certain if the
compression module is correct and invertable. If the compressed
size is at least the threshhold value c then the file appears to
be random and passes the test and israndom will exit with a 0
return value. In either case, it will print the alphabet size,
expected compressed size, sample count, and randomness
difference before exitting with an appropriate return code.
The default number of samples is 393216. Although larger sizes should
increase accuracy, using too few samples will cause the method to fail
to be able to resolve randomness in certain situations. This is a
theoretically unavoidable fact for all effective randomness tests.
If a filename is given, it is read to find the samples to analyze. If
the filename "-" is given, or no filename is given at all, then
israndom reads from standard input.
If text files are to be used, it is important to specify one or both of
-n and -r since without these, end of line characters will be
misinterpreted as samples.
OPTIONS
-c compressor_name
set compressor explicitly to compressor_name instead of the
default, bzlib. For basic analysis, bzlib is usually
sufficient. For detecting complex or subtle biases, a more
powerful compression module such as lzma (lzmax) or ppmd (ppmdx)
will detect more types of non-randomness. Because Lempel-Ziv
types are universal, all effective randomness tests can be
captured as a kind of compression discriminant function.
-n ignore newlines (so that text files may be used)
-r ignore carriage returns (so that text files may be used)
-a alphasize
set alphabet size to alphasize an integer between 2 and 256. If
you do not specify an alphabet size, it is automatically
determined by the contents of the samples.
-s samplecount
Use samplecount samples instead of the default of 393216. Using
a number that is too small here will reduce the accuracy of the
test, causing everything to appear to be random. If 0 is used,
it means to read until EOF.
-q quiet mode, with no extra status messages
-h print help and exit.
EXAMPLES
First, we can verify that the cryptographicly strong random
number generator is correct:
israndom /dev/urandom
Next, we can notice that the "od" command, without extra options, is
not random because it prints out addresses and spaces predictably.
Most compressors can tell by the regular spaces that it is not random:
od /dev/urandom | israndom -n -r
but if we remove spaces using ’tr’ then a more powerful compressor,
lzmax, is required to demonstrate the non-randomness of the sequence:
od /dev/urandom | tr -d ’ ’ | israndom -n -r -c lzmax
Removing the address lines using an
od option yields the expected result once again that the
sequence is effectively random:
od -An /dev/urandom | tr -d ’ ’ | israndom -n -r -c lzmax
The above sequence is not actually random, because every third octal
digit
only ranges from 0 to 3 since 377 octal is the same as 256
decimal. This subtle pattern is detectable using 10 million
samples and the advanced ppmdx compressor:
od -An /dev/urandom | tr -d ’ ’ | israndom -n -r -c ppmdx -s 10000000
As a sanity check, we see that even in extreme analysis as above,
/dev/urandom
still checks out okay as random, even with newlines and carriage
returns removed for good measure.
cat /dev/urandom | israndom -n -r -c ppmdx -s 10000000
ENVIRONMENT
No environment variables.
BUGS
Please report bugs to the Debian BTS.
AUTHOR
Rudi Cilibrasi <cilibrar@cilibrar.com>
SEE ALSO
complearn(5), ncd(1)