1
00:00:00,800 --> 00:00:05,360
We present LXM, a family of algorithms
for pseudorandom number generation.
2
00:00:06,160 --> 00:00:10,240
In this talk we give background for the
problem and explain some of the prior art,
3
00:00:10,240 --> 00:00:17,520
then present the algorithm itself in one slide,
now that we have set the context. Then we discuss
4
00:00:17,520 --> 00:00:22,000
some of the properties of the algorithm and then
present the specific contributions of our paper.
5
00:00:24,720 --> 00:00:30,400
Pseudorandom number generators have been around
for about seven decades now. Back in the 1960s what
6
00:00:30,400 --> 00:00:35,680
we typically wanted was a stream of floating-point
values drawn uniformly from the range zero to one
7
00:00:36,240 --> 00:00:40,720
and this is approximated by drawing uniformly
from a set of specific values. You would generate
8
00:00:40,720 --> 00:00:46,080
an integer between zero and p and then divide by p.
And it was considered okay if the low order bits
9
00:00:46,080 --> 00:00:50,160
were not very random because they would end up in
the lower part of the floating point significand.
10
00:00:51,040 --> 00:00:54,720
Nowadays we still want to be able to generate
that kind of floating-point value drawn
11
00:00:54,720 --> 00:00:59,280
uniformly from zero to one, but we typically
expect the value p to be much, much larger.
12
00:01:00,000 --> 00:01:05,440
In addition we would also like to be able to
extract a stream of w bit integers drawn uniformly
13
00:01:06,160 --> 00:01:12,080
from the w-bit integers 0 to 2-to-the-(w-minus-1),
and we expect all bits of each integer to
14
00:01:12,080 --> 00:01:17,200
be equally random because these integers may be
put to various purposes as bit strings, to take
15
00:01:17,200 --> 00:01:22,560
the remainder where it matters that the lower
bits be as random as the upper bits and
16
00:01:22,560 --> 00:01:27,840
so there are different applications; we want
all the bits to be good. Other things have
17
00:01:27,840 --> 00:01:31,680
also changed over the years to change the nature
of the problem of pseudorandom number generation.
18
00:01:32,400 --> 00:01:38,720
Moore's law for about 40 years made computers
ever faster and as a result nowadays applications
19
00:01:38,720 --> 00:01:43,920
can draw many more numbers in the course of their
execution. So a pseudo-random number generator that
20
00:01:43,920 --> 00:01:49,040
repeats its sequence after generating only
2-to-the-32 values is not considered okay anymore.
21
00:01:50,160 --> 00:01:54,880
Furthermore the test suites used to verify the
statistical properties of pseudorandom number
22
00:01:54,880 --> 00:02:00,240
generators have become much more discriminating.
Nowadays we routinely test trillions of generated
23
00:02:00,240 --> 00:02:05,280
values, rather than just millions, looking
for subtle statistical anomalies. This has
24
00:02:05,280 --> 00:02:10,800
improved the quality of the PRNGs but has
also made them harder to design. In addition
25
00:02:10,800 --> 00:02:16,320
computers now have a lot of parallelism of
various kinds, either SIMD or multi-threading,
26
00:02:16,320 --> 00:02:21,360
and so we want to be able to run just not one
pseudorandom number generator, but perhaps dozens
27
00:02:21,360 --> 00:02:26,480
or millions of them simultaneously and we need
them to behave as if statistically independent.
28
00:02:28,240 --> 00:02:32,880
Here is a model of the basic structure of a
pseudorandom number generator. There is a state,
29
00:02:32,880 --> 00:02:38,320
which we can think of as residing in a register
or a variable, and at every tick of the clock
30
00:02:38,320 --> 00:02:42,400
the value of the state is run through a state
transition function and the new value is then
31
00:02:42,400 --> 00:02:47,600
stored back into the state. In addition, that new
computed value is put through an additional output
32
00:02:47,600 --> 00:02:52,960
function and the result of the output function is
a stream of values, one at each tick of the clock.
33
00:02:55,200 --> 00:03:01,520
To describe it a little bit more mathematically,
we will call the initial state s0 and for every
34
00:03:01,520 --> 00:03:08,320
particular clock s-sub-k is computed by applying
the transition function f to s-sub-(k-minus-one)
35
00:03:08,320 --> 00:03:14,560
and so we are constantly producing a new stream of
states s1, s2, s3. The output function g is applied
36
00:03:14,560 --> 00:03:20,640
to each of these state values resulting in a
stream of values t-sub-k producing t-sub-1 t-sub-2,
37
00:03:20,640 --> 00:03:27,360
and t-sub-3. We assume the function f is bijective,
and as a result the sequence of states is a cycle,
38
00:03:28,000 --> 00:03:32,560
and the smallest k for which s-sub-k equals
s-su- 0 is called the period of the generator;
39
00:03:32,560 --> 00:03:36,080
this is the total length of
the circular sequence of states.
40
00:03:37,600 --> 00:03:41,440
On this slide we make a slight adjustment to
the algorithm. It's kind of an engineering hack:
41
00:03:42,240 --> 00:03:46,240
rather than taking the output of f
and feeding that as the input to g
42
00:03:46,240 --> 00:03:52,400
instead g takes the state directly. This is a trick
that allows f and g to be computed in parallel.
43
00:03:52,400 --> 00:03:58,720
As a result g computes the sequence of
values t0, t1, t2 rather than t1, t2, t3,
44
00:03:58,720 --> 00:04:01,440
but it doesn't matter because the
overall sequence of states is a cycle.
45
00:04:03,200 --> 00:04:08,960
Now we look at a specific example that's
been widely used: a Linear Congruential PRNG
46
00:04:08,960 --> 00:04:15,680
with a prime modulus. So given an integer state s,
a modulus p, and a multiplier m, the update function
47
00:04:15,680 --> 00:04:21,840
multiplies s by m, takes the result mod p and
stores it back. The output function divides by p
48
00:04:21,840 --> 00:04:26,160
and this gives a sequence of values between zero
and one. In fact they will all be non-zero; the
49
00:04:26,160 --> 00:04:31,760
reason is we must choose a non-zero initial state
or else all of the produced values will be zero.
50
00:04:32,400 --> 00:04:38,400
It's also important to choose p and m carefully;
typically the modulus p is a prime. One drawback of
51
00:04:38,400 --> 00:04:43,440
this specific algorithm is that that is a floating
point division in the output function, and division
52
00:04:43,440 --> 00:04:49,920
tends to be expensive. An alternative more suited
to modern requirements is a linear congruential
53
00:04:49,920 --> 00:04:55,840
generator using a power-of-two modulus. we
use k bits of state, use a modulus 2-to-the-k,
54
00:04:56,720 --> 00:05:01,920
and the multiplier m must be odd. The output
function, instead of requiring a floating-point
55
00:05:01,920 --> 00:05:07,760
divide, simply retains the high w bits of
the k bits of state to produce a w-bit output.
56
00:05:07,760 --> 00:05:11,840
This is much faster than floating-point division,
and converting the resulting bit string
57
00:05:11,840 --> 00:05:16,480
to a floating-point number is also reasonably
fast. This is a pretty good generator when k is
58
00:05:16,480 --> 00:05:22,960
at least twice w, but when k equals w, the low-order
bits tend to have small period; in fact, the
59
00:05:22,960 --> 00:05:28,000
lowest-order bit is always a 1. The overall period
cannot be larger than two-to-the-(k-minus-two),
60
00:05:30,080 --> 00:05:35,600
A further improvement to this strategy is to
introduce an additive parameter a, which must
61
00:05:35,600 --> 00:05:42,240
be odd, and require the multiplier mod 8 to be
equal to 5. With this set of constraints the
62
00:05:42,240 --> 00:05:47,280
period will be fully 2-to-the-k. Every possible
w-bit value can be generated; every possible
63
00:05:47,280 --> 00:05:53,680
k-bit state can be generated. When k equals w, all
two-to-the-w possible values can be generated and
64
00:05:53,680 --> 00:05:59,120
therefore we say that this generator is exactly
equi-distributed, but the low-order bits still have
65
00:05:59,120 --> 00:06:04,320
low period. The lowest-order bit has a period
of two; it alternates between 0 and 1.
66
00:06:04,320 --> 00:06:09,920
The next bit has a period of four (0 0 1 1),
the next bit has a period of eight, and so on.
67
00:06:12,240 --> 00:06:17,440
The introduction of this additive parameter a
suggests an interesting idea: can we use this
68
00:06:17,440 --> 00:06:22,800
parameter to provide for independent parallel
streams to be used by multi-threaded or simd
69
00:06:22,800 --> 00:06:27,600
operations? The idea would be to create many
instances of the algorithm, each with its own
70
00:06:27,600 --> 00:06:33,840
state s and with a different value for a, and then
each parallel thread would use its own instance.
71
00:06:35,600 --> 00:06:40,160
Versions of the algorithm without an additive
parameter have also been used for multi-threading.
72
00:06:40,160 --> 00:06:44,000
The idea is simply to have one big
state cycle as shown here on the left,
73
00:06:44,000 --> 00:06:48,400
and have the different threads shown here as
t1 through t4 traverse different parts of the
74
00:06:48,400 --> 00:06:52,800
state cycle. It's important to initialize
the states for the threads so they will
75
00:06:52,800 --> 00:06:56,960
traverse different parts of the state cycle so
they won't get accidental statistical correlations.
76
00:06:57,920 --> 00:07:03,520
The idea on the right is for each thread to have
its own state cycle, which could be smaller, but
77
00:07:03,520 --> 00:07:07,600
if each of these individual cycles behaves
as if statistically independent of the others
78
00:07:08,240 --> 00:07:12,720
then in effect all these
threads can use the same algorithm.
79
00:07:12,720 --> 00:07:15,840
and the question is. can the parameter
a be used to provide different cycles?
80
00:07:18,240 --> 00:07:24,240
It turns out that the answer to this question is
NO: this is not a good idea. There's a theorem that
81
00:07:24,240 --> 00:07:29,280
if you have two instances of this algorithm, one
with additive parameter a and one with a-prime, for
82
00:07:29,280 --> 00:07:34,800
any choice of a and a-prime you can find constants
i and r such that in effect the state cycle of one
83
00:07:34,800 --> 00:07:42,080
is equal to the other one offset by i positions and
then having the value r added to it mod 2-to-the-k.
84
00:07:42,080 --> 00:07:49,920
In visual terms, changing a doesn't change
the shape of the graph that plots t-sub-j versus j;
85
00:07:49,920 --> 00:07:54,480
it only translates it both in time and then
vertically on the graph modulo 2-to-the-k.
86
00:07:57,200 --> 00:08:01,040
This theorem allows us to show an equivalence
between two different algebraic structures.
87
00:08:01,040 --> 00:08:04,400
The upper one is the linear congruential
generator that we've already seen with the
88
00:08:04,400 --> 00:08:09,680
additive parameter a; the one at the bottom
replaces a with a constant 1 and introduces
89
00:08:09,680 --> 00:08:15,360
a value r that is added in to the output just
before the high bits are kept. If r is chosen
90
00:08:15,360 --> 00:08:20,560
appropriately as a function of a then these two
generators produce the exact same cycles of values.
91
00:08:22,160 --> 00:08:26,320
But there is an additional improvement we can make
that makes this idea work out pretty well after
92
00:08:26,320 --> 00:08:32,160
all: after the linear congruential generator we add
an additional hash function to the output function.
93
00:08:32,960 --> 00:08:37,680
Think of this hash function as a "bit mixer";
ideally it has good avalanche statistics so that
94
00:08:37,680 --> 00:08:42,240
whenever any bit that comes into it is
changed, half of the output bits tend to change.
95
00:08:42,960 --> 00:08:46,880
This generator works pretty well for parallel
threads and it solves the problem of low period
96
00:08:46,880 --> 00:08:52,160
bits because it allows the high period
upper bits to affect the low period bits
97
00:08:52,160 --> 00:08:55,680
through the use of the hash function.
But the period is still only 2-to-the-k.
98
00:08:57,920 --> 00:09:02,800
If you want to have a generator with a very
large period, we need to make the state larger.
99
00:09:02,800 --> 00:09:08,160
Unfortunately, then the multiplication operation in
the linear congruential algorithm becomes expensive,
100
00:09:08,160 --> 00:09:13,840
because it has quadratic cost. An alternate
approach is to use an F2-linear, or xor-based,
101
00:09:13,840 --> 00:09:19,840
generator. Here the state is treated as a vector of
n bits and the update function multiplies it by a
102
00:09:19,840 --> 00:09:26,160
fixed n-by-n bit matrix. The overall period can be
as large as (two-to-the-n)-minus-one. In this case
103
00:09:26,160 --> 00:09:32,160
again x, the state, cannot be all zero or else the
outputs would all be zero. A feature of this is
104
00:09:32,160 --> 00:09:39,520
that the output is n-over-w equidistributed; that
is, if n-over-w is 2 or 3 or 4 then
105
00:09:39,520 --> 00:09:44,800
2-tuples or 3-tuples or 4-tuples generated
by the generator are equidistributed. Furthermore
106
00:09:44,800 --> 00:09:48,960
we can choose the bit matrix carefully so
instead of having quadratic time execution
107
00:09:48,960 --> 00:09:54,240
it has constant time execution using a fixed
series of w-bit SHIFT, ROTATE, and XOR instructions.
108
00:09:55,840 --> 00:10:01,520
The statistical properties of the output of
an F2-linear generator are also not perfect
109
00:10:01,520 --> 00:10:06,160
but this can be considerably improved by the same
trick of adding a hash function onto the output.
110
00:10:08,400 --> 00:10:14,480
The statistical deficiencies of both the linear
congruential algorithm and the F2-linear algorithm
111
00:10:15,600 --> 00:10:20,640
can be ameliorated by combining them into a
compound PRNG algorithm by using one instance of
112
00:10:20,640 --> 00:10:27,360
each and then summing their outputs. Both Marsaglia,
and L'Ecuyer and Granger-Piché, have studied such
113
00:10:27,360 --> 00:10:33,760
compound generators in detail. One advantage is
that the overall period is the product of their
114
00:10:33,760 --> 00:10:38,560
individual periods, because the period of the
linear congruential generator is a power of two
115
00:10:39,280 --> 00:10:47,120
and the period of the XOR-based generator is odd.
An additional advantage to these generators is
116
00:10:47,120 --> 00:10:52,480
that they are equidistributed. Their output is
still slightly weak but as Marsaglia commented,
117
00:10:52,480 --> 00:11:00,160
if the numbers produced are not random they are at
least higgledy-piggledy. And now, ta-da! With all of
118
00:11:00,160 --> 00:11:04,960
that as background, we can show you the diagram
for the LXM algorithm presented in our paper.
119
00:11:05,920 --> 00:11:09,760
It is a simple combination of elements we've
already seen. There is a linear congruential
120
00:11:09,760 --> 00:11:15,280
generator and an XOR-based generator; those results
are summed, and then the result is run through a
121
00:11:15,280 --> 00:11:22,080
hash function, also known as the mixer; and it is
our thesis that this compound generator with a
122
00:11:22,080 --> 00:11:27,840
hash function added will successfully solve
many of the problems we faced heretofore.
123
00:11:29,600 --> 00:11:35,360
A Java implementation of this algorithm has
already been deployed in JDK17 and this is part
124
00:11:35,360 --> 00:11:41,120
of a larger API design that includes a new
RandomGenerator interface, which we expect to make it
125
00:11:41,120 --> 00:11:45,680
easier for new ORNG algorithms to be created
and also to make it easier for applications
126
00:11:45,680 --> 00:11:51,200
to switch among PRNG algorithms. In addition
to members of the LXM family of algorithms,
127
00:11:52,000 --> 00:11:58,880
this JDK17 update also includes versions of
the XOR-based algorithms "xoroshiro" and" xoshiro".
128
00:12:02,000 --> 00:12:06,880
But, wait! The title of our paper promises that
this algorithm is splittable. What does that mean?
129
00:12:06,880 --> 00:12:12,720
The idea is that given one instance of
your algorithm, you can dynamically create
130
00:12:12,720 --> 00:12:16,880
the state for a new one that is statistically
independent. This is a generalization of the
131
00:12:16,880 --> 00:12:20,560
well-known idea of jumping to a randomly
chosen point within a single state cycle.
132
00:12:21,600 --> 00:12:25,920
This idea appeared in the SplitMix
algorithm presented seven years ago at OOPSLA.
133
00:12:26,480 --> 00:12:31,520
The idea was just to generate the new state
data at random, but this approach was derived
134
00:12:31,520 --> 00:12:36,560
methodically by stepwise refinement of an
earlier algorithm called DotMix from 2012.
135
00:12:37,600 --> 00:12:42,560
But not all states for the SplitMix
algorithm work equally well so it's necessary
136
00:12:42,560 --> 00:12:47,280
to reject certain state configurations that are
known to be weak. This was deployed as the Java
137
00:12:47,280 --> 00:12:54,240
class SplittableRandom in JDK8. Unfortunately,
other configurations also turned out to be weak;
138
00:12:54,240 --> 00:13:00,960
see THIS paper for an explanation of why. LXM
uses the same approach by splitting new instances,
139
00:13:00,960 --> 00:13:06,240
creating their states at random, but we have good
theoretical empirical reasons to believe that LXM
140
00:13:06,240 --> 00:13:10,080
does not have any weak configurations.
We could be wrong, but this is the hope.
141
00:13:12,320 --> 00:13:15,920
Other strengths of the LXM algorithm stem
from its specific combination of components.
142
00:13:16,560 --> 00:13:21,440
Thanks to the linear congruential generator,
the w-bit results are exactly equidistributed.
143
00:13:22,000 --> 00:13:25,760
The additive parameter a makes it easy
to provide independent parallel streams,
144
00:13:26,960 --> 00:13:33,120
and the LCG greatly improves the tuple
equidistribution of the XBG. Thanks to that
145
00:13:33,120 --> 00:13:38,160
XOR-based generator, the period can be made very
large without a large speed penalty, and if the
146
00:13:38,160 --> 00:13:45,440
XBG produces equidistributed tuples of a specific
length, so does lxm, only better. The mixing function
147
00:13:45,440 --> 00:13:50,880
eliminates linear artifacts, particularly the
low-period low-order bits, and the mixing function
148
00:13:50,880 --> 00:13:56,320
is crucial to producing independent parallel stream:.
in effect the additive parameter of the LCG
149
00:13:57,040 --> 00:14:01,840
selects a specific variant of the hash
function provided by the mixing function.
150
00:14:02,400 --> 00:14:07,040
LXM is a simple, even incremental, idea,
but apparently not in the prior literature.
151
00:14:07,040 --> 00:14:10,880
We've seen many combinations of two of
these elements before, but not all three.
152
00:14:12,080 --> 00:14:17,280
To conclude, the contributions this paper are:
explanations of why these specific components were
153
00:14:17,280 --> 00:14:22,880
chosen and why we combine them in a specific way;
analysis of certain properties of this combination,
154
00:14:22,880 --> 00:14:27,040
including the period, equidistribution, and
probability of accidental correlations;
155
00:14:27,920 --> 00:14:31,520
we provide a comparison of this algebraic
structure to a great deal of prior work;
156
00:14:32,400 --> 00:14:36,880
we did extensive quality testing using
the TestU01 and PractRand test suites,
157
00:14:37,440 --> 00:14:41,920
and these test results are reported
in the paper; we did studies of scaling,
158
00:14:41,920 --> 00:14:47,520
testing up to 2-to-the-24 parallel streams using
various splitting strategies, and also testing very
159
00:14:47,520 --> 00:14:53,360
small versions of the algorithm using as little
as 48 bits of state; and finally we conducted timing
160
00:14:53,360 --> 00:14:59,920
tests that demonstrate that LXM is indeed almost
as fast as SplitMix. Thank you for your attention.