Slides for a talk to be given at ACM PPoPP on February 8, 2017. This 25-minute talk builds on the paper as accepted by PPoPP (Archivist 2016-057) and a previous version of the slides presented at NVIDIA GTC 2016 (Archivist 2016-0055). *** We describe a SIMD technique for drawing values from multiple discrete distributions,
such as sampling from the random variables of a mixture model,
that avoids computing a complete table of partial
sums of the relative probabilities. A table of alternate
("butterfly-patterned") form is faster to compute, making
better use of coalesced memory accesses; from this table,
complete partial sums are computed on the fly during a binary search.
Measurements using CUDA 7.5 on an NVIDIA Titan Black GPU show that
this technique makes an entire machine-learning application that
uses a Latent Dirichlet Allocation topic model with 1024 topics
is about 13% faster (when using single-precision floating-point data)
or about 35% faster (when using double-precision floating-point data)
than doing a straightforward matrix transposition after using coalesced accesses.