Research - Randorithms

I work on randomized algorithms for scalable machine learning. By replacing expensive exact algorithms with lightweight approximate methods, we can substantially reduce the resources needed to run a program. Machine learning is an ideal application area because learning algorithms can adapt to the noise introduced by the approximation.

My current work is on efficient approximate algorithms for low-level building blocks of machine learning, such as kernel sums and near-neighbor search, as well as fast training and inference. I am particularly interested in simple methods with theoretical guarantees that also work well in a web-scale production environment.

Selected Publications [Google Scholar]

One-Pass Diversified Sampling with Application to Terabyte-Scale Genomic Sequence Streams

Benjamin Coleman*, Benito Geordie*, Li Chou, R. A. Leo Elworth, Todd J. Treangen, and Anshumali Shrivastava 2022. International Conference on Machine Learning. (ICML22)

*Equal contribution, random order

Practical Near Neighbor Search via Group Testing

Joshua Engels*, Benjamin Coleman*, Anshumali Shrivastava 2021. Neural Information Processing Systems. (NeurIPS21) [Spotlight Talk - Top 3%]

*Equal contribution, random order

A One-Pass Distributed and Private Sketch for Kernel Sums with Applications to Machine Learning at Scale

Benjamin Coleman, Anshumali Shrivastava 2021. ACM Conference on Computer and Communications Security. (CCS21)

Sub-linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data

Benjamin Coleman, Anshumali Shriavastava 2020. The Web Conference. (WWW20)

Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data

Benjamin Coleman, Richard G Baraniuk, Anshumali Shriavastava 2020. International Conference on Machine Learning. (ICML20)

arXiv

Revisiting Consistent Hashing with Bounded Loads

John Chen, Benjamin Coleman, Anshumali Shrivastava 2021. Association for the Advancement of Artificial Intelligence. (AAAI21)

arXiv

Fast processing and querying of 170TB of genomics data via a repeated and merged bloom filter (RAMBO)

Gaurav Gupta, Minghao Yan, Benjamin Coleman, Bryce Kille, RA Leo Elworth, Tharun Medini, Todd Treangen, Anshumali Shrivastava 2021. ACM Special Interest Group on Management of Data. (SIGMOD21)