Mukul Gagrani

I work as a staff research sceintist at Qualcomm AI research. I currently work on improving the efficiency of Large Language Models. In particular, I am interested in the efficient inference with LLMs for their deployment on edge. In the past I have worked on Machine Learning for Combinatorial Optimization, Reinforcement Learning and stochastic control.

I obtained my PhD in Electrical & Computer Engineering from University of Southern California (USC) in 2020 under the supervision of Dr. Ashutosh Nayyar and Dr. Rahul Jain. Before that, I finished my undergrad in Electrical Engineering from IIT Kanpur in 2013.

selected publications

CVPR

On Speculative Decoding for Multimodal Large Language Models

Mukul Gagrani*, Raghavv Goel*, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

Awareded Abs PDF

Selected as spotlight paper in ELVM workshop at CVPR 2024

Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone which suffers from memory bandwidth bottleneck and generates tokens auto-regressively. In this paper, we explore the application of speculative decoding to enhance the inference efficiency of MLLMs, specifically the LLaVA 7B model. We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B, bypassing the need for image tokens and their associated processing components from the draft model. Our experiments across three different tasks show that speculative decoding can achieve a memory-bound speedup of up to 2.37x using a 115M parameter language model that we trained from scratch. Additionally, we introduce a compact LLaVA draft model incorporating an image adapter, which shows marginal performance gains in image captioning while maintaining comparable results in other tasks.
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement

Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, and Christopher Lott

arXiv preprint arXiv:2402.14160, 2024

Abs PDF

Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this method by establishing a draft-token tree, achieving superior performance over a single-sequence speculative decoding. However, those works independently generate tokens at each level of the tree, not leveraging the tree’s entire diversifiability. Besides, their empirical superiority has been shown for fixed length of sequences, implicitly granting more computational resource to LLM for the tree-based methods. None of the existing works has conducted empirical studies with fixed target computational budgets despite its importance to resource-bounded devices. We present Recursive Speculative Decoding (RSD), a novel tree-based method that samples draft tokens without replacement and maximizes the diversity of the tree. During RSD’s drafting, the tree is built by either Gumbel-Top-k trick that draws tokens without replacement in parallel or Stochastic Beam Search that samples sequences without replacement while early-truncating unlikely draft sequences and reducing the computational cost of LLM. We empirically evaluate RSD with Llama 2 and OPT models, showing that RSD outperforms the baseline methods, consistently for fixed draft sequence length and in most cases for fixed computational budgets at LLM.
ICLR

Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs

Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott

ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024

Abs HTML PDF

Text generation with Large Language Models (LLMs) is known to be memory bound due to the combination of their auto-regressive nature, huge parameter counts, and limited memory bandwidths, often resulting in low token rates. Speculative decoding has been proposed as a solution for LLM inference acceleration. However, since draft models are often unavailable in the modern open-source LLM families, e.g., for Llama 2 7B, training a high-quality draft model is required to enable inference acceleration via speculative decoding. In this paper, we propose a simple draft model training framework for direct alignment to chat-capable target models. With the proposed framework, we train Llama 2 Chat Drafter 115M, a draft model for Llama 2 Chat 7B or larger, with only 1.64% of the original size. Our training framework only consists of pretraining, distillation dataset generation, and finetuning with knowledge distillation, with no additional alignment procedure. For the finetuning step, we use instruction-response pairs generated by target model for distillation in plausible data distribution, and propose a new Total Variation Distance++ (TVD++) loss that incorporates variance reduction techniques inspired from the policy gradient method in reinforcement learning. Our empirical results show that Llama 2 Chat Drafter 115M with speculative decoding achieves up to 2.3 block efficiency and 2.4 speed-up relative to autoregressive decoding on various tasks with no further task-specific fine-tuning.
ICLR

Neural DAG scheduling via one-shot priority sampling

Wonseok Jeon*, Mukul Gagrani*, Burak Bartan, Weiliang Will Zeng, Harris Teague, Piero Zappi, and Christopher Lott

ICLR, 2022

Abs PDF

We consider the problem of scheduling operations/nodes, the dependency among which is characterized by a Directed Acyclic Graph (DAG). Due to its NP-hard nature, heuristic algorithms were traditionally used to acquire reasonably good solutions, and more recent works have proposed Machine Learning (ML) heuristics that can generalize to unseen graphs and outperform the non-ML heuristics. However, it is computationally costly to generate solutions using existing ML schedulers since they adopt the episodic reinforcement learning framework that necessitates multi-round neural network processing. We propose a novel ML scheduler that uses a one-shot neural network encoder to sample node priorities which are converted by list scheduling to the final schedules. Since the one-shot encoder can efficiently sample the priorities in parallel, our algorithm runs significantly faster than existing ML baselines and has comparable run time with the fast traditional heuristics. We empirically show that our algorithm generates better schedules than both non-neural and neural baselines across various real-world and synthetic scheduling tasks.
NeurIPS

Neural topological ordering for computation graphs

Mukul Gagrani*, Corrado Rainone*, Yang Yang, Harris Teague, Wonseok Jeon, Herke Van Hoof, Will Zeng, Piero Zappi, Christopher Lott, and Roberto Bondesan

NeurIPS, 2022

Abs PDF

Recent works on machine learning for combinatorial optimization have shown that learning based approaches can outperform heuristic methods in terms of speed and performance. In this paper, we consider the problem of finding an optimal topological order on a directed acyclic graph with focus on the memory minimization problem which arises in compilers. We propose an end-to-end machine learning based approach for topological ordering using an encoder-decoder framework. Our encoder is a novel attention based graph neural network architecture called Topoformer which uses different topological transforms of a DAG for message passing. The node embeddings produced by the encoder are converted into node priorities which are used by the decoder to generate a probability distribution over topological orders. We train our model on a dataset of synthetically generated graphs called layered graphs. We show that our model outperforms, or is on-par, with several topological ordering baselines while being significantly faster on synthetic graphs with up to 2k nodes. We also train and test our model on a set of real-world computation graphs, showing performance improvements.
Posterior sampling-based reinforcement learning for control of unknown linear systems

Yi Ouyang, Mukul Gagrani, and Rahul Jain

IEEE Transactions on Automatic Control, 2019

Abs HTML

We propose a posterior sampling-based learning algorithm for the linear quadratic (LQ) control problem with unknown system parameters. The algorithm is called posterior sampling-based reinforcement learning for LQ regulator (PSRL-LQ) where two stopping criteria determine the lengths of the dynamic episodes in posterior sampling. The first stopping criterion controls the growth rate of episode length. The second stopping criterion is triggered when the determinant of the sample covariance matrix is less than half of the previous value. We show under some conditions on the prior distribution that the expected (Bayesian) regret of PSRL-LQ accumulated up to time T is bounded by O (T−−√). Here, O (⋅) hides constants and logarithmic factors. Numerical simulations are provided to illustrate the performance of PSRL-LQ.
NeurIPS

Learning unknown markov decision processes: A thompson sampling approach

Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain

NeurIPS, 2017

Abs HTML PDF

We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria. The first stopping criterion controls the growth rate of episode length. The second stopping criterion happens when the number of visits to any state-action pair is doubled. We establish bounds on expected regret under a Bayesian setting, where and are the sizes of the state and action spaces, is time, and is the bound of the span. This regret bound matches the best available bound for weakly communicating MDPs. Numerical results show it to perform better than existing algorithms for infinite horizon MDPs.