Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Nov 25, 2024 | By Bud Ecosystem

In the research paper “Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting,” the authors introduce a new framework called Kangaroo designed to make large language models (LLMs) run faster. This framework enables the training of a smaller, lightweight model in a cost-effective way.

This new framework is introduced to speedup the text generation process of LLMs, especially when using a method called Speculative Decoding (SD). Here’s a simplified breakdown of the challenges the authors wanted to solve with their research;

  1. Memory Bandwidth Bottleneck: Even though LLMs do a lot of mathematical calculations, the real bottleneck or slowdown comes from the time spent moving data in and out (read/write operations) of the memory where the model’s weights are kept.
  2. Improving Decoding Speed: To make text generation faster, earlier research proposed a method called Speculative Decoding (SD). This involves a “draft” model that predicts several possible next words in parallel, instead of just one at a time. The goal is to quickly generate multiple words and then verify which ones are good. However, this approach has two main issues:
  3. Draft Model Training: Creating a draft model that can predict well often requires significant resources and time, which isn’t always practical.
  4. Draft Model Inference Speed: If the draft model itself is slow, it doesn’t help much in speeding up the text generation.
  5. Self-Drafting Models: Some methods, like LLMA and REST, try to address these issues by generating draft tokens using different strategies without needing a separate draft model. Medusa, for example, uses additional neural network components to create draft tokens, but it still has limitations in the effectiveness of these tokens and their generation speed.

Kangaroo Framework

The authors tackled the challenges of speeding up text generation in large language models using their new framework, Kangaroo. Here’s how they addressed the issues:

  • Autoregressive Self-Draft Model: They designed a lightweight and efficient model called an “autoregressive self-draft model.” This model is built on a fixed, shallow part of the large LLM and uses a small, additional adapter module.

The adapter network, which includes only one multi-head attention layer and two normalisation layers, has only 11.3% of the parameters compared to Medusa-1’s heads. Despite its simplicity, this design proves to be both efficient and powerful.

  • Early Exiting Mechanism: To further reduce latency, they implemented an early exiting mechanism during the draft token generation phase. This mechanism allows the model to exit early from processing when it’s generating simpler tokens, thus avoiding unnecessary computation for more complex tokens.
  • Self-Speculative Decoding Framework: They introduced the Kangaroo framework, which uses a double early-exit mechanism. First, the smaller self-draft model exits early from the shallow layers of the large LLM and connects to the adapter network to generate draft tokens. Second, it applies early exiting during the drafting phase to minimize computational overhead on difficult tokens.
  • Low-Cost Training and Deployment: Kangaroo offers a cost-effective way to train a lightweight model. Since the self-draft model and the large LLM share some of the KV cache and computation, the primary additional requirement for deployment is a small adapter network.
  • Performance Validation: The authors validated Kangaroo’s effectiveness through experiments on the Spec-Bench benchmark. Kangaroo achieved up to a 1.7× speedup compared to Medusa-1 while using 88.7% fewer additional parameters (67 million vs. 591 million).

In summary, the authors improved text generation speed by creating a lightweight model with a simple architecture and an efficient early exiting mechanism, thereby reducing computational costs and latency while maintaining performance

Tags:

Efficient Expert Pruning

Evolutionary Strategies

SMoE models

Sparse Mixture-of-Experts

Bud Ecosystem

Our vision is to simplify intelligence—starting with understanding and defining what intelligence is, and extending to simplifying complex models and their underlying infrastructure.

Related Blogs

Why Enterprise AI Doesn’t Need Another Tool — It Needs a Platform That Owns the Stack From Silicon to Consumption
Why Enterprise AI Doesn’t Need Another Tool — It Needs a Platform That Owns the Stack From Silicon to Consumption

In 2025, enterprises invested $684 billion in AI. More than $547 billion of that — over 80% — failed to deliver the business value it was meant to. MIT NANDA, after studying 300 deployments and interviewing 150 executives, found that 95% of GenAI pilots produced zero measurable P&L impact. S&P Global reported 42% of companies […]

When Generic AI Safety Isn’t Enough: Building Custom Guardrails That Fit Your Enterprise
When Generic AI Safety Isn’t Enough: Building Custom Guardrails That Fit Your Enterprise

Every enterprise deploying generative AI eventually arrives at the same uncomfortable realisation: the world’s best pre-built guardrails are still written by someone else, for someone else’s rules. Out-of-the-box safety layers do an admirable job on the universal hazards — profanity, personally identifiable information, prompt injection patterns, clearly toxic content, explicit refusals of hate speech. These […]

AI-Enabled vs. AI-Native: What’s the Actual Difference?
AI-Enabled vs. AI-Native: What’s the Actual Difference?

Here is a number worth sitting with. According to McKinsey’s 2025 State of AI survey across nearly 2,000 executives and 105 countries, 88 percent of organizations now use AI regularly in at least one business function. That sounds like progress. It is not the full story. Only 6 percent of those organizations qualify as AI high performers — firms […]

Introducing SIMD-Bench: An Open-Source Framework for Cross-Architecture Benchmarking, Profiling, and Improving SIMD Kernels
Introducing SIMD-Bench: An Open-Source Framework for Cross-Architecture Benchmarking, Profiling, and Improving SIMD Kernels

We open-sourced SIMD-Bench, an open-source framework that benchmarks and profiles SIMD kernels to evaluate and compare their performance across different instruction set architectures (ISAs), CPU platforms and CPU vendors. Unlike existing profilers that focus on single platforms or shallow metrics, SIMD-Bench combines reproducible cross-architecture benchmarking, roofline and top-down microarchitectural analysis, correctness validation, and energy metrics […]