Divide-or-Conquer? LLM Distillation Strategies
Presentation and analysis of Apple × Cornell's research paper on LLM distillation — separating decomposition (planning) from resolution (solving) to reduce inference costs while maintaining performance.
Business Context
LLMs are powerful for complex reasoning tasks but expensive and difficult to customize. This Apple × Cornell paper explores whether reasoning can be split into decomposition (planning) and resolution (solving), and which part benefits most from distillation into smaller models.
Strategic Problem
Can we effectively separate decomposition and resolution in LLM reasoning to reduce inference costs, facilitate local adaptation via fine-tuning/distillation, and still maintain good performance?
Data Sources
Three benchmark datasets: GSM8K (7.5K math problems, Exact Match), DROP (77.4K QA on long texts, F1 score), and Bamboogle (125 complex nested questions, Accuracy). Models tested: GPT and Vicuna-13B.
Methodology
Evaluated three strategies: Single-Stage (direct answer), Two-Stage (static decomposition then resolution), and Self-Ask/Interactive (dynamic sub-question generation). Tested distillation of the decomposer (planning) into a smaller model while keeping a large solver, and vice versa. Compared static vs. dynamic decomposition on token efficiency.
Key Results
Key finding: distilling the decomposer yields the best cost-performance tradeoff — a small distilled decomposer paired with a large solver achieves near-GPT performance at a fraction of the cost. Static two-stage decomposition uses 4x fewer tokens than dynamic approaches with comparable accuracy.
Business Impact
Deep understanding of LLM reasoning architectures, knowledge distillation, and the cost-performance tradeoffs in deploying AI systems at scale.