Cache & Distil: Optimising API Calls to Large Language Models

The paper focuses on optimising API calls to Large Language Models (LLMs) through the concept of neural caching.This involves training a smaller student model to handle user requests, thereby reducing the frequency of expensive API calls. 


The study focuses on the use of active learning (AL) strategies such as Margin Sampling and Query by Committee to improve the performance of the student model.


The experimental setup involves four classification tasks: ISEAR, RT-Polarity, FEVER, and Openbook. These tasks range from emotion annotation to fact-checking and question-answering. The datasets are split into online and test portions, with classes uniformly distributed. The paper also discusses the annotation process by the LLM to simulate the online setup.


The findings reveal the benefits of AL-based policies in improving the student model’s performance. Margin Sampling and Query by Committee consistently outperform baselines, indicating the robustness of the student model to noise introduced by the LLM. The study suggests that smart LLM query allocation and online knowledge distillation can play a crucial role in optimising API calls to LLMs.


The paper concludes that there is potential for smart LLM query allocation in continuously distilling LLMs into student models. Leveraging AL strategies in the online setup can lead to significant improvements in performance and efficiency. The authors include, Head of Aveni Labs, Alexandra Birch.


