Paper's Highlights

In-depth Papers Highlight 1st - 10th August

KV-Cache Management is the way to optimize LLM, and better AI alignment with self-improving alignment with LLM-as-a-Meta-Judge... is two of our favorite publications this week.

Aug 16, 2024

A. Notable Survey/Review

Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption

Link: https://arxiv.org/pdf/2407.18003
Goal/Motivation: The efficiency of LLMs is challenged by the Transformer architecture’s struggle with handling long texts in LLM. KV-Cache has emerged as a pivotal solution to this issue, converting the time complexity of token generation from quadratic to linear, albeit with increased GPU memory overhead proportional to conversation length. The paper presents various methods for KV-Cache optimization, which is critical to enhance LLM’s performance in longer contexts.
Github: https://github.com/zcli-charlie/Awesome-KV-Cache
Methodology:
- Training stage optimization: KV-Cache compression methods can be used during model pre-training. This paper focuses on the Self-Attention part. To model architecture, while still retaining the excellent properties of Attention, reduces the size of the generated Keys and Values vectors to a quarter or even less.
  - Multi-Query Attention (MQA) based on Multi-Head Attention: retain only one head for keys and values and still achieve good model performance.
  - Grouped-Query Attention (GQA): all query heads no longer calculate attention scores with the same key head, but are divided into n_g groups.
  These methods are usually the most effective, but they are not suitable for modifying the existing models or scenarios with low computational power.
- Deploy-Stage Optimization: An excellent inference system, specifically designed for the high-frequency and multiple small growth properties of KV-Cache, is an important way to improve the efficiency of KV-Cache.
  The use of different frameworks to optimize the use of KV-Cache including: Paged Attention mechanism and the vLLM framework; DistAttention and the DistKV-LLM built on it; ChunkAttention to make the model avoid repeated calculation of some tokens in the pre-fill stage; InfLLM based on the idea of Sliding Window Attention mechanism.
  These methods will not make a large number of modifications to KV-Cache itself, but can significantly optimize its efficiency in the same environment.
- Post-Training Optimizations:
  - Eviction: Eviction methods are about the policies to discard unnecessary tokens. Two lines of approaches exist: static policies, which are designed manually, and dynamic policies, which utilizes attention scores or other information to identify important tokens.
  - Quantization: This approach effectively compresses data by mapping tensor values, originally in full precision, to discrete levels and storing them at a reduced precision. There are two primary categories of KV-Cache quantization: Full quantization and KV-Cache only quantization.
Data Sources: Longbench (consists of 21 datasets across 6 task categories), Passkey retrieval, Needle in a Haystack, Few-shot Testing
Experiment & Results:
- Evaluation metric: Per Token GPU-Memory Usage, Throughput and Latency, Perplexity
- Key takeaways:
  - Principles of KV-Cache Optimization: The core of optimizing KV-Cache is to reduce memory consumption by compressing ‘K’ (Keys) or ‘V’ (Values) in the KV pairs. These techniques have an impact on the efficiency of the models, especially memory usage and processing speed.
  - Trade-offs in Deletion vs. Compression: Consider to delete less important KV pairs to save memory or to compress the KV-Cache without deletion remains an open question. While deletion might offer immediate memory relief, it could potentially compromise the model’s performance. In contrast, better compression techniques strive to retain information integrity while reducing memory footprint.
  - Extremes in KV-Cache Management: storing the KV-Cache externally, possibly on a different storage medium. This method could reduce memory usage on primary devices, it would introduce complexities in retrieval and integration processes.
Discussion & Next steps: Future Directions in storage and Retrieval Technologies: Innovations in how KV-Cache is managed, stored, and accessed could open up new avenues for making LLMs more efficient and versatile.
NeuroPurrfect AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

B. Notable Researches

1. Improving Retrieval Augmented Language Model with Self-Reasoning

Link: https://arxiv.org/pdf/2407.19813
Goal/Motivation: The Retrieval-Augmented Language Model (RALM) has good performance on knowledge-intensive tasks by incorporating external knowledge during inference, which mitigates the factual hallucinations inherited in large language models (LLMs). There is still a challenge that the irrelevant document retrieval may result in unhelpful response generation or even deteriorate the performance of LLMs. So, they propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs, whose core idea is to leverage reasoning trajectories generated by the LLM itself.
Github: https://github.com/vllm-project/vllm
Methodology:
A framework includes three processes:
1. A Relevance-Aware Process: choose the DPR (Karpukhin et al., 2020) and the Contriever (Izacard et al., 2021) as the default retriever R to recall the top-k relevant documents. The output should include two fields as relevant and relevant reason
2. An Evidence-Aware Selective Process: This process of citing the document facilitates reading comprehension and can serve as a technique for combining multiple short answers to address various aspects.
3. A Trajectory Analysis Process: consolidate all the self-reasoning trajectories (τr and τe) in the previous processes together to form a chain of reasoning snippets, thereby enhancing the overall performance of the retrieval augmentation generation.
Model training: They observed that it is more challenging to ensure the correctness of an LLM with 13B parameters when generating long reasoning trajectories than short ones. They hypothesize that an LLM’s effective reasoning length is limited and exceeding this limit might lead to error accumulation during the inference stage. Therefore, they propose a gradual training method by employing stage-wise masking strategies to gradually learn to generate long trajectories.
Data Sources: 4 datasets: Two short-form QA datasets and PopQA, one longform QA dataset (ASQA), and one fact verification dataset (FEVER).
Experiment & Results:
For short-form QA evaluations, the performance of LLMs with augmented retrieval is consistently better than that of basic ones, affirming the effectiveness of the augmented approach.
In the context of long-form QA evaluations, for the metrics of EM recall, it needs to comprehend multiple documents and merge answers.
Discussion & Next steps:
- Have not explored more challenging scenarios, such as multihop reasoning, code generation, and arithmetic reasoning. So that, in future, more challenging reasoning tasks, such as arithmetic reasoning, should be explored for the self-reasoning framework.
- The framework can effectively mitigate factual hallucinations in LLMs and improve the robustness of RALMs. However, there is still a risk that our method might generate hallucinations

2. Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Link: https://arxiv.org/pdf/2407.19594
Goal/Motivation: While improving LLM models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024c) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training.
Methodology: Meta-Rewarding - step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills.
- The model assumes three main roles:
  1. An actor: generates responses to given prompts
  2. A judge: evaluates and scores its own responses
  3. A meta-judge: compares the quality of its own judgments.
  - Actor preference dataset creation: consists of three main steps: 1/ Sample responses from Actor 2/ Aggregate Multiple Judgements and 3/ Preference data selection with Length-Control
  - Judge preference dataset creation: meta-judge to operate in a pairwise mode by comparing two given judgements. Three steps for generating and selecting chosen and rejected pairs: 1/ Response selection 2/ Pairwise Meta-Judge Evaluations 3/ Elo Score and Pairs Selection
Data Sources: Evaluation Fine-Tuning (EFT) dataset from Yuan et al. (2024c). To be more detailed, for Meta-Rewarding iterations, they utilize 20,000 prompts from Yuan et al. (2024c) that were generated by Llama-2-70B-Chat using an 8-shot prompt. For each iteration, they sample 5,000 prompts from this seed set and conduct four iterations in total.
Experiments & Results:
- Meta-Rewarding iterations significantly improves the win rate: Overall, we see a substantial increase from 22.9% to 39.4%, outperforming GPT-4 and approaching close to the Claude Opus model.
- The meta-judge and length-control mechanism are important.
- Meta-Rewarding improves nearly all instruction categories.
- Meta-Rewarding improves answering of complex and hard questions.
- Meta-Rewarding does not sacrifice multi-turn ability despite training only on single-turn.
Github:
- Code: https://github.com/tatsu-lab/alpaca_eval
- Results: https://tatsu-lab.github.io/alpaca_eval/
Discussion & Next steps:
- This scoring method often results in ties due to minimal quality differences between responses, necessitating careful averaging of multiple judgments to differentiate between them.
- Training progressed, responses increasingly approached the maximum score, making further improvements difficult to detect.
- Limitation lies in the judge training process: even the efforts to mitigate positional bias of our meta-judge, this issue persists and hindered further improvements in Iteration 3.
- The judge showed limited improvement in evaluating non-self-generated responses in our evaluations.

3. Think-On-Graph 2.0: Deep and interpretable large language model reasoning with knowledge graph-guided retrieval

Link: https://arxiv.org/abs/2407.10805v2
Goal/Motivation: Retrieval-augmented generation (RAG) has significantly advanced large language models (LLMs) by enabling dynamic information retrieval to mitigate knowledge gaps and hallucinations in generated content. However, these systems often falter with complex reasoning and consistency across diverse queries.
Most recent RAG implementations rely heavily on vector retrieval of text. Vector embeddings are numerical representations of words, phrases, or entire documents that capture their meanings in a semantic space.
There are some limitations:
1. Shallow correlation capture.
2. Difficulty in aggregating diverse facts 3/ Inability to handle complex logics. Structured external knowledge graphs (KGs) into RAG, such as ToG (Sun et al., 2024) is introduced to search for valid information based on the triple relationships in Wikipedia KG. This enhances the transparency and explainability of the systems. Meanwhile, there are some challenges including identifying and mitigating noise, ambiguity and incompleteness in KGs.
The synergy between knowledge graphs and unstructured documents for RAG is becoming crucial. Think-on-Graph 2.0 (ToG2.0) is presented as an enhanced RAG framework that aligns questions with the knowledge graph and uses it as a navigational tool, which deepens and refines the RAG paradigm for information collection and integration.
Methodology: Think-on-Graph 2.0 (ToG2.0), an advanced RAG paradigm with graph-guided knowledge retrieval for deep and interpretable reasoning.
- ToG2.0 effectively integrates unstructured knowledge from documents with structured insights from knowledge graphs (KGs), serving as a roadmap to enhance complex problem-solving. By aligning questions with the knowledge graph and using it as a navigational tool, this approach deepens and refines the RAG paradigm for information collection and integration, which not only ensures semantic similarity at the level of factual consistency but also fosters long-range associations to uphold logical consistency.
- The proposed paradigm makes LLM perform closer to human when reasoning and problem-solving: examining current clues and associating potential entities based on their existing knowledge framework, continuously digging into the topic until finding the answer.
Data Sources: two multi-hop KBQA datasets WebQSP (Yih et al., 2016) and QALD10-en (Usbeck et al., 2023), a multi-hop complex document QA dataset HotpotQA (Yang et al., 2018), and a fact verification dataset FEVER (Thorne et al., 2018)
Experiment & Results: In their study, they conduct experiments on two LLMs: GPT-3.5-turbo and Llama-2-13b-chat. Used the OpenAI API to access GPT-3.5-turbo, while Llama-2-13B-chat was deployed on 8 A100-40G GPUs without quantization.
Choose Wikidata as the knowledge source for all experiments. During the TP, RC, relation pruning, and reasoning stages, they employed a 2-shot demonstration for all prompts.
- Evaluation metric: for a fact verification dataset FEVER is Accuracy, for evaluation WebQSP & HotpotQA & QALD-10-en is Exact Match (EM)
- Results: Tog2.0 outperforms other baselines on WebQSP, HotpotQA and QALD-10-en
This demonstrates the advantages of our ”KG+context” framework in addressing complex problems. Although the performance on the fact-checking dataset FEVER is slightly inferior to CoK, this may be due to CoK utilizing more knowledge sources and an additional LLM self verification mechanism.
Discussion & Next steps: To save computational costs and reduce inference latency, we ultimately decided not to use a self-verification mechanism, which could be applied based on application requirements in the future.

4. RAG Foundry: A framework for enhancing LLMs for Retrieval Augmented Generation

Link: https://arxiv.org/pdf/2408.02545
Goal/Motivates: Implementing Retrieval-Augmented Generation (RAG) systems is inherently complex, requiring deep understanding of data, use cases, and intricate design decisions. RAG FOUNDRY - an open-source framework for augmenting LLMs for RAG use cases. It is highly customizable, facilitating rapid prototyping and experimentation across all aspects of RAG, including data selection, aggregation and filtering, retrieval, text processing, document ranking, few-shot generation, prompt design using templates, fine-tuning, inference, and evaluation.
Github: https://github.com/IntelLabs/RAGFoundry
Methodology: The library is composed of four modules: dataset creation, training, inference, and evaluation
1. Data creation and processing: The processing module facilitates the creation of context-enhanced datasets by persisting RAG interactions, which are essential for RAG-oriented training and inference. These interactions include: dataset loading, column normalization, data aggregation, information retrieval, template-based prompt creation and other forms of pre-processing. The processing module supports the handling of multiple datasets at once, through global dataset sharing. Furthermore, the module includes step caching which caches each pipeline step locally to improve compute efficiency, and facilitates easy reproduction.
2. Training - a training module to fine-tune models given the datasets created by the previous processing module. The training module relies on the well established training framework TRL2 and supports advanced and efficient training techniques, e.g. LoRA (Hu et al., 2021).
3. Inference: The inference module generates predictions given the processed datasets created by the processing module. Additionally, one can run multiple evaluations on a single, prepared inference results file.
4. Evaluation: The evaluation module allows users to run collections of metrics to evaluate RAG techniques and tuning processes. Metrics are also implemented in the library and are classified into Local metrics and Global metrics.
Data Sources: knowledge intensive question-answering datasets includes TriviaQA (Joshi et al.,2017), PubmedQA (Jin et al., 2019), and ASQA (Stelmakh et al., 2022)
Experiment & Results:
- Experiment on two representative models: Llama-35 (Touvron et al., 2023; AI@Meta, 2024) and Phi-36 (Abdin et al., 2024).
- Evaluation metric: Exact Match (EM) for TriviaQA, STR-EM for ASQA, accuracy and F1 for PubmedQA.
  Results are presented as below:
Best method is model dependent for this dataset:
- For TriviaQA: retrieved context improves the results, fine-tuning the RAG setting improves the results, fine-tuning on CoT reasoning (which includes training on a combination of gold passages and distractor passages) decreases performance.
- For ASQA: CoT reasoning produces consistent improvement in both models, as well as fine-tuning of the CoT configuration, which shows to perform best.
- PubmedQA: almost all methods improve upon the baseline (with one exception); CoT reasoning improves upon the untrained RAG setting, but upon fine-tuning, the RAG method appears to perform best in both models.
Discussion & Next steps:
- Just demonstrate on a subset of tasks and datasets. Future work can expand the evaluation to other tasks and implement other RAG techniques and evaluations.
- There might be specific workflows which will be difficult to run as-is and some code changes may be required.
- There could be some missing details regarding some functionality or specific use-cases.

Enjoy this paper’s hightlights? Subscribe, share and stay tuned for new series coming up.

Discussion about this post

No posts

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts