Post

RAG or Fine Tuning? Fine-tune Embedding models for Retrieval Augmented Generation (RAG)

RAG or Fine Tuning? A simple feature comparision to decide which technique you should use!

For customizing LLMs, in addition to RAG, another optimization technique is fine-tuning.

๐—ฅ๐—”๐—š is akin to providing a textbook to the model, allowing it to retrieve information based on specific queries. This approach is suitable for scenarios where the model needs to address particular information retrieval tasks. However, RAG is not suitable for teaching the model to understand broad domains or learn new languages, formats, or styles.

๐—™๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด is similar to enabling students to internalize knowledge through extensive learning. Fine-tuning can enhance the performance of non-fine-tuned models and make interactions more efficient. It is particularly suitable for emphasizing existing knowledge in the base model, modifying or customizing the modelโ€™s output, and providing complex directives to the model.

Sometimes it may not seem straightforward to choose one approach or the other, thatโ€™s why this guide will help you to differentiate which technique fits better your use case!

 Finetuning or RAG ?

RAG in Production: The importance of a Solid Data Strategy ๐Ÿ’ฅ

Retrieval-Augmented Generation (RAG) has become one of the hottest topics in Generative AI, providing powerful ways to enhance model responses with real-world data. But letโ€™s be honest, without a solid data strategy, youโ€™re setting yourself up for a meme-worthy fail. ๐Ÿ˜‚

๐Ÿ“ˆ ๐—ช๐—ต๐˜† ๐—ฅ๐—”๐—š ๐—ก๐—ฒ๐—ฒ๐—ฑ๐˜€ ๐—ฎ ๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐˜๐—ฟ๐—ฎ๐˜๐—ฒ๐—ด๐˜†:

  1. ๐——๐—ฎ๐˜๐—ฎ ๐—ค๐˜‚๐—ฎ๐—น๐—ถ๐˜๐˜†: Garbage in, garbage out. Your model is only as good as the data it retrieves.
  2. ๐—ฅ๐—ฒ๐—น๐—ฒ๐˜ƒ๐—ฎ๐—ป๐—ฐ๐—ฒ: Ensure your data is relevant to your use case.
  3. ๐—ฆ๐—ฐ๐—ฎ๐—น๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜†: Manage and scale your data efficiently to keep up with growing demands.

Remember, a well-thought-out data strategy is the backbone of any successful RAG implementation.

๐Ÿš€ ๐—–๐—ผ๐—ป๐—ฐ๐—น๐˜‚๐˜€๐—ถ๐—ผ๐—ป: Donโ€™t let your RAG use case fall flat. Invest in your data strategy and watch your AI soar! ๐ŸŒŸ

Fine-Tuning Embedding Models for RAG: Significant Performance Gains

Retrieve: How can we improve RAG performance through embedding model fine-tuning? What techniques enable domain-specific optimization?

Embedding models are crucial for RAG applications, but general models often fall short of domain-specific tasks. Fine-tuning embedding models can significantly boost retrieval performance, as demonstrated in a comprehensive study using financial RAG applications.

Performance Improvements

MetricImprovementImpact
Overall Performance7.4% to 22.55% boostโฌ†๏ธ Significant
Training SamplesOnly 6.3k samples neededโฌ‡๏ธ Efficient
Training Time~5 minutes on consumer GPUsโšก Fast
Model Size6x smaller with Matryoshkaโฌ‡๏ธ Efficient
Dimension Efficiency128-dim > 768-dim baselineโฌ†๏ธ Better

Fine-Tuning Workflow

graph TB
    A[Base Embedding Model] --> B[Domain Data]
    B --> C[Fine-Tuning Process]
    C --> D[Matryoshka Learning]
    D --> E[Optimized Embeddings]
    
    F[Synthetic Data] --> C
    G[Evaluation] --> C
    H[Sentence Transformers v3] --> C
    
    E --> I[RAG System]
    I --> J[Improved Retrieval]
    
    style A fill:#e1f5ff
    style C fill:#fff3cd
    style E fill:#d4edda
    style J fill:#f8d7da

Key Techniques

1. Matryoshka Representation Learning (MRL)

Innovate: MRL enables variable-dimension embeddings that maintain performance at smaller sizes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import BatchSamplers

# Initialize model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Matryoshka loss enables variable dimensions
train_loss = losses.MultipleNegativesRankingLoss(model)

# Training with Matryoshka
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    output_path='./fine-tuned-embedding-model'
)

# Use different dimensions
embeddings_128 = model.encode(texts, output_value='sentence_embedding', 
                              convert_to_numpy=True)[:, :128]
embeddings_256 = model.encode(texts, output_value='sentence_embedding',
                              convert_to_numpy=True)[:, :256]

Benefits:

  • ๐Ÿช† 99% performance at 6x smaller size
  • ๐Ÿ“ˆ 128-dim model outperforms 768-dim baseline by 6.51%
  • ๐Ÿ’พ Reduced storage and compute requirements

2. Synthetic Data Generation

Retrieve: Generate training data automatically for fine-tuning:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sentence_transformers import InputExample
import random

def generate_synthetic_pairs(base_texts, num_pairs=1000):
    """Generate synthetic query-document pairs"""
    pairs = []
    for _ in range(num_pairs):
        # Create variations
        query = create_query_variation(random.choice(base_texts))
        document = find_relevant_document(query, base_texts)
        pairs.append(InputExample(texts=[query, document], label=1.0))
    return pairs

# Use synthetic data for fine-tuning
synthetic_pairs = generate_synthetic_pairs(financial_documents, num_pairs=6300)

3. Baseline Creation & Evaluation

During Training:

  • โœ… Continuous evaluation
  • โœ… Performance tracking
  • โœ… Early stopping
  • โœ… Best model selection

Implementation Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from sentence_transformers import SentenceTransformer, losses, evaluation
from sentence_transformers.datasets import NoDuplicatesDataLoader

# Load base model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Prepare training data
train_examples = [
    InputExample(texts=["Query about revenue", "Document about financial revenue"]),
    # ... more examples
]

train_dataloader = NoDuplicatesDataLoader(train_examples, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)

# Evaluation during training
evaluator = evaluation.InformationRetrievalEvaluator(
    queries=test_queries,
    corpus=test_corpus,
    relevant_docs=test_relevant_docs
)

# Fine-tune
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    evaluator=evaluator,
    evaluation_steps=500,
    output_path='./financial-embedding-model'
)

Results Summary

DimensionPerformancevs. BaselineStorage
128-dim6.51% betterOutperforms 768-dim6x smaller
256-dimNear-optimal99% of full performance3x smaller
512-dimOptimalFull performance1.5x smaller
768-dimBaselineReferenceFull size

Use Case: Financial RAG

Dataset: NVIDIAโ€™s 2023 SEC Filing dataset

Results:

  • ๐Ÿš€ 7.4% to 22.55% performance improvement
  • โฑ๏ธ Fast training (5 minutes on consumer GPUs)
  • ๐Ÿงฌ Synthetic data generation
  • ๐Ÿช† Matryoshka efficiency

Resources

๐Ÿ‘‰ Original Article: https://www.philschmid.de/fine-tune-embedding-model-for-rag

๐Ÿ‘‰ Code Repository: https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-embedding-model-for-rag.ipynb

๐Ÿ‘‰ Hugging Face RAG Documentation: https://huggingface.co/docs/transformers/model_doc/rag

Key Takeaways

Retrieve: Fine-tuning embedding models with domain-specific data can boost RAG performance by 7-22%, even with small datasets (6.3k samples).

Innovate: Techniques like Matryoshka Representation Learning enable efficient embeddings that maintain performance at smaller sizes, reducing computational requirements while improving results.

Curiosity โ†’ Retrieve โ†’ Innovation: Start with curiosity about improving RAG performance, retrieve knowledge about embedding fine-tuning techniques, and innovate by applying these methods to your specific domain.

Next Steps:

  • Explore the implementation code
  • Fine-tune embeddings for your domain
  • Experiment with Matryoshka learning
  • Measure performance improvements

How to Select an Embedding Model for Your RAG Application?

Embeddings form the foundation for achieving precise and contextually relevant LLM outputs across different tasks.

Which encoder you select to generate embeddings is a critical decision, hugely impacting the overall success of the RAG system. Low quality embeddings lead to poor retrieval.

When selecting an embedding model, consider the vector dimension, average retrieval performance, and model size.

Companies such as OpenAI, Cohere, and Voyage consistently release enhanced embedding models.

Different types of embeddings are designed to address unique challenges and requirements in different domains.

 Types of Embedding Models For RAG

โฎ• Dense embeddings are continuous, real-valued vectors that represent information in a high-dimensional space.

Curiosity: In the context of RAG applications, dense embeddings, such as those generated by models like OpenAIโ€™s Ada or sentence transformers, contain non-zero values for every element.

โฎ• Sparse embeddings, on the other hand, are representations where most values are zero, emphasizing only relevant information.

In RAG applications, sparse vectors are essential for scenarios with many rare keywords or specialized terms.

โฎ• Multi-vector embedding models like ColBERT feature late interaction, where the interaction between query and document representations occurs late in the process, after both have been independently encoded.

โฎ• Long documents have always posed a particular challenge for embedding models.

The limitation on maximum sequence lengths, often rooted in architectures like BERT, leads to practitioners segmenting documents into smaller chunks. Unfortunately, this segmentation can result in fragmented semantic meanings and misrepresentation of entire paragraphs.

โฎ• Variable dimension embeddings are a unique concept built on Matryoshka Representation Learning (MRL).

MRL learns lower-dimensional embeddings that are nested into the original embedding, akin to a series of Matryoshka Dolls.

โฎ• Code embeddings are a recent development used to integrate AI-powered capabilities into Integrated Development Environments (IDEs), fundamentally transforming how developers interact with codebases.

Curiosity: What insights can we retrieve from this? How does this connect to innovation in the field?

There are several factors that need to be considered while selecting an embedding model.

Know more about embeddings and models in this article: https://www.rungalileo.io/blog/mastering-rag-how-to-select-an-embedding-model

Translate to Korean

RAG ๋˜๋Š” ๋ฏธ์„ธ ์กฐ์ •? ์–ด๋–ค ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”์ง€ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ„๋‹จํ•œ ๊ธฐ๋Šฅ ๋น„๊ต!

LLM์„ ์ปค์Šคํ„ฐ๋งˆ์ด์ง•ํ•˜๊ธฐ ์œ„ํ•ด RAG ์™ธ์—๋„ ๋˜ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๊ธฐ์ˆ ์ด ๋ฏธ์„ธ ์กฐ์ •์ž…๋‹ˆ๋‹ค.

RAG๋Š” ๋ชจ๋ธ์— ๊ต๊ณผ์„œ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•˜์—ฌ ํŠน์ • ์ฟผ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ์ด ํŠน์ • ์ •๋ณด ๊ฒ€์ƒ‰ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ RAG๋Š” ๋ชจ๋ธ์ด ๊ด‘๋ฒ”์œ„ํ•œ ๋„๋ฉ”์ธ์„ ์ดํ•ดํ•˜๊ฑฐ๋‚˜ ์ƒˆ๋กœ์šด ์–ธ์–ด, ํ˜•์‹ ๋˜๋Š” ์Šคํƒ€์ผ์„ ํ•™์Šตํ•˜๋„๋ก ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐ๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋ฏธ์„ธ ์กฐ์ •์€ ํ•™์ƒ๋“ค์ด ๊ด‘๋ฒ”์œ„ํ•œ ํ•™์Šต์„ ํ†ตํ•ด ์ง€์‹์„ ๋‚ด๋ฉดํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋ฏธ์„ธ ์กฐ์ •์€ ๋ฏธ์„ธ ์กฐ์ •๋˜์ง€ ์•Š์€ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์ƒํ˜ธ ์ž‘์šฉ์„ ๋ณด๋‹ค ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋ชจ๋ธ์˜ ๊ธฐ์กด ์ง€์‹์„ ๊ฐ•์กฐํ•˜๊ณ , ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ์ˆ˜์ •ํ•˜๊ฑฐ๋‚˜ ์‚ฌ์šฉ์ž ์ง€์ •ํ•˜๊ณ , ๋ชจ๋ธ์— ๋ณต์žกํ•œ ์ง€์‹œ๋ฌธ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐ ํŠนํžˆ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

๋•Œ๋กœ๋Š” ํ•œ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹ ๋˜๋Š” ๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ๊ฐ„๋‹จํ•˜์ง€ ์•Š์€ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ด ๊ฐ€์ด๋“œ๋Š” ์‚ฌ์šฉ ์‚ฌ๋ก€์— ๋” ์ ํ•ฉํ•œ ๊ธฐ์ˆ ์„ ๊ตฌ๋ณ„ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค!

์ƒ์‚ฐ ํ˜„์žฅ์—์„œ์˜ RAG: ๊ฒฌ๊ณ ํ•œ ๋ฐ์ดํ„ฐ ์ „๋žต๐Ÿ’ฅ์˜ ์ค‘์š”์„ฑ

RAG(Retrieval-Augmented Generation)๋Š” ์ œ๋„ˆ๋ ˆ์ดํ‹ฐ๋ธŒ AI์—์„œ ๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” ์ฃผ์ œ ์ค‘ ํ•˜๋‚˜๊ฐ€ ๋˜์—ˆ์œผ๋ฉฐ, ์‹ค์ œ ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ ์‘๋‹ต์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์†”์งํžˆ ๋งํ•ด์„œ ๊ฒฌ๊ณ ํ•œ ๋ฐ์ดํ„ฐ ์ „๋žต์ด ์—†์œผ๋ฉด ๋ฐˆ์— ์–ด์šธ๋ฆฌ๋Š” ์‹คํŒจ๋ฅผ ๋งž์ดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๐Ÿ˜‚

๐Ÿ“ˆ RAG์— ๋ฐ์ดํ„ฐ ์ „๋žต์ด ํ•„์š”ํ•œ ์ด์œ :

  1. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ: ์“ฐ๋ ˆ๊ธฐ ์œ ์ž…, ์“ฐ๋ ˆ๊ธฐ ๋ฐฐ์ถœ. ๋ชจ๋ธ์€ ๊ฒ€์ƒ‰ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋งŒํผ๋งŒ ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.
  2. ๊ด€๋ จ์„ฑ: ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ ์‚ฌ๋ก€์™€ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
  3. ํ™•์žฅ์„ฑ: ์ฆ๊ฐ€ํ•˜๋Š” ์ˆ˜์š”๋ฅผ ๋”ฐ๋ผ์žก๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ณ  ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

์‹ ์ค‘ํ•œ ๋ฐ์ดํ„ฐ ์ „๋žต์€ ์„ฑ๊ณต์ ์ธ RAG ๊ตฌํ˜„์˜ ์ค‘์ถ”๋ผ๋Š” ์ ์„ ๊ธฐ์–ตํ•˜์‹ญ์‹œ์˜ค.

๐Ÿš€ ๊ฒฐ๋ก : RAG ์‚ฌ์šฉ ์‚ฌ๋ก€๊ฐ€ ์‹คํŒจํ•˜์ง€ ์•Š๋„๋ก ํ•˜์‹ญ์‹œ์˜ค. ๋ฐ์ดํ„ฐ ์ „๋žต์— ํˆฌ์žํ•˜๊ณ  AI๊ฐ€ ๊ธ‰์ฆํ•˜๋Š” ๊ฒƒ์„ ์ง€์ผœ๋ณด์‹ญ์‹œ์˜ค! ๐ŸŒŸ

๋ฏธ์„ธ ์กฐ์ •์€ ๊ฒ€์ƒ‰ ์†๋„๋ฅผ ํฌ๊ฒŒ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๐Ÿ‘€

์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์€ RAG(Retrieval-Augmented Generation) ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์— ๋งค์šฐ ์ค‘์š”ํ•˜์ง€๋งŒ ์ผ๋ฐ˜ ๋ชจ๋ธ์€ ๋„๋ฉ”์ธ๋ณ„ ์ž‘์—…์— ๋ฏธ์น˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

Matryoshka Representation Learning๊ณผ ๊ฐ™์€ ์ตœ์‹  ์—ฐ๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ NVIDIA์˜ 2023 SEC Filing ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธˆ์œต RAG ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์šฉ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ๋ธ”๋กœ๊ทธ๋ฅผ ๊ณต์œ ํ•˜๊ฒŒ ๋˜์–ด ๊ธฐ์ฉ๋‹ˆ๋‹ค.

  • ๐Ÿš€ ๋ฏธ์„ธ ์กฐ์ •์œผ๋กœ ๋‹จ 6.3k ์ƒ˜ํ”Œ๋กœ 7.4%์—์„œ 22.55%๊นŒ์ง€ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • โœ… ๊ธฐ์ค€ ์ƒ์„ฑ + ํ•™์Šต ์ค‘ ํ‰๊ฐ€
  • ๐Ÿงฌ ๋ฏธ์„ธ ์กฐ์ •์— ์‚ฌ์šฉ๋˜๋Š” ์ƒ์„ฑ๋œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ
  • โฑ๏ธ ~10,000์— ๋Œ€ํ•œ ๊ต์œก, ์†Œ๋น„์ž์šฉ GPU์—์„œ ๋‹จ 5๋ถ„
  • ๐Ÿช† Matryoshka๋Š” 6๋ฐฐ ๋” ์ž‘์€ ํฌ๊ธฐ๋กœ 99%์˜ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ“ˆ ๋ฏธ์„ธ ์กฐ์ •๋œ 128-dim ๋ชจ๋ธ์€ ๊ธฐ์ค€ 768-dim๋ณด๋‹ค 6.51% ๋” ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ†• ์ƒˆ๋กœ์šด ๋ฌธ์žฅ ๋ณ€ํ™˜๊ธฐ v3 ์‚ฌ์šฉ

๋นŒ๋“œํ•˜๋Ÿฌ ๊ฐ€์„ธ์š”! ๐Ÿค—

This post is licensed under CC BY 4.0 by the author.