RLHF from Scratch: Learning to Summarize from Human Feedback on a Single DGX Spark

Posted Nov 3, 2025

By Fodev JEO 10 min read

🤔 Curiosity: Can We Train RLHF on a Single GPU Node?

What if you could reproduce the entire RLHF pipeline—from supervised fine-tuning to reward modeling to policy optimization—on a single DGX Spark? What if you could train models that learn to summarize from human feedback without needing an 8-H100 cluster?

Curiosity: RLHF typically requires massive compute resources. But what if we could scale it down to a single node while maintaining the core learning principles? How would that change accessibility to RLHF?

Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning language models with human preferences. However, most implementations require significant computational resources—often 8+ H100 GPUs. This repository demonstrates how to reproduce the complete RLHF pipeline on a single DGX Spark, making RLHF more accessible.

The question: How do we implement the full RLHF pipeline—SFT, reward modeling, DPO, and GRPO—on limited hardware? What are the key implementation details that make this possible?

As someone who’s trained RLHF systems, I know the challenges of scaling down while maintaining training quality. This implementation provides valuable insights into making RLHF more accessible.

📚 Retrieve: Understanding the RLHF Pipeline

The Complete Pipeline

RLHF consists of four main stages:

graph TB
    subgraph Stage1["Stage 1: Supervised Fine-Tuning"]
        Base[Base Model<br/>Qwen2.5-0.5B] --> SFT[SFT Model]
        SFT_Data[TL;DR Dataset] --> SFT
    end
    
    subgraph Stage2["Stage 2: Reward Model Training"]
        RM_Base[Qwen2.5-1.5B] --> RM[Reward Model]
        Comparison[Comparison Dataset] --> RM
    end
    
    subgraph Stage3["Stage 3: DPO Training"]
        SFT --> DPO_Ref[Reference Model]
        SFT --> DPO_Policy[Policy Model]
        Comparison --> DPO_Policy
        DPO_Ref --> DPO_Policy
    end
    
    subgraph Stage4["Stage 4: GRPO Training"]
        SFT --> GRPO_Policy[Policy Model]
        RM --> GRPO_Reward[Reward Signal]
        GRPO_Policy --> GRPO_Reward
        GRPO_Reward --> GRPO_Policy
    end
    
    style SFT fill:#4ecdc4,stroke:#0a9396,stroke-width:2px,color:#fff
    style RM fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style DPO_Policy fill:#ffe66d,stroke:#f4a261,stroke-width:2px,color:#000
    style GRPO_Policy fill:#95e1d3,stroke:#2d8659,stroke-width:2px,color:#000

Key Insight: Each stage builds on the previous one, creating a pipeline that progressively aligns the model with human preferences.

Implementation Details

This implementation is based on:

Hardware Requirements:

Single DGX Spark (or equivalent single-node GPU setup)
Docker for containerization
Hugging Face token for model access

💡 Innovation: Stage-by-Stage Implementation

Stage 1: Supervised Fine-Tuning (SFT)

Goal: Train a base model to generate summaries in the desired format.

Base Model: Qwen/Qwen2.5-0.5B

Dataset: summarize_from_feedback_tldr_3_filtered

Key Preprocessing:

Add prepending whitespace and trailing
1 <|endoftext|> token to reference summaries
Maximum response length: 63 tokens

Training Configuration:

Parameter Value

1
num_train_epochs

1 (116,722 episodes)

1
batch_size

1
gradient_accumulation_steps

Effective batch size 128

1
learning_rate

3e-06

1
min_learning_rate

3e-07

1
warmup_steps

1
lr_decay_steps

800

1
grad_clip

1
use_eight_bit_optimizer

true

Results:

The SFT model produces much more coherent summaries than the base model:

Input	Base Model	SFT Model
Reddit post about relationship	“Is it okay to ask if everything is okay or am I being pushy?”	“Guy I’m dating hasn’t been texting me in a month and I asked if everything was okay and he said yes. Am I being pushy or too clingy asking if everything is okay?”

Implementation Steps:

  
# 1. Build Docker container
sudo docker build --build-arg HF_TOKEN=$HF_TOKEN -t summary_from_human_feedback .

# 2. Run container
sudo sh launch_docker.sh

# 3. Process SFT dataset
python3 -m dgx_spark_summary_from_human_feedback.process_sft_dataset

# 4. Train SFT model
python sft.py

Stage 2: Reward Model Training

Goal: Train a model to predict which summary is better (chosen vs. rejected).

Dataset: openai/summarize_from_feedback

Key Finding: Using the SFT model to initialize the reward model didn’t work well (likely due to 0.5B model size). Better results with:

Qwen2.5-0.5B: Baseline performance
Qwen2.5-1.5B: Better training and validation performance ✅

Training Configuration (Qwen2.5-1.5B):

Parameter Value

1
batch_size

1
gradient_accumulation_steps

1
learning_rate

0.00005

1
warmup_ratio

0.03

1
num_train_epochs

1
grad_clip

1
use_eight_bit_optimizer

true

Reward Normalization:

Following best practices, compute average reward on SFT dataset and normalize reward model output:

  
# Normalize reward by subtracting mean reward on SFT dataset
normalized_reward = reward_model_output - mean_sft_reward

Upload to Hugging Face:

  
python3 -m dgx_spark_summary_from_human_feedback.load_local_reward_model_and_push_to_hf \
    --model_path /path/to/checkpoint \
    --push_to_hf \
    --hf_repo_id seangogo/Qwen2.5-1.5B_reward_model_v2

Stage 3: DPO Training

Goal: Align the model with human preferences without a reward model.

Key Innovation: DPO (Direct Preference Optimization) eliminates the need for a separate reward model by directly optimizing on preference pairs.

DPO Loss Function:

\[\mathcal{L}_{\text{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_c, y_r) \sim \mathcal{D}_{\text{PREF}}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_c | x)}{\pi^{\text{SFT}}(y_c | x)} - \beta \log \frac{\pi_\theta(y_r | x)}{\pi^{\text{SFT}}(y_r | x)} \right) \right]\]

Where:

$\pi_\theta$: Current policy model
$\pi^{\text{SFT}}$: SFT reference model
$(x, y_c, y_r)$: Prompt, chosen response, rejected response
$\beta$: Hyperparameter controlling confidence in comparison data

Training Configuration:

Parameter Value

1
batch_size

1
gradient_accumulation_steps

1
learning_rate

0.00005

1
beta

0.1

1
label_smoothing

0.1

1
num_train_epochs

Results:

DPO produces more detailed summaries compared to SFT:

Model	Summary Quality
Base	Repetitive, low quality
SFT	Coherent but basic
DPO	More detailed, better captures context

Training Command:

  
python3 dpo.py \
    --output_dir dpo_output \
    --sft_model_path <local_or_hf_path_to_sft_model>

Stage 4: GRPO Training

Goal: Optimize the policy using reinforcement learning with the reward model.

GRPO (Group Relative Policy Optimization) is a variant of PPO optimized for language model training.

Key Implementation Details:

1. Generation Configuration:

  
generation_config = GenerationConfig(
    max_new_tokens=63,  # response_length
    temperature=0.7 + 1e-7,
    top_k=0,
    top_p=1.0,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,  # Must be explicit!
    pad_token_id=tokenizer.pad_token_id,
)

Critical Notes:

EOS token must be explicitly set - default is
1 None, causing generation to continue until
1 max_new_tokens
No
1 min_new_tokens - setting it equal to
1 max_new_tokens prevents EOS generation
Left padding for prompts to ensure consistent prompt lengths

2. Optimization Process:

Multiple epochs of mini-batch gradient updates:

Split rollout data into mini-batches
Further split mini-batches into micro-batches for gradient accumulation
Update policy weights after processing all micro-batches

Total weight update steps:

\[\frac{\text{num\_epochs} \times \text{total\_samples} \times \text{num\_responses\_per\_prompt}}{\text{batch\_size} \times \text{mini\_batch\_size}}\]

3. Loss Functions:

Vanilla PPO Loss:

Bounded when advantage is positive
Unbounded when advantage is negative (problem!)

Dual Clip Loss:

Clips loss when advantage is negative and ratio >
1 c
Stabilizes training by bounding negative advantage cases

Training Configuration:

Category Hyperparameter Value

Training

1
num_train_epochs

1
learning_rate

5e-5

1
warmup_ratio

0.03

1
grad_clip

1.0

Batch Sizes

1
batch_size

1
mini_batch_size

1
micro_batch_size

1
num_responses_per_group

GRPO

1
update_per_rollout

1
clip_ratio

0.2

1
kl_coeff

0.05

1
kl_penalty_mode

“k3”

1
normalize_adv_by_std_of_group

true

1
no_eos_penalty

-1.0

Generation

1
response_length

1
temperature

0.7

Results Comparison:

Model	Summary
Base	“Is it okay to ask if everything is okay or am I being pushy?”
SFT	“Guy I’m dating hasn’t been texting me in a month…”
DPO	“I have been dating this guy for 1 month and he hasn’t responded…”
GRPO	“Dating guy is acting different and never responds to anything I ask. Is it okay to ask if everything is okay or am I being pushy?”

Training Command:

sh run_grpo.sh

🎯 Key Implementation Insights

1. Model Size Matters for Reward Models

Finding: SFT model (0.5B) didn’t work well as reward model initialization. Qwen2.5-1.5B performed significantly better.

Implication: Reward modeling may require larger models than policy models for effective learning.

2. Generation Configuration is Critical

Key Issues:

EOS token must be explicitly set
1 min_new_tokens can prevent EOS generation
Proper padding ensures consistent batch processing

Solution: Careful configuration of generation parameters is essential for RLHF training.

3. Dual Clip Loss Stabilizes Training

Problem: Vanilla PPO loss is unbounded for negative advantages.

Solution: Dual clip loss bounds the loss magnitude, preventing training instability.

4. Reward Normalization Improves Performance

Practice: Normalize reward model output by subtracting mean reward on SFT dataset.

Benefit: Better reward signal stability during policy optimization.

5. Efficient Batch Processing

Strategy: Use gradient accumulation with micro-batches to fit large models in limited GPU memory.

Formula: Effective batch size =

1	batch_size × gradient_accumulation_steps × micro_batch_size

📊 Performance Comparison

Stage	Model	Key Metric	Improvement
Base	Qwen2.5-0.5B	Baseline	-
SFT	Fine-tuned 0.5B	Coherent summaries	✅ Significant
Reward	Qwen2.5-1.5B	Preference prediction	✅ Better than 0.5B
DPO	DPO-aligned 0.5B	More detailed summaries	✅ Over SFT
GRPO	GRPO-optimized 0.5B	Best quality	✅ Over DPO

Key Takeaways:

SFT provides foundation - Essential baseline for all subsequent stages
Larger reward models help - 1.5B > 0.5B for reward modeling
DPO is efficient - No reward model needed, faster training
GRPO achieves best quality - RL optimization produces highest quality summaries

🔧 Technical Deep Dive

Docker Setup

Build Container:

  
sudo docker build \
    --build-arg HF_TOKEN=$HF_TOKEN \
    --build-arg CUDA_VERSION=13.0 \
    -t summary_from_human_feedback .

Run Container:

sudo sh launch_docker.sh

Dataset Processing

SFT Dataset:

Source:

1
vwxyzjn/summarize_from_feedback_tldr_3_filtered

Processing: Add whitespace and EOS tokens
Max length: 63 tokens

Comparison Dataset:

Source:
1 openai/summarize_from_feedback
Format: (prompt, chosen, rejected) triplets
Max length: 133 tokens

Memory Optimization

Strategies:

8-bit optimizer for SFT and reward model
Gradient accumulation for large effective batch sizes
Micro-batching for GRPO training
Efficient attention mechanisms

🎯 Key Takeaways

Insight	Implication	Action
Single-node RLHF is feasible	RLHF accessible on limited hardware	Use efficient implementations
Model size matters for rewards	Larger models better for reward modeling	Use 1.5B+ for reward models
Generation config is critical	EOS handling affects training	Explicitly set EOS tokens
Dual clip stabilizes PPO	Better than vanilla PPO loss	Use dual clip for GRPO
Reward normalization helps	More stable reward signals	Normalize by SFT mean

Why This Matters

As someone who’s trained RLHF systems, here’s what stands out:

Accessibility: Single-node implementation makes RLHF more accessible
Complete Pipeline: All stages from SFT to GRPO are covered
Practical Details: Real implementation details that matter in practice
Reproducibility: Docker setup ensures reproducible environment
Open Source: All code and models available on Hugging Face

What I’d Try First:

Run the Docker setup on a single GPU
Experiment with different model sizes for reward modeling
Compare DPO vs. GRPO performance
Test reward normalization impact

🤔 New Questions This Raises

Scaling: How does performance scale with more GPUs? What’s the optimal GPU count?
Model Size: What’s the minimum model size for effective reward modeling?
DPO vs. GRPO: When is DPO sufficient vs. when do you need GRPO?
Reward Normalization: How does normalization affect different reward model architectures?
Generation Config: What other generation parameters significantly affect RLHF training?
Memory Efficiency: Can we further optimize memory usage for larger models?

Next Steps: Experiment with the implementation, try different hyperparameters, and explore scaling to multiple GPUs.

References

Repository:

dgx_spark_summary_from_human_feedback - GitHub

Research Papers:

Related Implementations:

Models:

Datasets:

RLHF Resources:

Training Optimization:

AI, RLHF

This post is licensed under CC BY 4.0 by the author.

🤔 Curiosity: Can We Train RLHF on a Single GPU Node?

📚 Retrieve: Understanding the RLHF Pipeline

The Complete Pipeline

Implementation Details

💡 Innovation: Stage-by-Stage Implementation

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Reward Model Training

Stage 3: DPO Training

Stage 4: GRPO Training

🎯 Key Implementation Insights

1. Model Size Matters for Reward Models

2. Generation Configuration is Critical

3. Dual Clip Loss Stabilizes Training

4. Reward Normalization Improves Performance

5. Efficient Batch Processing

📊 Performance Comparison

🔧 Technical Deep Dive

Docker Setup

Dataset Processing

Memory Optimization

🎯 Key Takeaways

Why This Matters

🤔 New Questions This Raises

References

Trending Tags