Post

๐ŸŽ Post-training in the Apple Intelligence paper

 Architecture of Apple Intelligence with adapters for the language on-device and server models

Apple Intelligence Foundation Language Models

๐Ÿ‘‰ Paper : https://arxiv.org/pdf/2407.21075

Two models:

  • On-device: ~3B param with task-specific LoRA adapters
  • Server: ~70B (my estimation), almost no details

Human-annotated and synthetic data:

  • Mathematics: @WizardLM_AI
  • like evol strategy to create diverse problems and solutions
  • Tool use: Focuses on single-tool use cases first, then progresses to multi-tool scenarios
  • Coding: Self-instruct method with rejection sampling and execution-based validation

  • Mixture ratio optimization: Treats the combination of different data sources as an optimization problem. It doesnโ€™t seem super interesting ยฏ_(ใƒ„)_/ยฏ

Two new algorithms:

  • iTeC (Iterative Teaching Committee): An iterative RLHF framework that combines various preference optimization algorithms, uses a diverse โ€œmodel committeeโ€ for data collection, and scales up distillation to improve models across all sizes.
  • MDLOO (Mirror Descent with Leave-One-Out estimation): An online RLHF algorithm that uses a Leave-One-Out estimator for advantage estimation and Mirror Descent Policy Optimization for policy updates, designed to maximize KL-penalized reward functions.

Apple uses a classic post-training pipeline.

Some details are surprising, like the size of the code instruction dataset (only 12k samples).

The RLHF side is the most interesting aspect of the pipeline.

This post is licensed under CC BY 4.0 by the author.