Post

๐ŸŒŸ ๐‹๐‹๐Œ ๐๐ซ๐จ๐ฏ๐ข๐๐ž๐ซ'๐ฌ ๐‘๐ž๐ฅ๐ž๐š๐ฌ๐ž (2024 1H) & ๐Ÿ“ 2024 ๐‹๐‹๐Œ ๐’๐ฎ๐ซ๐ฏ๐ž๐ฒ (on Training / Data / RAG / Serving / Agent)

๐ŸŒŸ ๐‹๐‹๐Œ ๐๐ซ๐จ๐ฏ๐ข๐๐ž๐ซโ€™๐ฌ ๐‘๐ž๐ฅ๐ž๐š๐ฌ๐ž (2024 1H)

I have selected 21 LLMs from various providers for 2024 1H. [๐Ÿ“ฃRelease news] and [๐Ÿ“‹Tech report] are in this post. [๐Ÿ“˜API docs] and [๐Ÿค—HF models] are in the comments.

  1. ๐๐ฐ๐ž๐ง-2 (โ€‹Alibaba Groupโ€‹, 2024.06.07)
  2. ๐’๐จ๐ฅ๐š๐ซ-๐Œ๐ข๐ง๐ข-๐ฃ๐š (โ€‹Upstageโ€‹, 2024.05.22)
  3. ๐˜๐ข-๐‹๐š๐ซ๐ ๐ž (โ€‹01.AIโ€‹, 2024.05.13)
  4. ๐˜๐ข-1.5 (โ€‹01.AIโ€‹, 2024.05.13)
  5. ๐†๐๐“-4๐จ (โ€‹OpenAIโ€‹, 2024.05.13)
  6. ๐๐ฐ๐ž๐ง-๐Œ๐š๐ฑ (โ€‹Alibaba Groupโ€‹, 2024.05.11)
  7. ๐ƒ๐ž๐ž๐ฉ๐’๐ž๐ž๐ค-๐•2 (DeepSeek, 2024.05.07)
  8. ๐’๐ง๐จ๐ฐ๐Ÿ๐ฅ๐š๐ค๐ž-๐€๐ซ๐œ๐ญ๐ข๐œ (โ€‹Snowflakeโ€‹, 2024.04.24)
  9. ๐๐ก๐ข-3 (โ€‹Microsoftโ€‹, 2024.04.22)
  10. ๐‹๐ฅ๐š๐ฆ๐š-3 (โ€‹Meta Facebookโ€‹, 2024.04.18)
  11. ๐Œ๐ข๐ฑ๐ญ๐ซ๐š๐ฅ-8๐ฑ22๐ (โ€‹Mistral AIโ€‹, 2024.04.17)
  12. ๐‘๐ž๐ค๐š-๐‚๐จ๐ซ๐ž (โ€‹Reka AIโ€‹โ€‹, 2024.04.15)
  13. ๐‚๐จ๐ฆ๐ฆ๐š๐ง๐-๐‘-๐๐ฅ๐ฎ๐ฌ (โ€‹Cohereโ€‹, 2024.04.04)
  14. ๐ƒ๐๐‘๐— (โ€‹Databricksโ€‹, 2024.03.27)
  15. ๐†๐ž๐ฆ๐ข๐ง๐ข-1.5 (โ€‹Googleโ€‹, 2024.03.08)
  16. ๐‚๐ฅ๐š๐ฎ๐๐ž-3 (โ€‹Anthropicโ€‹, 2024.03.04)
  17. ๐Œ๐ข๐ฌ๐ญ๐ซ๐š๐ฅ-๐‹๐š๐ซ๐ ๐ž (โ€‹Mistral AIโ€‹, 2024.02.26)
  18. ๐†๐ž๐ฆ๐ฆ๐š (โ€‹Googleโ€‹, 2024.02.21)
  19. ๐๐ฐ๐ž๐ง-1.5 (โ€‹Alibaba Groupโ€‹, 2024.02.04)
  20. ๐’๐จ๐ฅ๐š๐ซ-๐Œ๐ข๐ง๐ข (โ€‹Upstageโ€‹, 2024.01.25)
  21. ๐’๐จ๐ฅ๐š๐ซ-10.7๐ (โ€‹Upstageโ€‹, 2023.12.23)

๐Ÿ“ 2024 ๐‹๐‹๐Œ ๐’๐ฎ๐ซ๐ฏ๐ž๐ฒ (on Training / Data / RAG / Serving / Agent)

Iโ€™ll read all below !!

 LLM 2024 survey

๐“๐ซ๐š๐ข๐ง๐ข๐ง๐ 

๐Ÿ“Œ A Survey on Self-Evolution of Large Language Models (2024.04.22)

๐Ÿ“Œ Continual Learning of Large Language Models: A Comprehensive Survey (2024.04.25)

๐Ÿ“Œ Continual Learning with Pre-Trained Models: A Survey (2024.01.29)

D๐š๐ญ๐š

๐Ÿ“Œ Datasets for Large Language Models: A Comprehensive Survey (2024.02.28)

๐Ÿ“Œ A Survey on Data Selection for Language Models (2024.02.26)

๐Ÿ“Œ A Survey on Data Selection for LLM Instruction Tuning (2024.02.04)

๐‘๐€๐†

๐Ÿ“Œ RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing (2024.04.30)

๐Ÿ“Œ Retrieval-Augmented Generation for AI-Generated Content: A Survey (2024.02.29)

๐Ÿ“Œ Retrieval-Augmented Generation for Large Language Models: A Survey (2023.12.18)

๐’๐ž๐ซ๐ฏ๐ข๐ง๐ 

๐Ÿ“Œ LLM Inference Unveiled: Survey and Roofline Model Insights (2024.02.26)

๐Ÿ“Œ A Survey on Effective Invocation Methods of Massive LLM Services (2024.02.05)

๐Ÿ“Œ Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models (2024.01.01)

๐€๐ ๐ž๐ง๐ญ

๐Ÿ“Œ Large Multimodal Agents: A Survey (2024.02.23)

๐Ÿ“Œ Large Language Model based Multi-Agents: A Survey of Progress and Challenges (2024.01.21)

๐Ÿ“Œ Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security (2024.01.10)

Information about Tokens in LLsM

Why do we keep talking about โ€œtokensโ€ in LLMs instead of words?

It happens to be much more efficient to break the words into sub-words (tokens) for model performance!

The typical strategy used in most modern LLMs since GPT-1 is the Byte Pair Encoding (BPE) strategy. The idea is to use, as tokens, sub-word units that appear often in the training data. The algorithm works as follows:

  • We start with a character-level tokenization
  • we count the pair frequencies
  • We merge the most frequent pair
  • We repeat the process until the dictionary is as big as we want it to be

The size of the dictionary becomes a hyperparameter that we can adjust based on our training data. For example, GPT-1 has a dictionary size of ~40K merges, GPT-2, GPT-3, and ChatGPT have a dictionary size of ~50K, and Llama 3 128K.

 Tokens from Words in LLMS

This post is licensed under CC BY 4.0 by the author.