Top Papers in Computer Vision, NLP, Speech, Multimodal AI, Core ML, RecSys, and Graph ML

Posted May 18, 2024

By Fodev JEO 5 min read

📝 Top Papers in Computer Vision, NLP, Speech, Multimodal AI, Core ML, RecSys, and Graph ML •

Distilled AI : https://aman.ai/papers/ aman AI : https://aman.ai/

👉🏼 I’ve put together a summary of key papers in 해시태그#AI and segregated them into (i) need-to-know and (ii) good-to-know.

🔹 Vision

Image Classification (CNN architectures such as AlexNet, VGGNet, InceptionNet, ResNet to Transformer architectures such as ViT, DeiT, BEiT, MAE)
Object Detection (YOLO v1-v8, Fast/er R-CNN, Mask R-CNN, CenterNet, Pix2Seq, DETR, Detic, Focal Loss)
Semantic/Instance Segmentation (U-Net, Mask R-CNN, Segment Anything)
NeRF (InstantNeRF, BlockNeRF)
SSL Contrastive Learning (SimCLR, MoCo, DINO v1 & v2)

🔹 NLP

Transformers (original paper)
Semantic Representation Encoders (BERT and its variants: RoBERTa, DistillBERT, ELECTRA, XLNet, MPNet, ALBERT)
Autoregressive Decoders (GPT-n, Llama 1/2/3, Alpaca, Vicuna)
Augmented LMs (RAG, Toolformer, HuggingGPT, Gorilla)
Supervised Fine-tuning (Instruction tuning/FLAN, LIMA, LESS)
LLM Alignment (RLHF/InstructGPT, PPO, DPO, KTO, GPO, IPO)
Encoder + Decoder Architectures (T0, T5, BART)
Machine Translation (M2M-100, NLLB-200)
Contrastive Learning (SNCSE, InfoNCE, Sentence-BERT)
Prompting (CoT, Auto-CoT, Self-Consistency, ToT, GoT, ReAct, APE, ART)
PEFT (Prefix-tuning, Adapters, LoRA, LLaMA-Adapter v1 and v2, QLoRA, QA-LoRA, DoRA, NOLA)

🔹 Speech

SSL Pre-Training (WavLM, AudioMAE, HuBERT)
Automatic Speech Recognition/Keyword Spotting (GMM-HMM, DNN-HMM, all-neural architectures such as LAS/Whisper, streaming architectures such as RNN-T/Transformer-T)
Speaker Identification (i/d/x-vectors, GE2E loss, AAM loss)
Text-to-Speech (HiFi-GAN, Tacotron v1 and v2, Voicebox)
Text-to-Audio/Music (MusicGen, AudioGen)

🔹 Multimodal

SSL Pre-Training (ViLT, MLIM, UNiTER, LXMERT, VisualBERT, Data2Vec v1 and v2, I-Code, VL-BEIT, ImageBind)
V+L Prompting (Flamingo, Frozen, InstructBLIP)
Text-to-Image (DALL-E 1/2/3, Imagen, Latent Diffusion, Make-A-Scene, Make-a-Video)
Translation (SeamlessM4T)
Contrastive Learning (InfoNCE, CLIP, CLAP, AudioCLIP)

🔹 Core ML

Training Regularizer (Dropout)
Training/Inference Efficiency (ZeRO, ZeRO-Infinity, FlashAttention, FlashAttention-2)
Training Stability (Batch/Layer/Group/Instance Norm, Residual/Skip Connections)
Explainable AI (Guided Backprop, Grad-CAM, CAV, Influence functions, Representer points, TracIn)

🔹 RecSys

ML-based Collaborative Filtering (Factorization Machines)
DL-based Algorithms (Collaborative Deep Learning, Wide & Deep, DNNs for YouTube Recommendations, Product-based DNNs, NCF, Deep & Cross v1 and v2, DeepFM, Deep Interest Network, Behavior Sequence Transformer)

🔹 Graph ML

Translate to Korean

Distilled AI : https://aman.ai/papers/ aman AI : https://aman.ai/
👉🏼 해시태그#AI 의 주요 논문을 요약하여 (i) 알아야 할 사항과 (ii) 알아두면 좋은 내용으로 구분했습니다.

🔹 시력

이미지 분류(AlexNet, VGGNet, InceptionNet, ResNet과 같은 CNN 아키텍처에서 ViT, DeiT, BEiT, MAE와 같은 Transformer 아키텍처까지)
물체 감지(YOLO v1-v8, Fast/er R-CNN, Mask R-CNN, CenterNet, Pix2Seq, DETR, Detic, Focal Loss)
의미론적/인스턴스 분할(U-Net, Mask R-CNN, Segment Anything)
NeRF (InstantNeRF, BlockNeRF)
SSL 대조 학습(SimCLR, MoCo, DINO v1 및 v2)

🔹 NLP (영어)

🔹 연설

SSL 사전 교육(WavLM, AudioMAE, HuBERT)
자동 음성 인식/키워드 스포팅(GMM-HMM, DNN-HMM, LAS/Whisper와 같은 전체 신경 아키텍처, RNN-T/Transformer-T와 같은 스트리밍 아키텍처)
화자 식별(i/d/x-벡터, GE2E 손실, AAM 손실)
텍스트 음성 변환(HiFi-GAN, Tacotron v1 및 v2, Voicebox)
텍스트-오디오/음악(MusicGen, AudioGen)

🔹 복합

SSL 사전 학습(ViLT, MLIM, UNiTER, LXMERT, VisualBERT, Data2Vec v1 및 v2, I-Code, VL-BEIT, ImageBind)
V+L 프롬프트 (Flamingo, Frozen, InstructBLIP)
텍스트-이미지(DALL-E 1/2/3, 영상, 잠재 확산, Make-A-Scene, Make-A-Video)
번역(SeamlessM4T)
대조 학습(InfoNCE, CLIP, CLAP, AudioCLIP)

🔹 코어 ML

🔹 레크시스

ML 기반 협업 필터링(Factorization Machine)
DL 기반 알고리즘 (Collaborative Deep Learning, Wide & Deep, YouTube Recommendations용 DNN, 제품 기반 DNN, NCF, Deep & Cross v1 및 v2, DeepFM, Deep Interest Network, Behavior Sequence Transformer)

🔹 그래프 ML

This post is licensed under CC BY 4.0 by the author.