Sora research report
An Image is Worth 16x16 Words- Transformers for Image Recognition at Scale
Attention Is All You Need
Auto-Encoding Variational Bayes
Denoising Diffusion Probabilistic Models
Elucidating the Design Space of Diffusion-Based Generative Models
Generating Long Videos of Dynamic Scenes
High-Resolution Image Synthesis with Latent Diffusion Models
Imagen Video- High Definition Video Generation with Diffusion Models
Improved Denoising Diffusion Probabilistic Models
Masked Autoencoders Are Scalable Vision Learners
MoCoGAN- Decomposing Motion and Content for Video Generation
Scalable Diffusion Models with Transformers
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
SDEdit- Guided Image Synthesis and Editing with Stochastic Differential Equations
ViViT- A Video Vision Transformer
World Models
Adversarial Video Generation on Complex Datasets
Align your Latents- High-Resolution Video Synthesis with Latent Diffusion Models
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
Diffusion Models Beat GANs on Image Synthesis
Generating Videos with Scene Dynamics
Generative Pretraining From Pixels
Hierarchical Text-Conditional Image Generation with CLIP Latents
Improving Image Generation with Better Captions
Language Models are Few-Shot Learners
NÜWA- Visual Synthesis Pre-training for Neural visUal World creAtion
Patch n' Pack- NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Photorealistic Video Generation with Diffusion Models
Recurrent Environment Simulators
Unsupervised Learning of Video Representations using LSTMs
VideoGPT- Video Generation using VQ-VAE and Transformers
Zero-Shot Text-to-Image Generation