Contrastive Language-Image Pre-training & Latent Diffusion David Valecký CLIP ●Learning method from natural language supervision ●Dataset of 400 million (image, text) pairs ●Perform a wide set of tasks during pre-training ○OCR ○geo-localization ○action recognition 2 Comparison with other methods ●Benchmark on 30 existing datasets ●Strong at zero-shot classification ●Better than public ImageNet models 3 Dual-Input model architecture ●Two types of inputs: an image and a text prompt ●Vision encoder for images ●Language encoder for text 4 Image Encoder ●Variant of Vision Transformer (ViT) ●Based on CNN ●Originally designed for natural language processing tasks ●Employs self-attention mechanisms ●High-dimensional vector representations - CLIP image embeddings ○Space where images and text are jointly represented ○3 vectors - Query (Q), Key (K), Value (V) ○Self-attention - can run in parallel 5 Text encoder ●Transformer-based architecture ●No pre-trained weights ●Linear projection to map from each encoder’s representation to the multi-modal embedding space ●Simplified because many CLIP’s pre-training dataset are only a single sentence 6 Performance ●The largest ResNet model ●Model name - RN50x64 ○18 days to train on 592 V100 GPUs ● Largest Vision Transformer ○12 days on 256 V100 GPUs ●Model name - ViT-L/14 7 Zero-Shot Transfer ●Performing unseen tasks ●Standard image classification datasets using a generically pre-trained model ○Second study zero-shot transfer to existing image classification datasets at the time of publishing paper 8 Applications and Use Cases ●Image Classification ●Text-to-Image Retrieval ●An ability to generalize to a wide array of task 9 Latent Diffusion ●Autoencoder (VAE) ●U-Net ●Text Encoder - CLIP, pre-trained 10 Latent diffusion model 11 Autoencoder (VAE) ●Transforms a high-dimensional image (512x512x3) into a compact latent representation (64x64x4) ●"Latents" serve as input to the U-Net ●Conversion significantly reduces memory requirements (48 times less) compared to pixel-space diffusion models ●The decoder reconstructs the original image from the latent representation ●Only the decoder is needed to convert denoised latents into actual images ●Isotropic Gaussian distribution 12 U-Net ●Predicts denoised image representations from noisy latents ●By subtracting this noise from the noisy latents, actual latents are obtained ●Contains: ○Encoder (12 blocks) ○Middle block ○Skipconnected Decoder (12 blocks) ○Totally 25 blocks ●8 blocks handle down-sampling or up-sampling convolution layers ●17 blocks containing four ResNet layers and two Vision Transformers (ViTs) 13 Text Encoder ●Stable Diffusion uses a pre-trained text encoder CLIP ●Latent space can be used to train multiple generative models ●Downstream application single-image CLIP-guided synthesis 14 Odkazy 15 ●https://arxiv.org/pdf/2112.10752.pdf ●https://cdn.openai.com/papers/dall-e-2.pdf ●https://arxiv.org/pdf/2103.00020.pdf