Free | 300 credits on sign-up

From Stable Diffusion to Z-Image: The Evolution of Text-to-Image AI

A technical journey through the architectures that shaped modern AI image generation—from latent diffusion to flow matching, and the rise of Chinese open-source models.

ShineficY··5 min read

Stable Diffusion 1.x: The Latent Space Revolution

Released in August 2022 by Stability AI alongside researchers from LMU Munich and Runway ML, Stable Diffusion fundamentally changed AI image generation by operating in compressed latent space rather than pixel space.

The architecture consists of three components:

  • Variational Autoencoder (VAE): Compresses 512×512×3 images into 64×64×4 latent representations—48× less memory than pixel-space processing
  • U-Net: An 860 million parameter conditional denoising network with ResNet blocks and cross-attention layers
  • CLIP Text Encoder: The frozen clip-vit-large-patch14 model (123M parameters) providing text conditioning

This design enabled consumer GPU operation where previous diffusion models required enterprise hardware. The model was trained on 512×512 images from a subset of the LAION-5B database.

Sources: Hugging Face Diffusers Documentation, NVIDIA NeMo Framework

SDXL: Scaling the Architecture

Stable Diffusion XL, released July 26, 2023, introduced significant architectural changes documented in the paper "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis".

Key specifications:

  • Base Model: 3.5 billion parameters (3× larger UNet backbone)
  • Refiner Model: 6.6 billion parameters for post-hoc image-to-image enhancement
  • Dual Text Encoders: CLIP ViT-L (768-dimensional) + OpenCLIP ViT-bigG (1,280-dimensional), concatenated to 2,048-dimensional embeddings
  • Native Resolution: 1024×1024 with multi-aspect-ratio training

The UNet backbone reorganized to three spatial resolutions, discarding the deepest downsampling level. Transformer block allocation intensified at coarser levels: 2 blocks at intermediate level, 10 at lowest resolution.

Sources: arXiv:2307.01952, Open Laboratory

Flux: Transformers Replace U-Nets

Black Forest Labs—founded by Robin Rombach, Andreas Blattmann, and Patrick Esser (former Stability AI researchers who created VQGAN, Latent Diffusion, and Stable Diffusion)—released Flux in August 2024.

The architecture marked a fundamental shift:

  • Parameters: 12 billion
  • Architecture: Hybrid multimodal and parallel diffusion transformer blocks (MM-DiT)
  • Training: Rectified flow matching instead of score-based diffusion
  • Text Encoders: Three encoders—two CLIP-based plus T5
  • Efficiency Features: Rotary positional embeddings, parallel attention layers

MM-DiT processes text and image tokens through learnable streams with two-way information flow, using Query-Key Normalization before attention to stabilize training. Rectified flow uses optimal transport to establish deterministic straight-line paths between noise and data distributions, improving few-step sampling.

Model variants:

  • Flux.1 [schnell]: Latent adversarial diffusion distillation, 1-4 steps
  • Flux.1 [dev]: Guidance distillation, improved efficiency
  • Flux.1 [pro]: API-only, highest quality

Sources: Black Forest Labs Announcement, Hugging Face FLUX.1-dev, MarkTechPost Analysis

Z-Image: Single-Stream Efficiency

Alibaba's Tongyi Wanxiang team released Z-Image in November 2025, introducing the Scalable Single-Stream DiT (S3-DiT) architecture.

Deep dive: For a detailed comparison of Z-Image Turbo vs Flux performance, quality, and ecosystem, see our comprehensive Z-Image Turbo vs Flux comparison.

Technical specifications from the official GitHub and Hugging Face model card:

  • Parameters: 6 billion
  • Architecture: Text, visual semantic tokens, and image VAE tokens concatenated at sequence level as unified input stream
  • Inference: 8 NFEs (Number of Function Evaluations)
  • Hardware: Sub-second latency on H800; compatible with 16GB VRAM consumer devices

Distillation Approach: Decoupled-DMD (Distribution Matching Distillation) separates two mechanisms:

  • CFG Augmentation (CA): Primary training engine
  • Distribution Matching (DM): Regularizer

DMDR methodology integrates Reinforcement Learning with DMD during post-training, where "RL unlocks the performance of DMD" while "DMD effectively regularizes RL."

Model variants:

  • Z-Image-Turbo: Distilled 8-step variant
  • Z-Image-Base: Non-distilled foundation for fine-tuning
  • Z-Image-Edit: Image-to-image editing

Sources: GitHub Tongyi-MAI/Z-Image, Hugging Face Model Card

Seedream: ByteDance's Architectural Innovations

ByteDance's Seed team released Seedream 3.0 with detailed technical documentation in their official report.

Cross-Modality RoPE: Text features treated as 2D tensors with shape [1, L], enabling unified 2D RoPE across modalities. This "improves modeling of inter-modal relationships and intra-modal relative positions."

Training Innovations:

  • Defect-aware data expansion: 20%+ usable dataset increase through selective retention with latent space masking
  • Visuo-semantic sampling: Hierarchical clustering for visual diversity, TF-IDF weighting for long-tail text distributions
  • Multi-resolution hybrid training: Phase 1 at 256×256, Phase 2 mixed from 512×512 to 2048×2048
  • Flow matching loss replaced score matching; REPA feature alignment for faster convergence

Distillation: Important timestep sampling network predicts optimal sampling distribution per sample, completing distillation within 64 GPU-days. Result: 1K-resolution images in ~3 seconds end-to-end.

Seedream 4.0 unified image generation and editing into single architecture, supporting up to 4K resolution with faster inference.

Sources: Seedream 3.0 Technical Report, Seedream 4.0 Announcement

Architectural Evolution Summary

ModelParametersArchitectureKey Innovation
SD 1.5860M UNet + 123M CLIPLatent U-NetConsumer GPU operation
SDXL3.5B base + 6.6B refinerScaled U-NetDual text encoders, 1024² native
Flux12BMM-DiTRectified flow, transformer backbone
Z-Image-Turbo6BS3-DiTSingle-stream, 8-step distillation
Seedream 3.0DiTCross-Modality RoPE, 64 GPU-day distillation

The trajectory: latent space processing remained constant while backbones evolved from U-Nets to transformers. Training shifted from score matching to flow matching. Distillation techniques enabled dramatic step reduction—from 50+ steps to under 10.


Experience Z-Image-Turbo's speed advantage at Z-Image.vip—generate images in seconds on consumer hardware.


Keep Reading