Stability AI has released a preview of Stable Diffusion 3.0, its flagship next-generation artificial intelligence model for generating images from text descriptions. Stable Diffusion 3.0 will be available in different versions based on neural networks ranging in size from 800 million to 8 billion parameters.
Image source: Stable Diffusion 3.0
Over the past year, Stability AI has continuously improved and released several neural networks, each of which has shown increasing levels of sophistication and quality.. The release of SDXL in July greatly improved the base Stable Diffusion model, and now the company is looking to go much further.
The new Stable Diffusion 3.0 model is designed to provide improved image quality and better performance when creating images from complex cues. The new neural network will provide significantly better typography than previous versions of Stable Diffusion, allowing for more accurate text within generated images. Typography has been Stable Diffusion's weak point in the past, as has been the case with many other AI artists.
Stable Diffusion 3.0 is not just a new version of the previous Stability AI model, because it is based on a new architecture. “ Stable Diffusion 3 is a transformative diffusion model, a new type of architecture that is similar to the one used in the recently introduced OpenAI Sora model, ” Emad Mostaque, CEO of Stability AI, told VentureBeat. “ This is a true successor to the original Stable Diffusion .”
Stability AI is experimenting with several types of image generation approaches. Earlier this month, the company released a preview version of Stable Cascade, which uses the Würstchen architecture to improve performance and accuracy. Stable Diffusion 3.0 takes a different approach, using transformer diffusion models. ” Stable Diffusion didn't have a transformer before ,” Mostak said.
Transformers underlie much of the modern neural networks that have launched the artificial intelligence revolution.. They are widely used as the basis of text generation models. Image generation has largely been in the realm of diffusion models. The research paper detailing Diffusion Transformers (DiT) explains that it is a new architecture for diffusion models that replaces the widely used U-Net backbone with a transformer that operates on hidden regions of the image.. The application of DiT allows for more efficient use of computing power and outperforms other approaches to diffuse image generation.
Another important innovation that Stable Diffusion 3.0 takes advantage of is thread matching. The Flow Matching research paper explains that it is a new method for training neural networks using “Continuous Normalizing Flow Matching” (CNF) to model complex data distributions. According to the researchers, using CFM with optimal transport paths results in faster learning, more efficient sampling, and improved throughput compared to diffusion paths.
Improved typography in Stable Diffusion 3.0 is the result of several improvements that Stability AI has built into the new model. As Mostak explained, high-quality generation of texts on images became possible thanks to the use of a diffusion transformer model and additional text encoders. With Stable Diffusion 3.0, it is now possible to generate complete sentences from images with a coherent writing style.
While Stable Diffusion 3.0 is initially being demonstrated as an AI technology for converting text to images, it will be the basis for much more. In recent months, Stability AI will also create neural networks to create 3D images and videos.
“ We create open models that can be used anywhere and adapted to any need ,” Mostak said. “ This is a series of models in different sizes that will serve as the basis for the development of our next generation of visual models, including video, 3D and more .”#!MARKER#!