ML | stable-diffusion

stable-diffusion-xl-base-1.0

The stable-diffusion-xl-base-1.0 is a text-to-image generative model developed by Stability AI.

  • Inputs
    • Text prompt: A description of the desired image, such as "a beautiful sunset over a mountain landscape".
  • Outputs
    • Generated image: An image that corresponds to the input text prompt, generated using the model's diffusion-based text-to-image capabilities.
1
2
3
4
5
6
micromamba env create -n SD
micromamba activate SD
micromamba install python==3.10
pip install diffusers --upgrade
pip install invisible_watermark transformers accelerate safetensors gradio
micromamba install git-lfs
  • When using torch >= 2.0, you can improve the inference speed by 20-30% with torch.compile. Simple wrap the unet with torch compile before running the pipeline:
    • pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
  • If you are limited by GPU VRAM, you can enable cpu offloading by calling pipe.enable_model_cpu_offload instead of .to("cuda"):
    • - pipe.to("cuda")
    • + pipe.enable_model_cpu_offload()

The models are saved at,

1
2
6.7G	/home/wpsze/.cache/huggingface/hub/models--stabilityai--stable-diffusion-xl-base-1.0
4.4G /home/wpsze/.cache/huggingface/hub/models--stabilityai--stable-diffusion-xl-refiner-1.0

base model

To just use the base model, you can run:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/python

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

image = pipe(prompt=prompt).images[0]

image.save("astronaut_rides_horse.png") # 1024 x 1024

GPU: ~ 10GB

base + refiner pipeline

To use the whole base + refiner pipeline as an ensemble of experts you can run:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from diffusers import DiffusionPipeline
import torch

# load both base & refiner
base = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
base.to("cuda")
refiner = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=base.text_encoder_2,
vae=base.vae,
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16",
)
refiner.to("cuda")

# Define how many steps and what % of steps to be run on each experts (80/20) here
n_steps = 40
high_noise_frac = 0.8

prompt = "A majestic lion jumping from a big stone at night"

# run both experts
image = base(
prompt=prompt,
num_inference_steps=n_steps,
denoising_end=high_noise_frac,
output_type="latent",
).images
image = refiner(
prompt=prompt,
num_inference_steps=n_steps,
denoising_start=high_noise_frac,
image=image,
).images[0]

image.save("refiner.png")

GPU: ~ 14GB

Demo

  • Left: Base model, Right: Base+refiner model

    • prompt = "An astronaut is doing Numerical Weather Prediction simulation on Space."

Where does hugging face's transformers save models?

The cache location has changed again, and is now ~/.cache/huggingface/hub/

1
2
3
4
5
6
7
8
9
10
$ du -sh ~/.cache/huggingface/hub/*
6.3G /home/wpsze/.cache/huggingface/hub/models--deepseek-ai--deepseek-vl2-tiny
414M /home/wpsze/.cache/huggingface/hub/models--dslim--bert-base-NER
961M /home/wpsze/.cache/huggingface/hub/models--ecmwf--aifs-single
1.7G /home/wpsze/.cache/huggingface/hub/models--microsoft--DialoGPT-medium
172K /home/wpsze/.cache/huggingface/hub/models--microsoft--Florence-2-large
6.7G /home/wpsze/.cache/huggingface/hub/models--stabilityai--stable-diffusion-xl-base-1.0
4.4G /home/wpsze/.cache/huggingface/hub/models--stabilityai--stable-diffusion-xl-refiner-1.0
4.0K /home/wpsze/.cache/huggingface/hub/version.txt
4.0K /home/wpsze/.cache/huggingface/hub/version_diffusers_cache.txt

web page

git clone,

1
2
3
git clone https://github.com/AUTOMATIC1111/TorchDeepDanbooru.git
cd TorchDeepDanbooru
wget https://github.com/AUTOMATIC1111/TorchDeepDanbooru/releases/download/v1/model-resnet_custom_v3.pt

and web.py,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
#!/bin/python

import numpy as np
import gradio as gr
from diffusers import DiffusionPipeline,StableDiffusionXLImg2ImgPipeline
import torch
import tqdm
from datetime import datetime

from TorchDeepDanbooru import deep_danbooru_model

MODEL_BASE = "stabilityai/stable-diffusion-xl-base-1.0"
MODEL_REFINER = "stabilityai/stable-diffusion-xl-refiner-1.0"

print("Loading model",MODEL_BASE)
base = DiffusionPipeline.from_pretrained(MODEL_BASE, torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
base.to("cuda")
print("Loading model",MODEL_REFINER)
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(MODEL_REFINER, text_encoder_2=base.text_encoder_2,vae=base.vae, torch_dtype=torch.float16, use_safetensors=True, variant="fp16",)
refiner.to("cuda")

# Define how many steps and what % of steps to be run on each experts (80/20) here
# base-high noise, refiner-low noise
# the base model steps = default_n_steps*default_high_noise_frac
default_n_steps = 40
default_high_noise_frac = 0.8
default_num_images =2

def predit_txt2img(prompt,negative_prompt,model_selected,num_images,n_steps, high_noise_frac,cfg_scale):
# run both experts
start = datetime.now()

num_images=int(num_images)
n_steps=int(n_steps)
prompt, negative_prompt = [prompt] * num_images, [negative_prompt] * num_images
images_list = []
model_selected = model_selected
high_noise_frac=float(high_noise_frac)
cfg_scale=float(cfg_scale)

g = torch.Generator(device="cuda")

if model_selected == "sd-xl-base-1.0" or model_selected == "sd-xl-base-refiner-1.0":
images = base(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=n_steps,
denoising_end=high_noise_frac,
guidance_scale=cfg_scale,
output_type="latent" if model_selected == "sd-xl-base-refiner-1.0" else "pil",
generator=g
).images

if model_selected == "sd-xl-base-refiner-1.0":
images = refiner(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=n_steps,
denoising_start=high_noise_frac,
guidance_scale=cfg_scale,
image=images,
).images

for image in images:
images_list.append(image)

torch.cuda.empty_cache()

cost_time=(datetime.now()-start).seconds
print(f"cost time={cost_time},{datetime.now()}")

return images_list

def predit_img2img(prompt, negative_prompt,init_image, model_selected,n_steps, high_noise_frac,cfg_scale,strength):

start = datetime.now()
prompt = prompt
negative_prompt =negative_prompt

model_selected = model_selected
init_image = init_image
n_steps=int(n_steps)
high_noise_frac=float(high_noise_frac)
cfg_scale=float(cfg_scale)
strength=float(strength)

if model_selected == "sd-xl-refiner-1.0":
images = refiner(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=n_steps,
denoising_start=high_noise_frac,
guidance_scale=cfg_scale,
strength = strength,
image=init_image,
# target_size = (1024, 1024)
).images

torch.cuda.empty_cache()

cost_time=(datetime.now()-start).seconds

print(f"cost time={cost_time},{datetime.now()}")

return images[0]

def interrogate_deepbooru(pil_image, threshold):
threshold =0.5
model = deep_danbooru_model.DeepDanbooruModel()
model.load_state_dict(torch.load('/home/wpsze/ML/huggingface_hub/stable-diffusion-xl-base-1.0/TorchDeepDanbooru/model-resnet_custom_v3.pt'))
model.eval().half().cuda()

pic = pil_image.convert("RGB").resize((512, 512))
a = np.expand_dims(np.array(pic, dtype=np.float32), 0) / 255

with torch.no_grad(), torch.autocast("cuda"):
x = torch.from_numpy(a).cuda()

# first run
y = model(x)[0].detach().cpu().numpy()

# measure performance
for n in tqdm.tqdm(range(10)):
model(x)

result_tags_out = []
for i, p in enumerate(y):
if p >= threshold:
result_tags_out.append(model.tags[i])
print(model.tags[i], p)

prompt = ', '.join(result_tags_out).replace('_', ' ').replace(':', ' ')
print(f"prompt={prompt}")
return prompt

def clear_txt2img(prompt, negative_prompt):
prompt = ""
negative_prompt = ""
return prompt, negative_prompt

def clear_img2img(prompt, negative_prompt, image_input,image_output):
prompt = ""
negative_prompt = ""
image_input = None
image_output = None

return prompt, negative_prompt,image_input,image_output

with gr.Blocks(title="Stable Diffusion",theme=gr.themes.Default(primary_hue=gr.themes.colors.blue))as demo:

with gr.Tab("Text-to-Image"):
# gr.Markdown("Stable Diffusion XL Base + Refiner.")
model_selected = gr.Radio(["sd-xl-base-refiner-1.0","sd-xl-base-1.0"],show_label=False, value="sd-xl-base-refiner-1.0")
with gr.Row():
with gr.Column(scale=4):
prompt = gr.Textbox(label= "Prompt",lines=3)
negative_prompt = gr.Textbox(label= "Negative Prompt",lines=1)
with gr.Row():
with gr.Column():
n_steps=gr.Slider(20, 60, value=default_n_steps, label="Steps", info="Choose between 20 and 60")
high_noise_frac=gr.Slider(0, 1, value=0.8, label="Denoising Start at")
with gr.Column():
num_images=gr.Slider(1, 3, value=default_num_images, label="Gernerated Images", info="Choose between 1 and 3") #num images=4,A10报显存溢出
cfg_scale=gr.Slider(1, 20, value=7.5, label="CFG Scale")
with gr.Column(scale=1):
with gr.Row():
txt2img_button = gr.Button("Generate",size="sm")
clear_button = gr.Button("Clear",size="sm")
gallery = gr.Gallery(label="Generated images", show_label=False, elem_id="gallery",columns=int(num_images.value), height=800,object_fit='fill')

txt2img_button.click(predit_txt2img, inputs=[prompt, negative_prompt, model_selected,num_images,n_steps, high_noise_frac,cfg_scale], outputs=[gallery])
clear_button.click(clear_txt2img, inputs=[prompt, negative_prompt], outputs=[prompt, negative_prompt])

with gr.Tab("Image-to-Image"):
model_selected = gr.Radio(["sd-xl-refiner-1.0"],value="sd-xl-refiner-1.0",show_label=False)
with gr.Row():
with gr.Column(scale=1):
prompt = gr.Textbox(label= "Prompt",lines=2)
with gr.Column(scale=1):
negative_prompt = gr.Textbox(label= "Negative Prompt",lines=2)
with gr.Row():
with gr.Column(scale=3):
image_input = gr.Image(type="pil",height=512)
with gr.Column(scale=3):
image_output = gr.Image(height=512)
with gr.Column(scale=1):
img2img_deepbooru = gr.Button("Interrogate DeepBooru",size="sm")
# img2img_clip = gr.Button("Interrogate CLIP",size="sm")
img2img_button = gr.Button("Generate",size="lg")
clear_button = gr.Button("Clear",size="sm")

n_steps=gr.Slider(20, 60, value=40, step=10,label="Steps")
high_noise_frac=gr.Slider(0, 1, value=0.8, step=0.1,label="Denoising Start at")
cfg_scale=gr.Slider(1, 20, value=7.5, step=0.1,label="CFG Scale")
strength=gr.Slider(0, 1, value=0.3,step=0.1,label="Denoising strength")

img2img_deepbooru.click(fn=interrogate_deepbooru, inputs=image_input,outputs=[prompt])
img2img_button.click(predit_img2img, inputs=[prompt, negative_prompt, image_input, model_selected, n_steps, high_noise_frac,cfg_scale,strength], outputs=image_output)
clear_button.click(clear_img2img, inputs=[prompt, negative_prompt, image_input], outputs=[prompt, negative_prompt, image_input,image_output])

if __name__ == "__main__":
demo.launch(server_name="0.0.0.0", server_port=8000)

Access to model is restricted (not yet)

1
2
Cannot access gated repo for url https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/model_index.json.
Access to model stabilityai/stable-diffusion-3.5-large is restricted. You must have access to it and be authenticated to access it. Please log in.

huggingface -- sign up -- Access Tokens -- generate token key

You need to sign in to huggingface, accept the license agreement on sd 3.5 page, go to settings in your hf profile, find your token and input it in sd next config file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ huggingface-cli login

_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|

To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `sd` has been saved to /home/wpsze/.cache/huggingface/stored_tokens
Your token has been saved to /home/wpsze/.cache/huggingface/token
Login successful.
The current active token is: `sd`

Then,

1
2
3
403 Forbidden: Please enable access to public gated repositories in your fine-grained token settings to view this repository..
Cannot access content at: https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/model_index.json.
Make sure your token has the correct permissions.

References

  1. 使用 Docker 快速上手 Stability AI 的 SDXL 1.0 正式版
  2. 在搭载 M1 及 M2 芯片 MacBook设备上玩 Stable Diffusion 模型
  3. GPU-基于Diffusers和Gradio搭建SDXL推理应用
  4. 麻瓜走入魔法世界 I — Gradio

ML | stable-diffusion
https://waipangsze.github.io/2025/01/09/ML-stable-diffusion/
Author
wpsze
Posted on
January 9, 2025
Updated on
February 24, 2025
Licensed under