8-6 Microsoft: Florence Model
Learning Objectives
Download the Florence model from the Hugging Face platform using a Python program, and use multimodal (text + image) input to ask Florence questions and get answers.

What is Florence?
Florence is a multimodal large model developed by Microsoft that can simultaneously understand text and images, acting like an AI assistant that can look at an image and then describe it in words.
What Can Florence Do?
1. Image caption generation: Input an image, and the AI automatically generates descriptive text.
2. Object detection: Input an image, and the AI automatically generates the coordinates of object bounding boxes.
How to Get Started?
1. Here is a sample program that lets Florence generate object detection bounding boxes and labels.
import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
# Set device and data type
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load the model
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Florence-2-base",
torch_dtype=torch_dtype,
trust_remote_code=True, # Trust remote code execution
cache_dir="./model", # Specify model storage path (default: ~/.cache/huggingface)
).to(device)
# Load the processor
processor = AutoProcessor.from_pretrained(
"microsoft/Florence-2-base",
trust_remote_code=True, # Trust remote code execution
cache_dir="./model", # Specify model storage path (default: ~/.cache/huggingface)
)
# Set the prompt - Object Detection tag
prompt = "
# Load and open the test image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
# Preprocess text and image into model-compatible input format
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
# Generate model response
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
do_sample=False,
num_beams=3,
)
# Decode the generated text
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
# Parse model output based on generated text and image size, then print results
parsed_answer = processor.post_process_generation(generated_text, task="
for bbox, label in zip(parsed_answer['
print(f"{label}: {bbox}")
2. After running it, you will see a response similar to the following:

Reference:
microsoft/Florence-2-base · Hugging Face