8-6 Microsoft: Florence Model

Learning Objectives

Download the Florence model from the Hugging Face platform using a Python program, and use multimodal (text + image) input to ask Florence questions and get answers.

What is Florence？

Florence is a multimodal large model developed by Microsoft that can simultaneously understand text and images, acting like an AI assistant that can look at an image and then describe it in words.

What Can Florence Do？

1. Image caption generation: Input an image, and the AI automatically generates descriptive text.

2. Object detection: Input an image, and the AI automatically generates the coordinates of object bounding boxes.

How to Get Started?

1. Here is a sample program that lets Florence generate object detection bounding boxes and labels.

import torch

import requests
from PIL import Image

from transformers import AutoProcessor, AutoModelForCausalLM 
# Set device and data type

device = "cuda:0" if torch.cuda.is_available() else "cpu"

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load the model

model = AutoModelForCausalLM.from_pretrained(

    "microsoft/Florence-2-base", 

    torch_dtype=torch_dtype,

    trust_remote_code=True,  # Trust remote code execution

    cache_dir="./model",  # Specify model storage path (default: ~/.cache/huggingface)

).to(device)

# Load the processor

processor = AutoProcessor.from_pretrained(

    "microsoft/Florence-2-base", 

    trust_remote_code=True,  # Trust remote code execution

    cache_dir="./model",  # Specify model storage path (default: ~/.cache/huggingface)

)
# Set the prompt - Object Detection tag

prompt = ""
# Load and open the test image

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"

image = Image.open(requests.get(url, stream=True).raw)
# Preprocess text and image into model-compatible input format

inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
# Generate model response

generated_ids = model.generate(

    input_ids=inputs["input_ids"],

    pixel_values=inputs["pixel_values"],

    max_new_tokens=1024,

    do_sample=False,

    num_beams=3,

)
# Decode the generated text

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
# Parse model output based on generated text and image size, then print results

parsed_answer = processor.post_process_generation(generated_text, task="", image_size=(image.width, image.height))

for bbox, label in zip(parsed_answer['']['bboxes'], parsed_answer['']['labels']):

    print(f"{label}: {bbox}")

2. After running it, you will see a response similar to the following:

Reference:

microsoft/Florence-2-base · Hugging Face