8-6 Microsoft: Florence モデル

学習目標

Pythonプログラムを使用してHugging FaceプラットフォームからFlorenceモデルをダウンロードし、マルチモーダル（テキスト＋画像）入力を使用してFlorenceに質問し、回答を得てください。

Florenceとは？

Florenceは、Microsoftが開発したマルチモーダルな大規模モデルで、テキストと画像を同時に理解し、画像を見て言葉で説明できるAIアシスタントのような役割を果たします。

Florenceでできること

1. 画像キャプション生成：画像を入力すると、AIが自動的に説明文を生成します

2. 物体検出：画像を入力すると、AIが自動的に物体のバウンディングボックスの座標を生成します。

使い始めるには？

1. 以下は、Florenceを使って物体検出のバウンディングボックスとラベルを生成するサンプルプログラムです。

import torch

import requests
from PIL import Image

from transformers import AutoProcessor, AutoModelForCausalLM 
# Set device and data type

device = "cuda:0" if torch.cuda.is_available() else "cpu"

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load the model

model = AutoModelForCausalLM.from_pretrained(

    "microsoft/Florence-2-base", 

    torch_dtype=torch_dtype,

    trust_remote_code=True,  # Trust remote code execution

    cache_dir="./model",  # Specify model storage path (default: ~/.cache/huggingface)

).to(device)

# Load the processor

processor = AutoProcessor.from_pretrained(

    "microsoft/Florence-2-base", 

    trust_remote_code=True,  # Trust remote code execution

    cache_dir="./model",  # Specify model storage path (default: ~/.cache/huggingface)

)
# Set the prompt - Object Detection tag

prompt = ""
# Load and open the test image

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"

image = Image.open(requests.get(url, stream=True).raw)
# Preprocess text and image into model-compatible input format

inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
# Generate model response

generated_ids = model.generate(

    input_ids=inputs["input_ids"],

    pixel_values=inputs["pixel_values"],

    max_new_tokens=1024,

    do_sample=False,

    num_beams=3,

)
# Decode the generated text

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
# Parse model output based on generated text and image size, then print results

parsed_answer = processor.post_process_generation(generated_text, task="", image_size=(image.width, image.height))

for bbox, label in zip(parsed_answer['']['bboxes'], parsed_answer['']['labels']):

    print(f"{label}: {bbox}")

2. 実行すると、以下のような応答が表示されます。

参考資料：

microsoft/Florence-2-base · Hugging Face