8-8 Google: Gemma モデル

学習目標

Pythonプログラムを使用してHugging FaceプラットフォームからFlorenceモデルをダウンロードし、マルチモーダル入力（テキスト、画像、音声）を使用してPhi-4に質問し、回答を取得します。

Gemmaとは？

Gemmaは、Googleが開発したマルチモーダルな大規模モデルで、テキストと画像を同時に理解できます。まるでAIアシスタントのように、画像の内容を理解し、それを言葉で説明してくれるのです。

Gemmaでできること

1. 画像キャプション生成：画像を入力すると、AIが自動的に説明文を生成します。

利用開始方法

1. google/gemma-3-4b-it (https://huggingface.co/google/gemma-3-4b-it) にアクセスするための情報をご提出いただき、承認をお待ちください。

2. gemma-3-4b-itは、特定のバージョンのTransformersライブラリを必要とします。まず、必要なバージョンをインストールしてください。

pip install transformers==4.51.3 --index-url https://pypi.jetson-ai-lab.io/jp6/cu126

3. 以下のサンプルコードを使用すると、Gemmaは画像と質問に基づいて回答を生成できます。

(1) トークンを、先ほど生成した「ハグ顔」トークンに置き換えてください。

import torch

from transformers import pipeline

from huggingface_hub import login
# Log in to Hugging Face

login(token="hf_XXXXXXXXXXXXXXXX") # Replace with your own Hugging Face token
# Create a text generation pipeline

pipe = pipeline(

    "image-text-to-text",  # Specify the task type

    model="google/gemma-3-4b-it",  # Specify the model ID

    device="cuda",  # Select the device (GPU)

    torch_dtype=torch.bfloat16,

    model_kwargs={"cache_dir": "./model"},  # Specify the model cache directory (default is ~/.cache/huggingface)

)
# Define conversation messages, roles and content

messages = [

    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},

    {

        "role": "user",

        "content": [

            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},

            {"type": "text", "text": "What animal is on the candy?"}

        ]

    }

]
# Generate text using the pipeline and print the result

output = pipe(text=messages, max_new_tokens=200)

print(output[0]["generated_text"][-1]["content"])

4. 実行すると、以下のような応答が表示されます。

参考資料：

google/gemma-3-4b-it · Hugging Face