8-8 Google: Gemma model

Learning Objectives

Use a Python program to download the Florence model from the Hugging Face platform, and use multimodal inputs (text, images, audio) to ask questions to Phi-4 and obtain answers.

What is Gemma ?

Gemma is a multimodal large model developed by Google that can simultaneously understand text and images, just like an AI assistant that can comprehend the content of images and then describe them in words.

What Can Gemma Do?

1. Image caption generation: Input an image, and the AI automatically generates descriptive text.

How to Get Started?

1. Submit your information to request access for google/gemma-3-4b-it (https://huggingface.co/google/gemma-3-4b-it) and wait for approval.

2. gemma-3-4b-it requires a different version of the transformers library, please install it first.

pip install transformers==4.51.3 --index-url https://pypi.jetson-ai-lab.io/jp6/cu126

3. The following example code will let Gemma generate answers based on an image and a question.

(1) Please replace the token with the Hugging Face token you generated earlier.

import torch
from transformers import pipeline
from huggingface_hub import login

# Log in to Hugging Face
login(token="hf_XXXXXXXXXXXXXXXX") # Replace with your own Hugging Face token

# Create a text generation pipeline
pipe = pipeline(
    "image-text-to-text",  # Specify the task type
    model="google/gemma-3-4b-it",  # Specify the model ID
    device="cuda",  # Select the device (GPU)
    torch_dtype=torch.bfloat16,
    model_kwargs={"cache_dir": "./model"},  # Specify the model cache directory (default is ~/.cache/huggingface)
)

# Define conversation messages, roles and content
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    }
]

# Generate text using the pipeline and print the result
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

4. After running it, you will see a response similar to the following:

Reference:

google/gemma-3-4b-it · Hugging Face

 

Copyright © 2026 YUAN High-Tech Development Co., Ltd.
All rights reserved.