8-8 Google: Gemma model

Learning Objectives

Use a Python program to download the Florence model from the Hugging Face platform, and use multimodal inputs (text, images, audio) to ask questions to Phi-4 and obtain answers.

What is Gemma ？

Gemma is a multimodal large model developed by Google that can simultaneously understand text and images, just like an AI assistant that can comprehend the content of images and then describe them in words.

What Can Gemma Do？

1. Image caption generation: Input an image, and the AI automatically generates descriptive text.

How to Get Started?

1. Submit your information to request access for google/gemma-3-4b-it (https://huggingface.co/google/gemma-3-4b-it) and wait for approval.

2. gemma-3-4b-it requires a different version of the transformers library, please install it first.

pip install transformers==4.51.3 --index-url https://pypi.jetson-ai-lab.io/jp6/cu126

3. The following example code will let Gemma generate answers based on an image and a question.

(1) Please replace the token with the Hugging Face token you generated earlier.

import torch

from transformers import pipeline

from huggingface_hub import login
# Log in to Hugging Face

login(token="hf_XXXXXXXXXXXXXXXX") # Replace with your own Hugging Face token
# Create a text generation pipeline

pipe = pipeline(

    "image-text-to-text",  # Specify the task type

    model="google/gemma-3-4b-it",  # Specify the model ID

    device="cuda",  # Select the device (GPU)

    torch_dtype=torch.bfloat16,

    model_kwargs={"cache_dir": "./model"},  # Specify the model cache directory (default is ~/.cache/huggingface)

)
# Define conversation messages, roles and content

messages = [

    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},

    {

        "role": "user",

        "content": [

            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},

            {"type": "text", "text": "What animal is on the candy?"}

        ]

    }

]
# Generate text using the pipeline and print the result

output = pipe(text=messages, max_new_tokens=200)

print(output[0]["generated_text"][-1]["content"])

4. After running it, you will see a response similar to the following:

Reference:

google/gemma-3-4b-it · Hugging Face