8-7 Microsoft: Phi-4 Model

Learning Objectives

Use a Python program to download the Florence model from the Hugging Face platform, and use multimodal inputs (text, images, audio) to ask questions to Phi-4 and obtain answers.

What is Phi-4？

Phi-4 is a multimodal large model developed by Microsoft that can simultaneously understand text, images, and audio. It’s like an AI brain that can "understand spoken commands" — you give it text, photos, or recordings, and it will act according to your instructions.

What Can Phi-4 Do？

1. Image Understanding: Input images and questions to generate answers.

2. Speech Recognition/Translation: Input audio to transcribe into text or translate into a specific language.

How to Get Started?

1. Here is an example program that lets Phi-4 generate image descriptions as well as speech transcription and translation.

import requests

import torch

import os

import io

from PIL import Image

import soundfile as sf

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

from urllib.request import urlopen
# Define model path

model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(

    model_path, 

    device_map="cuda", 

    torch_dtype="auto", 

    trust_remote_code=True, 

    attn_implementation='flash_attention_2',

    cache_dir="./model",

).cuda()
# Load generation config

generation_config = GenerationConfig.from_pretrained(model_path, cache_dir="./model")
# Define prompt structure

user_prompt = '<|user|>'

assistant_prompt = '<|assistant|>'

prompt_suffix = '<|end|>'
# Part 1: Image Processing

print("\n--- IMAGE PROCESSING ---")

image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'

prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'

print(f'>>> Prompt\n{prompt}')
# Download and open the image

image = Image.open(requests.get(image_url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')
# Generate model response

generate_ids = model.generate(

    **inputs,

    max_new_tokens=1000,

    generation_config=generation_config,

)

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(

    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False

)[0]

print(f'>>> Response\n{response}')
# Part 2: Audio Processing

print("\n--- AUDIO PROCESSING ---")

audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"

speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use  as a separator between the original transcript and the translation."

prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'

print(f'>>> Prompt\n{prompt}')
# Download and open the audio file

headers = {

    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

                  "AppleWebKit/537.36 (KHTML, like Gecko) "

                  "Chrome/143.0.0.0 Safari/537.36"

}

response = requests.get(audio_url, headers=headers)

response.raise_for_status()

audio, samplerate = sf.read(io.BytesIO(response.content))

inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')
# Generate model response

generate_ids = model.generate(

    **inputs,

    max_new_tokens=1000,

    generation_config=generation_config,

)

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(

    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False

)[0]

print(f'>>> Response\n{response}')

2. After running it, you will see a response similar to the following:

Reference:

microsoft/Phi-4-multimodal-instruct · Hugging Face