8-7 Microsoft: Phi-4 Model
Learning Objectives
Use a Python program to download the Florence model from the Hugging Face platform, and use multimodal inputs (text, images, audio) to ask questions to Phi-4 and obtain answers.

What is Phi-4?
Phi-4 is a multimodal large model developed by Microsoft that can simultaneously understand text, images, and audio. It’s like an AI brain that can "understand spoken commands" — you give it text, photos, or recordings, and it will act according to your instructions.
What Can Phi-4 Do?
1. Image Understanding: Input images and questions to generate answers.
2. Speech Recognition/Translation: Input audio to transcribe into text or translate into a specific language.
How to Get Started?
1. Here is an example program that lets Phi-4 generate image descriptions as well as speech transcription and translation.
import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen
# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
attn_implementation='flash_attention_2',
cache_dir="./model",
).cuda()
# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path, cache_dir="./model")
# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Download and open the image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')
# Generate model response
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Download and open the audio file
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/143.0.0.0 Safari/537.36"
}
response = requests.get(audio_url, headers=headers)
response.raise_for_status()
audio, samplerate = sf.read(io.BytesIO(response.content))
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')
# Generate model response
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
2. After running it, you will see a response similar to the following:

Reference:
microsoft/Phi-4-multimodal-instruct · Hugging Face