Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines

arXiv CS Monday 08 June 2026, 04:00 UTC By Jatin Kishnani, Mayank Goel, Amit Singh, Pulkit Agrawal, Sairanjan Mishra 1 min read

Key Points

Announce Type: replace Abstract: We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe - built on PyTorch, HuggingFace TRL, and FSDP - to the JAX +...

arXiv:2605.25645v3 Announce Type: replace Abstract: We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe - built on PyTorch, HuggingFace TRL, and FSDP - to the JAX + Tunix/Qwix stack. These adaptations span mesh configuration, LoRA module naming conventions, sharding annotation corrections, gradient checkpoint, data pipeline restructuring, and a custom Orbax-to-safetensor checkpoint merging procedure. For inference, we detail the vLLM-TPU Docker setup necessary to serve Gemma 4 on v6e-8 and characterize the resulting latency and throughput profile. Compared with a similar-costing 2xH100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. For inference, we cover the vLLM-TPU Docker setup required to serve Gemma 4 on v6e-8 and explain the observed latency and throughput characteristics across a QPS sweep spanning 512 to 16k input tokens. Across both workloads we compare performance and cost against a 2xH100 GPU baseline running identical hyperparameters. The TPU completes training 1.61x faster at 2.12x lower cost. For inference, TPU v6e-8 matches GPU at short context (<=2048 tokens) and decisively outperforms at long context: 66% higher throughput and 23.6x faster TTFT at 4096-token inputs (61 ms vs 1,443 ms at QPS=4). Our work removes a critical gap in the open tooling ecosystem and provides practitioners with a recipe for Gemma 4 Dense 31B deployment on the TPU infrastructure.

Technical Comparison (ORG) GPU (ORG) Google (ORG) Gemma 4 31B (ORG) TPU (ORG) Trillium (LOCATION) PyTorch (ORG) HuggingFace TRL (ORG) FSDP (ORG) the JAX + Tunix/Qwix (LOCATION) LoRA (ORG) Orbax (PERSON) QPS (ORG)

Originally published by arXiv CS Read original →

Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines

Related Stories

Trump Risks Key Surveillance Authority Over ‘Unqualified’ Spy-Chief Pick

Is predictive text giving you mistakes and 'hallucinations'? You're not alone

Valve will stop producing physical Steam gift cards because of scammers

Oracle Reports Higher-Than-Expected Data Center Spending