We have been iterating on our enterprise Java code assistant since early 2025. Our previous best -- a Qwen 2.5 Coder 14B fine-tune -- scored 2.19 on Java enterprise tasks. Today, we are reporting a significant step forward with Gemma 4 31B.
Training Setup
Base model: Google Gemma 4 31B IT (multimodal, text backbone extracted for fine-tuning)
Method: QLoRA (4-bit NF4), LoRA r=16, alpha=32
Dataset: ~70k examples (Cassandra, Elasticsearch, Hibernate, Kafka, Spring, WildFly) + Jakarta/best-practices supplements, deduplicated, class-level extraction
Hardware: RTX 5090 32GB, CUDA 13.1
Quantization: Q4_K_M → 17.8 GB final model size
Results
| Model | Java Enterprise | General Code |
|---|---|---|
| Gemma4 31B FT (Q4_K_M) | 3.53 | 4.25 |
| Qwen 3.5 9B FT (Q4_K_M) | 1.81 | 3.50 |
| Qwen 2.5 14B FT (Q4_K_M) | 2.19 | 3.45 |
All models evaluated under identical conditions: Claude Sonnet 4.6 judge, temperature 0.2, 5 completions per task, 13 tasks across two suites.
Latency
- Gemma4 31B: 64.9 tok/s, TTFT 205ms
- Qwen 3.5 9B: 173.4 tok/s, TTFT 120ms
- Qwen 2.5 14B: 148.5 tok/s, TTFT 59ms
The 31B model is slower but still very usable at 65 tok/s on a single RTX 5090.
Deployment Notes
Gemma 4 uses a <|channel>thought reasoning mode -- it thinks before generating visible output. This adds a small amount of latency to first-token but improves output quality, particularly on multi-step generation tasks.
GGUF conversion required patching llama.cpp to support the Gemma4ForCausalLM text-only architecture. The upstream converter expects the full multimodal model; extracting just the text backbone needed targeted modifications to the conversion script.
The model is served via Ollama on-premises through our OpenClaw gateway. It is now the default model for all enterprise Java code generation tasks.
Key Takeaway
Model size matters more than we expected for enterprise Java. The jump from 9B/14B to 31B closed the domain gap that smaller models could not bridge through fine-tuning alone. Gemma 4's architecture handles complex multi-class patterns -- Spring controllers, Kafka consumers, WildFly EJBs -- that consistently tripped up the smaller models.