We have been iterating on our enterprise Java code assistant since early 2025, fine-tuning successive base models on our proprietary Java enterprise corpus. Our previous best -- a Qwen 2.5 Coder 14B fine-tune -- scored 2.19 on Java enterprise tasks. Today, we are reporting a significant step forward: our fine-tune of Google's Gemma 4 31B scores 3.53, a 61% improvement.

Training Setup

Base model: Google Gemma 4 31B IT (multimodal, text backbone extracted for fine-tuning)
Method: QLoRA (4-bit NF4), LoRA r=16, alpha=32
Dataset: ~70k examples (Cassandra, Elasticsearch, Hibernate, Kafka, Spring, WildFly) + Jakarta/best-practices supplements, deduplicated, class-level extraction
Hardware: RTX 5090 32GB, CUDA 13.1
Quantization: Q4_K_M → 17.8 GB final model size

Results

All three models below are our own fine-tunes, trained on the same enterprise Java dataset using the same QLoRA pipeline and quantized to Q4_K_M. This is a head-to-head comparison of fine-tuned models, not off-the-shelf base models.

Model (all fine-tuned) Java Enterprise General Code
Gemma4 31B FT (Q4_K_M) 3.53 4.25
Qwen 3.5 9B FT (Q4_K_M) 1.81 3.50
Qwen 2.5 14B FT (Q4_K_M) 2.19 3.45

All models evaluated under identical conditions: Claude Sonnet 4.6 judge, temperature 0.2, 5 completions per task, 13 tasks across two suites. Scores are on a 1-5 scale (aggregate of correctness, completeness, code quality, and best practices).

Latency

  • Gemma4 31B FT: 64.9 tok/s, TTFT 205ms
  • Qwen 3.5 9B FT: 173.4 tok/s, TTFT 120ms
  • Qwen 2.5 14B FT: 148.5 tok/s, TTFT 59ms

The 31B model is slower but still very usable at 65 tok/s on a single RTX 5090.

Deployment Notes

Gemma 4 uses a <|channel>thought reasoning mode -- it thinks before generating visible output. This adds a small amount of latency to first-token but improves output quality, particularly on multi-step generation tasks.

GGUF conversion required patching llama.cpp to support the Gemma4ForCausalLM text-only architecture. The upstream converter expects the full multimodal model; extracting just the text backbone needed targeted modifications to the conversion script.

The model is served via Ollama on-premises through our OpenClaw gateway. It is now the default model for all enterprise Java code generation tasks.

Key Takeaway

Model size matters more than we expected for enterprise Java. The jump from 9B/14B to 31B closed the domain gap that smaller models could not bridge through fine-tuning alone. Gemma 4's architecture handles complex multi-class patterns -- Spring controllers, Kafka consumers, WildFly EJBs -- that consistently tripped up the smaller models.