Fine-Tuning a Java LLM: From Dataset to Deployment

Enterprise Java codebases are not well-served by general-purpose coding models. WildFly configuration, Spring batch jobs, Kafka consumer patterns, Hibernate lifecycle callbacks these patterns are underrepresented in the public repos most base models train on. The result: models that hallucinate JBoss deployment descriptors and confidently suggest deprecated Hibernate APIs.

Dataset Construction

The training corpus comprised 73,910 Java enterprise examples across WildFly/JBoss, Spring, Apache Kafka, Elasticsearch, and Hibernate/JPA. Each example follows an instruction-response format. Stream processing patterns were deliberately over-represented because that is where general-purpose models fail hardest.

Training Setup

Training ran on an RTX 5090 (32GB VRAM) with a custom PyTorch build for Blackwell/sm_120 architecture.

Method:          LoRA (r=16, alpha=32, dropout=0.1)
Target modules:  q_proj, v_proj, k_proj, o_proj
Batch size:      4 (gradient accumulation x8 = effective 32)
Learning rate:   2e-4 with cosine decay
Epochs:          3
Quantization:    Q4_K_M post-training (3.9GB final)

Final eval: loss 0.4092, perplexity 1.51.

Why Two Judges Matter

We evaluated with two judges: Claude Sonnet and a local Ollama-hosted model. The Ollama judge scored both models comparably on syntactic correctness. The Claude judge caught substantive failures: incorrect Kafka offset commit strategies that would cause message loss under failure, Hibernate fetch patterns generating N+1 queries at scale, Spring Security configs missing CSRF protection.

The fine-tuned model scored roughly 2x DeepSeek Coder 6.7B on stream processing tasks when evaluated by the Claude judge. For domain-specific evaluation, a capable external judge is worth the cost.

What is Next

The next candidate base model is Qwen 3.5 9B (dense, March 2026). At roughly 60% the parameter count it is a more deployable artifact for on-premises appliance packaging.