Enterprise Java codebases are not well-served by general-purpose coding models. WildFly configuration, Spring batch jobs, Kafka consumer patterns, Hibernate lifecycle callbacks these patterns are underrepresented in the public repos most base models train on. The result: models that hallucinate JBoss deployment descriptors and confidently suggest deprecated Hibernate APIs.
Dataset Construction
The training corpus comprised 73,910 Java enterprise examples across WildFly/JBoss, Spring, Apache Kafka, Elasticsearch, and Hibernate/JPA. Each example follows an instruction-response format. Stream processing patterns were deliberately over-represented because that is where general-purpose models fail hardest.
Training Setup
Training ran on an RTX 5090 (32GB VRAM) with a custom PyTorch build for Blackwell/sm_120 architecture.
Method: LoRA (r=16, alpha=32, dropout=0.1)
Target modules: q_proj, v_proj, k_proj, o_proj
Batch size: 4 (gradient accumulation x8 = effective 32)
Learning rate: 2e-4 with cosine decay
Epochs: 3
Quantization: Q4_K_M post-training (3.9GB final)
Final eval: loss 0.4092, perplexity 1.51.
Why Two Judges Matter
We evaluated with two judges: Claude Sonnet and a local Ollama-hosted model. The Ollama judge scored both models comparably on syntactic correctness. The Claude judge caught substantive failures: incorrect Kafka offset commit strategies that would cause message loss under failure, Hibernate fetch patterns generating N+1 queries at scale, Spring Security configs missing CSRF protection.
The fine-tuned model scored roughly 2x DeepSeek Coder 6.7B on stream processing tasks when evaluated by the Claude judge. For domain-specific evaluation, a capable external judge is worth the cost.
What is Next
The next candidate base model is Qwen 3.5 9B (dense, March 2026). At roughly 60% the parameter count it is a more deployable artifact for on-premises appliance packaging.