Fine-Tuning a Code LLM on 73K Java Enterprise Examples

At Aspexilary AI, our goal is deploying capable language models entirely on-premises inside regulated environments. That means no API calls leaving the building — but it also means the models need to actually work on the code our clients write.

Enterprise Java is a specific dialect. WildFly deployment descriptors, Spring Boot configuration, Kafka consumer groups, Hibernate mappings — these patterns are underrepresented in general-purpose training data. We decided to fix that.

The Dataset

We curated ~73,910 Java enterprise examples from open-source repositories spanning WildFly, Spring, Kafka, Elasticsearch, and Hibernate. Each example is a (prompt, completion) pair structured around realistic tasks: generating service classes, writing entity mappings, configuring message listeners.

Dataset quality matters more than quantity here. We filtered aggressively — no auto-generated boilerplate, no trivial getters/setters, no examples without meaningful context.

Training Setup

Base model: Qwen 2.5 Coder 14B
Method: LoRA (r=16, alpha=32) on q/k/v/o projections
Hardware: RTX 5090 (32GB VRAM), Ubuntu 24.04, CUDA 13.1
Quantization: Q4_K_M → 3.9GB final model size

The RTX 5090's Blackwell architecture required a custom PyTorch build from source (sm_120 target). Standard pip installs don't support it yet.

Results

Eval loss landed at 0.4092, perplexity 1.51. On stream processing tasks, the fine-tuned model scored roughly 2x DeepSeek models of comparable size.

We used a dual-judge evaluation approach: Claude as the primary judge (catching substantive failures) with an Ollama-based judge for throughput. The Ollama judge missed meaningful failures on complex multi-class scenarios; Claude caught them. Both have a place in the pipeline.

Key Lessons

Tool-calling needs its own training data. Our fine-tune has no tool-use capability because the training set had no tool-use examples. Future runs will address this directly.

Quantization at this scale is nearly lossless. Q4_K_M at 14B parameters retains enough weight precision that downstream task performance is not meaningfully degraded versus the full-precision model.

Evaluation is harder than training. A model that scores well on perplexity can still fail at the specific patterns clients care about. Domain-specific evaluation suites are non-negotiable.

The model is deployed via Ollama on-premises, served through our OpenClaw gateway. Next step: benchmarking Qwen 3.5 9B as the base for the next fine-tuning run.