DataSci-Coder: Fine-Tuned LLM for Data Science

Model on HuggingFace: jsmall12/DataSci-Coder-14B-LoRA  |  GitHub: jacksonSmall/DataSci-Coder

Overview

DataSci-Coder is a fine-tuned large language model specialized for data science code generation. Starting from Qwen2.5-Coder-14B-Instruct, I applied QLoRA (4-bit quantization) to train a domain-specific adapter on 10,795 curated examples covering statistics, machine learning, deep learning, and data engineering.

The core insight: general-purpose code models like the base Qwen2.5-Coder are excellent but treat all code domains equally. By fine-tuning on DS-specific examples with strict instruction-following requirements, the model learns to generate cleaner, more compliant data science code with zero explanatory text pollution.

Status: Published on HuggingFace (~40 downloads)Trained: Spring 2026

Training Setup

ParameterValue
Base Modelunsloth/Qwen2.5-Coder-14B-Instruct-bnb-4bit
MethodQLoRA (4-bit quantization)
LoRA Rankr=16, alpha=32, dropout=0.05
LoRA Targetsq/k/v/o_proj + gate/up/down_proj
Training Examples10,795
Epochs1
Batch Size1 (grad_accum=16, effective=16)
Learning Rate3e-5
GPULightning AI L40S (44GB VRAM)
Precisionbf16
Training Time1.9 hours
Final Loss0.5933

Training Data

10,795 examples from 6+ sources, filtered to data science Python code:

  • Public HuggingFace datasets (6 sources, max 3,000/dataset): Python code instructions, CodeAlpaca, Evol-Instruct variants
  • Newsletter content: Daily Dose of Data Science and similar — scraped, cleaned, formatted to Q&A pairs
  • Class materials: Stats, ML, and DL coursework formatted to instruction-following examples
  • Curated examples: Hand-picked, high-quality examples weighted 3x in training mix

Pipeline: All sources → deduplication → quality filter → code-only filter → category cap (3,000/category) → stratified 90/10 train/val split


Evaluation Results

Hard Eval (12 keyword-based tasks)

All 12 tasks completed correctly. 3 apparent “failures” were false negatives from strict keyword matching:

  • Stacking ensemble: Fine-tuned model built manually with StratifiedKFold (deeper understanding) instead of sklearn’s StackingClassifier wrapper — eval required the exact class name
  • SHAP: Fine-tuned model used shap.Explainer (newer, general-purpose API) instead of shap.TreeExplainer (legacy) — eval required legacy class name
  • GAN: Fine-tuned model wrote correct Generator/Discriminator architecture with proper structure but used variable names like z/output rather than fake/real

All 12 implementations were correct, runnable, and complete. Fine-tuned model matched or beat the base on every task.

Constraint Eval (10 multi-constraint tests)

MetricFine-TunedBaseDelta
Constraint Score93.3%91.4%+1.9%
Code-Only Compliance10/106/10+4
Code Ratio100%87.9%+12.1%

Won 3/10 tests, tied 5/10, lost 2/10.

Key win: The fine-tuned model follows “no explanations” instructions 100% of the time. The base model adds explanatory text 40% of the time even when explicitly instructed not to.


Key Technical Insights

  • DO NOT call merge_and_unload() on 4-bit models — produces garbage output. Keep adapter as a PEFT layer.
  • Load adapter with FastLanguageModel.from_pretrained(model_name=ADAPTER_DIR), not PeftModel
  • Set use_cache=False in generate() to bypass Unsloth RoPE decode bug
  • L40S supports bf16; T4 only fp16 — unsloth errors on mismatch
  • SFTTrainer + SFTConfig required (not raw Trainer) — unsloth patches expect it


Technologies Used

Python • PyTorch • Hugging Face Transformers • QLoRA/PEFT • Unsloth • TRL (SFTTrainer) • Lightning AI