Back to Inference Glossary
Inference Glossary

Fine-Tuning

Adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset. Usually done with parameter-efficient methods like LoRA that update less than 1% of weights.

Fine-tuning is the transfer-learning technique of taking a pre-trained model and continuing training on a smaller, task-specific dataset. Instead of training from scratch — which requires enormous datasets and compute — fine-tuning starts from a model that has already learned general patterns and refines it for a specific use case. This dramatically reduces the data, compute, and time required to reach high task performance.

The dominant fine-tuning approach in production is **LoRA** (Low-Rank Adaptation). LoRA adds small trainable rank-decomposition matrices to existing model layers, updating less than 1% of the total parameters. This reduces memory and compute requirements enough that LoRAs can be trained on a single GPU for most open-weight models, and dozens of LoRAs can be served concurrently on top of a shared base model. LoRA fine-tunes are also small enough to version, deploy, and roll back like ordinary application configuration.

Fine-tuning is most useful for workflows where a smaller model could plausibly handle the task if it learned the specific patterns of your domain. Customer support classification, structured extraction, tone matching, and tool-call generation are typical wins — a fine-tuned 1B or 3B model often matches a frontier model on a narrow workflow at a fraction of the cost and latency.

An inference platform that owns observability and routing can spot fine-tune candidates automatically. It identifies workflows where a smaller model would meet quality, trains LoRAs on captured production traffic, evaluates them in shadow against the live model, and migrates traffic gradually once the data supports the swap. This is the Fine-tune subsystem in Cumulus.