Inference Glossary

Fine-Tuning

Adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset. Usually done with parameter-efficient methods like LoRA that update less than 1% of weights.

Fine-tuning is the transfer-learning technique of taking a pre-trained model and continuing training on a smaller, task-specific dataset. Instead of training from scratch — which requires enormous datasets and compute — fine-tuning starts from a model that has already learned general patterns and refines it for a specific use case. This dramatically reduces the data, compute, and time required to reach high task performance.

The dominant fine-tuning approach in production is **LoRA** (Low-Rank Adaptation). LoRA adds small trainable rank-decomposition matrices to existing model layers, updating less than 1% of the total parameters. This reduces memory and compute requirements enough that LoRAs can be trained on a single GPU for most open-weight models, and dozens of LoRAs can be served concurrently on top of a shared base model. LoRA fine-tunes are also small enough to version, deploy, and roll back like ordinary application configuration.

Fine-tuning is most useful for workflows where a smaller model could plausibly handle the task if it learned the specific patterns of your domain. Customer support classification, structured extraction, tone matching, and tool-call generation are typical wins — a fine-tuned 1B or 3B model often matches a frontier model on a narrow workflow at a fraction of the cost and latency.

An inference platform that owns observability and routing can spot fine-tune candidates automatically. It identifies workflows where a smaller model would meet quality, trains LoRAs on captured production traffic, evaluates them in shadow against the live model, and migrates traffic gradually once the data supports the swap. This is the Fine-tune subsystem in Cumulus.

Related Terms

LLM Evaluation

The practice of grading model outputs against a target — using a stack of synthetic data, deterministic heuristics, calibrated LLM judges, and shadow evaluation against production traffic.

Shadow Evaluation

Running a candidate model in parallel with the production model on real traffic, serving the production response to users, and grading the candidate's response asynchronously. The cleanest way to evaluate a model swap.

Model Weights

The learned numerical parameters of a neural network, stored as large multi-dimensional arrays. The artifact that defines what a trained model does.

Inference

Running a trained model on new inputs to produce outputs — predictions, classifications, generations, embeddings. The production half of machine learning.