Cumulus

Cumulus
Labs

Production-grade inference.
Routed, evaluated, and fine-tuned.

Built on our own NVIDIA Grace and Blackwell fleet with custom attention kernels.

Founded by alumni from

Georgia Tech
NASA
Palantir
Space Force
Blackstone
UW Madison
Georgia Tech
NASA
Palantir
Space Force
Blackstone
UW Madison
Georgia Tech
NASA
Palantir
Space Force
Blackstone
UW Madison
Georgia Tech
NASA
Palantir
Space Force
Blackstone
UW Madison
Drop-in

Change one line.
Keep your code.

client.py
from openai import OpenAI client = OpenAI( api_key=os.environ["CUMULUS_KEY"], - base_url="https://api.openai.com/v1", + base_url="https://api.cumuluslabs.io/v1", )
← The whole pitch
Works withOpenAI SDK·Anthropic SDK·LangChain·LlamaIndex·Vercel AI SDK
The Premise

Workflows optimized
top down.

Optimize each node. The whole workflow gets faster, cheaper, and more reliable.

workflow · inbox-triagep50 218ms · 47k req/day · all healthy
Input
user request
Cumulus · Router
Declared rule
api.cumuluslabs.io
Classify
llama-3.2-1b
on Ion · custom kernels
Score urgency
gpt-5-mini
via OpenAI
Draft reply
claude-sonnet
via Anthropic
Output
JSON response
P50 latency
218ms
Quality · judge
97.8%
Cache hit
62%
Status
● Live
Why Cumulus

Production-grade,
not demo-grade.

30 to 50%
Throughput

Custom kernels on NVIDIA Grace.

Our Ion engine beats stock vLLM and SGLang on the same chip. More requests per second, lower cost per token.

Failover
Reliability

Provider outages routed around.

Continuous health checks across every provider. When one drops, traffic reroutes before users notice.

Continuous
Quality

More than judge consensus.

Synthetic data, heuristic checks, and LLM judges. Every workflow graded against production traffic in shadow.

The Status Quo

Production is brittle.
Five vendors. Five blind spots.

Disconnected
RoutingVendor A
ObservabilityVendor B
EvaluationVendor C
Fine-tuningVendor D
InferenceRented GPU
5 → 1Cumulus
Integrated
Cumulus
api.cumuluslabs.io
01Gateway
02Router
03Cache
04Observability
05Evaluation
06Fine-tune
07Custom hosting
08Ion
The Platform

Eight subsystems.
One platform.

Designed to work together. Inference at the core, everything else built on top.

01
Gateway
Translate any provider.

OpenAI-compatible HTTP layer. One client works against every provider.

02
Router
Per-workflow routing.

Declared routing rules pick the model, the provider, the infrastructure. Deterministic and traceable.

03
Cache
40 to 70% fewer tokens.

Exact-match, prefix, and an optional semantic cache. Stacked. Cuts input tokens dramatically.

04
Observability
Every request, logged.

Input, output, model, latency, cost, quality. Replayable audit log. Real-time dashboards.

05
Evaluation
Synthetic data + heuristics.

More than judge consensus. Auto-generated rubrics, synthetic test data, deterministic heuristic checks, plus LLM judges. Continuous shadow evaluation on production traffic.

06
Fine-tune
One-click LoRA training.

Spots candidate workflows from traffic. Trains LoRAs on our fleet. Migrates traffic gradually.

07
Custom hosting
Bring open weights.

Host open-weight models or your own fine-tunes on Ion. Cheaper than direct cloud GPU rental.

08
Ion
30 to 50% more throughput.

Our inference engine on NVIDIA Grace. Custom attention kernels beat vLLM and SGLang.

Real Deployments

Where it lands.

Anonymized examples. Real deployment shapes.

01SaaS · Multi-Provider Reliability

Routing logic in if-else statements.

Situation

Three providers across reasoning, summarization, and classification. Failover paths written, never tested. One outage takes the product down for six hours.

Cumulus

Declared routing rules. Every provider health-checked continuously. Failover routed automatically before the user notices.

02Healthtech · Continuous Quality

Frontier-only by default.

Situation

Mistakes have legal consequences. Suspects cheaper models could carry most traffic. Can't justify the engineering work to verify which ones.

Cumulus

Shadow evaluation runs against production traffic. Synthetic data plus heuristic checks plus LLM judges surface the safe swaps. Approve one workflow at a time.

03Voice AI · Sub-Second Latency

Frontier latency too high. Open-weight too slow.

Situation

Voice agents need responses under a second. Frontier model latency overshoots. Rented GPUs add cold-start time and unpredictable throughput.

Cumulus

Fine-tuned LoRA trained on production traffic. Served on Ion's custom kernels. Throughput uplift translates directly to lower end-to-end latency.

04Enterprise IT · Unified Telemetry

Four tools across three clouds.

Situation

AI tools built across Azure OpenAI, Bedrock, and Vertex AI. Different APIs, different dashboards, different audit logs. Per-tool attribution missing.

Cumulus

One client across all three clouds. Per-tool attribution by default. A single audit log built for compliance review.