/ parabun

bun:llm

GGUF LLM inference, BERT sentence encoders, Whisper STT, and an OpenAI-compatible HTTP server — all built into the runtime.

ts
import llm from "bun:llm";

bun:llm is an in-tree native inference stack. Models are mmapped off disk; weights stay device-resident on CUDA / Metal so per-token traffic is a 4-byte argmax. Three model classes ship today:

Plus llm.serve(...) — an OpenAI-compatible HTTP wrapper that points any of the above at :11434.

LLM — chat and completion

ts
import { LLM } from "bun:llm";

using m = await LLM.load("./Llama-3.2-1B-Instruct-Q4_K_M.gguf");

for await (const piece of m.chat([
  { role: "system", content: "You are helpful and concise." },
  { role: "user", content: "What is the capital of France?" },
])) {
  process.stdout.write(piece);
}

LLM.load(path, opts?)

Loads a GGUF file. Detects the architecture (general.architecture) and chat template (tokenizer.chat_template). Returns a LLM instance.

OptionDefaultDescription
deviceauto"cuda", "metal", or "cpu". Auto-probes in [metal, cuda, cpu] order.
kvCacheSize4096Token capacity of the KV cache (per request).

The instance is Disposableusing m = ... releases device buffers on scope exit. Manual m[Symbol.dispose]() works too.

m.chat(messages, opts?)

Async iterator yielding text pieces. messages is an array of { role, content } with role in {system, user, assistant}. The chat template detected from the GGUF wraps messages with the model's expected delimiters.

OptionDefaultDescription
maxTokens512Hard ceiling on generated tokens.
stopTokensmodel-specificToken IDs that end generation. Defaults to EOS + chat-template terminator (e.g. <|eot_id|>). Pass [] to disable.
includePromptfalseEcho the rendered prompt as the first piece.
temperature00 = greedy / argmax. >0 enables sampling.
topK, topPNucleus filter applied before sampling.
seedrandomMulberry32 seed for reproducible sampling.
grammarGBNF source. Only tokens that keep the grammar in an accept-able state are sampled.
schemaJSON Schema. Compiled to a grammar internally.
logitBiasMap<tokenId, number> added to logits before sampling.
prefixReuse a PrefixCache from m.prefix(text) to skip prefill on a shared system prompt.

Mutually exclusive: pass either grammar or schema, not both.

m.generate(prompt, opts?)

Same options as chat, but takes a raw string. No template wrap — useful when you want the bare BPE-tokenizer / decoder pipeline.

m.embed(text) (if the GGUF has embedding tied weights)

Returns the model's hidden state for the last token. Most decoder-only chat models don't expose this usefully — for sentence embeddings, use the Encoder class instead.

m.prefix(text) / m.prefixChat(messages)

Returns a PrefixCache — the KV cache snapshot after running text (or the templated chat prefix) through the model. Pass it as opts.prefix on subsequent calls to skip the prefill cost. Useful when many requests share a common system prompt.

Encoder — BERT-family sentence embeddings

ts
import { Encoder } from "bun:llm";

using enc = await Encoder.load("./bge-small-en-v1.5.gguf");
const vec = enc.embed("hello world");          // Float32Array of dModel
const norms = enc.embedBatch(texts);           // Float32Array[]

Targets general.architecture="bert" GGUFs. Bidirectional attention, post-LN residuals, GELU FFN, WordPiece tokenizer. Pooling defaults to whatever the GGUF says; pass { pool: "cls" | "mean" } to override. Outputs are L2-normalized by default — toggle with { normalize: false }.

WhisperModel — speech-to-text

ts
import llm from "bun:llm";
import audio from "bun:audio";

const wav = audio.readWav(new Uint8Array(await Bun.file("clip.wav").arrayBuffer()));
const m = await llm.WhisperModel.load("./ggml-tiny.en.bin");
const text = m.transcribe(wav.samples, { language: "auto", beamSize: 5 });

Loads whisper.cpp ggml-*.bin files. Both formats supported:

Tensor types: F32, F16, Q4_0, Q5_0, Q5_1, Q8_0. Quantized weights are dequantized at load time.

WhisperModel.load(path)

Reads the .bin, transposes encoder + cross-attn weights for gpu.matmul, wraps decoder weights in GpuFloat32Array for gpu.matVec. The model object is reusable — load once per process and cache.

m.transcribe(audio, opts?)

Single high-level call. audio is mono 16 kHz Float32Array PCM in [-1, 1]. Audio longer than 30 s is split into non-overlapping 30-second chunks; chunks below RMS=1e-4 are skipped. Whisper's literal silence annotations ([BLANK_AUDIO], [silent], [music], [inaudible]) are stripped from the output.

OptionDefaultDescription
language"en"ISO-639-1 code, or "auto" to detect (multilingual only).
maxTokens224Per-chunk token ceiling.
beamSize11 = greedy. >1 runs beam search with cumulative log-prob ranking and KV state forking. Cross-attn K/V is shared by reference across beams.

m.transcribeMel(mel, T, opts?)

Lower-level entry point. mel is a flat [nMels, T] row-major Float32Array. Use this when you've already computed the mel spectrogram (e.g. via audio.melSpectrogram(audio, { mode: "whisper" })) or when integrating with a custom audio source.

m.detectLanguage(mel, T) / m.detectLanguageFromEncoder(encoded)

Multilingual only. Runs the encoder + a single decoder step from [<\|startoftranscript\|>] and picks the language token with the highest logit. Returns { language: string, prob: number }.

Performance

Release build, NVIDIA RTX 4070 Ti, JFK 11-second sample on ggml-tiny.en:

StageTimeNotes
CPU debug, no cache93 sOriginal implementation.
+ KV cache29 sCross-attn K/V cached per encode; self-attn K/V appended per step.
+ release build14.8 sJIT optimizations.
+ CUDA encoder + decoder1.6 sEncoder im2col conv + matmuls + per-head batched attention; decoder per-token matVecs + LM head.

base.en (4× the parameters of tiny.en) runs in 3.07 s for 11 s of audio (~3.6× real-time). Beam search ≥ 2 typically adds <10% wall-clock thanks to early termination when no active beam can catch up to a finished one.

llm.serve — OpenAI-compatible HTTP server

ts
import llm from "bun:llm";

const m = await llm.LLM.load("./Llama-3.2-1B-Instruct-Q4_K_M.gguf");
llm.serve({ engine: m, modelId: "llama-3.2-1b", port: 11434 });

Routes:

Default port is 11434, matching ollama. Optional apiKey enables Authorization: Bearer ... checks. maxConcurrent (default 1) is a FIFO concurrency gate — useful when single-GPU inference doesn't pipeline.

The engine argument is duck-typed: anything with .chat() / .generate() / .embed() works. Plug in fakes / test doubles for local dev.

Low-level building blocks

When the high-level classes don't fit, the underlying components are exported:

Use them when you want to drive the forward pass yourself, share weights across instances, or stream KV-cache snapshots between requests.

Limits