bun:gpu
GPU-accelerated vector + matrix primitives. Metal on macOS, CUDA on Linux + Windows, CPU fallback everywhere.
import gpu from "bun:gpu";bun:gpu is the device-dispatch layer. The same API works on Metal, CUDA, and CPU — backends register themselves via probe + capability, and bun:gpu picks the best one available. The CPU backend forwards to bun:simd, so unsupported hosts still get vectorized routes.
Backend
activeBackend() / hasBackend(name) / setBackend(choice)
gpu.activeBackend(); // "cuda" | "metal" | "cpu"
gpu.hasBackend("cuda"); // boolean — does the binary include the backend AND does the host support it
gpu.setBackend("cpu"); // force CPU; useful for tests
gpu.setBackend("auto"); // re-probeProbe order is [metal, cpu] on macOS and [cuda, cpu] elsewhere. cpu always probes true — it's the floor.
winsForSize(op, n, elemBytes)
Returns true when the active backend's calibrated crossover says GPU beats CPU at this size. Use it to gate dispatch:
if (gpu.winsForSize("matVec", nRows * nCols, 4)) {
return gpu.matVec(matrix, vector, nRows, nCols);
}
return simd.matVec(matrix, vector, nRows, nCols);The CPU backend always returns false so if (winsForSize(...)) falls through to a scalar fallback.
calibrate()
Sweeps the real GPU kernel against bun:simd at a small set of sizes, persists the measured crossover under ~/.cache/parabun/gpu-calibrate-<hash>.json, and rehydrates it on subsequent process starts. Intended to be called once at app boot — the sweep takes 200–500ms. Setting BUN_PARABUN_SKIP_CALIBRATION=1 bypasses the cache read on module load.
Residency
GPU calls take typed arrays or device-resident handles. Wrap a Float32Array once with GpuFloat32Array and the bytes are HtoD-uploaded at construction; subsequent ops use the device buffer with no extra crossing. Disposal is GC-finalized but using is preferred.
import gpu from "bun:gpu";
using mat = new gpu.GpuFloat32Array(weights); // HtoD on construction
for (const q of queries) {
const scores = gpu.matVec(mat, q, M, K); // q HtoDs, mat is already there
}
// `mat` released at scope exitManual residency:
const handle = gpu.hold(typedArray); // returns GpuHandle
gpu.matVec(handle, vector, M, K);
gpu.release(handle);holdQ4K / holdQ6K accept raw quantized weight bytes — the device buffer holds the Q4_K / Q6_K super-block layout and dispatches an on-chip dequant kernel inside matVec. Used by bun:llm for the Q4_K_M / Q6_K Llama paths.
Vector ops
dot(a, b)
Vector dot product. Accepts typed arrays or handles for either side.
matVec(matrix, vector, nRows, nCols)
matrix is [nRows, nCols] row-major, vector is length nCols. Returns [nRows]. The hot path inside bun:llm's decoder step — every Q/K/V/O projection plus the LM head goes through here.
matmul(A, B, m, k, n, out?)
A is [m, k], B is [k, n], returns [m, n]. CUDA backend uses an 8×8 register-tiled NVRTC kernel. Pass out to write into a caller-owned destination buffer (avoids one allocation per call when sweeping).
simdMap(fn, a)
Element-wise map. fn is a JS function (x, i) => number. The runtime translates supported function bodies to PTX (CUDA) or MSL (Metal) and dispatches as a single kernel — no per-element call overhead. Supported subset: arithmetic, Math.*, ternary, conditional if. Fall back to CPU for anything outside that.
const y = gpu.simdMap(x => x * x + 1, input); // compiled to PTX/MSLReductions
gpu.reduce(input, "sum"); // | "min" | "max"
gpu.scan(input); // exclusive prefix sum
gpu.argMin(input);
gpu.argMax(input);
gpu.histogram(input, bins, min, max);
gpu.median(input);
gpu.quantile(input, q);
gpu.variance(input, ddof?);
gpu.stddev(input, ddof?);CUDA reduce (sum/min/max) and atomic-privatized histogram ship as device kernels today. Scan, argMin/argMax, variance, median/quantile have device kernels for some shapes and CPU correctness paths for the rest — all on the same dispatch surface, so the call site doesn't change as kernels land.
Image
conv2D(input, kernel, iH, iW, kH, kW)
2D valid-mode correlation. Used by bun:image for blur / sharpen / edge-detect. f32 only for v1.
imageBlurRGBA(input, width, height, sigma)
Separable Gaussian on RGBA8 — used internally by image.blur and image.sharpen's prefilter. Calls into the same CUDA / Metal kernel paths.
Allocators
alloc(n, type)
Returns a typed array (Float32Array | Float64Array) backed by pinned host memory when the active backend benefits from it. On CUDA, pinned memory cuts HtoD latency by ~2-3× on large transfers.
isAligned(arr)
True when the underlying buffer satisfies the active backend's alignment requirement (16-byte for current CUDA / Metal kernels).
Backend specifics
CUDA
Driver API via bun:ffi against libcuda.so.1. NVRTC compiles dynamic kernels (simdMap); static PTX is shipped for matVec, matmul, dot, reduce, histogram, and the quantized-matVec variants. Shared-memory tile sizes are tuned for SM 8.x (Ampere) and SM 9.x (Hopper); SM 7.x (Turing) hits a fallback launch shape.
Metal
Obj-C FFI to MTLDevice + MTLComputePipelineState. Zero-copy via Apple's unified memory — hold() is essentially free. MSL source is generated from JS for simdMap; static MSL is shipped for the rest.
CPU
Forwards every op to bun:simd. Always available — useful for tests and CI hosts without a GPU.
Limits
- f64 matmul / matVec are CUDA-only on NVIDIA's higher-precision SKUs; consumer cards trap to a much slower path.
bun:gpuruns f64 on CPU instead. - The dynamic kernel compiler (
simdMap) supports arithmetic +Math.*+ ternaries. No control flow beyond that — branch-heavy bodies stay on CPU. - Two
GpuHandles on different backends can't be mixed in one call (e.g. you can't pass a CUDA handle to a Metal kernel). The active backend at the call site determines which the inputs must belong to.