/ parabun

bun:arrow

In-memory columnar tables, computes, and Arrow IPC reader/writer — wire-compatible with apache-arrow, pyarrow, polars, duckdb.

ts
import arrow from "bun:arrow";

Apache Arrow's columnar model, in-process, with no npm dep on apache-arrow. RecordBatches are typed-array views with optional validity bitmaps; tables are sequences of batches sharing a schema. The Arrow IPC streaming + file formats round-trip both directions against the canonical implementations.

Building tables

recordBatch({ ... })

Takes a map of column name → values and infers per-column type. Values can be:

ts
const batch = arrow.recordBatch({
  age:    new Int32Array([25, 30, 35]),
  score:  new Float64Array([0.95, 0.82, 0.71]),
  name:   ["alice", "bob", "carol"],
  tags:   [["a", "b"], [], ["c", "d", "e"]],   // list<utf8>
});

batch.numRows;                    // 3
batch.column("age").get(0);       // 25
batch.column("tags").get(2);      // ["c", "d", "e"]

table(batches)

Concatenates batches sharing a schema. Table has a .column(name) that returns a ConcatColumn — a virtual view across batches. Pass it to any compute function for a table-wide aggregate.

fromRows(rows, opts?) / toRows(source)

Bridge between row-shaped JS data and the columnar form. fromRows is the typical entry point from bun:csv output:

ts
import csv from "bun:csv";
import arrow from "bun:arrow";

const rows: any[] = [];
for await (const r of csv.parseCsv(Bun.file("data.csv"), { header: true })) rows.push(r);
const tbl = arrow.fromRows(rows);

Compute primitives

All take a Column or ConcatColumn. Numeric reductions return a single scalar; predicate-style return a column or new batch.

FunctionDescription
sum, meanKahan-compensated sum / mean.
min, maxSkips nulls. NaN propagation matches IEEE 754.
argMin, argMaxFirst-occurrence tie-break, NaN-aware.
countCounts non-null entries.
variance(col, { ddof? }), stddevWelford accumulator. ddof=0 (population) by default.
quantile(col, q), median(col)Sorts internally; honor nulls.
distinct(col)Returns the unique values as a typed array (or string set for utf8).
cumsum(col), diff(col)New column of running totals / first differences.
concat(col)Materializes a ConcatColumn into a single typed array.

filter(batch, predicate)

Returns a new RecordBatch keeping rows where predicate(row) is truthy. Predicate sees a row-shaped object keyed by column name.

ts
const adults = arrow.filter(batch, row => row.age >= 30);

groupBy(batch, keys, aggs)

Hash group-by. keys is a string or array of column names; aggs is a map of output-name → { column, op }. Supported ops: sum, mean, min, max, count, variance, stddev, distinct.

ts
const result = arrow.groupBy(batch, "city", {
  rows:    { column: "name",  op: "count" },
  avgAge:  { column: "age",   op: "mean"  },
  topScore:{ column: "score", op: "max"   },
});

sort(batch, by, opts?)

Stable sort by one or more keys. by is string | string[] | { name, descending?: boolean }[]. Returns a new batch with rows reordered.

Arrow IPC

Streaming format

ts
const bytes = arrow.toIPC(table);          // Uint8Array
const restored = arrow.fromIPC(bytes);     // Table

Continuation-prefixed Schema + RecordBatch messages, FlatBuffers metadata (hand-rolled builder/reader; no npm dep), 8-byte-aligned body buffers, EOS marker. DictionaryBatch decode is implemented for round-tripping apache-arrow's default Dictionary<Utf8> for string columns.

File format

Pass "file" as the second arg to write the ARROW1-bracketed file format:

ts
const fileBytes = arrow.toIPC(table, "file");   // ARROW1 + messages + EOS + Footer + len + ARROW1
const restored = arrow.fromIPC(fileBytes);      // auto-detects via head/tail magic

The Footer flatbuffer carries a redundant copy of the schema plus a list of Block { offset, metaDataLength, bodyLength } entries pointing at each RecordBatch / DictionaryBatch — random-access on read.

fromIPC auto-detects: if the bytes start with ARROW1\0\0 and end with ARROW1 the file path is taken (Footer's schema and Block list drive the decode); otherwise it falls through to the streaming reader. Same callsite, both formats.

Type coverage

Logical kindIn-memory storageIPC type IDNotes
int32Int32ArrayInt(32, signed)Reads narrow int8/int16/uint8/uint16 by widening.
int64BigInt64ArrayInt(64, signed)Reads uint32 by widening (zero-extend). uint64 throws — no lossless target.
float32Float32ArrayFloatingPoint(SINGLE)
float64Float64ArrayFloatingPoint(DOUBLE)
boolUint8Array (one byte/value)BoolBit-packed on the wire.
utf8string[]Utf8
list<T>Int32Array offsets + recursive child columnListDepth-first FieldNode + buffer walk. Lists of lists work.

Date / Time / Timestamp from upstream Arrow streams are coerced to int32 / int64 on read (unit and timezone metadata dropped). Round-tripping re-emits them as plain ints.

Wire compat

bench/parabun-arrow-ipc-interop/ round-trips both directions against apache-arrow@21.1.0:

Mixed type table (Int8, Uint16, Uint32, Int32, Float64, Date64, Dictionary<Utf8>, List<Float64>) round-trips bit-for-bit through both formats.

Output you can read elsewhere

The bytes Parabun produces are the same wire format pyarrow, arrow-rs, nanoarrow, polars, and duckdb consume on the streaming + file paths. Save with .arrow:

ts
await Bun.write("data.arrow", arrow.toIPC(table, "file"));

Then in Python:

py
import pyarrow.feather as feather
df = feather.read_table("data.arrow")

Parquet

fromParquet(bytes) reads and toParquet(source, opts?) writes Apache Parquet files. Hand-rolled Thrift compact-protocol codec, Snappy compressor + decompressor, dictionary + RLE + bit-pack hybrid decoders, RLE writer for definition levels — no npm dep.

ts
// Read
const bytes = new Uint8Array(await Bun.file("rows.parquet").arrayBuffer());
const tbl = arrow.fromParquet(bytes);

// Write
const out = arrow.toParquet(tbl, { compression: "snappy" });
await Bun.write("rows.parquet", out);

toParquet options:

OptionDefaultDescription
compression"snappy""uncompressed" | "snappy" | "gzip".
FeatureReadWrite
Physical typesBOOLEAN, INT32, INT64, FLOAT, DOUBLE, BYTE_ARRAY (utf8). INT96 + FIXED_LEN_BYTE_ARRAY pending.Same set.
EncodingsPLAIN, PLAIN_DICTIONARY (alias), RLE_DICTIONARY, RLE.PLAIN for values, RLE for def levels (no dictionary yet — strings PLAIN-encoded; less compact than pyarrow but correct).
CompressionUNCOMPRESSED, SNAPPY, GZIP. LZ4, BROTLI, ZSTD follow when wired.UNCOMPRESSED, SNAPPY, GZIP.
PagesV1 data pages with def-level null reconstruction; dictionary pages. V2 pages pending.V1 data pages only.
Row groupsMulti-row-group reads.Single row group.
SchemasFlat columns, required + optional. Nested types need rep-level reconstruction — pending.Same.

Verified end-to-end against pyarrow output:

What's not here yet