Skip to content

@para/csv

import csv from "@para/csv";

Two output modes:

  • parseCsv(input, opts?) — async iterator of row objects (or string arrays). The classical shape; works for everything.
  • parseColumns(input, { schema, ... }) — Promise of per-column TypedArray buffers. Numeric data lands in compute-ready memory with no per-row Object allocation and no per-cell boxed Number. The thing the JS ecosystem was missing between row-objects libs and DuckDB-WASM.

Both are state machines over UTF-8 bytes; the row-iterator path never materializes the full file. The columnar path materializes the columns (it has to — the result IS the materialization), but each column ends as one tight TypedArray rather than 1M boxed values.

parseColumns(input, { schema, headers?, delimiter?, quote? })

Section titled “parseColumns(input, { schema, headers?, delimiter?, quote? })”
const cols = await csv.parseColumns(Bun.file("./sensors.csv"), {
schema: { ts: "f64", temp: "f32", sensorId: "i32", label: "string" },
});
// cols.ts is a Float64Array
// cols.temp is a Float32Array
// cols.sensorId is an Int32Array
// cols.label is a string[]

schema maps each column name to a type:

schema valueOutput type
"f32"Float32Array
"f64"Float64Array
"i8" / "u8"Int8Array / Uint8Array
"i16" / "u16"Int16Array / Uint16Array
"i32" / "u32"Int32Array / Uint32Array
"string"string[] (TypedArrays can’t hold strings)

headerstrue (default) treats the first row as headers and matches schema keys against header cell names. false maps schema keys to column indices in declaration order. Or pass an explicit array of header names to skip the lookup.

All of parseCsv’s dialect options apply: delimiter, quote, escape, comment, trim, skipEmptyLines. UTF-8 BOM is stripped automatically.

Empty / missing numeric cells become NaN for floats, 0 for ints. The result objects share no backing buffers with the input.

Most CSV libraries return Array<{col1, col2}> row objects. Each row is a JS Object (~56-byte header) plus a boxed Number for each numeric cell — ~24 bytes per number. For a 1M-row, 4-numeric-column CSV that’s ~120 MB of boxing overhead before you’ve done any actual work.

parseColumns writes straight into TypedArray buffers (one per column), grows them exponentially as rows arrive, and tight-fits the result at end-of-stream. For a 200K-row × 4-numeric-column CSV: 1.4× faster than the row-objects path on this codebase, and the result is 3 MB of contiguous bytes — ready to hand to @para/simd, @para/arrow.fromColumns(), GPU upload, or any other consumer that expects packed numeric data.

import csv from "@para/csv";
import { sum, mean } from "@para/simd";
const cols = await csv.parseColumns("./sensors.csv", {
schema: { temperature: "f32", humidity: "f32" },
});
const meanTemp = sum(cols.temperature) / cols.temperature.length;

parseBatches(input, { schema, batchSize?, ... })

Section titled “parseBatches(input, { schema, batchSize?, ... })”

Async iterator over fixed-size columnar chunks. Lets a caller process arbitrarily large CSVs in O(N) time and O(batchSize) memory without materializing the full column buffers. Default batchSize is 8192 rows.

for await (const batch of csv.parseBatches(Bun.file("./big.csv"), {
schema: { temp: "f32", ts: "f64" },
batchSize: 8192,
})) {
// batch.temp is a Float32Array of up to 8192 rows
// batch.ts is a Float64Array of up to 8192 rows
// process batch — feed to @para/simd, append to an Arrow stream, etc.
}

The final batch is tight-fit to the actual remaining row count; full-size batches share their backing buffers with the typed-array result.

reduceColumns(input, { schema, reducers, ... })

Section titled “reduceColumns(input, { schema, reducers, ... })”

Single-pass reduction over CSV columns — never materializes the data at all. Streaming aggregates per column with O(1) memory per column regardless of input size.

const stats = await csv.reduceColumns(Bun.file("./big.csv"), {
schema: { temp: "f32", humidity: "f32", sensor: "string" },
reducers: {
temp: ["count", "sum", "min", "max", "mean", "stddev"],
humidity: ["mean"],
sensor: ["count"],
},
});
// stats.temp = { count, sum, min, max, mean, stddev }
// stats.humidity = { mean }
// stats.sensor = { count }

Available reducers: count, sum, min, max, mean, variance, stddev. Numeric reductions skip NaN cells. variance and stddev use Welford’s online algorithm — numerically stable even at billions of rows. String columns can only meaningfully be count-reduced.

The classical async iterator path. Returns row objects (or string arrays with header: false). The parser is a state machine over UTF-8 bytes; it never materializes the full file in memory regardless of size.

input can be:

  • Bun.BunFile (recommended for files on disk).
  • ReadableStream<Uint8Array> or AsyncIterable<Uint8Array> (for fetched content, pipes, sockets).
  • Uint8Array or string (for in-memory).
import csv from "@para/csv";
for await (const row of csv.parseCsv(Bun.file("data.csv"), { header: true })) {
process(row.id, row.name, row.score);
}
OptionDefaultDescription
headerstrueWhen true, the first row is the column names; subsequent rows are emitted as objects keyed by column. When false, rows are string[]. Pass an explicit string[] to use them as headers and treat every input row as data.
delimiter","Single-character cell separator. Pass "" to auto-detect from the first non-comment line — the lexer counts , \t ; `
quote"\""Single-character quote that wraps cells with embedded delimiters / newlines.
escapesame as quoteRFC 4180 doubles the quote ("") to escape. Set to "\\" for backslash-escape dialects. When escape !== quote the parser stops treating "" as an escape.
comment"" (off)If set, lines starting with this character (when no field has been opened yet) are skipped entirely.
trimfalseStrip leading and trailing whitespace from each cell. Quoted cells are preserved verbatim — quoting exists to keep their whitespace.
skipEmptyLinestrueDrop wholly-blank rows. With trim: true, all-whitespace rows count as blank. Set false to surface them.
typeInferencetruePer-cell type inference: numeric → number, "true" / "false"boolean, empty → null. Plain strings pass through. Disabled automatically without headers.
skipLines0Skip this many leading rows before header detection.
maxRowsInfinityCap the number of data rows yielded. The header row does not count. Useful for previews.
transformHeadernone(header, index) => string. Maps each header cell before it becomes the object key (or, on the columnar paths, the schema-lookup name). Common use: normalize case so the CSV’s "First Name" matches a first_name schema key.
transformnone(value, column) => string. Maps each cell value before type inference. column is the header name when headers is set, otherwise the column index. parseCsv only — the columnar paths write straight into TypedArrays where a string→string mapping doesn’t fit.
parallelfalseSee Parallel mode below.

A leading UTF-8 BOM (U+FEFF) is stripped from the first chunk automatically; you do not need to handle it. Without headers, every row is an array of strings (no inference).

parallel: true chunks the input across @para/parallel’s worker pool when the input has no quoted cells (the byte-boundary heuristic doesn’t work otherwise). It runs the parse off the main thread.

for await (const row of csv.parseCsv(Bun.file("data.csv"), { header: true, parallel: true })) {
// row processed off main thread
}

This is not a per-file speedup. The serial state machine is already memory-bandwidth-bound, and the parallel path’s materialize-and-fork overhead grows with input size. Sweep on a 16-core x86 release build:

FixtureSerial (med)Parallel (med)Speedup
5 MB · 128k rows152 ms129 ms1.18×
50 MB · 1.25M rows1446 ms1528 ms0.95×
200 MB · 4.92M rows5892 ms6363 ms0.93×

Use parallel: true to keep the event loop responsive while parsing (parsing N files concurrently does scale across cores), not because you expect bigger files to go faster. bench/parabun-csv-parallel/ reproduces the numbers.

@para/csv rows pair naturally with @para/arrow’s fromRows:

import csv from "@para/csv";
import arrow from "@para/arrow";
const rows: any[] = [];
for await (const row of csv.parseCsv(Bun.file("data.csv"), { header: true })) rows.push(row);
const tbl = arrow.fromRows(rows);
arrow.mean(tbl.column("score"));

For very large CSVs, batch the bridge — call arrow.fromRows per N rows instead of materializing them all first.

The inverse of parseCsv — take rows in memory and emit RFC 4180 CSV text.

import csv from "@para/csv";
const text = csv.stringify([
{ id: 1, name: "Ada, Lovelace", note: 'said "hi"' },
{ id: 2, name: "Grace", note: "ok" },
]);
// id,name,note
// 1,"Ada, Lovelace","said ""hi"""
// 2,Grace,ok

rows can be either an array of objects (header row inferred from the union of keys, in first-seen order) or an array of arrays (no header row unless you pass headers explicitly).

OptionDefaultDescription
delimiter","Field delimiter.
quote"\""Quote character.
escapesame as quoteDoubled-quote ("") by default; set to "\\" for backslash-escape output.
newline"\r\n"Row terminator. CRLF is what Excel expects; pass "\n" for Unix-only readers.
headerstrue for object rows / false for array rowstrue emits the inferred header row, false skips it, or pass string[] to specify exactly which columns and in what order.
bomfalsePrefix the output with a UTF-8 BOM (U+FEFF) for tools that need it (mainly Excel for non-ASCII text).

Cells are quoted only when they contain the delimiter, the quote character, the escape character (if distinct), or a CR/LF. null / undefined round-trip as empty cells. Date values stringify as ISO 8601.

// Round-trip preserves the data.
const text = csv.stringify(rows);
const back = [];
for await (const r of csv.parseCsv(text, { headers: true })) back.push(r);
  • Multi-byte delimiters / quotes aren’t supported. RFC 4180 specifies single-byte for both.
  • Parallel mode requires the input has no quoted cells (otherwise byte-boundary chunking can split a quoted region) and a default dialect (no comment, trim, or distinct escape).
  • Type inference is per-cell — there’s no whole-column type promotion. If column score has mostly numbers and one "N/A", you get a mix of number and string; coerce on your end if that’s a problem.