Roadmap

timeline TD
0.1.0 · 9 Mar 2026 : Core DataFrame & Column
                   : SparkSession
                   : Built-in functions
0.2.0 · 15 Mar 2026 : Caching & repartitioning
                    : cube / rollup / pivot / unpivot
                    : DataFrameStat
0.3.0 · 29 Mar 2026 : DataFrameReader/Writer shortcuts
                    : DataFrameWriterV2
                    : Typed error hierarchy
0.4.0 · 14 May 2026 : Documentation site
            : Catalog parity with PySpark
            : TLS, bearer token, connection string
            : Reattach, retry, interrupts
            : RuntimeConfig (spark.conf)
            : Error trailer decode + FetchErrorDetails
            : Java UDF / UDAF registration via spark.udf
0.5.0 · 4 Jul 2026 : Structured Streaming
      : Watermarks & event-time windows
      : Type-driven Arrow decode
      : createDataFrame from local rows
0.6.0 : MERGE INTO builder
      : Session controls & artifacts
      : ~120 more functions
      : DataFrame long-tail
0.7.0 : Arrow batch UDFs
      : Table-valued functions
0.8.0 : Managed platform connectors, Databricks, EMR, possibly more
0.9.0 : Additional runtime & framework integrations

Shipped

See the Changelog for full details on each release.

0.1.0 · 9 March 2026 · core DataFrame, Column, SparkSession, built-in functions
0.2.0 · 15 March 2026 · caching, repartitioning, cube/rollup/pivot/unpivot, DataFrameStat
0.3.0 · 29 March 2026 · reader/writer shortcuts, DataFrameWriterV2, typed error hierarchy
0.4.0 · 14 May 2026 · catalog parity, docs site, transport (TLS, bearer, retry, reattach, interrupts), RuntimeConfig, error-trailer decoding, Java UDF registration
0.5.0 · 4 July 2026 · Structured Streaming, watermarks and event-time windows, type-driven Arrow decode, createDataFrame from local rows, typed row access

0.4.0 · Catalog parity, docs site, transport

Catalog parity with PySpark: the full spark.catalog surface.
spark.udf.registerJavaFunction and registerJavaUDAF for binding Java UDFs already on the server’s classpath to a SQL function name.
Documentation site (this site).
Full sc:// connection-string grammar: use_ssl=true for TLS, bearer token, user_id, user_agent, session_id, grpc_max_message_size, plus arbitrary metadata pass-through. Token-over-insecure rejected.
Resilience for long-running queries: per-request operation IDs, ReattachExecute iterator that resumes server-streaming responses after transient gRPC drops, configurable RetryPolicy, and interrupts (interruptAll, interruptTag, interruptOperation).
RuntimeConfig on spark.conf (get, set, unset, getAll, isModifiable).
SparkSession.version().
Error-trailer decoding: errorClass, sqlState, messageParameters, plus FetchErrorDetails fallback for errorTypeHierarchy and serverStackTrace.
client_observed_server_side_session_id echo for stale-session detection.
node-tls-behind-proxy example.

0.5.0 · Structured Streaming

readStream and writeStream builders, with Trigger factories and output modes.
StreamingQuery: id, runId, name, isActive, stop, awaitTermination, status, lastProgress, recentProgress, processAllAvailable, exception, explain.
StreamingQueryManager on spark.streams: active, get, awaitAnyTermination, resetTerminated, addListener/removeListener.
Listener callbacks: onQueryStarted, onQueryProgress, onQueryIdle, onQueryTerminated.
Event-time aggregation: withWatermark, window(), session_window().
Type-driven Arrow decode: temporals as Date, decimals as fixed-point strings, maps as Map<K, V>, longs always as bigint (count() returns bigint), applied recursively through structs and arrays. show() renders in Spark’s display style.
createDataFrame(rows) from plain JS objects, encoded to Arrow on the client, with an arrowEncoder builder hook for custom runtimes.
Typed access: df.as<Schema>() and the row accessor namespace. df.agg(...) and df.col(name).
Column methods wrap raw primitives as literals, and filter/where accept SQL string predicates.
lit(null) emits a typed NULL literal and lit(undefined) rejects.
Transport: channel handshake deadline, no-progress reattach ceiling, isSessionInvalidated predicate.
Connection strings reject non-sc:// schemes and userinfo in the host.
node-streaming example.

Planned

0.6.0 · Advanced Features

DataFrameWriterV2.mergeInto: fluent MERGE INTO builder with whenMatched/whenNotMatched/whenNotMatchedBySource and schema evolution.
SparkSession enhancements: newSession, active()/getActiveSession(), addArtifact/addArtifacts, copyFromLocalToFs, progress handlers, executionInfo.
DataFrame long-tail: checkpoint, localCheckpoint, observe, withMetadata, inputFiles, isLocal, transpose, sampleBy, colRegex, to(schema), lateralJoin, toArrow(), toJSON().
~120 additional built-in functions: Variant, XML, URL, geospatial, partition transforms, bitmap and sketch aggregates, extra time helpers, regex variants, try_* variants.
Integration coverage for the remaining file formats: extend tests/integration/ to round-trip Avro, XML, JDBC, and Hive via the generic .format() path. Currently only CSV / JSON / Parquet / ORC / text are exercised end-to-end; the I/O guide carries an “untested” caveat that this item lifts.

0.7.0 · UDFs and Table Functions

Arrow batch UDFs, contingent on Spark Connect protocol support without a JS runtime on executors.
Table-valued function helpers: explode, inline, posexplode, json_tuple, range, stack, variants, collations, sql_keywords.
TableArg with partitionBy/orderBy/withSinglePartition.

callFunction(name, ...cols) already works for server-side UDFs registered by name; closure-based UDFs are what this milestone adds.

0.8.0 · Managed platform connectors

Support for managed Spark Connect deployments. Confirmed in scope: Databricks and AWS EMR, possibly more depending on demand and what the transport layer needs per provider.

The work here is per-provider auth, transport, and connection-string plumbing rather than new DataFrame surface. Every supported provider ships with a runnable, CI-tested example. No example, no support claim.

0.9.0 · Additional runtime & framework integrations

Runtime and framework integrations beyond Node.js.

Which runtimes and frameworks land here depends on user demand and what the session/lifecycle API needs from each. The list isn’t fixed yet.