Comparison to PySpark

spark-connect-js tracks PySpark closely; the differences come from TypeScript vs Python and from the Spark Connect client model itself.

At a glance

The same query in both clients.

# PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, desc

spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()

result = (
    spark.table("events")
        .filter(col("status") == "active")
        .group_by("region")
        .agg(sum("revenue").alias("total"))
        .sort(desc("total"))
        .collect()
)

spark.stop()

// spark-connect-js
import { connect, col, sum, desc } from "@spark-connect-js/node";

const spark = connect("sc://localhost:15002");

const result = await spark
  .table("events")
  .filter(col("status").eq("active"))
  .groupBy("region")
  .agg(sum("revenue").alias("total"))
  .sort(desc("total"))
  .collect();

await spark.stop();

Imports

PySpark spreads its public API across submodules. spark-connect-js exports everything from one entry point.

# PySpark
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, lit, when, sum, count, regexp_replace
from pyspark.sql.types import StructType, StructField

// spark-connect-js
import {
  connect, Window,
  col, lit, when, sum, count, regexp_replace,
  StructType, StructField,
} from "@spark-connect-js/node";

Method names are camelCase

PySpark exposes both groupBy and group_by. spark-connect-js only ships the camelCase spelling.

# PySpark
df.group_by("department").agg(count("*").alias("n"))
df.with_column_renamed("old", "new")
df.create_or_replace_temp_view("events")

// spark-connect-js
df.groupBy("department").agg(count("*").alias("n"));
df.withColumnRenamed("old", "new");
df.createOrReplaceTempView("events");

Column operators are methods

TypeScript has no operator overloading, so comparisons and arithmetic on Column are methods. The methods accept raw JS primitives and wrap them as literals, matching PySpark’s implicit coercion.

# PySpark
df.filter((col("age") > 30) & (col("country") == "US"))
df.withColumn("total", col("price") * col("qty"))
df.sort(col("salary").desc())

// spark-connect-js
df.filter(col("age").gt(30).and(col("country").eq("US")));
df.withColumn("total", col("price").multiply(col("qty")));
df.sort(col("salary").desc());

PySpark	spark-connect-js
`==` / `!=`	`.eq()` / `.neq()` (also `.eqNullSafe()` for `<=>`)
`>` / `>=`	`.gt()` / `.gte()`
`<` / `<=`	`.lt()` / `.lte()`
`&` / `\|`	`.and()` / `.or()`
`+` / `-` / `*` / `/`	`.plus()` / `.minus()` / `.multiply()` / `.divide()`

filter and where also take SQL string predicates, so df.filter("age > 21") ports from PySpark unchanged.

`when()` returns a builder, not a Column

PySpark’s F.when(...) returns a Column. Ours returns a WhenBuilder, terminated with .otherwise(default) or .toColumn() for a NULL default. The separate type means the compiler stops a half-built chain from passing as a column.

when(col("age").gt(18), lit("adult"))
  .when(col("age").gt(12), lit("teen"))
  .otherwise(lit("child"));

Every action is async

Actions return Promise<T>. Transformations stay synchronous because they only build the plan.

# PySpark
rows = df.collect()
df.show()
n = df.count()

// spark-connect-js
const rows = await df.collect();
await df.show();
const n = await df.count();

The action set is the same as PySpark: collect, count, show, first, head, take, isEmpty, plus the DataFrameWriter save methods.

Row output is a plain object

collect() returns Record<string, unknown>[], not instances of a Row class. There’s no row.asDict() because a row already is one.

# PySpark
row = df.first()
row["name"]
row.name
row.asDict()

// spark-connect-js
const row = await df.first();
row?.name;
row?.["name"];

Sessions need an explicit stop

PySpark relies on interpreter shutdown to close the session. A long-running Node process doesn’t get that for free.

const spark = connect("sc://localhost:15002");
try {
  await doWork(spark);
} finally {
  await spark.stop();
}

stop() releases server-side session state (temp views, cached tables, in-flight queries) and closes the gRPC channel.

Errors

PySpark raises AnalysisException, ParseException, IllegalArgumentException, and friends. spark-connect-js folds all server-side failures into one type, SparkConnectError, carrying a gRPC status code (INVALID_ARGUMENT, INTERNAL, UNAVAILABLE, …). Errors thrown locally before any RPC are SparkClientError subclasses (InvalidConfigError, InvalidInputError, UnsupportedOperationError).

See Error handling for the full hierarchy.

Type coercion at the edges

Some Arrow types don’t round-trip cleanly into JavaScript. The full mapping is in architecture. Four notable mismatches:

Spark type	PySpark	spark-connect-js
`LONG` / `BIGINT`	`int`	`bigint`
`DECIMAL(p, s)`	`Decimal`	`string`
`TIMESTAMP`	`datetime` (μs)	`Date` (ms)
`MAP<K, V>`	`dict`	`Map<K, V>`

The Long rule extends to count(), which returns bigint where PySpark returns int. Cast on the server (CAST(amount AS DOUBLE), CAST(ts AS STRING)) when you need different representations.

Structured Streaming

The streaming surface mirrors PySpark’s: spark.readStream, df.writeStream, StreamingQuery, spark.streams, listeners, withWatermark, and the window()/session_window() grouping functions. The differences:

awaitTermination(timeoutMs) and awaitAnyTermination(timeoutMs) take milliseconds, matching the Scala client and the wire field. PySpark takes seconds, so awaitTermination(10) ported verbatim waits 10ms, not 10s.
isActive() is async. Every other inspection method on the query crosses the wire, and a sync getter would lie about its source of truth.
explain() returns the plan string instead of printing it.
streams.get(id) returns null on a miss, like PySpark. The Scala client throws.
listListeners() is not exposed. It reports server-side Java listeners only, which a JS client never registers, so the result would always be empty.
Listeners are cleared when the event subscription dies non-recoverably. Re-add them to resume. PySpark logs a warning through warnings.warn instead.
foreach and foreachBatch need JS UDF execution and are not yet available.

See the Structured Streaming guide for the full walkthrough.

Not in spark-connect-js

toPandas(), pandas_api, mapInPandas, mapInArrow, DataFrame.plot: Python-specific. Use await df.collect() for plain JS objects and any JS charting library for visualisation.
Arbitrary closure UDFs: a JS runtime on Spark executors is a cluster change, not a client change. Java UDFs already on the server’s classpath can be bound to a SQL function name via spark.udf.registerJavaFunction(...) / registerJavaUDAF(...), and any SQL function (built-in or registered) is callable from a DataFrame via callFunction.