Skip to content

Quickstart

You need:

  • Node.js 22 or later.
  • Docker, or a JDK 17+ install for native Spark.

If you already have access to a Spark Connect endpoint, skip to Install the client.

Pick whichever option fits your environment. All produce a server reachable at localhost:15002.

Terminal window
docker run --rm -p 15002:15002 \
apache/spark:3.5.3 \
/opt/spark/sbin/start-connect-server.sh \
--packages org.apache.spark:spark-connect_2.12:3.5.3

Stop the server with Ctrl+C (Docker) or the matching stop-connect-server.sh (native installs). Spark 3.4 or newer works the same way; the version pinned above is just a stable target.

Terminal window
npm install @spark-connect-js/node

@spark-connect-js/node depends on @spark-connect-js/core and re-exports its public API. SparkSession, DataFrame, Column, the built-in functions, and the catalog are all available from a single import.

import { connect } from "@spark-connect-js/node";
const spark = connect("sc://localhost:15002");
try {
const rows = await spark
.sql("SELECT 1 AS n, 'hello' AS greeting")
.collect();
console.log(rows);
// [ { n: 1, greeting: 'hello' } ]
} finally {
await spark.stop();
}
  • connect(url) is shorthand for SparkSession.builder().remote(url).getOrCreate(). The returned object is a handle; session state lives on the server.
  • sql(...) returns a lazy DataFrame. Nothing crosses the network until an action runs: collect, count, show, first, head, take, toLocalIterator, or a DataFrameWriter save method.
  • stop() closes the gRPC channel and releases the server-side session. Use try/finally for anything longer-lived than a script.

The snippet below builds an in-memory table, finds the top earners, and computes per-department aggregates. The same program lives in examples/node-quickstart.

import { connect, col, lit, avg, count } from "@spark-connect-js/node";
const spark = connect("sc://localhost:15002");
try {
const employees = spark.sql(`
SELECT * FROM VALUES
('Alice', 'Engineering', 90000),
('Bob', 'Engineering', 85000),
('Carol', 'Marketing', 70000),
('Dave', 'Marketing', 72000),
('Eve', 'Engineering', 95000)
AS employees(name, department, salary)
`);
const topEarners = employees
.filter(col("salary").gt(lit(75_000)))
.sort(col("salary").desc());
console.table(await topEarners.collect());
const byDepartment = employees
.groupBy("department")
.agg(
count("*").alias("headcount"),
avg("salary").alias("avg_salary"),
);
console.table(await byDepartment.collect());
} finally {
await spark.stop();
}

Three more scripts covering streaming, writes, and Arrow-batch iteration live in examples/.

connect("sc://host:443/;use_ssl=true;token=abc") opens a TLS connection with a bearer token. Spark Connect itself listens in plaintext, so production deployments put a reverse proxy in front to terminate TLS; the client connects through it. The Configuration page covers every URL parameter, and the node-tls-behind-proxy example shows the proxy setup.