SQL and DataFrame guide

Spark has two ways to express a query: SQL strings and the DataFrame API. Both compile to the same logical plan in Catalyst, so mixing them in one pipeline is fine.

Creating a DataFrame

A DataFrame comes from an SQL query, a table name, a numeric range, a file read, or local data. All of these are lazy. The server is untouched until an action runs.

const a = spark.sql("SELECT * FROM my_table WHERE x > 10");
const b = spark.table("my_table");
const c = spark.range(0, 1_000_000);
const d = spark.read.parquet("s3://bucket/events/");
const e = spark.createDataFrame([{ id: 1n, name: "alice" }, { id: 2n, name: "bob" }]);

createDataFrame encodes plain row objects to Arrow on the client, inferring each column’s type from its first non-null value: string, number, boolean, bigint, and Date are covered. For richer types (decimal, struct, array, map, binary), build Arrow IPC stream bytes yourself and pass the Uint8Array instead.

Lazy evaluation

DataFrames are immutable. Each transformation returns a new DataFrame with an extended plan:

const young = employees.filter(col("age").lt(30));
const named = young.select("name", "age");
// Neither `young` nor `named` has touched the server yet.

const rows = await named.collect();
// This is the call that sends the plan and returns rows.

The server receives the full plan on each action. Catalyst’s analyzer and optimizer run there, and results come back as Arrow IPC batches. The optimizer pushes predicates down, so placing filter before or after select usually produces the same physical plan.

Transformations

Projection

df.select("name", "age");            // by name
df.select(col("name"), col("age"));  // by Column
df.selectExpr("name", "age * 12 AS age_in_months");
df.withColumn("full_name", concat(col("first"), lit(" "), col("last")));
df.withColumns({ a: col("x"), b: col("y") });
df.drop("internal_id");
df.withColumnRenamed("age", "years");
df.toDF("a", "b", "c");  // positional rename of all columns

Filtering

filter and where are aliases. Both take a Column predicate or a SQL string, parsed server-side:

df.filter(col("age").gt(18));
df.filter("age > 18 AND country IN ('US', 'CA')");
df.filter(col("name").rlike("^A"));
df.where(col("age").between(18, 65));
df.filter(col("country").isin("US", "CA", "UK"));

Grouping and aggregation

import { count, sum, avg, max } from "@spark-connect-js/node";

df.groupBy("department")
  .agg(
    count("*").alias("headcount"),
    avg("salary").alias("avg_salary"),
    max("salary").alias("max_salary"),
  );

// Shorthand methods on GroupedData:
df.groupBy("department").count();
df.groupBy("department").sum("salary");

// Multi-dimensional aggregation:
df.cube("department", "role").count();
df.rollup("year", "month").agg(sum("revenue"));

// Whole-frame aggregation, no grouping:
df.agg(count("*").alias("rows"), avg("salary").alias("mean"));

Sorting and limits

df.sort(col("salary").desc());
df.orderBy(col("dept").asc(), col("salary").desc_nulls_last());
df.limit(100);
df.offset(200).limit(50);  // pagination
df.sortWithinPartitions("id");

Joins

df1.join(df2, col("df1.id").eq(col("df2.id")));                   // arbitrary condition
df1.join(df2, condition, "left_outer");                           // inner | left_outer | right_outer | full_outer | left_semi | left_anti | cross
df1.join(df2, undefined, "cross");
df1.crossJoin(df2);

join takes a Column condition, not a column-name string or an array of names. Build equi-joins with col(...).eq(col(...)).

In a self-join both sides share every column name, so a plain col("id") is ambiguous. df.col(name) binds the reference to that specific DataFrame:

const employees = spark.table("employees");
const managers = spark.table("employees");
employees.join(managers, employees.col("manager_id").eq(managers.col("id")));

Set operations

df1.union(df2);
df1.unionByName(df2, true);  // allowMissingColumns
df1.intersect(df2);
df1.except(df2);             // rows in df1 not in df2

Deduplication and sampling

df.distinct();
df.dropDuplicates("user_id", "event_date");
df.sample(0.1);                       // 10% without replacement
df.sample(0.1, false, 42);            // with seed
df.randomSplit([0.8, 0.2], 42);       // returns [DataFrame, DataFrame]

Missing values

df.dropna();                                 // drop rows with any null
df.dropna("all");                            // drop rows where every column is null
df.dropna("any", ["name", "email"]);         // scoped to specific columns
df.fillna(0);                                // fill nulls with 0 across all numeric columns
df.fillna("unknown", ["name"]);              // scoped to specific columns
df.replace({ "N/A": null, "": null });       // value replacement (map form)
df.replace({ "N/A": null }, ["country"]);    // scoped to specific columns

fillna takes a single scalar and an optional column subset. replace takes a Record<string, scalar | null> mapping old values to new.

Reshaping

df.groupBy("year").pivot("quarter").sum("revenue");
df.groupBy("year").pivot("quarter", ["Q1", "Q2", "Q3", "Q4"]).sum("revenue");

df.unpivot(
  ["id"],                           // keep
  ["jan", "feb", "mar"],            // unpivot
  "month",
  "value",
);

Partitioning and caching

Partitioning and caching are transformations as far as the client is concerned: they modify the plan and take effect on the next action.

df.repartition(200);
df.repartition(200, "user_id");
df.coalesce(10);
df.repartitionByRange(50, col("timestamp"));

await df.cache();                     // MEMORY_AND_DISK
await df.persist(MEMORY_ONLY);
await df.unpersist();

cache, persist, and unpersist round-trip to the server and return Promise<DataFrame>. Await them before running a query that depends on the cache being warm.

Actions

Actions trigger execution and return a Promise.

await df.collect();              // Row[]
await df.count();                // bigint
await df.first();                // Row | null
await df.head(10);               // Row[]
await df.take(5);                // alias for head
await df.show(20, false);        // print to stdout; returns void. Args: numRows, truncate.
await df.isEmpty();              // boolean

for await (const row of df.toLocalIterator()) {
  // Stream rows without buffering the whole result.
}

await df.forEach(row => { ... }); // drains the stream server-side and applies fn client-side

Rows decode to plain objects. Column values map to JS types by Spark type:

Integers, floats, and booleans arrive as number and boolean.
LongType arrives as bigint, which is why count() returns one. Wrap it in Number(...) when the count is known to fit a JS safe integer.
DateType and TimestampType arrive as Date. Sub-millisecond precision truncates.
DecimalType arrives as a fixed-point string like "1.50".
MapType arrives as Map<K, V>.
Structs arrive as nested objects and arrays as JS arrays, with the same rules applied recursively.

Typed access

collect() returns rows as Record<string, unknown>. df.as<Schema>() narrows them at compile time:

const rows = await spark.table("people").as<{ id: bigint; name: string }>().collect();
rows[0].name;  // string

as<Schema>() is an assertion, not a validation. Nothing is checked at runtime. The row accessors check instead. Each getter verifies the column exists and the value matches the expected type, returning null for NULL and throwing on a mismatch:

import { row } from "@spark-connect-js/node";

const [stats] = await df.agg(count("*").alias("n"), avg("salary").alias("mean")).collect();
row.getLong(stats, "n");       // bigint | null
row.getDouble(stats, "mean");  // number | null

Schema and metadata

Metadata queries are actions too; they round-trip to the server via AnalyzePlan.

await df.schema();          // schema tree as a plain object
await df.columns();         // string[]
await df.dtypes();          // Array<[string, string]>
await df.printSchema();     // prints to stdout
await df.explain();         // prints physical plan
await df.explain("extended");

Writing

df.write returns a DataFrameWriter. df.writeTo(table) returns the v2 writer. See the I/O guide. For continuous queries, df.writeStream starts a streaming write, covered in the Structured Streaming guide.

The Column DSL

col(name) is a reference to a column in the plan. The methods on Column build expression trees that the server evaluates.

import { col, lit } from "@spark-connect-js/node";

// Comparisons
col("age").eq(30);
col("age").gt(30);
col("age").lte(65);
col("name").neq("admin");

// Arithmetic
col("price").multiply(1.08);
col("a").plus(col("b"));
col("total").divide(col("count"));

// Logical
col("a").and(col("b"));
col("status").eq("active").or(col("priority").gt(5));
col("deleted").eq(false);

// Null and NaN handling
col("email").isNull();
col("email").isNotNull();
col("score").isNaN();

// Membership and ranges
col("country").isin("US", "CA", "UK");
col("age").between(18, 65);

// String matching
col("name").like("A%");
col("email").rlike("^[^@]+@example\\.com$");
col("path").startsWith("/home/");
col("path").endsWith(".log");
col("body").contains("error");

// Casting
col("age").cast("int");
col("amount").cast("decimal(18,2)");

// Ordering
col("salary").asc();
col("salary").desc_nulls_last();

// Aliasing
col("x").plus(col("y")).alias("sum");

Literals

lit(value) wraps a JavaScript value as a Column. Comparison, arithmetic, and bitwise methods wrap primitives automatically, so lit(...) is for the places a constant stands alone as a Column:

df.withColumn("is_vip", lit(true));
df.select(lit(1).alias("one"), col("name"));
df.withColumn("bucket", col("score").divide(10).cast("int"));

TypeScript has no operator overloading, so comparisons and arithmetic are methods: col("x").gt(30), col("a").plus(col("b")). lit(null) produces a typed NULL literal. Refine it with .cast(...) when the server needs a concrete type. lit(undefined) throws, since an absent value is almost always a bug rather than an intended NULL.

SQL strings and raw expressions

For cases where the DataFrame API is awkward, fall back to SQL:

df.selectExpr("age * 365.25 AS age_in_days", "upper(name) AS upper_name");

import { expr } from "@spark-connect-js/node";
df.filter(expr("age > 18 AND country IN ('US', 'CA')"));
df.withColumn("age_group", expr("CASE WHEN age < 18 THEN 'minor' ELSE 'adult' END"));

filter and where accept the same SQL fragments directly. expr("...") lifts one into a Column for anywhere else an expression goes, and selectExpr(...) covers SQL projections.

Temp views bridge the two worlds:

await df.createOrReplaceTempView("events");
const hourly = spark.sql(`
  SELECT date_trunc('hour', ts) AS hour, count(*) AS n
  FROM events
  GROUP BY 1
`);

Conditional expressions

import { when, coalesce, isnull } from "@spark-connect-js/node";

df.withColumn(
  "tier",
  when(col("spend").gt(1000), lit("gold"))
    .when(col("spend").gt(100), lit("silver"))
    .otherwise(lit("bronze")),
);

df.withColumn("display_name", coalesce(col("full_name"), col("email"), lit("anonymous")));

User-defined identifiers

Column and table names are identifiers, not strings; they follow SQL identifier rules. If a name contains special characters or clashes with a reserved word, backtick-quote it inside the string:

df.select("`order-id`", "`from`");