Skip to content

Catalog

session.catalog is the client-side handle to Spark’s CatalogManager: listing catalogs and databases, inspecting tables and functions, managing temp views, and controlling the storage-level cache.

const catalog = spark.catalog;

All listing methods return DataFrames so they compose with the rest of the API. Existence checks and state-changing methods return Promise<T>.

A Spark server can register multiple catalogs. spark_catalog is built in; others (Iceberg, Delta, JDBC, Hive) are configured server-side in spark-defaults.conf. The client forwards catalog operations as-is, so what you can list and query depends on the server’s configuration.

await catalog.currentCatalog(); // string
await catalog.listCatalogs().collect(); // Row[]
await catalog.setCurrentCatalog("iceberg");
await catalog.currentDatabase(); // "default"
await catalog.listDatabases().collect();
await catalog.databaseExists("analytics"); // boolean
await catalog.getDatabase("analytics").collect();
await catalog.setCurrentDatabase("analytics");
await catalog.listTables().collect();
await catalog.listTables("analytics").collect(); // in a specific database
await catalog.tableExists("events");
await catalog.getTable("events").collect();
await catalog.listColumns("events").collect();

Temp views are session-scoped; they disappear when the session ends. Global temp views live in the global_temp database and are visible across sessions until explicitly dropped.

await df.createOrReplaceTempView("events");
await df.createOrReplaceGlobalTempView("events_global");
await catalog.dropTempView("events"); // returns boolean
await catalog.dropGlobalTempView("events_global");

createTable registers a managed or external table backed by a file format:

import { StructType, StructField } from "@spark-connect-js/node";
const schema = new StructType([
new StructField("id", "long"),
new StructField("name", "string"),
new StructField("value", "double"),
]);
const created = catalog.createTable("demo_table", {
source: "parquet",
schema,
path: "/tmp/demo", // optional, omit for a managed table
options: { compression: "snappy" },
});
await created.collect(); // returns an empty DataFrame

For INSERT / OVERWRITE / MERGE semantics, use the DataFrame writer instead.

await catalog.listFunctions().collect(); // every SQL function registered on the server
await catalog.functionExists("count");
await catalog.getFunction("count").collect(); // metadata row with return type, signature, etc.

The catalog cache controls in-memory persistence for named tables and views. Useful when you plan to query the same relation several times in a session.

await catalog.cacheTable("events");
await catalog.isCached("events"); // true
await catalog.uncacheTable("events");
await catalog.cacheTable("events");
await catalog.clearCache(); // drops every cached relation

For caching intermediate DataFrames (not named tables), use df.cache() / df.persist(...) / df.unpersist() directly.

Spark caches file listings and partition metadata. After an out-of-band change to underlying storage, refresh explicitly:

await catalog.refreshTable("events");
await catalog.refreshByPath("s3://bucket/events/");
await catalog.recoverPartitions("events"); // re-discovers Hive-style partitions
import { connect, StructType, StructField } from "@spark-connect-js/node";
const spark = connect("sc://localhost:15002");
const catalog = spark.catalog;
console.log("Catalog:", await catalog.currentCatalog());
console.log("Database:", await catalog.currentDatabase());
const employees = spark.sql(`
SELECT * FROM VALUES
('Alice', 'Engineering', 90000),
('Bob', 'Marketing', 75000)
AS employees(name, department, salary)
`);
await employees.createOrReplaceTempView("employees");
console.log(await catalog.tableExists("employees"));
console.table(await catalog.listColumns("employees").collect());
await catalog.cacheTable("employees");
console.log("cached?", await catalog.isCached("employees"));
await catalog.dropTempView("employees");
await spark.stop();

The full runnable version is in examples/node-catalog.