Skip to content

DataFrame

Defined in: data-frame.ts:51

A distributed collection of rows with a named schema, obtained from a SparkSession (for example via spark.read.parquet(path) or spark.sql(...)).

DataFrame is lazy. Transformation methods (select, filter, join, withColumn, etc.) return a new DataFrame that wraps an extended logical plan; no work is performed on the server until an action (collect, count, show, write.save, etc.) is called.

const df = await spark.read.parquet("s3://bucket/events");
const recent = df
.filter(col("ts").gte(lit("2026-01-01")))
.groupBy("country")
.count();
const rows = await recent.collect();

Spark source: Dataset.scala

get stat(): DataFrameStat;

Defined in: data-frame.ts:631

Access statistical functions (corr, cov, crosstab, etc.).

DataFrameStat


get write(): DataFrameWriter;

Defined in: data-frame.ts:638

Returns a DataFrameWriter for persisting the contents of this DataFrame.

DataFrameWriter

alias(name): DataFrame;

Defined in: data-frame.ts:387

Assign an alias to this DataFrame, useful for self-joins.

ParameterType
namestring

DataFrame


cache(): Promise<DataFrame>;

Defined in: data-frame.ts:656

Persist this DataFrame with the default storage level (MEMORY_AND_DISK). Returns this DataFrame for method chaining.

Promise<DataFrame>


coalesce(numPartitions): DataFrame;

Defined in: data-frame.ts:509

Return a new DataFrame that is reduced to the given number of partitions. Unlike repartition(), coalesce avoids a full shuffle and tries to combine existing partitions.

ParameterTypeDescription
numPartitionsnumberTarget number of partitions

DataFrame


collect(): Promise<Row[]>;

Defined in: data-frame.ts:776

Execute the plan and collect all result rows into a JS array.

For large datasets, prefer toLocalIterator() or forEach() to avoid loading everything into memory.

Promise<Row[]>


columns(): Promise<string[]>;

Defined in: data-frame.ts:904

Return the column names as a string array. Uses the AnalyzePlan.Schema RPC to resolve the schema without executing.

Promise<string[]>


count(): Promise<number>;

Defined in: data-frame.ts:791

Return the number of rows. Uses an aggregate count plan. The full dataset is not collected.

Promise<number>


createGlobalTempView(viewName): Promise<void>;

Defined in: data-frame.ts:739

Register as a global temporary view. Throws if the view already exists.

ParameterType
viewNamestring

Promise<void>


createOrReplaceGlobalTempView(viewName): Promise<void>;

Defined in: data-frame.ts:728

Register as a global temporary view, replacing if it already exists.

ParameterType
viewNamestring

Promise<void>


createOrReplaceTempView(viewName): Promise<void>;

Defined in: data-frame.ts:706

Register this DataFrame as a temporary view with the given name. The view is session-scoped and will be dropped when the session ends.

ParameterType
viewNamestring

Promise<void>


createTempView(viewName): Promise<void>;

Defined in: data-frame.ts:717

Register as a temporary view. Throws if the view already exists.

ParameterType
viewNamestring

Promise<void>


crossJoin(other): DataFrame;

Defined in: data-frame.ts:188

Alias for join with joinType=“cross”.

ParameterType
otherDataFrame

DataFrame


cube(...columns): GroupedData;

Defined in: data-frame.ts:100

Multi-dimensional cube aggregation (all grouping-column combinations).

ParameterType
columns(string | Column)[]

GroupedData


describe(...cols): DataFrame;

Defined in: data-frame.ts:548

Compute summary statistics (count, mean, stddev, min, max) for columns.

ParameterType
colsstring[]

DataFrame


distinct(): DataFrame;

Defined in: data-frame.ts:266

Alias for dropDuplicates() with no arguments.

DataFrame


drop(...columnNames): DataFrame;

Defined in: data-frame.ts:193

Drop one or more columns by name.

ParameterType
columnNamesstring[]

DataFrame


dropDuplicates(...columnNames): DataFrame;

Defined in: data-frame.ts:256

Remove duplicate rows, optionally considering only a subset of columns.

ParameterType
columnNamesstring[]

DataFrame


dropna(how?, cols?): DataFrame;

Defined in: data-frame.ts:362

Drop rows with null values.

ParameterTypeDefault value
how"all" | "any""any"
colsstring[][]

DataFrame


dtypes(): Promise<[string, string][]>;

Defined in: data-frame.ts:914

Return column names and their data types as [name, type] pairs. Uses the AnalyzePlan.Schema RPC.

Promise<[string, string][]>


except(other): DataFrame;

Defined in: data-frame.ts:307

Return rows in this but not in other (distinct).

ParameterType
otherDataFrame

DataFrame


exceptAll(other): DataFrame;

Defined in: data-frame.ts:312

Return rows in this but not in other (duplicates kept).

ParameterType
otherDataFrame

DataFrame


explain(mode?): Promise<string>;

Defined in: data-frame.ts:947

Return the query execution plan as a string.

ParameterTypeDefault valueDescription
mode"simple" | "extended" | "codegen" | "cost" | "formatted""simple"Explain mode: “simple”, “extended”, “codegen”, “cost”, “formatted”

Promise<string>


fillna(value, cols?): DataFrame;

Defined in: data-frame.ts:352

Replace null values. If cols is empty, applies to all columns.

ParameterTypeDefault value
valuestring | number | booleanundefined
colsstring[][]

DataFrame


filter(condition): DataFrame;

Defined in: data-frame.ts:70

Filter rows by a boolean Column expression.

ParameterType
conditionColumn

DataFrame


first(): Promise<Row | null>;

Defined in: data-frame.ts:866

Return the first row as a Row object, or null if the DataFrame is empty.

Promise<Row | null>


forEach(fn): Promise<void>;

Defined in: data-frame.ts:844

Process each row with a callback as it streams from the server.

ParameterType
fn(row) => void

Promise<void>

await df.forEach((row) => console.log(row.name, row.salary));

getStorageLevel(): Promise<StorageLevel>;

Defined in: data-frame.ts:693

Get the storage level used for caching this DataFrame. Returns the StorageLevel if cached, or NONE if not cached.

Promise<StorageLevel>


groupBy(...columns): GroupedData;

Defined in: data-frame.ts:94

Group by one or more columns, returning a GroupedData handle for aggregation.

ParameterType
columns(string | Column)[]

GroupedData


head(n?): Promise<Row[]>;

Defined in: data-frame.ts:874

Return the first n rows as an array (alias for limit + collect).

ParameterTypeDefault value
nnumber1

Promise<Row[]>


hint(name, ...parameters): DataFrame;

Defined in: data-frame.ts:403

Attach an optimizer hint to this DataFrame.

ParameterType
namestring
parameters(string | number | boolean)[]

DataFrame

df.hint("broadcast")
df.join(right.hint("broadcast"), ...)

intersect(other): DataFrame;

Defined in: data-frame.ts:297

Return rows present in both DataFrames (distinct).

ParameterType
otherDataFrame

DataFrame


intersectAll(other): DataFrame;

Defined in: data-frame.ts:302

Return rows present in both DataFrames (duplicates kept).

ParameterType
otherDataFrame

DataFrame


isEmpty(): Promise<boolean>;

Defined in: data-frame.ts:924

Returns true if the DataFrame has no rows. Uses head(1) to check and stops after the first row.

Promise<boolean>


join(
other,
condition?,
joinType?): DataFrame;

Defined in: data-frame.ts:160

Join with another DataFrame.

ParameterTypeDefault valueDescription
otherDataFrameundefinedThe right side DataFrame
condition?ColumnundefinedJoin condition (a boolean Column expression)
joinType?| "inner" | "full_outer" | "left_outer" | "right_outer" | "left_semi" | "left_anti" | "cross""inner"Type of join (default: “inner”)

DataFrame


limit(n): DataFrame;

Defined in: data-frame.ts:112

Limit the number of rows.

ParameterType
nnumber

DataFrame


melt(
ids,
values,
variableColumnName,
valueColumnName): DataFrame;

Defined in: data-frame.ts:621

Alias for unpivot().

ParameterType
ids(string | Column)[]
values| (string | Column)[] | undefined
variableColumnNamestring
valueColumnNamestring

DataFrame


offset(n): DataFrame;

Defined in: data-frame.ts:271

Skip the first N rows.

ParameterType
nnumber

DataFrame


orderBy(...columns): DataFrame;

Defined in: data-frame.ts:149

Alias for sort().

ParameterType
columns(string | Column)[]

DataFrame


persist(storageLevel?): Promise<DataFrame>;

Defined in: data-frame.ts:666

Persist this DataFrame with the given storage level. Returns this DataFrame for method chaining.

ParameterTypeDefault valueDescription
storageLevelStorageLevelMEMORY_AND_DISKHow to store the cached data

Promise<DataFrame>


printSchema(): Promise<void>;

Defined in: data-frame.ts:962

Print the schema to the console in a tree format. Convenience method that calls schema() and formats the output.

Promise<void>


randomSplit(weights, seed?): DataFrame[];

Defined in: data-frame.ts:580

Randomly split this DataFrame into multiple DataFrames by weight.

ParameterType
weightsnumber[]
seed?number

DataFrame[]


repartition(numPartitions, ...columns): DataFrame;

Defined in: data-frame.ts:484

Return a new DataFrame partitioned by the given number of partitions. This results in a full shuffle of the data.

ParameterTypeDescription
numPartitionsnumberTarget number of partitions
columns(string | Column)[]Optional partitioning columns

DataFrame


repartitionByRange(numPartitions, ...columns): DataFrame;

Defined in: data-frame.ts:524

Return a new DataFrame partitioned by the given columns using range partitioning.

ParameterTypeDescription
numPartitionsnumberTarget number of partitions
columns(string | Column)[]Partitioning columns

DataFrame


replace(to, subset?): DataFrame;

Defined in: data-frame.ts:566

Replace values matching old with new, optionally restricted to a column subset.

ParameterTypeDefault value
toRecord<string, string | number | boolean | null>undefined
subsetstring[][]

DataFrame


rollup(...columns): GroupedData;

Defined in: data-frame.ts:106

Multi-dimensional rollup aggregation (hierarchical subtotals).

ParameterType
columns(string | Column)[]

GroupedData


sameSemantics(other): Promise<boolean>;

Defined in: data-frame.ts:750

Returns true if both DataFrames have the same logical plan.

ParameterType
otherDataFrame

Promise<boolean>


sample(
fraction,
withReplacement?,
seed?): DataFrame;

Defined in: data-frame.ts:338

Return a random sample of rows.

ParameterTypeDefault value
fractionnumberundefined
withReplacementbooleanfalse
seed?numberundefined

DataFrame


schema(): Promise<Record<string, unknown>>;

Defined in: data-frame.ts:934

Return the schema of the DataFrame as a plain object. Uses the AnalyzePlan.Schema RPC to resolve column names and types without executing the query.

Promise<Record<string, unknown>>


select(...columns): DataFrame;

Defined in: data-frame.ts:84

Project (select) a subset of columns.

ParameterType
columns(string | Column)[]

DataFrame


selectExpr(...exprs): DataFrame;

Defined in: data-frame.ts:421

Select columns using SQL expression strings. Each string is parsed by the server as an expression.

ParameterType
exprsstring[]

DataFrame

df.selectExpr("age * 2 as doubled_age", "name")

semanticHash(): Promise<number>;

Defined in: data-frame.ts:760

Returns a hash code of the logical plan.

Promise<number>


show(numRows?, truncate?): Promise<void>;

Defined in: data-frame.ts:974

Pretty-print the first numRows rows to the console as an ASCII table.

Mirrors PySpark’s df.show() behaviour. If truncate is true, strings longer than 20 characters are truncated with ....

ParameterTypeDefault value
numRowsnumber20
truncatebooleantrue

Promise<void>


sort(...columns): DataFrame;

Defined in: data-frame.ts:124

Sort by one or more columns (ascending by default). Use col(“x”).desc() for descending order.

ParameterType
columns(string | Column)[]

DataFrame


sortWithinPartitions(...columns): DataFrame;

Defined in: data-frame.ts:451

Sort within each partition (non-global sort).

ParameterType
columns(string | Column)[]

DataFrame


summary(...statistics): DataFrame;

Defined in: data-frame.ts:557

Compute specified statistics for numeric and string columns.

ParameterType
statisticsstring[]

DataFrame


tail(n): Promise<Row[]>;

Defined in: data-frame.ts:891

Return the last n rows as an array.

Maps to Spark Connect’s Relation.Tail.

ParameterType
nnumber

Promise<Row[]>


take(n): Promise<Row[]>;

Defined in: data-frame.ts:882

Return the first n rows as an array. Alias for head(). Matches PySpark’s take() semantics.

ParameterType
nnumber

Promise<Row[]>


toDF(...columnNames): DataFrame;

Defined in: data-frame.ts:374

Return a new DataFrame with renamed columns (positional).

ParameterType
columnNamesstring[]

DataFrame


toLocalIterator(): AsyncIterableIterator<Row>;

Defined in: data-frame.ts:827

Async iterator that yields rows one at a time. Only one batch is in memory at a time.

AsyncIterableIterator<Row>

for await (const row of df.toLocalIterator()) {
console.log(row);
}

transform<T>(fn): T;

Defined in: data-frame.ts:444

Apply a user-defined function to this DataFrame and return the result. This is purely client-side; it just calls fn(this).

Enables fluent pipeline composition:

Type Parameter
T extends DataFrame
ParameterType
fn(df) => T

T

df.transform(withDoubledAge).transform(withSalaryBand)

union(other): DataFrame;

Defined in: data-frame.ts:282

Return a new DataFrame with rows from both this and other (duplicates kept).

ParameterType
otherDataFrame

DataFrame


unionAll(other): DataFrame;

Defined in: data-frame.ts:287

Alias for union().

ParameterType
otherDataFrame

DataFrame


unionByName(other, allowMissingColumns?): DataFrame;

Defined in: data-frame.ts:292

Union by column name (rather than position), keeping duplicates.

ParameterTypeDefault value
otherDataFrameundefined
allowMissingColumnsbooleanfalse

DataFrame


unpersist(blocking?): Promise<DataFrame>;

Defined in: data-frame.ts:680

Remove this DataFrame from the cache.

ParameterTypeDefault valueDescription
blockingbooleanfalseWhether to block until the operation completes

Promise<DataFrame>


unpivot(
ids,
values,
variableColumnName,
valueColumnName): DataFrame;

Defined in: data-frame.ts:602

Unpivot from wide format to long format.

ParameterType
ids(string | Column)[]
values| (string | Column)[] | undefined
variableColumnNamestring
valueColumnNamestring

DataFrame


where(condition): DataFrame;

Defined in: data-frame.ts:79

Alias for filter().

ParameterType
conditionColumn

DataFrame


withColumn(name, expression): DataFrame;

Defined in: data-frame.ts:206

Add or replace a column.

ParameterType
namestring
expressionColumn

DataFrame

df.withColumn("doubled", col("value").multiply(lit(2)))

withColumnRenamed(existing, newName): DataFrame;

Defined in: data-frame.ts:230

Rename a single column.

ParameterType
existingstring
newNamestring

DataFrame


withColumns(colMap): DataFrame;

Defined in: data-frame.ts:217

Add or replace multiple columns at once.

ParameterType
colMapRecord<string, Column>

DataFrame


withColumnsRenamed(colsMap): DataFrame;

Defined in: data-frame.ts:243

Rename multiple columns at once.

ParameterTypeDescription
colsMapRecord<string, string>mapping of { existingName: newName }

DataFrame


writeTo(tableName): DataFrameWriterV2;

Defined in: data-frame.ts:646

Returns a DataFrameWriterV2 for writing to the given table using the DataSource V2 API (catalog-aware, supports create/replace/append/overwrite).

ParameterType
tableNamestring

DataFrameWriterV2