TableFrame API Guide
This guide provides an in-depth exploration of the Tabsdata TableFrame API, extending the concepts introduced in Working with Tables. The TableFrame is the core data structure in Tabsdata for representing and manipulating tabular data within Tabsdata functions.
Overview
A TableFrame is Tabsdata’s lazy evaluation data structure that represents tabular data organized in rows and columns. Unlike eager evaluation frameworks, TableFrame operations are not executed immediately—instead, they build a query plan that is optimized and executed when the data is materialized.
Key characteristics:
Lazy Evaluation: Operations are deferred until explicitly collected, enabling query optimization.
Immutable: Each operation returns a new TableFrame; the original is unchanged.
Columnar: Data is stored and processed column-wise for efficient analytical operations.
Type-safe: Each column has a defined data type that is enforced throughout operations.
Creating TableFrames
While TableFrames are typically received as input in publishers and transformers, you can also create them programmatically for testing or data generation purposes.
From a Dictionary
Use TableFrame.from_dict() to create a TableFrame from a Python dictionary:
import tabsdata as td
data = {
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["New York", "London", "Paris"]
}
tf = td.TableFrame.from_dict(data)
From Pandas
Convert a Pandas DataFrame to a TableFrame:
import pandas as pd
import tabsdata as td
df = pd.DataFrame({
"product": ["Widget", "Gadget"],
"price": [19.99, 29.99]
})
tf = td.TableFrame.from_pandas(df)
From Polars
Convert from Polars LazyFrame or DataFrame:
import polars as pl
import tabsdata as td
# From Polars LazyFrame
lf = pl.LazyFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
tf = td.TableFrame.from_polars(lf)
# From Polars DataFrame
pdf = pl.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
tf = td.TableFrame.from_polars(pdf)
Converting TableFrames
Export TableFrames to other formats:
# To dictionary
data_dict = tf.to_dict()
# To Pandas DataFrame
pandas_df = tf.to_pandas()
# To Polars DataFrame
polars_df = tf.to_polars_df()
# To Polars LazyFrame
polars_lf = tf.to_polars_lf()
Column Selection and Projection
Projection operations allow you to select, transform, and reshape columns in your data.
Selecting Columns
Use select() to choose specific columns. Multiple syntax forms are supported:
# Using td.col() expressions
tf.select(td.col("first_name"), td.col("last_name"))
# Using column names directly
tf.select("first_name", "last_name")
# Using a list of column names
tf.select(["first_name", "last_name"])
Column Selectors
Tabsdata provides powerful selectors to dynamically select columns by type or pattern:
By Data Type:
import tabsdata.tableframe.selectors as cs
# Select all numeric columns
tf.select(cs.numeric())
# Select all string columns
tf.select(cs.string())
# Select all temporal columns (date, datetime, time)
tf.select(cs.temporal())
# Select by specific dtype
tf.select(cs.by_dtype(td.Int64, td.Float64))
By Pattern:
# Columns starting with a prefix
tf.select(cs.starts_with("customer_"))
# Columns ending with a suffix
tf.select(cs.ends_with("_id"))
# Columns containing a substring
tf.select(cs.contains("price"))
# Columns matching a regex pattern
tf.select(cs.matches(r"^amount_\d+$"))
By Position:
# First N columns
tf.select(cs.first(3))
# Last N columns
tf.select(cs.last(2))
# By index (0-based)
tf.select(cs.by_index(0, 2, 4))
Excluding Columns:
# All columns except specific ones
tf.select(cs.exclude("internal_id", "metadata"))
# Combine with other selectors
tf.select(cs.numeric() - cs.ends_with("_raw"))
Adding and Modifying Columns
Use with_columns() to add new columns or modify existing ones without dropping other columns:
tf.with_columns(
# Add a new calculated column
(td.col("price") * td.col("quantity")).alias("total"),
# Modify an existing column
td.col("name").str.to_uppercase().alias("name"),
# Add a literal value column
td.lit("USD").alias("currency")
)
Dropping Columns
Use drop() to remove columns:
# Drop single column
tf.drop("temporary_column")
# Drop multiple columns
tf.drop("col1", "col2", "col3")
# Drop using selectors
tf.drop(cs.ends_with("_raw"))
Renaming Columns
Use rename() to change column names:
tf.rename({
"old_name": "new_name",
"customer_id": "cust_id"
})
Column Aliasing
Use alias() on expressions to name computed columns:
tf.select(
(td.col("first_name") + " " + td.col("last_name")).alias("full_name"),
(td.col("price") * 1.1).alias("price_with_tax")
)
Filtering Data
Filter operations allow you to select rows based on conditions.
Basic Filtering
Use filter() with boolean expressions:
# Simple comparison
tf.filter(td.col("age") > 18)
# Equality check
tf.filter(td.col("status") == "active")
# Not equal
tf.filter(td.col("category") != "archived")
Combining Conditions
Combine multiple conditions using logical operators:
# AND conditions (two syntaxes)
tf.filter((td.col("age") > 18) & (td.col("age") < 65))
tf.filter((td.col("age") > 18).and_(td.col("age") < 65))
# OR conditions
tf.filter((td.col("status") == "active") | (td.col("status") == "pending"))
tf.filter((td.col("status") == "active").or_(td.col("status") == "pending"))
# NOT condition
tf.filter(~(td.col("is_deleted")))
tf.filter(td.col("is_deleted").not_())
# XOR condition
tf.filter(td.col("flag_a").xor(td.col("flag_b")))
Range and Membership Checks
# Between (inclusive)
tf.filter(td.col("price").is_between(10.0, 100.0))
# In a set of values
tf.filter(td.col("country").is_in(["USA", "Canada", "Mexico"]))
# Null checks
tf.filter(td.col("email").is_not_null())
tf.filter(td.col("optional_field").is_null())
# NaN checks (for floating point)
tf.filter(td.col("ratio").is_not_nan())
Limiting Results
# First N rows
tf.head(10)
tf.limit(10)
tf.first(10)
# Last N rows
tf.tail(10)
tf.last(10)
# Slice with offset
tf.slice(offset=100, length=50)
# Single row access
first_row = tf.first_row()
last_row = tf.last_row()
Unique Values
Remove duplicate rows (De-duplication):
# Unique across all columns
tf.unique()
# Unique based on specific columns
tf.unique(subset=["customer_id", "date"])
# Keep first or last occurrence
tf.unique(subset=["customer_id"], keep="first")
tf.unique(subset=["customer_id"], keep="last")
String Operations
The .str namespace provides comprehensive string manipulation functions.
Case Transformations
td.col("name").str.to_uppercase()
td.col("name").str.to_lowercase()
td.col("name").str.to_titlecase()
Pattern Matching
# Check if string starts/ends with pattern
td.col("email").str.ends_with("@example.com")
td.col("phone").str.starts_with("+1")
# Contains substring
td.col("description").str.contains("urgent")
# Contains any of multiple patterns
td.col("text").str.contains_any(["error", "warning", "critical"])
# Regex matching
td.col("code").str.contains(r"^[A-Z]{3}-\d{4}$")
String Extraction and Replacement
# Extract using regex groups
td.col("email").str.extract(r"@(\w+)\.", group_index=1)
# Replace patterns
td.col("text").str.replace("old", "new")
td.col("text").str.replace_all(r"\s+", " ") # Replace all whitespace
# Replace multiple patterns
td.col("text").str.replace_many(["a", "b", "c"], ["x", "y", "z"])
String Manipulation
# Trimming
td.col("name").str.strip_chars() # Both sides
td.col("name").str.strip_chars_start() # Left side
td.col("name").str.strip_chars_end() # Right side
td.col("path").str.strip_prefix("/api/")
td.col("filename").str.strip_suffix(".txt")
# Padding
td.col("id").str.pad_start(10, "0") # Left pad with zeros
td.col("code").str.pad_end(20, " ") # Right pad with spaces
td.col("number").str.zfill(5) # Zero-fill to length
# Slicing
td.col("text").str.head(10) # First 10 characters
td.col("text").str.tail(5) # Last 5 characters
td.col("text").str.slice(offset=5, length=10) # Substring
# Reverse
td.col("text").str.reverse()
String Length and Position
# Length
td.col("description").str.len_chars() # Character count
td.col("data").str.len_bytes() # Byte count
# Find position
td.col("text").str.find("@") # First occurrence index
# Count matches
td.col("log").str.count_matches(r"\berror\b") # Count pattern matches
Grok Parsing
Parse semi-structured text using Grok patterns:
# Parse Apache log format
tf.grok(
"log_line",
r"%{IP:client_ip} - - \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{URIPATHPARAM:path}\""
)
Numeric Operations
Mathematical and numeric operations on columns.
Arithmetic Operations
# Basic arithmetic
td.col("price") + 10
td.col("price") - td.col("discount")
td.col("quantity") * td.col("unit_price")
td.col("total") / td.col("count")
# Method syntax
td.col("a").add(td.col("b"))
td.col("a").sub(td.col("b"))
td.col("a").mul(td.col("b"))
td.col("a").truediv(td.col("b"))
td.col("a").mod(td.col("b")) # Modulo
Rounding and Clipping
# Rounding
td.col("price").round(2) # Round to 2 decimal places
td.col("value").round_sig_figs(3) # Round to 3 significant figures
td.col("amount").floor() # Round down
td.col("amount").ceil() # Round up
# Clipping (constrain to range)
td.col("score").clip(lower_bound=0, upper_bound=100)
Mathematical Functions
# Absolute value and sign
td.col("change").abs()
td.col("value").sign() # Returns -1, 0, or 1
td.col("value").neg() # Negate
# Powers and roots
td.col("x").pow(2) # Square
td.col("x").sqrt() # Square root
td.col("x").cbrt() # Cube root
# Logarithms
td.col("x").log() # Natural logarithm
td.col("x").log10() # Base-10 logarithm
td.col("x").log1p() # log(1 + x)
td.col("x").exp() # e^x
Trigonometric Functions
# Basic trig functions
td.col("angle").sin()
td.col("angle").cos()
td.col("angle").tan()
td.col("angle").cot()
# Inverse trig functions
td.col("x").arcsin()
td.col("x").arccos()
td.col("x").arctan()
# Hyperbolic functions
td.col("x").sinh()
td.col("x").cosh()
td.col("x").tanh()
# Angle conversions
td.col("degrees").radians() # Degrees to radians
td.col("radians").degrees() # Radians to degrees
Difference and Comparison
# Row-to-row difference
td.col("value").diff() # Difference from previous row
td.col("value").diff(n=2) # Difference from 2 rows ago
Date and Time Operations
The .dt namespace provides datetime manipulation functions.
Extracting Components
# Date components
td.col("date").dt.year()
td.col("date").dt.month()
td.col("date").dt.day()
td.col("date").dt.quarter()
td.col("date").dt.week()
td.col("date").dt.weekday() # 0 = Monday
td.col("date").dt.ordinal_day() # Day of year (1-366)
td.col("date").dt.is_leap_year()
# Time components
td.col("timestamp").dt.hour()
td.col("timestamp").dt.minute()
td.col("timestamp").dt.second()
td.col("timestamp").dt.millisecond()
td.col("timestamp").dt.microsecond()
td.col("timestamp").dt.nanosecond()
# ISO calendar
td.col("date").dt.iso_year()
Date Arithmetic
# Add/subtract time intervals
td.col("date").dt.offset_by("1w") # Add 1 week
td.col("date").dt.offset_by("-3d") # Subtract 3 days
td.col("date").dt.offset_by("2mo") # Add 2 months
# Business days
td.col("date").dt.add_business_days(5)
# Month boundaries
td.col("date").dt.month_start() # First day of month
td.col("date").dt.month_end() # Last day of month
Truncation
# Truncate to time unit
td.col("timestamp").dt.truncate("1h") # Truncate to hour
td.col("timestamp").dt.truncate("1d") # Truncate to day
td.col("timestamp").dt.truncate("1mo") # Truncate to month
td.col("timestamp").dt.truncate("1w") # Truncate to week
Total Durations
# Convert duration to total units
td.col("duration").dt.total_days()
td.col("duration").dt.total_hours()
td.col("duration").dt.total_minutes()
td.col("duration").dt.total_seconds()
td.col("duration").dt.total_milliseconds()
td.col("duration").dt.total_microseconds()
Timezone Operations
# Convert timezone
td.col("timestamp").dt.convert_time_zone("America/New_York")
# Replace timezone (without conversion)
td.col("timestamp").dt.replace_time_zone("UTC")
# Get timezone offsets
td.col("timestamp").dt.base_utc_offset()
td.col("timestamp").dt.dst_offset()
Formatting
# Format as string
td.col("date").dt.to_string("%Y-%m-%d")
td.col("timestamp").dt.strftime("%Y-%m-%d %H:%M:%S")
# Get epoch timestamp
td.col("timestamp").dt.epoch(time_unit="s") # Seconds since epoch
td.col("timestamp").dt.timestamp() # Microseconds since epoch
Aggregation and Grouping
Aggregate data to compute summary statistics.
Basic Aggregations
# Count
td.col("id").count() # Non-null count
td.col("id").len() # Total count including nulls
# Sum and mean
td.col("amount").sum()
td.col("amount").mean()
td.col("amount").median()
# Min and max
td.col("date").min()
td.col("value").max()
# First and last
td.col("timestamp").first()
td.col("timestamp").last()
# Unique counts
td.col("category").n_unique()
Group By Operations
Use group_by() to aggregate by groups:
# Single column grouping
tf.group_by("category").agg(
td.col("sales").sum().alias("total_sales"),
td.col("quantity").mean().alias("avg_quantity")
)
# Multiple column grouping
tf.group_by("region", "product_type").agg(
td.col("revenue").sum().alias("total_revenue"),
td.col("order_id").count().alias("order_count")
)
# Group by expression
tf.group_by(
td.col("timestamp").dt.year().alias("year"),
td.col("timestamp").dt.month().alias("month")
).agg(
td.col("amount").sum().alias("monthly_total")
)
Shorthand Group Aggregations
For simple aggregations:
# Count per group
tf.group_by("category").count()
# Sum per group
tf.group_by("category").sum()
# Mean per group
tf.group_by("category").mean()
# Min/Max per group
tf.group_by("category").min()
tf.group_by("category").max()
Ranking
Assign ranks to values:
tf.with_columns(
td.col("score").rank(method="ordinal").alias("rank"),
td.col("score").rank(method="dense").alias("dense_rank"),
td.col("score").rank(method="min").alias("min_rank"),
td.col("score").rank(method="max").alias("max_rank"),
td.col("score").rank(method="average").alias("avg_rank")
)
Joining Tables
Combine data from multiple TableFrames.
Join Types
# Inner join (default)
tf1.join(tf2, on="key_column")
tf1.join(tf2, on="key_column", how="inner")
# Left join
tf1.join(tf2, on="key_column", how="left")
# Right join
tf1.join(tf2, on="key_column", how="right")
# Outer (full) join
tf1.join(tf2, on="key_column", how="full")
# Semi join (keep left rows that have matches)
tf1.join(tf2, on="key_column", how="semi")
# Anti join (keep left rows that don't have matches)
tf1.join(tf2, on="key_column", how="anti")
Join on Multiple Columns
# Same column names in both tables
tf1.join(tf2, on=["customer_id", "date"])
# Different column names
tf1.join(
tf2,
left_on=["cust_id", "order_date"],
right_on=["customer_id", "date"]
)
Handling Duplicate Column Names
# Add suffix to duplicate columns
tf1.join(tf2, on="id", suffix="_right")
Join Example
import tabsdata as td
@td.transformer(
input_tables=["orders", "customers"],
output_tables=["order_details"],
)
def enrich_orders(orders: td.TableFrame, customers: td.TableFrame):
return orders.join(
customers.select("customer_id", "name", "region"),
on="customer_id",
how="left"
)
Union Operations
Combine rows from multiple TableFrames.
Concatenation
Use td.concat() to stack TableFrames vertically:
import tabsdata as td
# Concatenate multiple TableFrames
combined = td.concat([tf1, tf2, tf3])
# With rechunking for optimization
combined = td.concat([tf1, tf2, tf3], rechunk=True)
Note
All TableFrames must have the same schema (column names and types) for concatenation.
Data Manipulation
Handle missing values and transform data.
Handling Null Values
# Drop rows with any null values
tf.drop_nulls()
# Drop rows with nulls in specific columns
tf.drop_nulls(subset=["email", "phone"])
# Fill null values with a constant
tf.fill_null(0)
td.col("status").fill_null("unknown")
# Fill nulls in the entire TableFrame
tf.fill_null({
"quantity": 0,
"status": "pending",
"price": 0.0
})
Handling NaN Values
For floating-point columns:
# Drop rows with NaN values
tf.drop_nans()
tf.drop_nans(subset=["ratio", "percentage"])
# Fill NaN values
tf.fill_nan(0.0)
td.col("rate").fill_nan(0.0)
Sorting
# Sort by single column (ascending)
tf.sort("date")
# Sort descending
tf.sort("date", descending=True)
# Sort by multiple columns
tf.sort(["category", "date"], descending=[False, True])
# Nulls placement
tf.sort("value", nulls_last=True)
Type Casting
# Cast expression type
td.col("id").cast(td.Int64)
td.col("price").cast(td.Float64)
td.col("is_active").cast(td.Boolean)
# Cast entire TableFrame columns
tf.cast({"id": td.String, "amount": td.Float64})
# String to date/time
td.col("date_str").str.to_date("%Y-%m-%d")
td.col("ts_str").str.to_datetime("%Y-%m-%d %H:%M:%S")
td.col("time_str").str.to_time("%H:%M:%S")
td.col("num_str").str.to_integer()
# Date to string
td.col("date").dt.to_string("%Y-%m-%d")
TableFrame Attributes
Access metadata about your TableFrame.
Schema Information
# Get column data types
dtypes = tf.dtypes
# Check if columns exist
tf.has_cols("column_name")
tf.has_cols(["col1", "col2"])
# Assert columns exist (raises error if not)
tf.assert_has_cols(["required_col1", "required_col2"])
# Compare schemas
tf.has_same_schema(other_tf)
Empty Checks
# Check if TableFrame is empty
tf.is_empty()
# Create an empty TableFrame with same schema
empty_tf = tf.empty()
# Clear all rows
cleared_tf = tf.clear()
Query Explanation
For debugging and optimization:
# Get query plan explanation
plan = tf.explain()
print(plan)
# Visualize query graph
tf.show_graph()
Literal Values
Create constant value columns:
# Add literal value columns
tf.with_columns(
td.lit(1).alias("version"),
td.lit("active").alias("status"),
td.lit(True).alias("is_valid"),
td.lit(3.14159).alias("pi")
)
Extracting Data
Extract specific data patterns from tables.
Extract as Columns
Pivot unique values into columns:
# Extract unique values from a column as separate columns
tf.extract_as_columns(
column="metric_name",
value_column="metric_value"
)
Extract as Rows
Unpivot columns into rows:
# Melt multiple columns into rows
tf.extract_as_rows(
columns=["jan_sales", "feb_sales", "mar_sales"],
variable_name="month",
value_name="sales"
)
Unnesting
Flatten nested structures:
# Unnest struct columns
tf.unnest("address") # Expands struct fields to columns
Best Practices
Chain Operations: TableFrame operations return new TableFrames, enabling fluent method chaining:
result = ( tf .filter(td.col("status") == "active") .with_columns( (td.col("price") * td.col("quantity")).alias("total") ) .group_by("category") .agg(td.col("total").sum().alias("category_total")) .sort("category_total", descending=True) )
Use Selectors for Dynamic Column Selection: Instead of hardcoding column names, use selectors for more maintainable code.
Prefer Native Operations Over UDFs: TableFrame expressions are optimized; use UDFs only when necessary.
Filter Early: Apply filters as early as possible in your pipeline to reduce data volume.
Use Aliases: Always alias computed columns for clarity and to avoid naming conflicts.
See Also
Working with Tables - Basic table operations and CLI commands
User-defined Functions (UDFs) - Custom UDF implementation
TableFrame API Reference - Complete API documentation