Data Quality
Tabsdata Data Quality (DQ) lets you attach declarative quality checks to the output of publishers and transformers. Every run inspects the data that was just produced, enriches it with quality signals, and optionally quarantines or rejects rows that fail your criteria.
The feature augments each output with quality columns, per-classifier summaries, and optional quarantine/select tables. These artifacts help teams monitor ingestion quality, steer remediation, and deliver trustworthy data for downstream decisions.
In the code below, a transformer reads from a new_visitors table and outputs to a visitors table with data quality checks attached. Three classifiers inspect the data: one checks for NULL values in name and email (tagged as "required"), another checks for NULL values in country, and a third verifies that ok_to_contact is true. Three operators then act on these signals: Summary() generates a visitor_dq_summary table with aggregated quality metrics, Select() copies rows passing all checks to a visitor_leads table, and Filter() removes rows with NULL name or email values from the output. Note that rows with NULL country or false ok_to_contact remain in the output since the filter only targets the "required" tag.
Example Code:
import tabsdata as td
import tabsdata.dataquality as dq
@td.transformer(
input_tables=["new_visitors"],
output_tables=["visitors"],
on_output_tables=[
dq.DataQuality(
table="visitors",
classifiers=[
dq.IsNotNull(column_names=["name","email"], tags="required"),
dq.IsNotNull("country"),
dq.IsTrue("ok_to_contact"),
],
operators=[
# creates table `visitor_dq_summary`
dq.Summary(),
# creates table `visitor_leads`
dq.Select(
to_table="visitor_leads",
criteria=dq.AllOk(),
),
# drop from `visitors` without name or email
dq.Filter(
criteria=dq.AnyFailed(tags="required"),
),
],
)
],
)
def process_visitors(tf: td.TableFrame) -> td.TableFrame:
return tf
DataQuality Action
The DataQuality action is the entry point for attaching quality checks to an output table. Defined under on_tables parameter in Publisher and on_output_tables in Transformer decorators, DataQuality Action attaches a target output table to a declarative/programmatic set of quality checks and follow-up reactions so every run can inspect and act on the data that was just produced.
Example: You have a publisher writing customer data to a customers table. By attaching a DataQuality action to this table, you can automatically validate every batch of customer records before they become available to downstream consumers.
Data Quality actions have these core ingredients:
- Classifiers
Inspect one or more columns and emit boolean or categorical signals (e.g.,
is_positive,is_null). Each classifier examines specific columns and creates a data quality column with the values generated by the classifier.Example: Use
dq.IsNull("email")to detect rows with missing email addresses, ordq.IsPositive("order_amount")to flag transactions with zero or negative values that might indicate data entry errors.- Operators
Consume the signals emitted by classifiers to enrich tables, build summaries, route/retain rows, or fail a run. Operators determine what happens after classifiers have evaluated the data.
Example: After classifying rows with null customer IDs, use
dq.Filter()to quarantine those rows into a separatebad_recordstable for manual review, while clean rows flow to downstream analytics. Or usedq.Summary()to generate a quality report showing how many records passed or failed each check.- Criteria & Thresholds
Describe which classifier outcomes an operator cares about and when to trigger actions such as filtering or failure. Criteria specify the conditions (e.g., “any null values”), while thresholds define acceptable limits (e.g., “fail if more than 5% of rows are affected”).
Example: Use
dq.AnyFailed(tags="nulls")as criteria to match rows that failed null checks, combined withdq.PercentThreshold(5)to abort the pipeline if more than 5% of incoming records have missing required fields.- Tags (optional)
Label every classifier output column. Operators can refer to those labels to focus on specific checks, all checks, or only the untagged ones. Tags help organize multiple classifiers and allow operators to selectively act on specific quality dimensions.
Example: Tag null-checking classifiers with
tags="completeness"and range-checking classifiers withtags="validity". Then configure oneFilteroperator to quarantine rows failing completeness checks and another to quarantine rows failing validity checks into separate tables for different remediation workflows.When a classifier runs, every generated column inherits the tags declared on that classifier (or stays untagged if no tags were provided). Operators read those tags to decide which columns they should process. Aside from the no-duplicates rules called out below, there is no cap on how many actions, classifiers, or operators can be attached to a single function.
Note
Running a DQ action can yield additional derived tables (enriched rows, summary tables, quarantine tables) alongside the regular outputs. In such a case, downstream steps can consume the results in the same function run.
All configuration and consistency checks run during function registration, so once a transformer or publisher is accepted, authors know every DQ action attached to it is structurally valid.
Examples
Example 1 – Enriching Output with Quality Columns
The following example shows how to enrich output rows with quality flag columns. A publisher writes order data to an orders table, and the Enrich() operator adds a total_is_positive column to the output orders table indicating whether each row’s total value is positive.
import tabsdata as td
import tabsdata.dataquality as dq
@td.publisher(
source=td.LocalFileSource("orders.csv"),
output_tables=["orders"],
on_tables=[
dq.DataQuality(
table="orders",
classifiers=[dq.IsPositive("total")],
operators=[dq.Enrich()],
)
],
)
def publish_orders(tf: td.TableFrame) -> td.TableFrame:
return tf
Example 2 – Failing a Run Based on Row Count Threshold
In this example, a transformer validates that critical fields (order_id, customer_id) are not NULL and aborts the run if more than 10 rows fail the check. The Fail() operator uses RowCountThreshold to enforce this limit. If 10 or more rows have NULL` values in order_id or customer_id, the entire run fails with a DataQualityProcessingError.
import tabsdata as td
import tabsdata.dataquality as dq
@td.transformer(
input_tables=["raw_orders"],
output_tables=["validated_orders"],
on_output_tables=[
dq.DataQuality(
table="validated_orders",
classifiers=[dq.IsNotNull(["order_id", "customer_id"])],
operators=[
dq.Fail(
criteria=dq.AnyFailed(),
threshold=dq.RowCountThreshold(10),
)
],
)
],
)
def validate_orders(tf: td.TableFrame) -> td.TableFrame:
return tf
Classifiers
There are two families of classifiers:
Boolean Classifiers
Boolean classifiers add <column>_<classifier> UInt8 columns. Values follow a fixed mapping:
0= False1= True252= underflow253= overflow254= NaN255= NULL
If your operator sets none_is_ok=True and/or nan_is_ok=True and the classifier isn’t expressly checking for nullability (e.g., IsPositive), missing values are treated as True instead of receiving sentinel codes.
Nullability classifiers:
IsNull/IsNotNullIsNullOrNan/IsNotNullNorNanIsNan/IsNotNan
Zero and sign checks:
IsZero/IsNotZeroIsPositive/IsPositiveOrZeroIsNegative/IsNegativeOrZero
Range checks:
IsBetween/IsNotBetween– withmin_val,max_val, andclosed∈{"none", "lower", "upper", "both"}
Set membership:
IsIn/IsNotIn
Pattern and length:
Matches/DoesNotMatchHasLength
All boolean classifiers support:
column_names– include multiple columns at once.on_missing_column–"ignore"or"fail"to decide whether schema drift aborts the run.on_wrong_type–"ignore"or"fail"for type mismatches.Optional
tags.
Categorical Classifiers
Categorical classifiers extend the abstract Categorizer. The available implementation is ScaleCategorizer, which emits a <column>_scale_category UInt8 column representing bin numbers plus special sentinels ("none", "nan", "underflow", "overflow").
It accepts numeric inputs and one of the following scales:
- IdentityScale
IdentityScale(min_val, max_val, use_bin_zero=False)– integer-to-bucket mapping.- LinearScale
LinearScale(min_val, max_val, bins=MAX_BINS, use_bin_zero=False)– equal-width bins.- MonomialScale
MonomialScale(min_val, max_val, power, bins=MAX_BINS, use_bin_zero=False)– power-law spacing. Ifmin_valis negative, the values are shifted so the minimum maps to 0 before binning.- LogarithmicScale
LogarithmicScale(min_val, max_val, base, bins=MAX_BINS, use_bin_zero=False)– logarithmic spacing. Negativemin_valinputs are shifted so the minimum becomes just above zero to keep the domain positive.- ExponentialScale
ExponentialScale(min_val, max_val, base, bins=MAX_BINS, use_bin_zero=False)– exponential widening. Negative ranges are handled like the logarithmic scale.
MAX_BINS is 100. When use_bin_zero=True, bin 0 is reserved for the minimum and the usable bin count shrinks accordingly.
Default Column Names
Boolean classifier outputs:
Classifier |
Default Column Name |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Categorizer outputs:
Classifier |
Default Column Name |
|---|---|
|
|
When you supply (source, alias) pairs, the alias overrides these defaults.
Operators
Operators consume classifier signals and produce actions on your data.
Enrich
Enrich(to_table=None, tags=None)
Adds classifier columns to either the original table (default, in-place overwrite) or a dedicated new table. When tags are provided, only classifiers with matching tags are materialized.
Summary
Summary(tags=None, table=None)
Always emits a new table. When you omit table, the engine names it <source>_dq_summary.
Filter
Filter(criteria, to_table=None, include_quality_columns="none")
Removes rows matching criteria from the output table. With to_table, rejected rows are written to a separate table; otherwise they are dropped.
include_quality_columns options:
"none"– bare data rows."criteria"– only the DQ columns referenced by the criteria (boolean or categorical)."all"– all classifier outputs.
Select
Select(criteria, to_table, include_quality_columns="none")
Copies rows matching criteria to another table without mutating the source table. A target name is mandatory.
Fail
Fail(criteria, threshold)
Aborts the entire transformer/publisher run when the threshold is exceeded.
Thresholds:
RowCountThreshold(row_count)– fail if matching rows >= count.PercentThreshold(percent)– fail if matching rows exceed percentage.
Derived Table Names
Operator |
Default Table Name |
Behavior with |
|---|---|---|
|
Overwrites source table in place |
Uses your exact |
|
Rows dropped; no table emitted |
Uses your exact |
|
Target name is mandatory |
Uses your exact |
|
|
Uses your exact |
Cross-Action Constraints
Only one
Enrichper output table may redirect rows by settingto_table, because that operator substitutes the original table with the enriched copy.A given derived table name (
to_tablein Filter/Select ortablein Summary) must be declared once across all actions attached to the function; producing the same non-output table multiple times is disallowed.
Criteria and Thresholds
Boolean criteria:
AllOk/AnyFailed– operate on boolean classifier columns and exposenone_is_ok/nan_is_oktoggles. Set these toTruewhen you want operators to treat missing/NaN values as acceptable rather than as sentinel codes.
Categorical criteria:
InBins/NotInBins– target categorizer outputs. Bins can include integers or the special labels"none","nan","underflow","overflow".
Each criteria holds onto the classifier tags it needs, so operators automatically align with the right DQ columns. Threshold choices (row count / percent) are validated to catch impossible values.
Showcased Workflows
1. Enriching Output Rows and Building Summaries
dq.DataQuality(
table="table_o",
classifiers=[dq.IsPositive("value")],
operators=[dq.Summary(), dq.Enrich()],
)
This produces table_o with the extra value_is_positive column (UInt8 status codes) and table_o_dq_summary with boolean counts (dq.true, dq.false, dq.none, dq.nan). This is the easiest drop-in for surfacing quality metrics alongside the data.
3. Failing a Run When Breaches Cross a Threshold
dq.DataQuality(
table="clean",
classifiers=[dq.IsNotNull(["customer_id", "order_id"])],
operators=[
dq.Fail(
criteria=dq.AnyFailed(none_is_ok=False),
threshold=dq.RowCountThreshold(2),
)
],
)
If 2 or more rows contain NULL in either column, the transformer raises DataQualityProcessingError and the orchestration layer can retry or alert. Swapping to PercentThreshold(5) switches to percentage-based gating.
4. Categorizing Continuous Metrics
temp_scale = dq.LinearScale(
min_val=-20.0,
max_val=50.0,
bins=7,
use_bin_zero=True
)
dq.DataQuality(
table="sensor_readings",
classifiers=[
dq.ScaleCategorizer(("temp_c", "temp_bucket"), scale=temp_scale),
],
operators=[dq.Enrich(), dq.Summary()],
)
temp_bucketholds UInt8 bin ids, where 0 means “exactly the minimum” becauseuse_bin_zero=True.Special status values mark NULL, NaN, underflow, and overflow readings when they fall outside the configured range.
The summary table reports
dq.bins,dq.under,dq.over, anddq.p000…dq.p007counts so dashboards can aggregate distribution drift.
5. Guarding Against Schema Drift
dq.IsNull(
["legacy_col"],
on_missing_column="fail", # explode if publisher forgets the column
)
dq.IsNan(
"ratio",
on_wrong_type="fail", # catch accidental casting to string
)
These flags are essential when the source team wants DQ to double as guardrails against structural regressions.
Summary Table Schema
Reference for the schema emitted by dq.Summary() at runtime:
Column |
Meaning |
|---|---|
|
Engine version (1 today) |
|
Result column name (e.g., |
|
|
|
Number of configured bins (categorizer only, else None) |
|
Counts of boolean failures/passes |
|
Rows with None / NaN |
|
Underflow/overflow counts for categorizers |
|
Bin occupancy; unused bins remain None |
|
Stringified classifier configuration ( |
Known Limitations
Subscribers do not emit TableFrames and therefore cannot run DQ yet.
Input-table DQ (validating data as it enters a transformer) is not implemented. Today, users must stage raw data or build a two-step pipeline.
Max of 100 bins per categorizer (101 if
use_bin_zero=True). Heavy skew scenarios may require multiple categorizer columns.