Data Quality

Tabsdata Data Quality (DQ) lets you attach declarative quality checks to the output of publishers and transformers. Every run inspects the data that was just produced, enriches it with quality signals, and optionally quarantines or rejects rows that fail your criteria.

The feature augments each output with quality columns, per-classifier summaries, and optional quarantine/select tables. These artifacts help teams monitor ingestion quality, steer remediation, and deliver trustworthy data for downstream decisions.

In the code below, a transformer reads from a new_visitors table and outputs to a visitors table with data quality checks attached. Three classifiers inspect the data: one checks for NULL values in name and email (tagged as "required"), another checks for NULL values in country, and a third verifies that ok_to_contact is true. Three operators then act on these signals: Summary() generates a visitor_dq_summary table with aggregated quality metrics, Select() copies rows passing all checks to a visitor_leads table, and Filter() removes rows with NULL name or email values from the output. Note that rows with NULL country or false ok_to_contact remain in the output since the filter only targets the "required" tag.

Example Code:

import tabsdata as td
import tabsdata.dataquality as dq

@td.transformer(
    input_tables=["new_visitors"],
    output_tables=["visitors"],
    on_output_tables=[
        dq.DataQuality(
            table="visitors",
            classifiers=[
                dq.IsNotNull(column_names=["name","email"], tags="required"),
                dq.IsNotNull("country"),
                dq.IsTrue("ok_to_contact"),
            ],
            operators=[
                # creates table `visitor_dq_summary`
                dq.Summary(),

                # creates table `visitor_leads`
                dq.Select(
                    to_table="visitor_leads",
                    criteria=dq.AllOk(),
                ),
                # drop from `visitors` without name or email
                dq.Filter(
                    criteria=dq.AnyFailed(tags="required"),
                ),
            ],
        )
    ],
)
def process_visitors(tf: td.TableFrame) -> td.TableFrame:
    return tf

DataQuality Action

The DataQuality action is the entry point for attaching quality checks to an output table. Defined under on_tables parameter in Publisher and on_output_tables in Transformer decorators, DataQuality Action attaches a target output table to a declarative/programmatic set of quality checks and follow-up reactions so every run can inspect and act on the data that was just produced.

Example: You have a publisher writing customer data to a customers table. By attaching a DataQuality action to this table, you can automatically validate every batch of customer records before they become available to downstream consumers.

Data Quality actions have these core ingredients:

Classifiers

Inspect one or more columns and emit boolean or categorical signals (e.g., is_positive, is_null). Each classifier examines specific columns and creates a data quality column with the values generated by the classifier.

Example: Use dq.IsNull("email") to detect rows with missing email addresses, or dq.IsPositive("order_amount") to flag transactions with zero or negative values that might indicate data entry errors.

Operators

Consume the signals emitted by classifiers to enrich tables, build summaries, route/retain rows, or fail a run. Operators determine what happens after classifiers have evaluated the data.

Example: After classifying rows with null customer IDs, use dq.Filter() to quarantine those rows into a separate bad_records table for manual review, while clean rows flow to downstream analytics. Or use dq.Summary() to generate a quality report showing how many records passed or failed each check.

Criteria & Thresholds

Describe which classifier outcomes an operator cares about and when to trigger actions such as filtering or failure. Criteria specify the conditions (e.g., “any null values”), while thresholds define acceptable limits (e.g., “fail if more than 5% of rows are affected”).

Example: Use dq.AnyFailed(tags="nulls") as criteria to match rows that failed null checks, combined with dq.PercentThreshold(5) to abort the pipeline if more than 5% of incoming records have missing required fields.

Tags (optional)

Label every classifier output column. Operators can refer to those labels to focus on specific checks, all checks, or only the untagged ones. Tags help organize multiple classifiers and allow operators to selectively act on specific quality dimensions.

Example: Tag null-checking classifiers with tags="completeness" and range-checking classifiers with tags="validity". Then configure one Filter operator to quarantine rows failing completeness checks and another to quarantine rows failing validity checks into separate tables for different remediation workflows.

When a classifier runs, every generated column inherits the tags declared on that classifier (or stays untagged if no tags were provided). Operators read those tags to decide which columns they should process. Aside from the no-duplicates rules called out below, there is no cap on how many actions, classifiers, or operators can be attached to a single function.

Note

Running a DQ action can yield additional derived tables (enriched rows, summary tables, quarantine tables) alongside the regular outputs. In such a case, downstream steps can consume the results in the same function run.

All configuration and consistency checks run during function registration, so once a transformer or publisher is accepted, authors know every DQ action attached to it is structurally valid.

Examples

Example 1 – Enriching Output with Quality Columns

The following example shows how to enrich output rows with quality flag columns. A publisher writes order data to an orders table, and the Enrich() operator adds a total_is_positive column to the output orders table indicating whether each row’s total value is positive.

import tabsdata as td
import tabsdata.dataquality as dq

@td.publisher(
    source=td.LocalFileSource("orders.csv"),
    output_tables=["orders"],
    on_tables=[
        dq.DataQuality(
            table="orders",
            classifiers=[dq.IsPositive("total")],
            operators=[dq.Enrich()],
        )
    ],
)
def publish_orders(tf: td.TableFrame) -> td.TableFrame:
    return tf

Example 2 – Failing a Run Based on Row Count Threshold

In this example, a transformer validates that critical fields (order_id, customer_id) are not NULL and aborts the run if more than 10 rows fail the check. The Fail() operator uses RowCountThreshold to enforce this limit. If 10 or more rows have NULL` values in order_id or customer_id, the entire run fails with a DataQualityProcessingError.

import tabsdata as td
import tabsdata.dataquality as dq

@td.transformer(
    input_tables=["raw_orders"],
    output_tables=["validated_orders"],
    on_output_tables=[
        dq.DataQuality(
            table="validated_orders",
            classifiers=[dq.IsNotNull(["order_id", "customer_id"])],
            operators=[
                dq.Fail(
                    criteria=dq.AnyFailed(),
                    threshold=dq.RowCountThreshold(10),
                )
            ],
        )
    ],
)
def validate_orders(tf: td.TableFrame) -> td.TableFrame:
    return tf

Example 3 – Using Multiple Classifiers and Operators with Tags

In the code below, a transformer reads from a landing table and outputs to a clean table with data quality checks attached. Two classifiers inspect the data: one checks for NULL values in customer_id (tagged as "nulls"), and another verifies that amount is positive (tagged as "positive"). Three operators then act on these signals: Enrich() adds the quality flag columns to the output table, Summary() generates a clean_dq_summary table with aggregated quality metrics, and Filter() removes rows with NULL customer IDs—routing them to a separate clean_nulls quarantine table for review. Note that rows with non-positive amounts remain in the output (flagged but not filtered) since the filter only targets the "nulls" tag.

import tabsdata as td
import tabsdata.dataquality as dq

@td.transformer(
    input_tables=["landing"],
    output_tables=["clean"],
    on_output_tables=[
        dq.DataQuality(
            table="clean",
            classifiers=[
                dq.IsNull("customer_id", tags="nulls"),
                dq.IsPositive("amount", tags="positive"),
            ],
            operators=[
                dq.Enrich(),           # add dq columns to `clean`
                dq.Summary(),          # writes `clean_dq_summary`
                dq.Filter(
                    criteria=dq.AnyFailed(tags="nulls", none_is_ok=False),
                    to_table="clean_nulls",
                    include_quality_columns="criteria",
                ),
            ],
        )
    ],
)
def enrich(tf: td.TableFrame) -> td.TableFrame:
    ...

Key parameters:

  • classifiers and operators accept either a single object or a list, which lets authors write compact one-off rules or longer pipelines.

  • tags on both classifiers and operators constrain which classifier results are visible to a given operator:

    • tags=None – read every classifier column (default).

    • tags=[] – target only untagged classifier columns.

    • One or more tag strings – narrow the scope to matching classifiers.

  • Boolean operators can decide how to treat missing or NaN inputs via none_is_ok / nan_is_ok. These flags matter only when the classifier itself isn’t already checking for NULLs or NaNs (e.g., IsPositive). If a classifier explicitly tests for nullability (IsNull, IsNan, IsNullOrNan, etc.), the corresponding operator flags are ignored.

  • column_names can be a string, a list of strings, or (col_name, dq_col_name) tuples. The alias becomes the DQ column name. For example, ScaleCategorizer(column_names=("value", "value_scale_category"), ...) creates value_scale_category.

Within a single action, the resulting column names (original + DQ columns) must remain unique. Configuring two classifiers that would emit the same alias is rejected at registration time.

Classifiers

There are two families of classifiers:

Boolean Classifiers

Boolean classifiers add <column>_<classifier> UInt8 columns. Values follow a fixed mapping:

  • 0 = False

  • 1 = True

  • 252 = underflow

  • 253 = overflow

  • 254 = NaN

  • 255 = NULL

If your operator sets none_is_ok=True and/or nan_is_ok=True and the classifier isn’t expressly checking for nullability (e.g., IsPositive), missing values are treated as True instead of receiving sentinel codes.

Nullability classifiers:

  • IsNull / IsNotNull

  • IsNullOrNan / IsNotNullNorNan

  • IsNan / IsNotNan

Zero and sign checks:

  • IsZero / IsNotZero

  • IsPositive / IsPositiveOrZero

  • IsNegative / IsNegativeOrZero

Range checks:

  • IsBetween / IsNotBetween – with min_val, max_val, and closed{"none", "lower", "upper", "both"}

Set membership:

  • IsIn / IsNotIn

Pattern and length:

  • Matches / DoesNotMatch

  • HasLength

All boolean classifiers support:

  • column_names – include multiple columns at once.

  • on_missing_column"ignore" or "fail" to decide whether schema drift aborts the run.

  • on_wrong_type"ignore" or "fail" for type mismatches.

  • Optional tags.

Categorical Classifiers

Categorical classifiers extend the abstract Categorizer. The available implementation is ScaleCategorizer, which emits a <column>_scale_category UInt8 column representing bin numbers plus special sentinels ("none", "nan", "underflow", "overflow").

It accepts numeric inputs and one of the following scales:

IdentityScale

IdentityScale(min_val, max_val, use_bin_zero=False) – integer-to-bucket mapping.

LinearScale

LinearScale(min_val, max_val, bins=MAX_BINS, use_bin_zero=False) – equal-width bins.

MonomialScale

MonomialScale(min_val, max_val, power, bins=MAX_BINS, use_bin_zero=False) – power-law spacing. If min_val is negative, the values are shifted so the minimum maps to 0 before binning.

LogarithmicScale

LogarithmicScale(min_val, max_val, base, bins=MAX_BINS, use_bin_zero=False) – logarithmic spacing. Negative min_val inputs are shifted so the minimum becomes just above zero to keep the domain positive.

ExponentialScale

ExponentialScale(min_val, max_val, base, bins=MAX_BINS, use_bin_zero=False) – exponential widening. Negative ranges are handled like the logarithmic scale.

MAX_BINS is 100. When use_bin_zero=True, bin 0 is reserved for the minimum and the usable bin count shrinks accordingly.

Default Column Names

Boolean classifier outputs:

Classifier

Default Column Name

IsNull

<source_column>_is_null

IsNotNull

<source_column>_is_not_null

IsNullOrNan

<source_column>_is_null_or_nan

IsNotNullNorNan

<source_column>_is_not_null_nor_nan

IsNan

<source_column>_is_nan

IsNotNan

<source_column>_is_not_nan

IsZero

<source_column>_is_zero

IsNotZero

<source_column>_is_not_zero

IsPositive

<source_column>_is_positive

IsPositiveOrZero

<source_column>_is_positive_or_zero

IsNegative

<source_column>_is_negative

IsNegativeOrZero

<source_column>_is_negative_or_zero

IsBetween

<source_column>_is_between

IsNotBetween

<source_column>_is_not_between

IsIn

<source_column>_is_in

IsNotIn

<source_column>_is_not_in

Matches

<source_column>_matches

DoesNotMatch

<source_column>_does_not_match

HasLength

<source_column>_has_length

Categorizer outputs:

Classifier

Default Column Name

ScaleCategorizer (no alias provided)

<source_column>_scale_category

When you supply (source, alias) pairs, the alias overrides these defaults.

Operators

Operators consume classifier signals and produce actions on your data.

Enrich

Enrich(to_table=None, tags=None)

Adds classifier columns to either the original table (default, in-place overwrite) or a dedicated new table. When tags are provided, only classifiers with matching tags are materialized.

Summary

Summary(tags=None, table=None)

Always emits a new table. When you omit table, the engine names it <source>_dq_summary.

Filter

Filter(criteria, to_table=None, include_quality_columns="none")

Removes rows matching criteria from the output table. With to_table, rejected rows are written to a separate table; otherwise they are dropped.

include_quality_columns options:

  • "none" – bare data rows.

  • "criteria" – only the DQ columns referenced by the criteria (boolean or categorical).

  • "all" – all classifier outputs.

Select

Select(criteria, to_table, include_quality_columns="none")

Copies rows matching criteria to another table without mutating the source table. A target name is mandatory.

Fail

Fail(criteria, threshold)

Aborts the entire transformer/publisher run when the threshold is exceeded.

Thresholds:

  • RowCountThreshold(row_count) – fail if matching rows >= count.

  • PercentThreshold(percent) – fail if matching rows exceed percentage.

Derived Table Names

Operator

Default Table Name

Behavior with to_table/table

Enrich

Overwrites source table in place

Uses your exact to_table name for the enriched copy

Filter

Rows dropped; no table emitted

Uses your exact to_table name for rejected rows

Select

Target name is mandatory

Uses your exact to_table name for selected rows

Summary

<output_table_name>_dq_summary

Uses your exact table name when provided

Cross-Action Constraints

  • Only one Enrich per output table may redirect rows by setting to_table, because that operator substitutes the original table with the enriched copy.

  • A given derived table name (to_table in Filter/Select or table in Summary) must be declared once across all actions attached to the function; producing the same non-output table multiple times is disallowed.

Criteria and Thresholds

Boolean criteria:

  • AllOk / AnyFailed – operate on boolean classifier columns and expose none_is_ok / nan_is_ok toggles. Set these to True when you want operators to treat missing/NaN values as acceptable rather than as sentinel codes.

Categorical criteria:

  • InBins / NotInBins – target categorizer outputs. Bins can include integers or the special labels "none", "nan", "underflow", "overflow".

Each criteria holds onto the classifier tags it needs, so operators automatically align with the right DQ columns. Threshold choices (row count / percent) are validated to catch impossible values.

Showcased Workflows

1. Enriching Output Rows and Building Summaries

dq.DataQuality(
    table="table_o",
    classifiers=[dq.IsPositive("value")],
    operators=[dq.Summary(), dq.Enrich()],
)

This produces table_o with the extra value_is_positive column (UInt8 status codes) and table_o_dq_summary with boolean counts (dq.true, dq.false, dq.none, dq.nan). This is the easiest drop-in for surfacing quality metrics alongside the data.

2. Quarantining Bad Rows with Tags

scale = dq.LinearScale(min_val=0, max_val=30, bins=3)

dq.DataQuality(
    table="scores",
    classifiers=[
        dq.IsPositive("score", tags="positive", none_is_ok=False),
        dq.ScaleCategorizer(
            ("score_float", "score_float_scale_category"),
            scale=scale
        ),
    ],
    operators=[
        dq.Filter(
            criteria=dq.AnyFailed(tags="positive", none_is_ok=False),
            to_table="scores_bad",
            include_quality_columns="criteria",  # adds only score_is_positive
        ),
        dq.Select(
            criteria=dq.InBins("score_float_scale_category", [0, 1]),
            to_table="scores_low_bins",
            include_quality_columns="all",
        ),
    ],
)
  • Rows with invalid or zero scores are removed from the main output. With to_table=None they would be discarded completely.

  • Because "criteria" is used, scores_bad only exposes the DQ columns that justified the move, keeping tables narrow.

  • The Select example shows how categorical bins get propagated to downstream tables for auditing.

3. Failing a Run When Breaches Cross a Threshold

dq.DataQuality(
    table="clean",
    classifiers=[dq.IsNotNull(["customer_id", "order_id"])],
    operators=[
        dq.Fail(
            criteria=dq.AnyFailed(none_is_ok=False),
            threshold=dq.RowCountThreshold(2),
        )
    ],
)

If 2 or more rows contain NULL in either column, the transformer raises DataQualityProcessingError and the orchestration layer can retry or alert. Swapping to PercentThreshold(5) switches to percentage-based gating.

4. Categorizing Continuous Metrics

temp_scale = dq.LinearScale(
    min_val=-20.0,
    max_val=50.0,
    bins=7,
    use_bin_zero=True
)

dq.DataQuality(
    table="sensor_readings",
    classifiers=[
        dq.ScaleCategorizer(("temp_c", "temp_bucket"), scale=temp_scale),
    ],
    operators=[dq.Enrich(), dq.Summary()],
)
  • temp_bucket holds UInt8 bin ids, where 0 means “exactly the minimum” because use_bin_zero=True.

  • Special status values mark NULL, NaN, underflow, and overflow readings when they fall outside the configured range.

  • The summary table reports dq.bins, dq.under, dq.over, and dq.p000dq.p007 counts so dashboards can aggregate distribution drift.

5. Guarding Against Schema Drift

dq.IsNull(
    ["legacy_col"],
    on_missing_column="fail",   # explode if publisher forgets the column
)

dq.IsNan(
    "ratio",
    on_wrong_type="fail",       # catch accidental casting to string
)

These flags are essential when the source team wants DQ to double as guardrails against structural regressions.

Summary Table Schema

Reference for the schema emitted by dq.Summary() at runtime:

Column

Meaning

dq.version

Engine version (1 today)

dq.classifier

Result column name (e.g., amount_is_positive)

dq.type

"boolean" or "categorical"

dq.bins

Number of configured bins (categorizer only, else None)

dq.false / dq.true

Counts of boolean failures/passes

dq.none / dq.nan

Rows with None / NaN

dq.under / dq.over

Underflow/overflow counts for categorizers

dq.p000dq.p100

Bin occupancy; unused bins remain None

dq.parameters

Stringified classifier configuration (str(classifier))

Known Limitations

  • Subscribers do not emit TableFrames and therefore cannot run DQ yet.

  • Input-table DQ (validating data as it enters a transformer) is not implemented. Today, users must stage raw data or build a two-step pipeline.

  • Max of 100 bins per categorizer (101 if use_bin_zero=True). Heavy skew scenarios may require multiple categorizer columns.