tabsdata.tableframe.lazyframe.frame.TableFrame.udf#
- TableFrame.udf(on: td_typing.IntoExpr | list[td_typing.IntoExpr], function: td_udf.UDF) TableFrame [source]#
Apply a user-defined function (UDF) to the columns resolved by expr.
The selected columns are supplied to function, which can implement either on_batch or on_element. An on_batch implementation receives a list of Polars series representing the selected columns and must return a list of Polars series with matching length. An on_element implementation receives a list of Python scalars for each row and returns a list of scalars; the framework wraps this in an efficient batch executor, so data still flows in batches even when authoring row-wise logic. In both cases the returned series become new columns appended to the original TableFrame.
- Creating UDFs:
Subclass
tabsdata.tableframe.udf.function.UDF
.Implement
__init__
to callsuper().__init__(output_columns)
whereoutput_columns
is a tuple or list of tuples(name, data type)
specifying the UDF default output schema (column names and data types). Each tuple must contain a column name (string) and a data type (DataType).Override exactly one of on_batch or on_element, to implement the UDF function logic.
Return a list of TabsData Series (for on_batch) or TabsData supported scalars (for on_element) with the same length as specified in the output schema.
If overriding the on_batch method, the return type must be a list of TabsData Series. If overriding the on_element method, the return type must be a list of supported TabsData scalar values. For both cases, the number of elements in the returned lists must match the number of elements in the output_columns list provided to the UDF constructor.
- Using UDFs:
Instantiate a function created as above.
Pass it to TableFrame method udf().
Optionally use
UDF.output_columns()
to override output column names or data types after instantiation.
- Parameters:
on – Expression selecting the input column(s) of the UDF.
function – Instance of
tabsdata.tableframe.udf.function.UDF
defining on_batch or on_element to produce the output series.
Examples
>>> import tabsdata as td >>> import tabsdata.tableframe as tdf >>> >>> class SumUDF(tdf.UDF): ... def __init__(self): ... super().__init__(("total", tdf.Int64)) ... ... def on_batch(self, series): ... return [series[0] + series[1]] >>> >>> tf = td.TableFrame({"a": [1, 2, 3], "b": [10, 20, 30]}) >>> print(tf) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ └─────┴─────┘ >>> tf.udf(td.col("a", "b"), SumUDF()) >>> print(tf) ┌─────┬─────┬───────┐ │ a ┆ b ┆ total │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═══════╡ │ 1 ┆ 10 ┆ 11 │ │ 2 ┆ 20 ┆ 22 │ │ 3 ┆ 30 ┆ 33 │ └─────┴─────┴───────┘
>>> class RatioUDF(tdf.UDF): ... def __init__(self): ... super().__init__(("ratio", tdf.Float64)) ... ... def on_element(self, values): ... return [values[0] / values[1]] >>> >>> tf = td.TableFrame({"numerator": [10, 20, 30], >>> "denominator": [2, 5, 10],}) >>> print(tf) ┌───────────┬──────────────┐ │ numerator ┆ denominator │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═══════════╪══════════════╡ │ 10 ┆ 2 │ │ 20 ┆ 5 │ │ 30 ┆ 10 │ └───────────┴──────────────┘ >>> tf.udf(td.col("numerator", "denominator"), RatioUDF()).collect() >>> print(tf) ┌───────────┬──────────────┬──────┐ │ numerator ┆ denominator ┆ ratio│ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ f64 │ ╞═══════════╪══════════════╪══════╡ │ 10 ┆ 2 ┆ 5.0 │ │ 20 ┆ 5 ┆ 4.0 │ │ 30 ┆ 10 ┆ 3.0 │ └───────────┴──────────────┴──────┘