tabsdata.tableframe.lazyframe.frame.TableFrame.udf#

TableFrame.udf(on: td_typing.IntoExpr | list[td_typing.IntoExpr], function: td_udf.UDF) TableFrame[source]#

Apply a user-defined function (UDF) to the columns resolved by expr.

The selected columns are supplied to function, which can implement either on_batch or on_element. An on_batch implementation receives a list of Polars series representing the selected columns and must return a list of Polars series with matching length. An on_element implementation receives a list of Python scalars for each row and returns a list of scalars; the framework wraps this in an efficient batch executor, so data still flows in batches even when authoring row-wise logic. In both cases the returned series become new columns appended to the original TableFrame.

Creating UDFs:
  1. Subclass tabsdata.tableframe.udf.function.UDF.

  2. Implement __init__ to call super().__init__(output_columns) where output_columns is a tuple or list of tuples (name, data type) specifying the UDF default output schema (column names and data types). Each tuple must contain a column name (string) and a data type (DataType).

  3. Override exactly one of on_batch or on_element, to implement the UDF function logic.

  4. Return a list of TabsData Series (for on_batch) or TabsData supported scalars (for on_element) with the same length as specified in the output schema.

  1. If overriding the on_batch method, the return type must be a list of TabsData Series. If overriding the on_element method, the return type must be a list of supported TabsData scalar values. For both cases, the number of elements in the returned lists must match the number of elements in the output_columns list provided to the UDF constructor.

Using UDFs:
  1. Instantiate a function created as above.

  2. Pass it to TableFrame method udf().

  3. Optionally use UDF.output_columns() to override output column names or data types after instantiation.

Parameters:
  • on – Expression selecting the input column(s) of the UDF.

  • function – Instance of tabsdata.tableframe.udf.function.UDF defining on_batch or on_element to produce the output series.

Examples

>>> import tabsdata as td
>>> import tabsdata.tableframe as tdf
>>>
>>> class SumUDF(tdf.UDF):
...     def __init__(self):
...         super().__init__(("total", tdf.Int64))
...
...     def on_batch(self, series):
...         return [series[0] + series[1]]
>>>
>>> tf = td.TableFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
>>> print(tf)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 10  │
│ 2   ┆ 20  │
│ 3   ┆ 30  │
└─────┴─────┘
>>> tf.udf(td.col("a", "b"), SumUDF())
>>> print(tf)
┌─────┬─────┬───────┐
│ a   ┆ b   ┆ total │
│ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═══════╡
│ 1   ┆ 10  ┆ 11    │
│ 2   ┆ 20  ┆ 22    │
│ 3   ┆ 30  ┆ 33    │
└─────┴─────┴───────┘
>>> class RatioUDF(tdf.UDF):
...     def __init__(self):
...         super().__init__(("ratio", tdf.Float64))
...
...     def on_element(self, values):
...         return [values[0] / values[1]]
>>>
>>> tf = td.TableFrame({"numerator": [10, 20, 30],
>>>                     "denominator": [2, 5, 10],})
>>> print(tf)
┌───────────┬──────────────┐
│ numerator ┆ denominator  │
│ ---       ┆ ---          │
│ i64       ┆ i64          │
╞═══════════╪══════════════╡
│ 10        ┆ 2            │
│ 20        ┆ 5            │
│ 30        ┆ 10           │
└───────────┴──────────────┘
>>> tf.udf(td.col("numerator", "denominator"), RatioUDF()).collect()
>>> print(tf)
┌───────────┬──────────────┬──────┐
│ numerator ┆ denominator  ┆ ratio│
│ ---       ┆ ---          ┆ ---  │
│ i64       ┆ i64          ┆ f64  │
╞═══════════╪══════════════╪══════╡
│ 10        ┆ 2            ┆ 5.0  │
│ 20        ┆ 5            ┆ 4.0  │
│ 30        ┆ 10           ┆ 3.0  │
└───────────┴──────────────┴──────┘