Working with Transformers#
A transformer reads data from one or more Tabsdata tables, transforms them, and writes transformed data to new or existing tables in the Tabsdata server.
Example (Transformer)#
Here is an example transformer named process_sales
. It reads three tabsdata tables, sums the sales_value
data by sales_rep
, and writes the resultant tables back to the Tabsdata server. This transformer executes automatically as soon as a new commit occurs on any of its input tables:
import tabsdata as td
@td.transformer(
input_tables=["us-sales", "eu-sales", "emea-sales"],
output_tables=[
"us-sales-by-sales-rep",
"eu-sales-by-sales-rep",
"emea-sales-by-sales-rep",
],
)
def process_sales(tf1: td.TableFrame, tf2: td.TableFrame, tf3: td.TableFrame):
tf1 = tf1.group_by(td.col("sales_rep")).agg(td.col("sales_value").sum())
tf2 = tf2.group_by(td.col("sales_rep")).agg(td.col("sales_value").sum())
tf3 = tf3.group_by(td.col("sales_rep")).agg(td.col("sales_value").sum())
return tf1, tf2, tf3
When you define a transformer, you specify the name of the transformer, the Tabsdata tables to read, and the names of the Tabsdata tables to write to. By default a transformer is triggered by a new commit to any of its input tables.
To create and run a transformer, complete the following tasks:
Define a transformer - Define a transformer to configure the parameters that govern the transformations of tables in the Tabsdata server.
Register the transformer - Register the transformer with a Tabsdata collection. A transformer can read from tables in any Tabsdata collection, but can only write output tables to the collection that it is registered with. For information about registering functions, see here.
Execute the transformer - Execute the transformer by initiating a trigger. Once executed, the transformer processes data from the specified input tables as defined in the function logic, then writes the results to tables in the Tabsdata server. By default, a commit on any of the input tables triggers a transformer. However, you can use any Tabsdata table as a trigger. For more information about executing a function, see here.
Setup (Transformer)#
The following code uses placeholder values for defining a transformer:
import tabsdata as td
@td.transformer(
input_tables=["<input_table1>", "<collection_name>/<input_table2>"],
output_tables=["<output_table1>", "<output_table2>"],
trigger_by=["<trigger_table1>", "<trigger_table2>"],
)
def <transformer_name>("<table_frame1>": td.TableFrame, "<table_frame2>"": td.TableFrame):
<function_logic>
return <table_frame_output1>, <table_frame_output2>
Following properties are defined in the setup code above:
Input Tables#
<input_table_1>
is the name of an input table in the collection that the function is registered with.
<collection_name>/<input_table_2>
is an input table from a collection that the transformer is not registered with.
Tabsdata stores all the commits of a table in the Tabsdata server. Hence, you can have multiple commits of the same table in the input. You can read more about it here.
Output Tables#
<output_table1>
, <output_table2>
… are the names of the Tabsdata tables to publish to.
Triggers#
<trigger_table1>
, <trigger_table2>
… are the names of the tables in the Tabsdata server. A new commit to any of these tables triggers the publisher. All listed trigger tables must exist in the server before registering the publisher.
Defining trigger tables is optional. If you don’t define the trigger_by
property, the transformer will get triggered by any of its input tables. If you define the trigger_by
property, then only those tables listed in the property can automatically trigger the transformer.
For more information, see Working with Triggers.
Name#
<transformer_name>
is the name for the transformer that you are configuring. It has be unique within all functions registered in a collection.
Function Logic#
<function_logic>
governs the processing performed by the publisher. You can specify function logic to be a simple write or to perform additional processing, such as dropping nulls, before writing data to output tables. For more information about the function logic that you can include, see Working with Tables.
<table_frame1>
, <table_frame2>
… are the names for the variables that temporarily store source data for processing.
<table_frame_output1>
, <table_frame_output2>
… are the output from the function that are stored as Tabsdata tables with names as defined in the tables
property. Consequently, the number of tables returned from the function have to exactly match the number of tables defined in the tables
property.