Local File System#
You can use this built-in Tabsdata subscriber to write files to a file system that is accessible to the Tabsdata server. This can include a local file system or a NAS. Subscribers can write in following file formats: CSV, jsonl, ndjson, and parquet.
Example (Subscriber - Local File System)#
Here is an example subscriber named write_employees
. It reads departments table and multiple employees tables from Tabsdata. Subsequently, it writes the tables to the output HR folder. This subscriber executes automatically as soon as a new commit occurs on any of its input tables.
import tabsdata as td
@td.subscriber(
tables=["departments", "employees_1", "employees_2"],
destination=td.LocalFileDestination(
[
"/users/username/opt/hr/output/departments.csv",
"/users/username/opt/hr/output/employees_1.csv",
"/users/username/opt/hr/output/employees_2.csv",
]
),
)
def write_employees(tf1: td.TableFrame, tf2: td.TableFrame, tf3: td.TableFrame):
return tf1, tf2, tf3
Note: After defining the subscriber, you need to register it with a Tabsdata collection. For more information, see Register a Function.
Setup (Subscriber - Local File System)#
The following code uses placeholder values for defining a subscriber that reads data from the Tabsdata server and writes it to the Local File System:
import tabsdata as td
@td.subscriber(
tables=["<output_table1>", "<output_table2>"],
destination=td.LocalFileDestination(["<path_to_file1>", "<path_to_file2>"]),
trigger_by=["<trigger_table1>", "<trigger_table2>"],
)
def <subscriber_name>(<table_frame1>: td.TableFrame, <table_frame2>: td.TableFrame):
<function_logic>
return <table_frame_output1>, <table_frame_output2>
Note: After defining the subscriber, you need to register it with a Tabsdata collection. For more information, see Register a Function.
Following properties are defined in the setup code above:
destination#
<path_to_files>#
<path_to_file1>
, <path_to_file2>
… are the full system directory paths to write the files to. They are usually of the format /users/username/...
.
All the destination files in a subscriber need to be have the same extension. Following file formats are supported currently: CSV, jsonl, ndjson, and parquet.
You must use the absolute system path (e.g., /user/username/project/employees.csv)
in the code instead of the relative one (e.g., ./employees.csv)
. Since these functions will be registered in the Tabsdata server, an absolute path is necessary to ensure proper access to the destination.
You can specify as many file paths as needed. Additionally, you can use the following variables within path to files to make the files unique or identifiable by a Tabsdata ID. e.g. td-iceberg/customers-$EXPORT_TIMESTAMP.parquet as used in this Tabsdata tutorial.
Field |
Description |
---|---|
|
The execution plan ID. |
|
Datetime the function started executing in epoch millis. |
|
The function run ID. |
|
Datetime when the scheduler set the function run to running, in epoch millis. |
|
Datetime when the execution plan was triggered, in epoch millis. |
|
The transaction ID. |
You can define the destination files in the following ways:
File Path
To write by file path where the file extension is included as part of the file path, define the destination as follows:
destination=td.LocalFileDestination([
"<path_to_file1.ext>","<path_to_file2.ext>"
]),
<path_to_file1.ext>
, <path_to_file2.ext>
… are paths to files with extensions of the file included in the file name.
File Format
To write files by file format where the format is declared separately and not in the file name, define the destination as follows:
destination=td.LocalFileDestination([
"<path_to_file1_no_extension>",
"<path_to_file2_no_extension>",
], format="<format_name>"),
"<path_to_file1_no_extension>"
, "<path_to_file2_no_extension>"
… are paths to files with extensions of the file not included in the file name. The extension is to all files is mentioned separately in format
.
Custom delimiter for CSV
To define a custom delimiter for reading a CSV file, define the destination as follows:
destination=td.LocalFileDestination([
"<path_to_file1_no_extension>",
"<path_to_file2_no_extension>",
], format=td.CSVFormat(separator="<separator_character>")),
"<path_to_file1_no_extension>"
, "<path_to_file2_no_extension>"
… are paths to CSV files with a custom delimiter, with extensions of the file not included in the file name. The delimiter is a single byte character such as colon (:), semicolon (;), and period (.) that separate the fields in the given file instead of a comma(,). You define the character in separator
.
None
as an input and output#
A subscriber may receive and return a None
value instead of TableFrames.
When a subscriber receives a None
value instead of a TableFrame it means that the requested table dependency version does not exist.
When a subscriber returns a None
value instead of a TableFrame it means there is no new data to write to the external system. This helps in avoiding the creation of multiple copies of the same data.
[Optional] trigger_by#
<trigger_table1>
, <trigger_table2>
… are the names of the tables in the Tabsdata server. A new commit to any of these tables triggers the subscriber. All listed trigger tables must exist in the server before registering the subscriber.
Defining trigger tables is optional. If you don’t define the trigger_by
property, the subscriber will be triggered by any of its input tables. If you define the trigger_by
property, then only those tables listed in the property can automatically trigger the subscriber.
For more information, see Working with Triggers.
<subscriber_name>#
<subscriber_name>
is the name for the subscriber that you are configuring.
<function_logic>#
<function_logic>
governs the processing performed by the subscriber. You can specify function logic to be a simple write or to perform additional processing as needed. For more information about the function logic that you can include, see Working with Tables.
<table_frame1>
, <table_frame2>
… are the names for the variables that temporarily store source data for processing.
<table_frame_output1>
, <table_frame_output2>
… are the output from the function that are written to the external system.
Data Drift Support#
This section talks about how Tabsdata handles data drift in the input data for this Subscriber connector.
As this is a file-based subscriber, the exported files will reflect the updated schema. No special handling is required in this case, as the target external system serves purely as a storage layer without schema-dependent logic.
However, this applies only if the subscriber function does not contain schema-dependent logic. If such logic exists, changes in the schema may conflict with the function’s expectations, leading to execution errors.