Amazon S3#
You can use a subscriber to write tables from the Tabsdata server to Amazon S3 as files. Subscribers can write the following file formats: CSV, jsonl, ndjson, and parquet.
Example (Subscriber - Amazon S3)#
Here is an example subscriber named write_employees
. It reads departments table and multiple employees tables from Tabsdata. Subsequently, it writes the tables to the output HR folder. This subscriber executes automatically as soon as a new commit occurs on any of its input tables.
import tabsdata as td
s3_credentials = td.S3AccessKeyCredentials(
aws_access_key_id=td.HashiCorpSecret("path-to-secret","S3_ACCESS_KEY"),
aws_secret_access_key=td.HashiCorpSecret("path-to-secret","S3_SECRET_KEY"),
@td.subscriber(
tables=["departments", "employees_1", "employees_2"],
destination=td.S3Destination(
[
"s3://opt/hr/departments.csv",
"s3://opt/hr/employees_*.csv",
],
credentials = s3_credentials,
region = "us-east-2",
),
)
def write_employees(tf1: td.TableFrame, tf2: td.TableFrame, tf3: td.TableFrame):
return tf1, tf2, tf3
Where:
S3_ACCESS_KEY
is the value of your Amazon S3 access key.
S3_SECRET_KEY
is the value of your Amazon S3 secret key.
Note: After defining the function, you need to register it with a Tabsdata collection. For more information, see here.
Setup (Subscriber - Amazon S3)#
The following code uses placeholder values for defining a subscriber that reads tables from Tabsdata tables and writes files to Amazon S3:
import tabsdata as td
s3_credentials = td.S3AccessKeyCredentials(
aws_access_key_id=td.HashiCorpSecret("path-to-secret","S3_ACCESS_KEY"),
aws_secret_access_key=td.HashiCorpSecret("path-to-secret","S3_SECRET_KEY"),
@td.subscriber(
tables=["<input_table1>", "<input_table2>"],
destination=td.S3Destination(
["s3://<path_to_file1>", "s3://<path_to_file2>"], credentials = s3_credentials, region = "<region_name>",
),
trigger_by=["<trigger_table1>", "<trigger_table2>"],
)
def <subscriber_name>(<table_frame1>: td.TableFrame, <table_frame2>: td.TableFrame):
<function_logic>
return <table_frame_output1>, <table_frame_output2>
Note: After defining the function, you need to register it with a Tabsdata collection. For more information, see here here.
Following properties are defined in the setup code above:
tables#
<input_table1>
, <input_table2>
… are the names of the Tabsdata tables to be written to the external system.
destination#
<path_to_file1>
, <path_to_file2>
… are the full system directory paths to the files to write.
All the destination files in a subscriber need to be have the same extension. Following file formats are supported currently: CSV, jsonl, ndjson, and parquet.
You can specify as many file paths as needed.
You can define the destination files in the following ways:
File Path
To write by file path where the file extension is included as part of the file path, define the destination as follows:
destination=td.S3Destination(["s3://<path_to_file1.ext>","s3://<path_to_file2.ext>"], s3_credentials),
<path_to_file1.ext>
… have the extensions of the file included in the file name as part of the path.
File Format
To write files by file format where format is declared separately and not in the file name, define the destination as follows:
destination=td.S3Destination([
"s3://<path_to_file1_no_extension>",
"s3://<path_to_file2_no_extension>",
], s3_credentials, format="<format_name>", region = "<region_name>"),
"<path_to_file1_no_extension>"
, "<path_to_file2_no_extension>"
… don’t have the extension in the file name. The extension to all files is mentioned separately in format
.
Custom delimiter for CSV
To define a custom delimiter for reading a CSV file, use the format code as follows:
destination=td.S3Destination([
"s3://<path_to_file1_no_extension>",
"s3://<path_to_file2_no_extension>",
], s3_credentials, format=td.CSVFormat(separator="<separator_character>"), region = "<region_name>",),
"<path_to_file1_no_extension>"
, "<path_to_file2_no_extension>"
… are paths to CSV files with a custom delimiter, with extensions of the file not included in the file name. The delimiter is a single byte character such as colon (:), semicolon (;), and period (.) that separate the fields in the given file instead of a comma(,). You define the character in separator
.
credentials#
A subscriber needs credentials to write files from Amazon S3. Here the value is defined using a variable s3_credentials. The variable is an object of class S3AccessKeyCredentials
with following values.
S3_ACCESS_KEY
is the value of your Amazon S3 access key.
S3_SECRET_KEY
is the value of your Amazon S3 secret key.
path-to-secret
is the name of the default key value store in HashiCorp where the credential values are required to be stored.
You can use different ways to store the credentials which are highlighted here in the documentation.
region#
"<region_name>"
is where you have your S3 bucket is located.
trigger_by#
[optional] <trigger_table1>
, <trigger_table2>
… are the names of the tables in the Tabsdata server. A new commit to any of these tables triggers the subscriber. All listed trigger tables must exist in the server before registering the subscriber.
Defining trigger tables is optional. If you don’t define the trigger_by
property, the subscriber will be triggered by any of its input tables. If you define the trigger_by
property, then only those tables listed in the property can automatically trigger the subscriber.
For more information, see Working with Triggers.
<subscriber_name>#
<subscriber_name>
is the name for the subscriber that you are configuring.
<function_logic>#
<function_logic>
governs the processing performed by the subscriber. You can specify function logic to be a simple write or to perform additional processing as needed. For more information about the function logic that you can include, see Working with Tables.
<table_frame1>
, <table_frame2>
… are the names for the variables that temporarily store source data for processing.
<table_frame_output1>
, <table_frame_output2>
… are the output from the function that are written to the external system.