Big Query#
You can use this built-in Tabsdata subscriber to write to Big Query by Google.
Installing the connector#
To work with the Big Query connector you are required to install the dependencies separately. This is to manage the size of the core package. Please run the following command in your terminal to install the dependencies.
$ pip install tabsdata[“bigquery”]
or,
$ pip install 'tabsdata['bigquery']'
Example (Subscriber - Big Query)#
The following example creates a subscriber named write_sales. It writes two Tabsdata tables to the database. The subscriber is automatically triggered by a new commit to any of its input tables, and writes data to the database without any modification.
import tabsdata as td
@td.subscriber(
tables=["vendors", "items"],
destination=td.BigQueryDest(
td.BigQueryConn(
gcs_folder="gcs://acme_inc/staging",
credentials=td.GCPServiceAccountKeyCredentials("<ACCOUNT_KEY>") ,
project="sales",
dataset="procurement",
),
tables=["vendors","items"],
if_table_exists="replace",
schema_strategy="strict",
),
)
def write_sales(tf1: td.TableFrame, tf2: td.TableFrame):
return tf1, tf2
Note: After defining the function, you need to register it with a Tabsdata collection and execute it. For more information, see Register a Function and Execute a Function.
Setup (Subscriber - Big Query)#
The following code uses placeholder values for defining a subscriber that reads Tabsdata tables and writes to Big Query:
import tabsdata as td
@td.subscriber(
tables=["<input_table1>", "<input_table2>"],
destination=td.BigQueryDest(
td.BigQueryConn(
gcs_folder="<gcs://path_to_folder>",
credentials=td.GCPServiceAccountKeyCredentials("<ACCOUNT_KEY>"),
project="<project_name>",
dataset="<dataset_name>",
enforce_connection_params="<connector_params_boolen>"
cx_dst_configs_gcs={"<gcs_param1>":"<gsc_value1>","<gsc_param2>":"<gcs_value2>"}
cx_dst_configs_bigquery={"<bq_param1>":"<bq_value1>","<bq_param2>":"<bq_value2>"}
)
tables=["vendors","items"],
if_table_exists="replace",
schema_strategy="strict",
),
trigger_by=["<trigger_table1>", "<trigger_table2>"],
)
def <subscriber_name> (<table_frame1>:td.TableFrame, <table_frame2>:td.TableFrame):
<function_logic>
return <table_frame_output1>, <table_frame_output2>
Note: After defining the function, you need to register it with a Tabsdata collection and execute it. For more information, see Register a Function and Execute a Function.
Following properties are defined in the code above:
tables#
<input_table1>, <input_table2>… are the names of the Tabsdata tables to be written to the external system.
destination#
BiqQueryConn#
gcs_folder#
A gcs:// URI indicating the folder where the table(s) will be staged before being uploaded to BigQuery. After the upload is done, the files created in this folder will be deleted.
credentials#
GCP credentials to use for authentication.
project#
The default GCP project to use. If not provided, the project must be specified in the table names provided in the destination. Defaults to None.
dataset#
The default BigQuery dataset to use. If not provided, the dataset must be specified in the table names provided in the destination. Defaults to None.
enforce_connection_params#
If set to True, enforce that project and dataset are used to fully qualify table names in the destination if provided. If set to False, the connection will allow the project and/or dataset to be overriden by the table names provided in the destination. Defaults to True.
cx_dst_configs_gcs#
Additional configuration parameters to pass to the GCS client. Passed as a dictionary. Defaults to None.
cx_dst_configs_bigquery#
Additional configuration parameters to pass to the BigQuery client. Passed as a dictionary. Defaults to None.
tables#
The table(s) to store the data in. If multiple tables are provided, they must be provided as a list. If None, the table names will be those of the input tables for the function. Defaults to None.
if_table_exists#
This is an optional property to define the strategy to follow when the table already exists. ‘replace’ will create overwrite the existing table, and ‘append’ will append to the existing data in the table. Defaults to ‘append’.
schema_strategy#
The is to mention the strategy to follow when appending to a table with an existing schema. ‘update’ will update the schema with the possible new columns that might exist in the TableFrame. ‘strict’ will not modify the schema, and will fail if there is any difference.
None as an input and output#
A subscriber may receive and return a None value instead of TableFrames.
When a subscriber receives a None value instead of a TableFrame it means that the requested table dependency version does not exist.
When a subscriber returns a None value instead of a TableFrame it means there is no new data to write to the external system. This helps in avoiding the creation of multiple copies of the same data.
[Optional] trigger_by#
<trigger_table1>, <trigger_table2>… are the names of the tables in the Tabsdata server. A new commit to any of these tables triggers the subscriber. All listed trigger tables must exist in the server before registering the subscriber.
Defining trigger tables is optional. If you don’t define the trigger_by property, the subscriber will be triggered by any of its input tables. If you define the trigger_by property, then only those tables listed in the property can automatically trigger the subscriber.
For more information, see Working with Triggers.
<subscriber_name>#
<subscriber_name> is the name for the subscriber that you are configuring.
<function_logic>#
<function_logic> governs the processing performed by the subscriber. You can specify function logic to be a simple write or to perform additional processing as needed. For more information about the function logic that you can include, see Working with Tables.
<table_frame1>, <table_frame2>… are the names for the variables that temporarily store source data for processing.
<table_frame_output1>, <table_frame_output2>… are the output from the function that are written to the external system.
Data Drift Support#
This section talks about how Tabsdata handles data drift in the output data for this Subscriber connector.
The behaviour depends on how if_table_exists and schema_strategy are configured in the subscriber.
For a Big Query subscriber with if_table_exists=replace (schema_strategy is ignored), the table is dropped and recreated on each execution. The resulting schema exactly matches the newly exported data, and type changes are fully reflected in the new table schema.
For a Big Query subscriber with if_table_exists=append and schema_strategy=update, the existing table schema is augmented to include new columns. Columns that are no longer present in the source remain in the table.
For a Big Query subscriber with if_table_exists=append and schema_strategy=strict, any schema changes (added, removed, or modified columns) will cause execution to fail.