tabsdata

class AWSGlue(

definition: dict,

tables: str | list[str],

auto_create_at: list[str | None] | str | None = None,

if_table_exists: Literal['append', 'replace'] = 'append',

partitioned_table: bool = False,

schema_strategy: Literal['update', 'strict'] = 'update',

s3_credentials: S3Credentials = None,

s3_region: str = None,

**kwargs,

)

Bases: Catalog

property auto_create_at: list[str | None]

property definition: dict

property if_table_exists: Literal['append', 'replace']

The strategy to follow when the table already exists.

Type:: str

property partitioned_table: bool

Whether the table is partitioned or not.

Type:: bool

property s3_credentials: S3Credentials | None

property s3_region: str | None

The region where the S3 bucket is located.

Type:: str

property schema_strategy: Literal['update', 'strict']

The strategy to follow when appending to a table with an existing schema.

Type:: str

property tables: list[str]

class AvroFormat( chunk_size: int = 50000, )

Bases: FileFormat

Categories:: file-format

Avro file format.

property chunk_size: int

Returns the chunk size for reading/writing AVRO files.

Returns:: The chunk size.
Return type:: int

class AwsGlueSchemaRegistry(

**kwargs,

)

Bases: KafkaSchemaRegistry

Configuration for AWS Glue Schema Registry.

property configs: dict: Additional configuration properties for the AWS Glue client.

property credentials: S3AccessKeyCredentials: Credentials for authenticating with AWS Glue.

property region_name: str: The AWS region of the Schema Registry.

property registry_name: str: Registry name for AWS Glue Schema Registry.

property schema_name: str: Schema name for AWS Glue Schema Registry.

AwsGlueSchemaRegistryCredentialsSpec: alias of S3AccessKeyCredentials

class AzureAccountKeyCredentials( account_name: str | Secret, account_key: str | Secret, )

Bases: AzureCredentials

Categories:: credentials

Credentials class to access Azure using account key credentials (account name and account key).

Variables:

account_name (Secret) – The Azure account name.
account_key (Secret) – The Azure account key.
dict (to_dict() ->) – Convert the AzureAccountNameKeyCredentials object to a dictionary

property account_key: Secret

The Azure account key.

Type:: Secret

property account_name: Secret

The Azure account name.

Type:: Secret

class AzureDestination( uri: str | list[str], credentials: AzureCredentials, format: str | FileFormat = None, )

Bases: DestinationPlugin

Categories:: destination

Azure-file-based data outputs.

class SupportedFormats( *values, )

Bases: Enum

Enum for the supported formats for the AzureDestination.

avro = <class 'tabsdata._format.AvroFormat'>

csv = <class 'tabsdata._format.CSVFormat'>

ndjson = <class 'tabsdata._format.NDJSONFormat'>

parquet = <class 'tabsdata._format.ParquetFormat'>

property allow_fragments: bool

Whether to allow fragments in the output.

Type:: bool

chunk( working_dir: str, *results, )

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

property credentials: AzureCredentials

The credentials required to access Azure.

Type:: AzureCredentials

property format: FileFormat

The format of the file. If not provided, it will be inferred from the file extension of the URI.

Type:: FileFormat

stream( working_dir: str, *results, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

property uri: str | list[str]

‘az://path/to/files’.

Type:: str | list[str]
Type:: The URI of the files with format

AzureDestination.chunk( working_dir: str, *results, )

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

AzureDestination.stream( working_dir: str, *results, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

class AzureSource( uri: str | list[str], credentials: AzureCredentials, format: str | FileFormat = None, initial_last_modified: str | datetime = None, )

Bases: SourcePlugin

Categories:: source

Azure-file-based data inputs.

class SupportedFormats( *values, )

Bases: Enum

Enum for the supported formats for the AzureSource.

avro = <class 'tabsdata._format.AvroFormat'>

csv = <class 'tabsdata._format.CSVFormat'>

log = <class 'tabsdata._format.LogFormat'>

ndjson = <class 'tabsdata._format.NDJSONFormat'>

parquet = <class 'tabsdata._format.ParquetFormat'>

chunk( working_dir: str, ) → list[str | None | list[str | None]]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property credentials: AzureCredentials

The credentials required to access Azure.

Type:: AzureCredentials

property format: FileFormat

The format of the file. If not provided, it will be inferred from the file extension of the data.

Type:: FileFormat

property initial_last_modified: str

The date and time after which the files were modified.

Type:: str

property initial_values: dict

Return a dictionary with the initial values to be stored after execution of the plugin. They will be accessible in the next execution of the plugin. The dictionary must have the parameter names as keys and the initial values as values, all the type string.

Returns:: A dictionary with the initial values of the parameters of the plugin.
Return type:: dict

property uri: str | list[str]

‘az://path/to/files’.

Type:: str | list[str]
Type:: The URI of the files with format

AzureSource.chunk( working_dir: str, ) → list[str | None | list[str | None]]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

class BigQueryConn( gcs_folder: str, credentials: GCPCredentials, project: str = None, dataset: str = None, enforce_connection_params: bool = True, cx_dst_configs_gcs: dict = None, cx_dst_configs_bigquery: dict = None, )

Bases: Conn

For connection configuration to BigQueryDest.

property credentials: GCPCredentials

property cx_dst_configs_bigquery: dict

property cx_dst_configs_gcs: dict

property dataset: str | None

property enforce_connection_params: bool

property gcs_folder: str

property project: str | None

class BigQueryDest( conn: BigQueryConn, tables: TableSpec | None = None, if_table_exists: IfTableExistStrategySpec = 'append', schema_strategy: SchemaStrategySpec = 'update', )

Bases: DestinationPlugin

Categories:: destination

BigQuery based data outputs. The data is first stored in parquet files in a GCS bucket, and then loaded into the BigQuery tables.

property conn: BigQueryConn

property if_table_exists: Literal['append', 'replace']

The strategy to follow when the table already exists.

Type:: str

property schema_strategy: Literal['update', 'strict']

The strategy to follow when appending to a table with an existing schema.

Type:: str

stream( working_dir: str, *results: list[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

property tables: list[str] | None

BigQueryDest.stream( working_dir: str, *results: list[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

class CSVFormat( separator: str | int = ',', quote_char: str | int = '"', eol_char: str | int = '\n', input_encoding: str = 'Utf8', input_null_values: list | None = None, input_missing_is_null: bool = True, input_truncate_ragged_lines: bool = False, input_comment_prefix: str | int | None = None, input_try_parse_dates: bool = False, input_decimal_comma: bool = False, input_has_header: bool = True, input_skip_rows: int = 0, input_skip_rows_after_header: int = 0, input_raise_if_empty: bool = True, input_ignore_errors: bool = False, output_include_header: bool = True, output_datetime_format: str | None = None, output_date_format: str | None = None, output_time_format: str | None = None, output_float_scientific: bool | None = None, output_float_precision: int | None = None, output_null_value: str | None = None, output_quote_style: str | None = None, output_maintain_order: bool = True, )

Bases: FileFormat

Categories:: file-format

CSV file format.

Variables:

separator (str | int) – The separator of the CSV file.
quote_char (str | int) – The quote character of the CSV file.
eol_char (str | int) – The end of line character of the CSV file.
input_encoding (str) – The encoding of the CSV file. Only used when importing data.
input_null_values (list | None) – The null values of the CSV file. Only used when importing data.
input_missing_is_null (bool) – Whether missing values should be marked as null. Only used when importing data.
input_truncate_ragged_lines (bool) – Whether to truncate ragged lines of the CSV file. Only used when importing data.
input_comment_prefix (str | int | None) – The comment prefix of the CSV file. Only used when importing data.
input_try_parse_dates (bool) – Whether to try parse dates of the CSV file. Only used when importing data.
input_decimal_comma (bool) – Whether the CSV file uses decimal comma. Only used when importing data.
input_has_header (bool) – If the CSV file has header. Only used when importing data.
input_skip_rows (int) – How many rows should be skipped in the CSV file. Only used when importing data.
input_skip_rows_after_header (int) – How many rows should be skipped after the header in the CSV file. Only used when importing data.
input_raise_if_empty (bool) – If an error should be raised for an empty CSV. Only used when importing data.
input_ignore_errors (bool) – If the errors loading the CSV must be ignored. Only used when importing data.
output_include_header (bool) – Whether to include header in the CSV output. Only used when exporting data.
output_datetime_format (str | None) – A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any). Only used when exporting data.
output_date_format (str | None) – A format string, with the specifiers defined by the chrono Rust crate. Only used when exporting data.
output_time_format (str | None) – A format string, with the specifiers defined by the chrono Rust crate. Only used when exporting data.
output_float_scientific (bool | None) – Whether to use scientific form always (true), never (false), or automatically (None). Only used when exporting data.
output_float_precision (int | None) – Number of decimal places to write. Only used when exporting data.
output_null_value (str | None) – A string representing null values (defaulting to the empty string). Only used when exporting data.
output_quote_style (str | None) – Determines the quoting strategy used. Only used when exporting data. * necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, separator or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default. * always: This puts quotes around every field. Always. * never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator). * non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary.
output_maintain_order (bool) – Maintain the order in which data is processed. Setting this to False will be slightly faster. Only used when exporting data.

class ConfluentSchemaRegistry(

**kwargs,

)

Bases: KafkaSchemaRegistry

Configuration for Confluent Schema Registry.

property configs: dict: Additional configuration properties for the Schema Registry client.

property credentials: UserPasswordCredentials | None: Credentials for authenticating with the Schema Registry.

property url: str: The URL of the Schema Registry.

ConfluentSchemaRegistryCredentialsSpec: alias of UserPasswordCredentials

class Conn

Bases: object

Represents a connection configuration. This is a base class.

class CronTrigger( mask: str, start: datetime | None = None, end: datetime | None = None, )

Bases: object

Categories:: trigger

Represents a cron-based trigger for scheduling Tabsdata publishers, subscribers and transformers.

This class defines a trigger based on a cron expression, with optional start and end times to constrain its activation period. It validates the cron mask and time boundaries to ensure the trigger is well-formed.

property end: str | None

Gets the end time of the trigger as a timezone-aware ISO 8601 string with ‘Z’ timezone.

Returns:: The end time in ISO format, or None if not set.

property mask: str

Gets the cron expression mask.

Returns:: The cron expression string.

property start: str | None

Gets the start time of the trigger as a timezone-aware ISO 8601 string with ‘Z’ timezone.

Returns:: The start time in ISO format, or None if not set.

class CustomException( message: str, error_code=None, )

Bases: Exception

Custom exception for execution errors.

class DatabricksDestination(

host_url: str,

token: str | Secret,

tables: list[str] | str,

volume: str,

catalog: str | None = None,

schema: str | None = None,

warehouse: str | None = None,

warehouse_id: str | None = None,

if_table_exists: Literal['append', 'replace'] = 'append',

schema_strategy: Literal['update', 'strict'] = 'update',

**kwargs,

)

Bases: DestinationPlugin

Categories:: destination

Databricks based data outputs.

chunk( working_dir: str, *results: LazyFrame | None, ) → List[None | str]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

property host_url: str

property if_table_exists: Literal['append', 'replace']

The strategy to follow when the table already exists.

Type:: str

property schema_strategy: Literal['update', 'strict']

The strategy to follow when appending to a table with an existing schema.

Type:: str

stream( working_dir: str, *results: List[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

property tables: List[str]

property token: Secret

write( files, ): This method is used to write the files to the databricks. It is called from the stream method, and it is not intended to be called directly.

DatabricksDestination.chunk( working_dir: str, *results: LazyFrame | None, ) → List[None | str]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

DatabricksDestination.stream( working_dir: str, *results: List[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

DatabricksDestination.write( files, ): This method is used to write the files to the databricks. It is called from the stream method, and it is not intended to be called directly.

class DestinationPlugin

Bases: object

Categories:: plugin

Parent class for subscriber connectors.

trigger_output(working_dir, *args, **kwargs):
Trigger the exporting of the data. This function will receive the resulting data from the dataset function and must store it in the desired location.

chunk( working_dir: str, *results: VALID_PLUGIN_RESULT, ) → List[str | List[str] | List[List[str]]]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

stream( working_dir: str, *results: VALID_PLUGIN_RESULT, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

write( files: List[str | List[str] | List[List[str]]], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

DestinationPlugin.chunk( working_dir: str, *results: VALID_PLUGIN_RESULT, ) → List[str | List[str] | List[List[str]]]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

DestinationPlugin.stream( working_dir: str, *results: VALID_PLUGIN_RESULT, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

DestinationPlugin.write( files: List[str | List[str] | List[List[str]]], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

class EnvironmentSecret( environment_variable_name: str, )

Bases: Secret

Categories:: secret

Secrets class representing a secret obtained from an environment variable in the server.

Variables:

environment_variable_name (str) – Name of the environment variable from which we will obtain the secret value.
dict (to_dict() ->) – Convert the EnvironmentSecret object to a dictionary.
str (secret_value() ->) – Get the secret value.

property secret_value: str

Get the secret value pointed at by the secret. To be used only during execution in the backend.

Returns:: The secret value.
Return type:: str

class GCPServiceAccountKeyCredentials( service_account_key: str | Secret, )

Bases: GCPCredentials

Categories:: credentials

Credentials class to store the credentials needed to access GCS using account key credentials (service account key).

Variables:

service_account_key (Secret) – The GCS service account key.
dict (to_dict() ->) – Convert the GCPServiceAccountKeyCredentials object to a dictionary

property service_account_key: Secret

The GCS service account key.

Type:: Secret

class GCSDestination( uri: str | list[str], credentials: GCPCredentials, format: str | FileFormat = None, )

Bases: DestinationPlugin

Categories:: destination

GCS-file-based data outputs.

class SupportedFormats( *values, )

Bases: Enum

Enum for the supported formats for the GCSDestination.

avro = <class 'tabsdata._format.AvroFormat'>

csv = <class 'tabsdata._format.CSVFormat'>

ndjson = <class 'tabsdata._format.NDJSONFormat'>

parquet = <class 'tabsdata._format.ParquetFormat'>

property allow_fragments: bool

Whether to allow fragments in the output.

Type:: bool

chunk( working_dir: str, *results, )

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

property credentials: GCPCredentials

The credentials required to access GCS.

Type:: GCPCredentials

property format: FileFormat

The format of the file. If not provided, it will be inferred from the file extension of the URI.

Type:: FileFormat

stream( working_dir: str, *results, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

property uri: str | list[str]

‘gs://path/to/files’.

Type:: str | list[str]
Type:: The URI of the files with format

GCSDestination.chunk( working_dir: str, *results, )

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

GCSDestination.stream( working_dir: str, *results, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

class GCSSource( uri: str | list[str], credentials: GCPCredentials, format: str | FileFormat = None, initial_last_modified: str | datetime = None, )

Bases: SourcePlugin

Categories:: source

GCS-file-based data inputs.

class SupportedFormats( *values, )

Bases: Enum

Enum for the supported formats for the GCSSource.

avro = <class 'tabsdata._format.AvroFormat'>

csv = <class 'tabsdata._format.CSVFormat'>

log = <class 'tabsdata._format.LogFormat'>

ndjson = <class 'tabsdata._format.NDJSONFormat'>

parquet = <class 'tabsdata._format.ParquetFormat'>

chunk( working_dir: str, ) → list[str | None | list[str | None]]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property credentials: GCPCredentials

The credentials required to access GCS.

Type:: GCPCredentials

property format: FileFormat

The format of the file. If not provided, it will be inferred from the file extension of the data.

Type:: FileFormat

property initial_last_modified: str

The date and time after which the files were modified.

Type:: str

property initial_values: dict

Return a dictionary with the initial values to be stored after execution of the plugin. They will be accessible in the next execution of the plugin. The dictionary must have the parameter names as keys and the initial values as values, all the type string.

Returns:: A dictionary with the initial values of the parameters of the plugin.
Return type:: dict

property uri: str | list[str]

‘gs://path/to/files’.

Type:: str | list[str]
Type:: The URI of the files with format

GCSSource.chunk( working_dir: str, ) → list[str | None | list[str | None]]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

class HashiCorpSecret( path: str, name: str, vault: str = 'HASHICORP', )

Bases: Secret

Categories:: secret

Secrets class representing a secret stored in Hashicorp Vault.

Variables:

path (str) – The path to the secret in Hashicorp Vault.
name (str) – The name of the secret in Hashicorp Vault.
vault (str) – If multiple vaults exist in the system, the name of the vault to use. When executing in the server, the URL and token associated to that specific vault will be used. The name can only contain uppercase letters, numbers and underscores, and can’t begin with a number. Defaults to “HASHICORP”.
dict (to_dict() ->) – Convert the HashiCorpSecret object to a dictionary.
str (secret_value() ->) – Get the secret value.

property secret_value: str

Get the secret value pointed at by the secret. To be used only during execution in the backend.

Returns:: The secret value.
Return type:: str

property vault: str

class KafkaConn(

**kwargs,

)

Bases: Conn

Kafka Consumer Connection configuration.

property credentials: UserPasswordCredentials: Kafka credentials.

property cx_src_configs: dict: Source-specific configuration parameters for the Kafka connection.

property enforce_connection_params: bool: Whether to enforce connection parameters.

property group_id: str | None: Kafka consumer group ID.

property schema_registry: ConfluentSchemaRegistry | AwsGlueSchemaRegistry | None: Kafka schema registry configuration.

property servers: str | list[str]: Kafka bootstrap servers.

KafkaCredentialsSpec: alias of UserPasswordCredentials

class KafkaTrigger(

**kwargs,

)

Bases: StageTrigger

Categories:: trigger

A StageTrigger that consumes messages from a Kafka topic and stages them.

This trigger connects to a Kafka cluster, consumes messages from a specified topic, and uses rollover conditions (time, size, and message count) to buffer and stage the data.

property conn: KafkaConn: Kafka consumer connection configuration.

property data_format: Literal['avro', 'json'] | Protobuf: Data format of the messages (“avro”, “json”, “protobuf”).

property dlq_topic: str | None: Dead-letter queue topic.

property group_id: str | None: Kafka consumer group ID.

property messages_rollover: Annotated[int, Gt(gt=0)] | None: Message count-based rollover configuration.

run( context: StageTriggerContext, )

Run the Kafka stage trigger to consume messages, buffer them, and stage them based on rollover conditions.

Parameters:: context – The StageTriggerContext instance.

property schema: str: Schema string for deserializing messages.

property size_rollover_mb: Annotated[int, Gt(gt=0)] | None: Size-based rollover configuration. Size is specified in megabytes.

property time_rollover_secs: int: Timeout in seconds for consuming each message.

property topic: str: Kafka topic to consume from.

KafkaTrigger.run( context: StageTriggerContext, )

Run the Kafka stage trigger to consume messages, buffer them, and stage them based on rollover conditions.

Parameters:: context – The StageTriggerContext instance.

class LocalFileDestination( path: str | list[str], format: str | FileFormat = None, )

Bases: DestinationPlugin

Categories:: destination

LocalFile-based data outputs.

class SupportedFormats( *values, )

Bases: Enum

Enum for the supported formats for the LocalFileDestination.

avro = <class 'tabsdata._format.AvroFormat'>

csv = <class 'tabsdata._format.CSVFormat'>

ndjson = <class 'tabsdata._format.NDJSONFormat'>

parquet = <class 'tabsdata._format.ParquetFormat'>

property allow_fragments: bool

Whether to allow fragments in the output.

Type:: bool

chunk( working_dir: str, *results, )

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

property format: FileFormat

The format of the file or files. If not provided, it will be inferred from the file extension in the path.

Type:: FileFormat

property path: str | list[str]

The path or paths to store the files.

Type:: str | list[str]

stream( working_dir: str, *results, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

LocalFileDestination.chunk( working_dir: str, *results, )

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

LocalFileDestination.stream( working_dir: str, *results, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

class LocalFileSource( path: str | list[str], format: str | FileFormat = None, initial_last_modified: str | datetime = None, )

Bases: SourcePlugin

Categories:: source

Local-file-based data inputs.

class SupportedFormats( *values, )

Bases: Enum

Enum for the supported formats for the LocalFileSource.

avro = <class 'tabsdata._format.AvroFormat'>

csv = <class 'tabsdata._format.CSVFormat'>

log = <class 'tabsdata._format.LogFormat'>

ndjson = <class 'tabsdata._format.NDJSONFormat'>

parquet = <class 'tabsdata._format.ParquetFormat'>

chunk( working_dir: str, ) → list[str | None | list[str | None]]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property format: FileFormat

The format of the file or files. If not provided, it will be inferred from the file extension in the path.

Type:: FileFormat

property initial_last_modified: str

The date and time after which the files were modified.

Type:: str

property initial_values: dict

Return a dictionary with the initial values to be stored after execution of the plugin. They will be accessible in the next execution of the plugin. The dictionary must have the parameter names as keys and the initial values as values, all the type string.

Returns:: A dictionary with the initial values of the parameters of the plugin.
Return type:: dict

property path: str | list[str]

The path or paths to the files to load.

Type:: str | list[str]

LocalFileSource.chunk( working_dir: str, ) → list[str | None | list[str | None]]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

class LogFormat

Bases: FileFormat

Categories:: file-format

Log file format.

class MSSQLConn( connection_string_or_uri: str, credentials: UserPasswordCredentials | None = None, enforce_connection_params: bool = True, cx_src_configs_mssql: dict | None = None, cx_dst_configs_mssql: dict | None = None, )

Bases: Conn

Represents a connection configuration to Microsoft SQL Server (MSSQL).

property connection_string_or_uri: str: The connection string or URI for the MSSQL database.

property credentials: UserPasswordCredentials | None: The credentials used to authenticate with the database.

property cx_dst_configs_mssql: dict: The destination configuration parameters for the MSSQL connection.

property cx_src_configs_mssql: dict: The source configuration parameters for the MSSQL connection.

property enforce_connection_params: bool: Whether to enforce connection parameters for the MSSQL connection. Applicable only for a connection to a MSSQL Destination.

property uri: str: The URI for the MSSQL database.

class MSSQLDestination(

connection_string: str,

destination_table: str | list[str],

credentials: dict | UserPasswordCredentials | None = None,

server: str | Secret = None,

database: str | Secret = None,

driver: str | Secret = None,

if_table_exists: Literal['append', 'replace'] = 'append',

chunk_size: int = 50000,

**kwargs,

)

Bases: DestinationPlugin

Categories:: destination

MSSQL-based data outputs.

chunk( working_dir: str, *results: LazyFrame | None, ) → list[None | str]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

property connection_string: str

Get the connection string for the database.

Returns:: The connection string.
Return type:: str

property credentials: UserPasswordCredentials | None

The credentials required to access Microsoft SQL Server. If no credentials were provided, it will return None.

Type:: UserPasswordCredentials | None

property destination_table: list[str]

Get the destination table(s) where the data will be stored.

Returns:: The destination table(s).
Return type:: list[str]

property if_table_exists: Literal['append', 'replace']: Returns the value of the if_table_exists property. This property determines what to do if the table already exists.

stream( working_dir: str, *results: list[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

write( files, ): This method is used to write the files to the database. It is called from the stream method, and it is not intended to be called directly.

MSSQLDestination.chunk( working_dir: str, *results: LazyFrame | None, ) → list[None | str]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

MSSQLDestination.stream( working_dir: str, *results: list[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

MSSQLDestination.write( files, ): This method is used to write the files to the database. It is called from the stream method, and it is not intended to be called directly.

class MSSQLSource(

**kwargs,

)

Bases: SourcePlugin

Categories:: source

Microsoft SQL Server based data inputs.

chunk( working_dir: str, )

Execute the query and yield chunks of data.

Parameters:: working_dir (str) – The working directory for temporary files.
Yields:: pd.DataFrame – A chunk of data from the query result.

property connection_string: str

Get the connection string for the database.

Returns:: The connection string.
Return type:: str

property credentials: UserPasswordCredentials | None

The credentials required to access Microsoft SQL Server. If no credentials were provided, it will return None.

Type:: UserPasswordCredentials | None

property initial_values: dict

The initial values for the parameters in the SQL queries.

Type:: dict

property query: list[str]

Get the SQL query or queries to execute.

Returns:: The SQL query or queries.
Return type:: str | list[str]

MSSQLSource.chunk( working_dir: str, )

Execute the query and yield chunks of data.

Parameters:: working_dir (str) – The working directory for temporary files.
Yields:: pd.DataFrame – A chunk of data from the query result.

Bases: SourcePlugin

Categories:: source

Source Plugin for reading data from a Microsoft SQL Server (MSSQL) database.

property chunk_size: int: The chunk size in records for reading large queries in batches.

property conn: MSSQLConn: The MSSQL connection configuration.

property initial_values: dict: The initial values for the parameters in the SQL queries.

property loader: Literal['polars_sqlalchemy', 'pandas']: The data processing loader to use for executing SQL imports.

property queries: list[str]: The SQL queries to execute to read data.

property schema_overrides: list[dict[str, Boolean | Categorical | Date | Datetime | Decimal | Duration | Enum | Float32 | Float64 | Int8 | Int16 | Int64 | Int32 | Int128 | Null | String | Time | UInt8 | UInt16 | UInt32 | UInt64]] | None: The schema overrides for the tables being read.

stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

property transactional: bool: Whether to use transactions for reading to ensure consistency.

MSSQLSrc.stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

class MariaDBConn( uri: str, credentials: UserPasswordCredentials | None = None, enforce_connection_params: bool = True, cx_src_configs_mariadb: dict | None = None, cx_dst_configs_mariadb: dict | None = None, )

Bases: Conn

Represents a connection configuration to MariaDB.

property credentials: UserPasswordCredentials | None: The credentials required to access the MariaDB database. If no credentials were provided, it will return None.

property cx_dst_configs_mariadb: dict: Destination-specific configuration parameters for the MariaDB connection.

property cx_src_configs_mariadb: dict: Source-specific configuration parameters for the MariaDB connection.

property enforce_connection_params: bool: Whether to enforce connection parameters for the MariaDB connection.

property uri: str: The URI of the database where the data is located.

class MariaDBDest( conn: MariaDBConn, destination_tables: str | list[str], if_table_exists: Literal['append', 'replace'] = 'append', transactional: bool = True, chunk_size: Annotated[int, Gt(gt=0)] = 100000, loader: Literal['polars_sqlalchemy'] = 'polars_sqlalchemy', )

Bases: DestinationPlugin

Categories:: destination

Destination plugin for writing data to MySQL.

property chunk_size: int: The chunk size for writing large results in batches.

property conn: MariaDBConn: The MariaDB connection configuration.

property destination_tables: list[str]: The table(s) to create. If multiple tables are provided, they must be provided as a list.

property if_table_exists: Literal['append', 'replace']: The strategy to follow when the table already exists.

property loader: Literal['polars_sqlalchemy']: The data processing loader to use for executing DB write operation.

stream( working_dir: str, *results: list[LazyFrame | None], )

Store the results into the MariaDB database.

Parameters:

working_dir – The working directory where the results are stored.
results – The results to store in the SQL destination.

property transactional: bool: Whether to use transactions for writing data to ensure consistency.

MariaDBDest.stream( working_dir: str, *results: list[LazyFrame | None], )

Store the results into the MariaDB database.

Parameters:

working_dir – The working directory where the results are stored.
results – The results to store in the SQL destination.

class MariaDBDestination(

**kwargs,

)

Bases: DestinationPlugin

Categories:: destination

MariaDB-based data outputs.

chunk( working_dir: str, *results: LazyFrame | None, ) → list[str | None]

Store the results in the SQL destination.

Parameters:

working_dir – The working directory where the results will be stored.
results (list[pl.LazyFrame | None]) – The results to store in the SQL destination.

property credentials: UserPasswordCredentials

The credentials required to access the MariaDB database.

Type:: UserPasswordCredentials

property destination_table: str | List[str]

The table(s) to create. If multiple tables are provided, they must be provided as a list.

Type:: str | List[str]

property if_table_exists: Literal['append', 'replace']

The strategy to follow when the table already exists.

Type:: str

property uri: str

The URI of the database where the data is going to be stored.

Type:: str

write( files: list[str | None], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

MariaDBDestination.chunk( working_dir: str, *results: LazyFrame | None, ) → list[str | None]

Store the results in the SQL destination.

Parameters:

working_dir – The working directory where the results will be stored.
results (list[pl.LazyFrame | None]) – The results to store in the SQL destination.

MariaDBDestination.write( files: list[str | None], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

class MariaDBSource(

**kwargs,

)

Bases: SourcePlugin

Categories:: source

MariaDB-based data inputs.

chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property credentials: UserPasswordCredentials | None

The credentials required to access MariaDB. If no credentials were provided, it will return None.

Type:: UserPasswordCredentials | None

property initial_values: dict

The initial values for the parameters in the SQL queries.

Type:: dict

property query: str | List[str]

The SQL query(s) to execute.

Type:: str | List[str]

property uri: str

The URI of the database where the data is located.

Type:: str

MariaDBSource.chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

Bases: SourcePlugin

Categories:: source

Source plugin for reading data from MariaDB.

property chunk_size: int: The chunk size in records for reading large queries in batches.

property conn: MariaDBConn: The MariaDB connection configuration.

property initial_values: dict: The initial values for the parameters in the SQL queries.

property loader: Literal['polars_sqlalchemy', 'pandas']: The data processing loader to use for executing SQL imports.

property queries: list[str]: The SQL queries to execute to read data.

property schema_overrides: list[dict[str, Boolean | Categorical | Date | Datetime | Decimal | Duration | Enum | Float32 | Float64 | Int8 | Int16 | Int64 | Int32 | Int128 | Null | String | Time | UInt8 | UInt16 | UInt32 | UInt64]] | None

stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

property transactional: bool: Whether to use transactions for reading to ensure consistency.

MariaDBSrc.stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

class MongoDBDestination(

uri: str,

collections_with_ids: tuple[str, str | None] | List[tuple[str, str | None]],

credentials: UserPasswordCredentials = None,

connection_options: dict = None,

if_collection_exists: Literal['append', 'replace'] = 'append',

use_trxs: bool = False,

docs_per_trx: int = 1000,

maintain_order: bool = False,

update_existing: bool = True,

fail_on_duplicate_key: bool = True,

log_intermediate_files: bool = False,

**kwargs,

)

Bases: DestinationPlugin

Categories:: destination

MongoDB-based data outputs.

chunk( working_dir: str, *results: LazyFrame | None, ) → List[None | List[str]]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

property collections_with_ids: List[tuple[str, str | None]]

property connection_options: dict

property credentials: UserPasswordCredentials | None

property if_collection_exists: Literal['append', 'replace']

stream( working_dir: str, *results: List[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

property uri: str

write( files, ): This method is used to write the files to the database. It is called from the stream method, and it is not intended to be called directly.

MongoDBDestination.chunk( working_dir: str, *results: LazyFrame | None, ) → List[None | List[str]]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

MongoDBDestination.stream( working_dir: str, *results: List[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

MongoDBDestination.write( files, ): This method is used to write the files to the database. It is called from the stream method, and it is not intended to be called directly.

class MySQLConn( uri: str, credentials: UserPasswordCredentials | None = None, enforce_connection_params: bool = True, cx_src_configs_mysql: dict | None = None, cx_dst_configs_mysql: dict | None = None, )

Bases: Conn

Represents a connection configuration to MySQL.

property credentials: UserPasswordCredentials | None: The credentials required to access the MySQL database. If no credentials were provided, it will return None.

property cx_dst_configs_mysql: dict: The destination configuration parameters for the MySQL connection.

property cx_src_configs_mysql: dict: The source configuration parameters for the MySQL connection.

property enforce_connection_params: bool: Whether to enforce connection parameters for the MySQL connection.

property uri: str: The URI of the database where the data is located.

class MySQLDest( conn: MySQLConn, destination_tables: str | list[str], if_table_exists: Literal['append', 'replace'] = 'append', transactional: bool = True, chunk_size: Annotated[int, Gt(gt=0)] = 100000, loader: Literal['polars_sqlalchemy'] = 'polars_sqlalchemy', )

Bases: DestinationPlugin

Categories:: destination

Destination plugin for writing data to MySQL.

property chunk_size: int: The chunk size for writing large results in batches.

property conn: MySQLConn: The MySQL connection configuration.

property destination_tables: list[str]: The table(s) to create. If multiple tables are provided, they must be provided as a list.

property if_table_exists: Literal['append', 'replace']: The strategy to follow when the table already exists.

property loader: Literal['polars_sqlalchemy']: The data processing loader to use for executing DB write operation.

stream( working_dir: str, *results: list[LazyFrame | None], )

Store the results into the MySQL database.

Parameters:

working_dir – The working directory where the results are stored.
results – The results to store in the SQL destination.

property transactional: bool: Whether to use transactions for writing data to ensure consistency.

MySQLDest.stream( working_dir: str, *results: list[LazyFrame | None], )

Store the results into the MySQL database.

Parameters:

working_dir – The working directory where the results are stored.
results – The results to store in the SQL destination.

class MySQLDestination(

**kwargs,

)

Bases: DestinationPlugin

Categories:: destination

MySQL-based data outputs.

chunk( working_dir: str, *results: LazyFrame | None, ) → list[str | None]

Store the results in the SQL destination.

Parameters:

working_dir (str) – The working directory where the results will be stored.
results (list[pl.LazyFrame | None]) – The results to store in the SQL destination.

property credentials: UserPasswordCredentials

The credentials required to access the MySQLDatabase.

Type:: UserPasswordCredentials

property destination_table: str | List[str]

The table(s) to create. If multiple tables are provided, they must be provided as a list.

Type:: str | List[str]

property if_table_exists: Literal['append', 'replace']

The strategy to follow when the table already exists.

Type:: str

property uri: str

The URI of the database where the data is going to be stored.

Type:: str

write( files: list[str | None], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

MySQLDestination.chunk( working_dir: str, *results: LazyFrame | None, ) → list[str | None]

Store the results in the SQL destination.

Parameters:

working_dir (str) – The working directory where the results will be stored.
results (list[pl.LazyFrame | None]) – The results to store in the SQL destination.

MySQLDestination.write( files: list[str | None], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

class MySQLSource(

**kwargs,

)

Bases: SourcePlugin

Categories:: source

MySQL-based data inputs.

chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property credentials: UserPasswordCredentials | None

The credentials required to access the MySQLDatabase. If no credentials were provided, it will return None.

Type:: UserPasswordCredentials | None

property initial_values: dict

The initial values for the parameters in the SQL queries.

Type:: dict

property query: str | List[str]

The SQL query(s) to execute.

Type:: str | List[str]

property uri: str

The URI of the database where the data is located.

Type:: str

MySQLSource.chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

Bases: SourcePlugin

Categories:: source

Source plugin for reading data from MySQL.

property chunk_size: int: The chunk size in records for reading large queries in batches.

property conn: MySQLConn: The MySQL connection configuration.

property initial_values: dict: The initial values for the parameters in the SQL queries.

property loader: Literal['polars_sqlalchemy', 'pandas']: The data processing loader to use for executing SQL imports.

property queries: list[str]: The SQL queries to execute to read data.

property schema_overrides: list[dict[str, Boolean | Categorical | Date | Datetime | Decimal | Duration | Enum | Float32 | Float64 | Int8 | Int16 | Int64 | Int32 | Int128 | Null | String | Time | UInt8 | UInt16 | UInt32 | UInt64]] | None: The schema overrides for the tables being read.

stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

property transactional: bool: Whether to use transactions for reading to ensure consistency.

MySQLSrc.stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

class NDJSONFormat

Bases: FileFormat

Categories:: file-format

JSON file format.

class OracleConn( uri: str, credentials: UserPasswordCredentials | None = None, enforce_connection_params: bool = True, cx_src_configs_oracle: dict | None = None, cx_dst_configs_oracle: dict | None = None, )

Bases: Conn

Represents a connection configuration to Oracle.

property credentials: UserPasswordCredentials | None: The credentials required to access the Oracle database. If no credentials were provided, it will return None.

property cx_dst_configs_oracle: dict: Destination-specific configuration parameters for the Oracle connection.

property cx_src_configs_oracle: dict: Source-specific configuration parameters for the Oracle connection.

property enforce_connection_params: bool: Whether to enforce connection parameters for the Oracle connection.

property uri: str: The URI of the database where the data is located.

class OracleDestination( uri: str, destination_table: List[str] | str, credentials: UserPasswordCredentials = None, if_table_exists: Literal['append', 'replace'] = 'append', )

Bases: DestinationPlugin

Categories:: destination

Oracle-based data outputs.

chunk( working_dir: str, *results: LazyFrame | None, ) → list[str | None]

Store the results in the SQL destination.

Parameters:

working_dir (str) – The working directory where the results will be stored.
results (list[pl.LazyFrame | None]) – The results to store in the SQL destination.

property credentials: UserPasswordCredentials

The credentials required to access the Oracle database.

Type:: UserPasswordCredentials

property destination_table: str | List[str]

The table(s) to create. If multiple tables are provided, they must be provided as a list.

Type:: str | List[str]

property if_table_exists: Literal['append', 'replace']

The strategy to follow when the table already exists.

Type:: str

property uri: str

The URI of the database where the data is going to be stored.

Type:: str

write( files: list[str | None], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

OracleDestination.chunk( working_dir: str, *results: LazyFrame | None, ) → list[str | None]

Store the results in the SQL destination.

Parameters:

working_dir (str) – The working directory where the results will be stored.
results (list[pl.LazyFrame | None]) – The results to store in the SQL destination.

OracleDestination.write( files: list[str | None], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

class OracleSource(

**kwargs,

)

Bases: SourcePlugin

Categories:: source

Oracle-based data inputs.

chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property credentials: UserPasswordCredentials | None

The credentials required to access Oracle. If no credentials were provided, it will return None.

Type:: UserPasswordCredentials | None

property initial_values: dict

The initial values for the parameters in the SQL queries.

Type:: dict

property query: str | List[str]

The SQL query(s) to execute.

Type:: str | List[str]

property uri: str

The URI of the database where the data is located.

Type:: str

OracleSource.chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

Bases: SourcePlugin

Categories:: source

Source plugin for reading data from Oracle.

property chunk_size: int: The chunk size in records for reading large queries in batches.

property conn: OracleConn: The Oracle connection configuration.

property initial_values: dict: The initial values for the parameters in the SQL queries.

property loader: Literal['polars_sqlalchemy', 'pandas']: The data processing loader to use for executing SQL imports.

property queries: list[str]: The SQL queries to execute to read data.

property schema_overrides: list[dict[str, Boolean | Categorical | Date | Datetime | Decimal | Duration | Enum | Float32 | Float64 | Int8 | Int16 | Int64 | Int32 | Int128 | Null | String | Time | UInt8 | UInt16 | UInt32 | UInt64]] | None: The schema overrides for the tables being read.

stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

property transactional: bool: Whether to use transactions for reading to ensure consistency.

OracleSrc.stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

class ParquetFormat

Bases: FileFormat

Categories:: file-format

Parquet file format.

class PostgresConn( uri: str, credentials: UserPasswordCredentials | None = None, enforce_connection_params: bool = True, cx_src_configs_postgres: dict | None = None, cx_dst_configs_postgres: dict | None = None, )

Bases: Conn

Represents a connection configuration to Postgres.

property credentials: UserPasswordCredentials | None: The credentials required to access the Postgres database. If no credentials were provided, it will return None.

property cx_dst_configs_postgres: dict: Destination-specific configuration parameters for the Postgres connection.

property cx_src_configs_postgres: dict: Source-specific configuration parameters for the Postgres connection.

property enforce_connection_params: bool: Whether to enforce connection parameters for the Postgres connection.

property uri: str: The URI of the database where the data is located.

class PostgresDest( conn: PostgresConn, destination_tables: str | list[str], if_table_exists: Literal['append', 'replace'] = 'append', transactional: bool = True, chunk_size: Annotated[int, Gt(gt=0)] = 100000, loader: Literal['polars_sqlalchemy'] = 'polars_sqlalchemy', )

Bases: DestinationPlugin

Categories:: destination

Destination plugin for writing data to PostgreSQL.

property chunk_size: int: The chunk size for writing large results in batches.

property conn: PostgresConn: The Postgres connection configuration.

property destination_tables: list[str]: The table(s) to create. If multiple tables are provided, they must be provided as a list.

property if_table_exists: Literal['append', 'replace']: The strategy to follow when the table already exists.

property loader: Literal['polars_sqlalchemy']: The data processing loader to use for executing DB write operation.

stream( working_dir: str, *results: list[LazyFrame | None], )

Store the results into the PostgreSQL database.

Parameters:

working_dir – The working directory where the results are stored.
results – The results to store in the SQL destination.

property transactional: bool: Whether to use transactions for writing data to ensure consistency.

PostgresDest.stream( working_dir: str, *results: list[LazyFrame | None], )

Store the results into the PostgreSQL database.

Parameters:

working_dir – The working directory where the results are stored.
results – The results to store in the SQL destination.

class PostgresDestination(

**kwargs,

)

Bases: DestinationPlugin

Categories:: destination

Postgres-based data outputs.

chunk( working_dir: str, *results: LazyFrame | None, ) → list[str | None]

Store the results in the SQL destination.

Parameters:

working_dir (str) – The working directory where the results will be stored.
results (list[pl.LazyFrame | None]) – The results to store in the SQL destination.

property credentials: UserPasswordCredentials

The credentials required to access the Postgres database.

Type:: UserPasswordCredentials

property destination_table: str | List[str]

The table(s) to create. If multiple tables are provided, they must be provided as a list.

Type:: str | List[str]

property if_table_exists: Literal['append', 'replace']

The strategy to follow when the table already exists.

Type:: str

property uri: str

The URI of the database where the data is going to be stored.

Type:: str

write( files: list[str | None], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

PostgresDestination.chunk( working_dir: str, *results: LazyFrame | None, ) → list[str | None]

Store the results in the SQL destination.

Parameters:

working_dir (str) – The working directory where the results will be stored.
results (list[pl.LazyFrame | None]) – The results to store in the SQL destination.

PostgresDestination.write( files: list[str | None], )

Given a file or a list of files, write to the desired destination. Note: this method might materialize the data in the files it receives, so chunks should be of an appropriate size.

Parameters:: files (str) – The file or files to be stored in the final destination.

class PostgresSource(

**kwargs,

)

Bases: SourcePlugin

Categories:: source

Postgres-based data inputs.

chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property credentials: UserPasswordCredentials | None

The credentials required to access the PostgresDatabase. If no credentials were provided, it will return None.

Type:: UserPasswordCredentials | None

property initial_values: dict

The initial values for the parameters in the SQL queries.

Type:: dict

property query: str | List[str]

The SQL query(s) to execute.

Type:: str | List[str]

property uri: str

The URI of the database where the data is located.

Type:: str

PostgresSource.chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

Bases: SourcePlugin

Categories:: source

Source plugin for reading data from Postgres.

property chunk_size: int: The chunk size in records for reading large queries in batches.

property conn: PostgresConn: The Postgres connection configuration.

property initial_values: dict: The initial values for the parameters in the SQL queries.

property loader: Literal['polars_sqlalchemy', 'pandas']: The data processing loader to use for executing SQL imports.

property queries: list[str]: The SQL queries to execute to read data.

property schema_overrides: list[dict[str, Boolean | Categorical | Date | Datetime | Decimal | Duration | Enum | Float32 | Float64 | Int8 | Int16 | Int64 | Int32 | Int128 | Null | String | Time | UInt8 | UInt16 | UInt32 | UInt64]] | None: The schema overrides for the tables being read.

stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

property transactional: bool: Whether to use transactions for reading to ensure consistency.

PostgresSrc.stream( working_dir: str, ) → list[TableFrame]

Execute the query and load the results as a list of TableFrames.

Parameters:: working_dir – The directory where the Parquet files will be saved.
Returns:: A list of TableFrames containing the query results.

class S3AccessKeyCredentials( aws_access_key_id: str | Secret, aws_secret_access_key: str | Secret, )

Bases: S3Credentials

Categories:: credentials

Credentials class to access an S3 bucket using access key credentials (access key id and secret access key).

Variables:

aws_access_key_id (Secret) – The AWS access key id.
aws_secret_access_key (Secret) – The AWS secret access key.
dict (to_dict() ->) – Convert the S3AccessKeyCredentials object to a dictionary

property aws_access_key_id: Secret

The AWS access key id.

Type:: Secret

property aws_secret_access_key: Secret

The AWS secret access key.

Type:: Secret

class S3Destination( uri: str | list[str], credentials: S3Credentials, format: str | FileFormat = None, region: str = None, catalog: AWSGlue = None, )

Bases: DestinationPlugin

Categories:: destination

S3-file-based data outputs.

class SupportedFormats( *values, )

Bases: Enum

Enum for the supported formats for the S3Destination.

avro = <class 'tabsdata._format.AvroFormat'>

csv = <class 'tabsdata._format.CSVFormat'>

ndjson = <class 'tabsdata._format.NDJSONFormat'>

parquet = <class 'tabsdata._format.ParquetFormat'>

property allow_fragments: bool

Whether to allow fragments in the output.

Type:: bool

property catalog: AWSGlue

The catalog to store the data in.

Type:: Catalog

chunk( working_dir: str, *results, )

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

property credentials: S3Credentials | S3AccessKeyCredentials

The credentials required to access the S3 bucket.

Type:: S3Credentials

property format: FileFormat

The format of the file. If not provided, it will be inferred from the file.

Type:: FileFormat

property region: str | None

The region where the S3 bucket is located.

Type:: str

stream( working_dir: str, *results, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

property uri: str | list[str]

‘s3://path/to/files’.

Type:: str | list[str]
Type:: The URI of the files with format

S3Destination.chunk( working_dir: str, *results, )

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

S3Destination.stream( working_dir: str, *results, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

class S3Source( uri: str | list[str], credentials: S3Credentials, format: str | FileFormat = None, initial_last_modified: str | datetime = None, region: str = None, )

Bases: SourcePlugin

Categories:: source

S3-file-based data inputs.

class SupportedFormats( *values, )

Bases: Enum

Enum for the supported formats for the S3Source.

avro = <class 'tabsdata._format.AvroFormat'>

csv = <class 'tabsdata._format.CSVFormat'>

log = <class 'tabsdata._format.LogFormat'>

ndjson = <class 'tabsdata._format.NDJSONFormat'>

parquet = <class 'tabsdata._format.ParquetFormat'>

chunk( working_dir: str, ) → list[str | None | list[str | None]]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property credentials: S3Credentials

The credentials required to access the S3 bucket.

Type:: S3Credentials

property format: FileFormat

The format of the file. If not provided, it will be inferred from the file.

Type:: FileFormat

property initial_last_modified: str

The date and time after which the files were modified.

Type:: str

property initial_values: dict

Return a dictionary with the initial values to be stored after execution of the plugin. They will be accessible in the next execution of the plugin. The dictionary must have the parameter names as keys and the initial values as values, all the type string.

Returns:: A dictionary with the initial values of the parameters of the plugin.
Return type:: dict

property region: str | None

The region where the S3 bucket is located.

Type:: str

property uri: str | list[str]

‘s3://path/to/files’.

Type:: str | list[str]
Type:: The URI of the files with format

S3Source.chunk( working_dir: str, ) → list[str | None | list[str | None]]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

class SalesforceReportSource(

credentials: SalesforceCredentials,

report: str | list[str],

column_name_strategy: Literal['columnName', 'label'],

find_report_by: Literal['id', 'name'] = None,

filter: tuple[str, str, str] | list[tuple[str, str, str]] = None,

filter_logic: str = None,

instance_url: str = None,

last_modified_column: str = None,

initial_last_modified: str = None,

**kwargs,

)

Bases: SourcePlugin

Categories:: source

Salesforce Reports based data inputs.

chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property column_name_strategy: Literal['columnName', 'label']

property credentials: SalesforceCredentials | SalesforceTokenCredentials

property filter: list[tuple[str, str, str]] | None

property filter_logic: str | None

property find_report_by: Literal['id', 'name']

property report: list[str]

SalesforceReportSource.chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

class SalesforceSource(

credentials: SalesforceCredentials,

query: str | list[str],

instance_url: str = None,

include_deleted: bool = False,

initial_last_modified: str = None,

**kwargs,

)

Bases: SourcePlugin

Categories:: source

Salesforce (SOQL query) based data inputs (not Salesforce Reports).

chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property credentials: SalesforceCredentials | SalesforceTokenCredentials

property query: list[str]

SalesforceSource.chunk( working_dir: str, ) → list[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

class SalesforceTokenCredentials( username: str | Secret, password: str | Secret, security_token: str | Secret, )

Bases: SalesforceCredentials

Categories:: credentials

Credentials class to store the credentials needed to access a Salesforce using a username, a password and a token.

Variables:

username (Secret) – The username to access Salesforce.
password (Secret) – The password to access Salesforce.
security_token (Secret) – The security token to access Salesforce.
dict (to_dict() ->) – Convert the AzureAccountNameKeyCredentials object to a dictionary

property password: Secret

The password to access Salesforce.

Type:: Secret

property security_token: Secret

The security token to access Salesforce.

Type:: Secret

property username: Secret

The username to access Salesforce.

Type:: Secret

class SnowflakeDestination(

connection_parameters: dict,

destination_table: List[str] | str,

if_table_exists: Literal['append', 'replace'] = 'append',

stage: str | None = None,

**kwargs,

)

Bases: DestinationPlugin

Categories:: destination

Snowflake based data outputs.

chunk( working_dir: str, *results: LazyFrame | None, ) → List[None | str]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

property destination_table: list[str]

Get the destination table(s) where the data will be stored.

Returns:: The destination table(s).
Return type:: str | list[str]

property if_table_exists: Literal['append', 'replace']: Returns the value of the if_table_exists property. This property determines what to do if the table already exists.

stream( working_dir: str, *results: List[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

write( files, ): This method is used to write the files to the database. It is called from the stream method, and it is not intended to be called directly.

SnowflakeDestination.chunk( working_dir: str, *results: LazyFrame | None, ) → List[None | str]

Trigger the exporting of the data to local parquet chunks. This method will receive the resulting data from the user function and must store it in the local system as parquet files, using the working_dir. Note: This method should not materialize the data, it should only store it in the local system.

Parameters:

working_dir (str) – The folder where any files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

A list of the intermediate files created

SnowflakeDestination.stream( working_dir: str, *results: List[LazyFrame | None] | LazyFrame | None, )

Trigger the exporting of the data. This method will receive the resulting data from the user function and must store it in the desired location. Note: this method might materialize the data provided in a single chunk generated by the chunk function if invoked, so chunks should be of an appropriate size.

Parameters:

working_dir (str) – The folder where any intermediate files generated must be stored (this refers to temporary files that will be deleted after the execution of the plugin, not the final destination of the data)
results – The data to be exported. It is a list of polars LazyFrames or None.

Returns:

None

SnowflakeDestination.write( files, ): This method is used to write the files to the database. It is called from the stream method, and it is not intended to be called directly.

class SourcePlugin

Bases: object

Categories:: plugin

Parent class for publisher connectors.

def chunk(working_dir) -> Union[str, Tuple[str, …], List[str]]
Trigger the import of the data. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files: return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

chunk( working_dir: str, ) → str | Tuple[str, ...] | List[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

property initial_values: dict

Return a dictionary with the initial values to be stored after execution of the plugin. They will be accessible in the next execution of the plugin. The dictionary must have the parameter names as keys and the initial values as values, all the type string.

Returns:: A dictionary with the initial values of the parameters of the plugin.
Return type:: dict

stream( working_dir: str, ) → List[TableFrame | None | List[TableFrame | None]]

SourcePlugin.chunk( working_dir: str, ) → str | Tuple[str, ...] | List[str]

Trigger the import of the data. This must be implemented in any class that inherits from this class unless directly implementing streaming. The method will receive a folder where it must store the data as parquet files, and return a list of the paths of the files created. This files will then be loaded and mapped to the dataset function in positional order, so if you want file.parquet to be the first argument of the dataset function, you must return it first. If you want a parameter to receive multiple files, return a list of the paths. For example, you would give the following return to provide a first argument with a single file and a second argument with two files:return [“file1.parquet”, [“file2.parquet”, “file3.parquet”]]

Parameters:

working_dir (str) – The folder where the files must be stored

Returns:

The path of the file(s) created, in: the order they must be mapped to the dataset function

Return type:

Union[str, Tuple[str, …], List[str]]

SourcePlugin.stream( working_dir: str, ) → List[TableFrame | None | List[TableFrame | None]]

class Stage( use_existing_data: bool = True, )

Bases: SourcePlugin

Categories:: source

Inputs that have been placed in a stage, usually from a StageTrigger.

stream( working_dir: str, ) → list[TableFrame | None | list[TableFrame | None]]

property use_existing_data: bool

Stage.stream( working_dir: str, ) → list[TableFrame | None | list[TableFrame | None]]

class StageTrigger(

**kwargs,

)

Bases: ABC

property end: datetime | None

abstractmethod run( ctx: StageTriggerContext, ) → StageTriggerStatus: Continuously runs the stage trigger logic using the received context to trigger and monitor executions. SUCCESS if the StageTrigger.end datetime is reached. exit with 0. ERROR if any issue stops it from working or recovering. exit with 1. STOPPED upon a signal. exit with 0.

property start: datetime | None

abstractmethod StageTrigger.run( ctx: StageTriggerContext, ) → StageTriggerStatus: Continuously runs the stage trigger logic using the received context to trigger and monitor executions. SUCCESS if the StageTrigger.end datetime is reached. exit with 0. ERROR if any issue stops it from working or recovering. exit with 1. STOPPED upon a signal. exit with 0.

Bases: object

> Private Functions

assert_has_cols( cols: str | list[str], exact: bool | None = False, ) → None

Categories:: tableframe

Ensures that the (non-system) columns in the TableFrame match the expected columns.

Raises an exception if the expectation is not met.

If exact is True, the check verifies that the TableFrame contains exactly the expected columns, with no extra or missing ones.

Parameters:

cols – The expected column name(s). Can be a string or a list of strings.
exact – If True, checks that the TableFrame contains exactly the specified columns.

Raises:

ValueError – If expected columns are missing or unexpected columns are present in the TableFrame.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>> tf.assert_has_cols("a")
>>> tf.assert_has_cols(["a", "b"], exact=True)

cast( dtypes: Mapping[td_typing.ColumnNameOrSelector | td_typing.DataType, td_typing.DataType] | td_typing.DataType, *, strict: bool = True, ) → TableFrame

Categories:: manipulation

Cast columns to a new data type.

Parameters:

dtypes – Mapping of the column name(s) to the new data type(s).
strict – If True, raises an error if the cast cannot be performed.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf.cast({"b":td.Float32}).collect()
>>>
┌──────┬───────┐
│ a    ┆ b     │
│ ---  ┆ ---   │
│ str  ┆ f32   │
╞══════╪═══════╡
│ A    ┆ 1.0   │
│ X    ┆ 10.0  │
│ C    ┆ 3.0   │
│ D    ┆ 5.0   │
│ M    ┆ 9.0   │
│ A    ┆ 100.0 │
│ M    ┆ 50.0  │
│ null ┆ 20.0  │
│ F    ┆ null  │
└──────┴───────┘

clear( n: int = 0, ) → TableFrame

Categories:: tableframe

Clears all rows of the TableFrame preserving the schema.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf.cast({"b":td.Float32}).collect()
>>>
┌──────┬───────┐
│ a    ┆ b     │
│ ---  ┆ ---   │
│ str  ┆ f32   │
╞══════╪═══════╡
└──────┴───────┘

columns( kind: Literal['all', 'user', 'system'] | None = 'user', ) → list[str]

drop( *columns: td_typing.ColumnNameOrSelector | Iterable[td_typing.ColumnNameOrSelector], strict: bool = True, ) → TableFrame

Categories:: projection

Discard columns from the TableFrame.

Parameters:

columns – Columns to drop.
strict – If True, raises an error if a column does not exist.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ 10.0 ┆ 10.0 │
│ 4.0  ┆ 10.0 │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘
>>>
>>> tf.drop("y")
>>>
┌──────┐
│ x    │
│ ---  │
│ f64  │
╞══════╡
│ 1.0  │
│ 2.0  │
│ NaN  │
│ 4.0  │
│ 5.0  │
│ null │
└──────┘

drop_nans( subset: td_typing.ColumnNameOrSelector | Collection[td_typing.ColumnNameOrSelector] | None = None, ) → TableFrame

Categories:: manipulation

Drop rows with NaN values.

Parameters:: subset – Columns to look for Nan values. If None, all columns are considered.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┬──────┐
│ ss   ┆ u    ┆ ff   │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ i64  ┆ f64  │
╞══════╪══════╪══════╡
│ A    ┆ 1    ┆ 1.1  │
│ B    ┆ 0    ┆ 0.0  │
│ A    ┆ 2    ┆ 2.2  │
│ B    ┆ 3    ┆ 3.3  │
│ B    ┆ 4    ┆ 4.4  │
│ C    ┆ 5    ┆ -1.1 │
│ C    ┆ 6    ┆ -2.2 │
│ C    ┆ 7    ┆ -3.3 │
│ D    ┆ 8    ┆ inf  │
│ F    ┆ 9    ┆ NaN  │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
>>>
>>> tf.unique("a", keep="last")
┌─────┬─────┬──────┐
│ ss  ┆ u   ┆ ff   │
│ --- ┆ --- ┆ ---  │
│ str ┆ i64 ┆ f64  │
╞═════╪═════╪══════╡
│ A   ┆ 1   ┆ 1.1  │
│ B   ┆ 0   ┆ 0.0  │
│ A   ┆ 2   ┆ 2.2  │
│ B   ┆ 3   ┆ 3.3  │
│ B   ┆ 4   ┆ 4.4  │
│ C   ┆ 5   ┆ -1.1 │
│ C   ┆ 6   ┆ -2.2 │
│ C   ┆ 7   ┆ -3.3 │
│ D   ┆ 8   ┆ inf  │
└─────┴─────┴──────┘
>>>

drop_nulls( subset: td_typing.ColumnNameOrSelector | Collection[td_typing.ColumnNameOrSelector] | None = None, ) → TableFrame

Categories:: manipulation

Drop rows with null values.

Parameters:: subset – Columns to evaluate for null values. If None, all columns are considered.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┬──────┐
│ ss   ┆ u    ┆ ff   │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ i64  ┆ f64  │
╞══════╪══════╪══════╡
│ A    ┆ 1    ┆ 1.1  │
│ B    ┆ 0    ┆ 0.0  │
│ A    ┆ 2    ┆ 2.2  │
│ B    ┆ 3    ┆ 3.3  │
│ B    ┆ 4    ┆ 4.4  │
│ C    ┆ 5    ┆ -1.1 │
│ C    ┆ 6    ┆ -2.2 │
│ C    ┆ 7    ┆ -3.3 │
│ D    ┆ 8    ┆ inf  │
│ F    ┆ 9    ┆ NaN  │
│ G    ┆ null ┆ 2.3  │
└──────┴──────┴──────┘
>>>
>>> tf.drop_nulls("a")
>>>
┌─────┬─────┬──────┐
│ ss  ┆ u   ┆ ff   │
│ --- ┆ --- ┆ ---  │
│ str ┆ i64 ┆ f64  │
╞═════╪═════╪══════╡
│ A   ┆ 1   ┆ 1.1  │
│ B   ┆ 0   ┆ 0.0  │
│ A   ┆ 2   ┆ 2.2  │
│ B   ┆ 3   ┆ 3.3  │
│ B   ┆ 4   ┆ 4.4  │
│ C   ┆ 5   ┆ -1.1 │
│ C   ┆ 6   ┆ -2.2 │
│ C   ┆ 7   ┆ -3.3 │
│ D   ┆ 8   ┆ inf  │
│ F   ┆ 9   ┆ NaN  │
└─────┴─────┴──────┘

property dtypes: list[DataType]

classmethod empty( schema: SimpleSchema = None, ) → TableFrame

Categories:: tableframe

Creates an empty (no column - no row) TableFrame.

explain(

**kwargs,

): Provide private accessibility level wrapper.

extract_as_columns( offset: int, length: int, ) → dict[str, list[Any]]

Categories:: filters

Extract a slice of rows from the table as a column-oriented dictionary.

The result is a mapping of column names to lists of values from the selected rows.

Parameters:

offset (int) – The starting row index of the slice.
length (int) – The number of rows to include in the slice.

Returns:

A dictionary where each key is a column name, and its value is a list of values from the selected slice.

Return type:

dict[str, list[Any]]

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.extract_as_columns(offset=0, length=2)
    {
        "a": ["A", 1],
        "b": ["X", 10]
    }

extract_as_rows( offset: int, length: int, ) → list[dict[str, Any]]

Categories:: filters

Extract a slice of rows from the TableFrame as a list of dictionaries.

Each dictionary represents one row, where keys are column names and values are the corresponding cell values.

Parameters:

offset (int) – The starting row index of the slice.
length (int) – The number of rows to include in the slice.

Returns:

A list of row dictionaries.

Return type:

list[dict[str, Any]]

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.extract_as_rows(offset=0, length=2)
[
>>>
>>> tf.extract_as_rows(offset=0, length=2)
    [
        {"a": "A", "b": 1},
        {"a": "X", "b": 10},
    ]

fill_nan( value: int | float | Expr | None, ) → TableFrame

Categories:: manipulation

Replace all NaN values in the TableFrame with the given value.

Parameters:: value – The value to replace NaN with.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ NaN  ┆ NaN  │
│ 4.0  ┆ NaN  │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘
>>>
>>> tf.fill_nan(10)
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ 10.0 ┆ 10.0 │
│ 4.0  ┆ 10.0 │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘

fill_null( value: Any | Expr | None = None, ) → TableFrame

Categories:: manipulation

Replace all null values in the TableFrame with the given value.

Parameters:: value – The value to replace null with.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ NaN  ┆ NaN  │
│ 4.0  ┆ NaN  │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘
>>>
>>> tf.fill_null(20)
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ NaN  ┆ NaN  │
│ 4.0  ┆ NaN  │
│ 5.0  ┆ 20.0 │
│ 20.0 ┆ 20.0 │
└──────┴──────┘

filter( *predicates: td_typing.IntoExprColumn | Iterable[td_typing.IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any], ) → TableFrame

Categories:: filters

Filter the TableFrame based on the given predicates.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
│ A   ┆ 100 │
│ M   ┆ 50  │
└─────┴─────┘
>>>
>>> tf.filter(td.col("a").is_in(["A", "C"]).or_(td.col("b").eq(10)))
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ A   ┆ 100 │
└─────┴─────┘

first() → TableFrame

Categories:: filters

Return a TableFrame with the first row.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.first()
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
└─────┴─────┘

first_row( named: bool = False, ) → tuple[Any, ...] | dict[str, Any] | None

Categories:: filters

Return a tuple or dictionary with the first row, or None if no row.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
│ b   ┆ 2   │
│ c   ┆ 3   │
└─────┴─────┘
>>>
>>> tf.last_row()
>>>
('a', 1)
>>>
>>> tf.last_row(named=True)
>>>
{'A': 'a', 'B': '1'}

classmethod from_dict( data: Mapping[str, Sequence[object] | Mapping[str, Sequence[object]]] | None = None, ) → TableFrame

Categories:: tableframe

Creates tableframe from a dictionary, or None. None produces as an empty (no column - no row) tableframe.

Parameters:: data – Input data.

classmethod from_pandas( data: pd.DataFrame | None = None, ) → TableFrame

Categories:: tableframe

Creates tableframe from a pandas dataframe, or None. None produces as an empty (no column - no row) tableframe.

Parameters:: data – Input data.

classmethod from_polars( data: LazyFrame | DataFrame | None = None, ) → TableFrame

Categories:: tableframe

Creates tableframe from a polars dataframe or lazyframe, or None. None produces as an empty (no column - no row) tableframe.

Parameters:: data – Input data.

grok( expr: td_typing.IntoExpr, pattern: str, schema: dict[str, td_col.Column], ) → TableFrame

Categories:: string

Parse log text into structured fields using a Grok pattern.

Applies a Grok pattern to the given column or expression and directly appends one new column per named capture in the pattern to the output TableFrame. Rows that do not match the pattern will contain null values for the extracted columns.

Parameters:

expr (IntoExpr) – Column name or expression that resolves to a single string column containing log lines.
pattern (str) – Grok pattern with named captures (e.g., %{WORD:user}).
schema (dict[str, td_col.Column]) – A mapping where each capture name is associated with its corresponding column definition, specifying both the column name and its data type.

Returns:

A new TableFrame with one column per Grok capture added.

Return type:

TableFrame

Example

>>> import tabsdata as td
>>> tf = td.TableFrame({"logs": [
...     "alice-login-2023",
...     "bob-logout-2024",
... ]})
>>>
>>> # Capture 3 fields: user, action, year
>>> log_pattern = r"%{WORD:user}-%{WORD:action}-%{INT:year}"
>>> log_schema = {
>>>     "word": td_col.Column("user", td.String),
>>>     "action": td_col.Column("action", td.String),
>>>     "year": td_col.Column("year", td.Int8),
>>> }
>>> out = tf.grok("logs", log_pattern, log_schema)
>>> out.collect()
┌──────────────────┬───────┬────────┬──────┐
│ logs             ┆ user  ┆ action ┆ year │
│ ---              ┆ ---   ┆ ---    ┆ ---  │
│ str              ┆ str   ┆ str    ┆ i64  │
╞══════════════════╪═══════╪════════╪══════╡
│ alice-login-2023 ┆ alice ┆ login  ┆ 2023 │
│ bob-logout-2024  ┆ bob   ┆ logout ┆ 2024 │
└──────────────────┴───────┴────────┴──────┘

Notes

The function automatically expands the Grok captures into separate columns.
Non-matching rows will show null for the extracted columns.
If a pattern defines duplicate capture names, numeric suffixes like field, field[1] will be used to disambiguate them.

group_by( *by: td_typing.IntoExpr | Iterable[td_typing.IntoExpr], ) → td_group_by.TableFrameGroupBy

Categories:: aggregation

Perform a group by on the TableFrame.

Parameters:: by – Columns or expressions to group by.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
│ A   ┆ 100 │
│ M   ┆ 50  │
└─────┴─────┘
>>>
>>> tf.group_by(td.col("a")).agg(td.col("b").sum())
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ M   ┆ 59  │
│ A   ┆ 101 │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ X   ┆ 10  │
└─────┴─────┘

has_cols( cols: str | list[str], exact: bool | None = False, ) → Tuple[bool, set[str], set[str]]

Categories:: tableframe

Verifies the presence of (non-system) columns in the TableFrame.

If exact is True, the check ensures that the TableFrame contains exactly the specified columns (excluding system columns), with no extras or omissions.

Parameters:

cols – The column name(s) to verify. Can be a string or a list of strings.
exact – If True, checks that the TableFrame contains exactly the specified columns.

Returns:

A boolean indicating whether the check was successful.
A set of columns in cols missing in the TableFrame.
A set of columns in the TableFrame missing in cols.

Return type:

tuple[bool, set[str], set[str]]

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>>
>>> tf.has_cols("a")
>>>
(True, {}, {"b"})
>>>
>>> tf.has_cols(["a", "c", "d"])
>>>
(False, {"c", "d"}, {"b"})
>>>
>>> tf.has_cols("a", exact=True)
>>>
(False, {}, {"b"})
>>>
>>> tf.has_cols(["a", "b"], exact=True)
>>>
(True, {}, {})

has_same_schema( tf: TableFrame, ) → bool

Categories:: tableframe
Categories:: description
Categories:: description

Verifies if the schema of the current TableFrame is same than the provided TableFrame.

Parameters:: tf – The TableFrame to compare with.
Returns:: Whether the condition is met or not.
Return type:: bool

Example:

>>> import tabsdata as td
>>>
>>> tf1: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>>
>>> tf2: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ c    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>> tf1.has_same_schema(tf2)
>>>
False
>>>
>>> tf1: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>>
>>> tf2: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>> tf1.has_same_schema(tf2)
>>>
False

head( n: int = 5, ) → TableFrame

Categories:: filters

Return a TableFrame with the first n rows.

Parameters:: n – The number of rows to return.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.head(2)
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
└─────┴─────┘

inspect(

**kwargs,

): Provide private accessibility level wrapper.

is_empty() → bool

Categories:: tableframe

Checks if a TableFrame has no rows.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>>
>>> tf.is_empty()
>>>
False
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
└──────┴──────┘
>>>
>>> tf.is_empty()
>>>
True

item() → Any

Categories:: projection

Returns a scalar value if the TableFrame contains exactly one user column and one row.

Raises an exception if there is more than one user column or more than one row.

Returns None if the TableFrame is empty.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┐
│ a   │
│ --- │
│ str │
╞═════╡
│ A   │
└─────┘
>>>
>>> tf.python_version()
>>>
A

Categories:: join

Join the TableFrame with another TableFrame.

Parameters:

other – The TableFrame to join.
on – Name(s) of the columns to join on. The column name(s) must be in both TableFrame’s. Don’t use this parameter if using `left_on and right_on parameters, or if how=”cross”.
how –
Join strategy:
- inner: An inner join.
- left: A left join.
- right: A rigth join.
- full: A full join.
- cross: The cartesian product.
- semi: An inner join but only returning the columns from left TableFrame.
- anti: Rows from the left TableFrame that have no match in the right TableFrame.
left_on – Name(s) of the columns to join on from the left TableFrame. It must be used together wit the right_on parameter. It cannot be used with the on parameter.
right_on – Name(s) of the columns to join on from the right TableFrame. It must be used together wit the left_on parameter. It cannot be used with the on parameter.
suffix – Duplicate columns on the right Table are appended this suffix.
join_nulls – If null value matches should produce join rows or not.
coalesce – Collapse join columns into a single column.

Example:

>>> import tabsdata as td
>>>
>>> tf1: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf2: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 3    │
│ Y    ┆ 4    │
│ Z    ┆ 5    │
│ A    ┆ 0    │
│ M    ┆ 6    │
│ null ┆ 8    │
│ F    ┆ null │
└──────┴──────┘
>>>
An inner join:
>>>
>>> tf1.join(tf2, on="a", how="inner")
>>>
┌─────┬──────┬─────────┐
│ a   ┆ b    ┆ b_right │
│ --- ┆ ---  ┆ ---     │
│ str ┆ i64  ┆ i64     │
╞═════╪══════╪═════════╡
│ A   ┆ 1    ┆ 3       │
│ A   ┆ 1    ┆ 0       │
│ M   ┆ 9    ┆ 6       │
│ A   ┆ 100  ┆ 3       │
│ A   ┆ 100  ┆ 0       │
│ M   ┆ 50   ┆ 6       │
│ F   ┆ null ┆ null    │
└─────┴──────┴─────────┘
>>>
A left join:
>>>
>>> tf1.join(tf2, on="a", how="left")
>>>
┌──────┬──────┬─────────┐
│ a    ┆ b    ┆ b_right │
│ ---  ┆ ---  ┆ ---     │
│ str  ┆ i64  ┆ i64     │
╞══════╪══════╪═════════╡
│ A    ┆ 1    ┆ 3       │
│ A    ┆ 1    ┆ 0       │
│ X    ┆ 10   ┆ null    │
│ C    ┆ 3    ┆ null    │
│ D    ┆ 5    ┆ null    │
│ …    ┆ …    ┆ …       │
│ A    ┆ 100  ┆ 3       │
│ A    ┆ 100  ┆ 0       │
│ M    ┆ 50   ┆ 6       │
│ null ┆ 20   ┆ null    │
│ F    ┆ null ┆ null    │
└──────┴──────┴─────────┘
>>>
Turning off column coalesce:
>>>
>>> tf1.join(tf2, on="a", coalesce=False)
>>>
┌─────┬──────┬─────────┬─────────┐
│ a   ┆ b    ┆ a_right ┆ b_right │
│ --- ┆ ---  ┆ ---     ┆ ---     │
│ str ┆ i64  ┆ str     ┆ i64     │
╞═════╪══════╪═════════╪═════════╡
│ A   ┆ 1    ┆ A       ┆ 3       │
│ A   ┆ 1    ┆ A       ┆ 0       │
│ M   ┆ 9    ┆ M       ┆ 6       │
│ A   ┆ 100  ┆ A       ┆ 3       │
│ A   ┆ 100  ┆ A       ┆ 0       │
│ M   ┆ 50   ┆ M       ┆ 6       │
│ F   ┆ null ┆ F       ┆ null    │
└─────┴──────┴─────────┴─────────┘

last() → TableFrame

Categories:: filters

Return a TableFrame with the last row.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.last()
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ M   ┆ 9   │
└─────┴─────┘

last_row( named: bool = False, ) → tuple[Any, ...] | dict[str, Any] | None

Categories:: filters

Return a tuple or dictionary with the last row, or None if no row.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
│ b   ┆ 2   │
│ c   ┆ 3   │
└─────┴─────┘
>>>
>>> tf.last_row()
>>>
('c', 3)
>>>
>>> tf.last_row(named=True)
>>>
{'A': 'c', 'B': 3}

limit( n: int = 5, ) → TableFrame

Categories:: filters

Return a TableFrame with the first n rows. This is equivalent to head.

Parameters:: n – The number of rows to return.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.limit(2)
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
└─────┴─────┘

rename( mapping: dict[str, str], ) → TableFrame

Categories:: projection

Rename columns from the TableFrame.

Parameters:: mapping – A dictionary mapping column names to their new names. The operation will fail if any specified column name does not exist.

Examples:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ i    ┆ f    │
│ ---  ┆ ---  │
│ i32  ┆ f64  │
╞══════╪══════╡
│ 1    ┆ 3.1  │
│ 2    ┆ 4.1  │
│ 3    ┆ 5.9  │
│ 4    ┆ 2.6  │
│ 5    ┆ 53.5 │
│ 6    ┆ 8.97 │
└──────┴──────┘
>>>
>>> tf.{"i": "index", "f": "amount"})
>>>
┌───────┬────────┐
│ index ┆ amount │
│ ----- ┆ ------ │
│ i32   ┆ f64    │
╞═══════╪════════╡
│ 1     ┆ 3.1    │
│ 2     ┆ 4.1    │
│ 3     ┆ 5.9    │
│ 4     ┆ 2.6    │
│ 5     ┆ 53.5   │
│ 6     ┆ 8.97   │
└───────┴────────┘

property schema: Schema

select(

*exprs: td_typing.IntoExpr | Iterable[td_typing.IntoExpr],

**named_exprs: td_typing.IntoExpr,

) → TableFrame

Categories:: projection

Select column(s) from the TableFrame.

Parameters:

exprs – Columns or expressions to select.
named_exprs – Named expressions to select.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
│ A   ┆ 100 │
│ M   ┆ 50  │
└─────┴─────┘
>>>
>>> tf.select(td.col("a"), td.col("b").mul(2).alias("bx2"),)
>>>
┌─────┬─────┐
│ a   ┆ bx2 │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 2   │
│ X   ┆ 20  │
│ C   ┆ 6   │
│ D   ┆ 10  │
│ M   ┆ 18  │
│ A   ┆ 200 │
│ M   ┆ 100 │
└─────┴─────┘

show_graph(

**kwargs,

): Provide private accessibility level wrapper.

slice( offset: int, length: int | None = None, ) → TableFrame

Categories:: filters

Return a TableFrame with a slice of the original TableFrame

Parameters:

offset – Slice starting index.
length – The length of the slice. None means all the way to the end.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.slice(2,2)
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ C   ┆ 3   │
│ D   ┆ 5   │
└─────┴─────┘

sort( by: td_typing.IntoExpr | Iterable[td_typing.IntoExpr], *more_by: td_typing.IntoExpr, descending: bool | Sequence[bool] = False, nulls_last: bool | Sequence[bool] = False, maintain_order: bool = False, ) → TableFrame

Categories:: tableframe

Sort the TableFrame by the given column(s) or expression(s).

Parameters:

by – Column(s) or expression(s) to sort by.
more_by – Additional colums to sort by.
descending – Specifies if the sorting should be descending or not.
nulls_last – Specifies if null values should be placed last.
maintain_order – Preserve the order of equal rows.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf.sort(td.col("a"), descending = True)
>>>
┌──────┬───────┐
│ a    ┆ b     │
│ ---  ┆ ---   │
│ str  ┆ f32   │
╞══════╪═══════╡
│ A    ┆ 1.0   │
│ X    ┆ 10.0  │
│ C    ┆ 3.0   │
│ D    ┆ 5.0   │
│ M    ┆ 9.0   │
│ A    ┆ 100.0 │
│ M    ┆ 50.0  │
│ null ┆ 20.0  │
│ F    ┆ null  │
└──────┴───────┘

tail( n: int = 5, ) → TableFrame

Categories:: filters

Return a TableFrame with the last n rows.

Parameters:: n – The number of rows to return.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.tail(2)
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘

to_pandas() → pd.DataFrame

Categories:: tableframe

Creates a pandas dataframe from this tableframe.

to_polars_df() → DataFrame

Categories:: tableframe

Creates a polars dataframe from this tableframe.

to_polars_lf() → LazyFrame

Categories:: tableframe

Creates a polars lazyframe from this tableframe.

udf( on: td_typing.IntoExpr | list[td_typing.IntoExpr], function: td_udf.UDF, ) → TableFrame

Categories:: projection

Apply a user-defined function (UDF) to the columns resolved by expr.

The selected columns are supplied to function, which can implement either on_batch or on_element. An on_batch implementation receives a list of Polars series representing the selected columns and must return a list of Polars series with matching length. An on_element implementation receives a list of Python scalars for each row and returns a list of scalars; the framework wraps this in an efficient batch executor, so data still flows in batches even when authoring row-wise logic. In both cases the returned series become new columns appended to the original TableFrame.

By default, both methods receive their inputs as a list. Override the signature property in your UDF class to return “unpacked” to have each column passed as a separate argument instead.

Creating UDFs:

Subclass tabsdata.tableframe.udf.function.UDF.
Implement __init__ to call super().__init__(output_columns) where output_columns is a tuple or list of tuples (name, data type) specifying the UDF default output schema (column names and data types). Each tuple must contain a column name (string) and a data type (DataType).
Override exactly one of on_batch or on_element, to implement the UDF function logic.
Return a list of TabsData Series (for on_batch) or TabsData supported scalars (for on_element) with the same length as specified in the output schema.
If overriding the on_batch method, the return type must be a list of TabsData Series. If overriding the on_element method, the return type must be a list of supported TabsData scalar values. For both cases, the number of elements in the returned lists must match the number of elements in the output_columns list provided to the UDF constructor.

Using UDFs:

Instantiate a function created as above.

Pass it to TableFrame method udf().

Optionally use UDF.output_columns() to override output column names or data types after instantiation.

Parameters:

on – Expression selecting the input column(s) of the UDF.
function – Instance of tabsdata.tableframe.udf.function.UDF defining on_batch or on_element to produce the output series.

Examples

>>> import tabsdata as td
>>> import tabsdata.tableframe as tdf
>>>
>>> class SumUDF(tdf.UDF):
...     def __init__(self):
...         super().__init__(("total", tdf.Int64))
...
...     def on_batch(self, series):
...         return [series[0] + series[1]]
>>>
>>> tf = td.TableFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
>>> print(tf)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 10  │
│ 2   ┆ 20  │
│ 3   ┆ 30  │
└─────┴─────┘
>>> tf.udf(td.col("a", "b"), SumUDF())
>>> print(tf)
┌─────┬─────┬───────┐
│ a   ┆ b   ┆ total │
│ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═══════╡
│ 1   ┆ 10  ┆ 11    │
│ 2   ┆ 20  ┆ 22    │
│ 3   ┆ 30  ┆ 33    │
└─────┴─────┴───────┘

>>> class RatioUDF(tdf.UDF):
...     def __init__(self):
...         super().__init__(("ratio", tdf.Float64))
...
...     def on_element(self, values):
...         return [values[0] / values[1]]
>>>
>>> tf = td.TableFrame({"numerator": [10, 20, 30],
>>>                     "denominator": [2, 5, 10],})
>>> print(tf)
┌───────────┬──────────────┐
│ numerator ┆ denominator  │
│ ---       ┆ ---          │
│ i64       ┆ i64          │
╞═══════════╪══════════════╡
│ 10        ┆ 2            │
│ 20        ┆ 5            │
│ 30        ┆ 10           │
└───────────┴──────────────┘
>>> tf.udf(td.col("numerator", "denominator"), RatioUDF()).collect()
>>> print(tf)
┌───────────┬──────────────┬──────┐
│ numerator ┆ denominator  ┆ ratio│
│ ---       ┆ ---          ┆ ---  │
│ i64       ┆ i64          ┆ f64  │
╞═══════════╪══════════════╪══════╡
│ 10        ┆ 2            ┆ 5.0  │
│ 20        ┆ 5            ┆ 4.0  │
│ 30        ┆ 10           ┆ 3.0  │
└───────────┴──────────────┴──────┘

Using signature property to receive individual arguments:

>>> class RatioUnpackedUDF(tdf.UDF):
...     def __init__(self):
...         super().__init__(("ratio", tdf.Float64))
...
...     @property
...     def signature(self):
...         return "unpacked"
...
...     def on_element(self, numerator, denominator):
...         return [numerator / denominator]
>>>
>>> tf = td.TableFrame({"numerator": [10, 20, 30],
>>>                     "denominator": [2, 5, 10],})
>>> tf.udf(td.col("numerator", "denominator"), RatioUnpackedUDF()).collect()
>>> print(tf)
┌───────────┬──────────────┬──────┐
│ numerator ┆ denominator  ┆ ratio│
│ ---       ┆ ---          ┆ ---  │
│ i64       ┆ i64          ┆ f64  │
╞═══════════╪══════════════╪══════╡
│ 10        ┆ 2            ┆ 5.0  │
│ 20        ┆ 5            ┆ 4.0  │
│ 30        ┆ 10           ┆ 3.0  │
└───────────┴──────────────┴──────┘

unique( subset: td_typing.ColumnNameOrSelector | Collection[td_typing.ColumnNameOrSelector] | None = None, *, keep: td_typing.UniqueKeepStrategy = 'any', maintain_order: bool = False, ) → TableFrame

Categories:: filters

Deduplicate rows from the TableFrame.

Parameters:

subset – Columns to evaluate for duplicate values. If None, all columns are considered.
keep – Strategy to keep duplicates: first, last, any, none ( eliminate duplicate rows).
maintain_order – Preserve the order of the rows.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf.unique("a", keep="last")
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ D    ┆ 5    │
│ C    ┆ 3    │
│ X    ┆ 10   │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ F    ┆ null │
│ null ┆ 20   │
└──────┴──────┘

unnest( columns: td_typing.ColumnNameOrSelector | Collection[td_typing.ColumnNameOrSelector], *more_columns: td_typing.ColumnNameOrSelector, ) → TableFrame

Categories:: projection

Expand one or more struct columns so that each field within the struct becomes a separate column in the TableFrame.

The resulting TableFrame will place these new columns in the same position as the original struct column, effectively replacing it. This makes it easier to work directly with the inner fields as standard columns.

Parameters:

columns – Name of the struct column(s) to expand.
more_columns – Additional struct columns to expand, provided as positional arguments.

Example:

>>> import tabsdata as td
>>>
>>> tf = td.TableFrame({
...     "id": [1, 2],
...     "info": [
...         {"name": "Alice", "age": 30},
...         {"name": "Bob", "age": None},
...     ],
...     "status": ["active", "inactive"],
... })
>>>
>>> tf
>>>
┌─────┬───────────────┬───────────┐
│ id  ┆ info          ┆ status    │
│ --- ┆ ---           ┆ ---       │
│ i64 ┆ struct[2]     ┆ str       │
╞═════╪═══════════════╪═══════════╡
│ 1   ┆ {"Alice",30}  ┆ active    │
│ 2   ┆ {"Bob",null}  ┆ inactive  │
└─────┴───────────────┴───────────┘
>>>
>>> tf.unnest("info")
>>>
┌─────┬───────┬──────┬───────────┐
│ id  ┆ name  ┆ age  ┆ status    │
│ --- ┆ ---   ┆ ---  ┆ ---       │
│ i64 ┆ str   ┆ i64  ┆ str       │
╞═════╪═══════╪══════╪═══════════╡
│ 1   ┆ Alice ┆ 30   ┆ active    │
│ 2   ┆ Bob   ┆ null ┆ inactive  │
└─────┴───────┴──────┴───────────┘

property width: int

with_columns(

*exprs: td_typing.IntoExpr | Iterable[td_typing.IntoExpr],

**named_exprs: td_typing.IntoExpr,

) → TableFrame

Categories:: projection

Add columns to the TableFrame.

Parameters:

exprs – Columns or expressions to add.
named_exprs – Named expressions to add.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ 10.0 ┆ 10.0 │
│ 4.0  ┆ 10.0 │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘
>>>
>>> tf.with_columns(td.col("x").mul(td.col("y")).alias("z"))
>>>
┌──────┬──────┬──────┐
│ x    ┆ y    ┆ z    │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╡
│ 1.0  ┆ 2.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  ┆ 4.0  │
│ NaN  ┆ NaN  ┆ NaN  │
│ 4.0  ┆ NaN  ┆ NaN  │
│ 5.0  ┆ null ┆ null │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘

TableFrame.assert_has_cols( cols: str | list[str], exact: bool | None = False, ) → None

Categories:: tableframe

Ensures that the (non-system) columns in the TableFrame match the expected columns.

Raises an exception if the expectation is not met.

If exact is True, the check verifies that the TableFrame contains exactly the expected columns, with no extra or missing ones.

Parameters:

cols – The expected column name(s). Can be a string or a list of strings.
exact – If True, checks that the TableFrame contains exactly the specified columns.

Raises:

ValueError – If expected columns are missing or unexpected columns are present in the TableFrame.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>> tf.assert_has_cols("a")
>>> tf.assert_has_cols(["a", "b"], exact=True)

TableFrame.cast( dtypes: Mapping[td_typing.ColumnNameOrSelector | td_typing.DataType, td_typing.DataType] | td_typing.DataType, *, strict: bool = True, ) → TableFrame

Categories:: manipulation

Cast columns to a new data type.

Parameters:

dtypes – Mapping of the column name(s) to the new data type(s).
strict – If True, raises an error if the cast cannot be performed.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf.cast({"b":td.Float32}).collect()
>>>
┌──────┬───────┐
│ a    ┆ b     │
│ ---  ┆ ---   │
│ str  ┆ f32   │
╞══════╪═══════╡
│ A    ┆ 1.0   │
│ X    ┆ 10.0  │
│ C    ┆ 3.0   │
│ D    ┆ 5.0   │
│ M    ┆ 9.0   │
│ A    ┆ 100.0 │
│ M    ┆ 50.0  │
│ null ┆ 20.0  │
│ F    ┆ null  │
└──────┴───────┘

TableFrame.clear( n: int = 0, ) → TableFrame

Categories:: tableframe

Clears all rows of the TableFrame preserving the schema.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf.cast({"b":td.Float32}).collect()
>>>
┌──────┬───────┐
│ a    ┆ b     │
│ ---  ┆ ---   │
│ str  ┆ f32   │
╞══════╪═══════╡
└──────┴───────┘

TableFrame.columns( kind: Literal['all', 'user', 'system'] | None = 'user', ) → list[str]

TableFrame.drop( *columns: td_typing.ColumnNameOrSelector | Iterable[td_typing.ColumnNameOrSelector], strict: bool = True, ) → TableFrame

Categories:: projection

Discard columns from the TableFrame.

Parameters:

columns – Columns to drop.
strict – If True, raises an error if a column does not exist.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ 10.0 ┆ 10.0 │
│ 4.0  ┆ 10.0 │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘
>>>
>>> tf.drop("y")
>>>
┌──────┐
│ x    │
│ ---  │
│ f64  │
╞══════╡
│ 1.0  │
│ 2.0  │
│ NaN  │
│ 4.0  │
│ 5.0  │
│ null │
└──────┘

TableFrame.drop_nans( subset: td_typing.ColumnNameOrSelector | Collection[td_typing.ColumnNameOrSelector] | None = None, ) → TableFrame

Categories:: manipulation

Drop rows with NaN values.

Parameters:: subset – Columns to look for Nan values. If None, all columns are considered.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┬──────┐
│ ss   ┆ u    ┆ ff   │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ i64  ┆ f64  │
╞══════╪══════╪══════╡
│ A    ┆ 1    ┆ 1.1  │
│ B    ┆ 0    ┆ 0.0  │
│ A    ┆ 2    ┆ 2.2  │
│ B    ┆ 3    ┆ 3.3  │
│ B    ┆ 4    ┆ 4.4  │
│ C    ┆ 5    ┆ -1.1 │
│ C    ┆ 6    ┆ -2.2 │
│ C    ┆ 7    ┆ -3.3 │
│ D    ┆ 8    ┆ inf  │
│ F    ┆ 9    ┆ NaN  │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
>>>
>>> tf.unique("a", keep="last")
┌─────┬─────┬──────┐
│ ss  ┆ u   ┆ ff   │
│ --- ┆ --- ┆ ---  │
│ str ┆ i64 ┆ f64  │
╞═════╪═════╪══════╡
│ A   ┆ 1   ┆ 1.1  │
│ B   ┆ 0   ┆ 0.0  │
│ A   ┆ 2   ┆ 2.2  │
│ B   ┆ 3   ┆ 3.3  │
│ B   ┆ 4   ┆ 4.4  │
│ C   ┆ 5   ┆ -1.1 │
│ C   ┆ 6   ┆ -2.2 │
│ C   ┆ 7   ┆ -3.3 │
│ D   ┆ 8   ┆ inf  │
└─────┴─────┴──────┘
>>>

TableFrame.drop_nulls( subset: td_typing.ColumnNameOrSelector | Collection[td_typing.ColumnNameOrSelector] | None = None, ) → TableFrame

Categories:: manipulation

Drop rows with null values.

Parameters:: subset – Columns to evaluate for null values. If None, all columns are considered.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┬──────┐
│ ss   ┆ u    ┆ ff   │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ i64  ┆ f64  │
╞══════╪══════╪══════╡
│ A    ┆ 1    ┆ 1.1  │
│ B    ┆ 0    ┆ 0.0  │
│ A    ┆ 2    ┆ 2.2  │
│ B    ┆ 3    ┆ 3.3  │
│ B    ┆ 4    ┆ 4.4  │
│ C    ┆ 5    ┆ -1.1 │
│ C    ┆ 6    ┆ -2.2 │
│ C    ┆ 7    ┆ -3.3 │
│ D    ┆ 8    ┆ inf  │
│ F    ┆ 9    ┆ NaN  │
│ G    ┆ null ┆ 2.3  │
└──────┴──────┴──────┘
>>>
>>> tf.drop_nulls("a")
>>>
┌─────┬─────┬──────┐
│ ss  ┆ u   ┆ ff   │
│ --- ┆ --- ┆ ---  │
│ str ┆ i64 ┆ f64  │
╞═════╪═════╪══════╡
│ A   ┆ 1   ┆ 1.1  │
│ B   ┆ 0   ┆ 0.0  │
│ A   ┆ 2   ┆ 2.2  │
│ B   ┆ 3   ┆ 3.3  │
│ B   ┆ 4   ┆ 4.4  │
│ C   ┆ 5   ┆ -1.1 │
│ C   ┆ 6   ┆ -2.2 │
│ C   ┆ 7   ┆ -3.3 │
│ D   ┆ 8   ┆ inf  │
│ F   ┆ 9   ┆ NaN  │
└─────┴─────┴──────┘

classmethod TableFrame.empty( schema: SimpleSchema = None, ) → TableFrame

Categories:: tableframe

Creates an empty (no column - no row) TableFrame.

TableFrame.extract_as_columns( offset: int, length: int, ) → dict[str, list[Any]]

Categories:: filters

Extract a slice of rows from the table as a column-oriented dictionary.

The result is a mapping of column names to lists of values from the selected rows.

Parameters:

offset (int) – The starting row index of the slice.
length (int) – The number of rows to include in the slice.

Returns:

A dictionary where each key is a column name, and its value is a list of values from the selected slice.

Return type:

dict[str, list[Any]]

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.extract_as_columns(offset=0, length=2)
    {
        "a": ["A", 1],
        "b": ["X", 10]
    }

TableFrame.extract_as_rows( offset: int, length: int, ) → list[dict[str, Any]]

Categories:: filters

Extract a slice of rows from the TableFrame as a list of dictionaries.

Each dictionary represents one row, where keys are column names and values are the corresponding cell values.

Parameters:

offset (int) – The starting row index of the slice.
length (int) – The number of rows to include in the slice.

Returns:

A list of row dictionaries.

Return type:

list[dict[str, Any]]

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.extract_as_rows(offset=0, length=2)
[
>>>
>>> tf.extract_as_rows(offset=0, length=2)
    [
        {"a": "A", "b": 1},
        {"a": "X", "b": 10},
    ]

TableFrame.fill_nan( value: int | float | Expr | None, ) → TableFrame

Categories:: manipulation

Replace all NaN values in the TableFrame with the given value.

Parameters:: value – The value to replace NaN with.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ NaN  ┆ NaN  │
│ 4.0  ┆ NaN  │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘
>>>
>>> tf.fill_nan(10)
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ 10.0 ┆ 10.0 │
│ 4.0  ┆ 10.0 │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘

TableFrame.fill_null( value: Any | Expr | None = None, ) → TableFrame

Categories:: manipulation

Replace all null values in the TableFrame with the given value.

Parameters:: value – The value to replace null with.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ NaN  ┆ NaN  │
│ 4.0  ┆ NaN  │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘
>>>
>>> tf.fill_null(20)
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ NaN  ┆ NaN  │
│ 4.0  ┆ NaN  │
│ 5.0  ┆ 20.0 │
│ 20.0 ┆ 20.0 │
└──────┴──────┘

TableFrame.filter( *predicates: td_typing.IntoExprColumn | Iterable[td_typing.IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any], ) → TableFrame

Categories:: filters

Filter the TableFrame based on the given predicates.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
│ A   ┆ 100 │
│ M   ┆ 50  │
└─────┴─────┘
>>>
>>> tf.filter(td.col("a").is_in(["A", "C"]).or_(td.col("b").eq(10)))
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ A   ┆ 100 │
└─────┴─────┘

TableFrame.first() → TableFrame

Categories:: filters

Return a TableFrame with the first row.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.first()
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
└─────┴─────┘

TableFrame.first_row( named: bool = False, ) → tuple[Any, ...] | dict[str, Any] | None

Categories:: filters

Return a tuple or dictionary with the first row, or None if no row.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
│ b   ┆ 2   │
│ c   ┆ 3   │
└─────┴─────┘
>>>
>>> tf.last_row()
>>>
('a', 1)
>>>
>>> tf.last_row(named=True)
>>>
{'A': 'a', 'B': '1'}

classmethod TableFrame.from_dict( data: Mapping[str, Sequence[object] | Mapping[str, Sequence[object]]] | None = None, ) → TableFrame

Categories:: tableframe

Creates tableframe from a dictionary, or None. None produces as an empty (no column - no row) tableframe.

Parameters:: data – Input data.

classmethod TableFrame.from_pandas( data: pd.DataFrame | None = None, ) → TableFrame

Categories:: tableframe

Creates tableframe from a pandas dataframe, or None. None produces as an empty (no column - no row) tableframe.

Parameters:: data – Input data.

classmethod TableFrame.from_polars( data: LazyFrame | DataFrame | None = None, ) → TableFrame

Categories:: tableframe

Creates tableframe from a polars dataframe or lazyframe, or None. None produces as an empty (no column - no row) tableframe.

Parameters:: data – Input data.

TableFrame.grok( expr: td_typing.IntoExpr, pattern: str, schema: dict[str, td_col.Column], ) → TableFrame

Categories:: string

Parse log text into structured fields using a Grok pattern.

Applies a Grok pattern to the given column or expression and directly appends one new column per named capture in the pattern to the output TableFrame. Rows that do not match the pattern will contain null values for the extracted columns.

Parameters:

expr (IntoExpr) – Column name or expression that resolves to a single string column containing log lines.
pattern (str) – Grok pattern with named captures (e.g., %{WORD:user}).
schema (dict[str, td_col.Column]) – A mapping where each capture name is associated with its corresponding column definition, specifying both the column name and its data type.

Returns:

A new TableFrame with one column per Grok capture added.

Return type:

TableFrame

Example

>>> import tabsdata as td
>>> tf = td.TableFrame({"logs": [
...     "alice-login-2023",
...     "bob-logout-2024",
... ]})
>>>
>>> # Capture 3 fields: user, action, year
>>> log_pattern = r"%{WORD:user}-%{WORD:action}-%{INT:year}"
>>> log_schema = {
>>>     "word": td_col.Column("user", td.String),
>>>     "action": td_col.Column("action", td.String),
>>>     "year": td_col.Column("year", td.Int8),
>>> }
>>> out = tf.grok("logs", log_pattern, log_schema)
>>> out.collect()
┌──────────────────┬───────┬────────┬──────┐
│ logs             ┆ user  ┆ action ┆ year │
│ ---              ┆ ---   ┆ ---    ┆ ---  │
│ str              ┆ str   ┆ str    ┆ i64  │
╞══════════════════╪═══════╪════════╪══════╡
│ alice-login-2023 ┆ alice ┆ login  ┆ 2023 │
│ bob-logout-2024  ┆ bob   ┆ logout ┆ 2024 │
└──────────────────┴───────┴────────┴──────┘

Notes

The function automatically expands the Grok captures into separate columns.
Non-matching rows will show null for the extracted columns.
If a pattern defines duplicate capture names, numeric suffixes like field, field[1] will be used to disambiguate them.

TableFrame.group_by( *by: td_typing.IntoExpr | Iterable[td_typing.IntoExpr], ) → td_group_by.TableFrameGroupBy

Categories:: aggregation

Perform a group by on the TableFrame.

Parameters:: by – Columns or expressions to group by.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
│ A   ┆ 100 │
│ M   ┆ 50  │
└─────┴─────┘
>>>
>>> tf.group_by(td.col("a")).agg(td.col("b").sum())
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ M   ┆ 59  │
│ A   ┆ 101 │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ X   ┆ 10  │
└─────┴─────┘

TableFrame.has_cols( cols: str | list[str], exact: bool | None = False, ) → Tuple[bool, set[str], set[str]]

Categories:: tableframe

Verifies the presence of (non-system) columns in the TableFrame.

If exact is True, the check ensures that the TableFrame contains exactly the specified columns (excluding system columns), with no extras or omissions.

Parameters:

cols – The column name(s) to verify. Can be a string or a list of strings.
exact – If True, checks that the TableFrame contains exactly the specified columns.

Returns:

A boolean indicating whether the check was successful.
A set of columns in cols missing in the TableFrame.
A set of columns in the TableFrame missing in cols.

Return type:

tuple[bool, set[str], set[str]]

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>>
>>> tf.has_cols("a")
>>>
(True, {}, {"b"})
>>>
>>> tf.has_cols(["a", "c", "d"])
>>>
(False, {"c", "d"}, {"b"})
>>>
>>> tf.has_cols("a", exact=True)
>>>
(False, {}, {"b"})
>>>
>>> tf.has_cols(["a", "b"], exact=True)
>>>
(True, {}, {})

TableFrame.has_same_schema( tf: TableFrame, ) → bool

Categories:: tableframe
Categories:: description
Categories:: description

Verifies if the schema of the current TableFrame is same than the provided TableFrame.

Parameters:: tf – The TableFrame to compare with.
Returns:: Whether the condition is met or not.
Return type:: bool

Example:

>>> import tabsdata as td
>>>
>>> tf1: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>>
>>> tf2: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ c    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>> tf1.has_same_schema(tf2)
>>>
False
>>>
>>> tf1: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>>
>>> tf2: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>> tf1.has_same_schema(tf2)
>>>
False

TableFrame.head( n: int = 5, ) → TableFrame

Categories:: filters

Return a TableFrame with the first n rows.

Parameters:: n – The number of rows to return.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.head(2)
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
└─────┴─────┘

TableFrame.is_empty() → bool

Categories:: tableframe

Checks if a TableFrame has no rows.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
└──────┴──────┘
>>>
>>> tf.is_empty()
>>>
False
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
└──────┴──────┘
>>>
>>> tf.is_empty()
>>>
True

TableFrame.item() → Any

Categories:: projection

Returns a scalar value if the TableFrame contains exactly one user column and one row.

Raises an exception if there is more than one user column or more than one row.

Returns None if the TableFrame is empty.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┐
│ a   │
│ --- │
│ str │
╞═════╡
│ A   │
└─────┘
>>>
>>> tf.python_version()
>>>
A

Categories:: join

Join the TableFrame with another TableFrame.

Parameters:

other – The TableFrame to join.
on – Name(s) of the columns to join on. The column name(s) must be in both TableFrame’s. Don’t use this parameter if using `left_on and right_on parameters, or if how=”cross”.
how –
Join strategy:
- inner: An inner join.
- left: A left join.
- right: A rigth join.
- full: A full join.
- cross: The cartesian product.
- semi: An inner join but only returning the columns from left TableFrame.
- anti: Rows from the left TableFrame that have no match in the right TableFrame.
left_on – Name(s) of the columns to join on from the left TableFrame. It must be used together wit the right_on parameter. It cannot be used with the on parameter.
right_on – Name(s) of the columns to join on from the right TableFrame. It must be used together wit the left_on parameter. It cannot be used with the on parameter.
suffix – Duplicate columns on the right Table are appended this suffix.
join_nulls – If null value matches should produce join rows or not.
coalesce – Collapse join columns into a single column.

Example:

>>> import tabsdata as td
>>>
>>> tf1: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf2: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 3    │
│ Y    ┆ 4    │
│ Z    ┆ 5    │
│ A    ┆ 0    │
│ M    ┆ 6    │
│ null ┆ 8    │
│ F    ┆ null │
└──────┴──────┘
>>>
An inner join:
>>>
>>> tf1.join(tf2, on="a", how="inner")
>>>
┌─────┬──────┬─────────┐
│ a   ┆ b    ┆ b_right │
│ --- ┆ ---  ┆ ---     │
│ str ┆ i64  ┆ i64     │
╞═════╪══════╪═════════╡
│ A   ┆ 1    ┆ 3       │
│ A   ┆ 1    ┆ 0       │
│ M   ┆ 9    ┆ 6       │
│ A   ┆ 100  ┆ 3       │
│ A   ┆ 100  ┆ 0       │
│ M   ┆ 50   ┆ 6       │
│ F   ┆ null ┆ null    │
└─────┴──────┴─────────┘
>>>
A left join:
>>>
>>> tf1.join(tf2, on="a", how="left")
>>>
┌──────┬──────┬─────────┐
│ a    ┆ b    ┆ b_right │
│ ---  ┆ ---  ┆ ---     │
│ str  ┆ i64  ┆ i64     │
╞══════╪══════╪═════════╡
│ A    ┆ 1    ┆ 3       │
│ A    ┆ 1    ┆ 0       │
│ X    ┆ 10   ┆ null    │
│ C    ┆ 3    ┆ null    │
│ D    ┆ 5    ┆ null    │
│ …    ┆ …    ┆ …       │
│ A    ┆ 100  ┆ 3       │
│ A    ┆ 100  ┆ 0       │
│ M    ┆ 50   ┆ 6       │
│ null ┆ 20   ┆ null    │
│ F    ┆ null ┆ null    │
└──────┴──────┴─────────┘
>>>
Turning off column coalesce:
>>>
>>> tf1.join(tf2, on="a", coalesce=False)
>>>
┌─────┬──────┬─────────┬─────────┐
│ a   ┆ b    ┆ a_right ┆ b_right │
│ --- ┆ ---  ┆ ---     ┆ ---     │
│ str ┆ i64  ┆ str     ┆ i64     │
╞═════╪══════╪═════════╪═════════╡
│ A   ┆ 1    ┆ A       ┆ 3       │
│ A   ┆ 1    ┆ A       ┆ 0       │
│ M   ┆ 9    ┆ M       ┆ 6       │
│ A   ┆ 100  ┆ A       ┆ 3       │
│ A   ┆ 100  ┆ A       ┆ 0       │
│ M   ┆ 50   ┆ M       ┆ 6       │
│ F   ┆ null ┆ F       ┆ null    │
└─────┴──────┴─────────┴─────────┘

TableFrame.last() → TableFrame

Categories:: filters

Return a TableFrame with the last row.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.last()
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ M   ┆ 9   │
└─────┴─────┘

TableFrame.last_row( named: bool = False, ) → tuple[Any, ...] | dict[str, Any] | None

Categories:: filters

Return a tuple or dictionary with the last row, or None if no row.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
│ b   ┆ 2   │
│ c   ┆ 3   │
└─────┴─────┘
>>>
>>> tf.last_row()
>>>
('c', 3)
>>>
>>> tf.last_row(named=True)
>>>
{'A': 'c', 'B': 3}

TableFrame.limit( n: int = 5, ) → TableFrame

Categories:: filters

Return a TableFrame with the first n rows. This is equivalent to head.

Parameters:: n – The number of rows to return.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.limit(2)
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
└─────┴─────┘

TableFrame.rename( mapping: dict[str, str], ) → TableFrame

Categories:: projection

Rename columns from the TableFrame.

Parameters:: mapping – A dictionary mapping column names to their new names. The operation will fail if any specified column name does not exist.

Examples:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ i    ┆ f    │
│ ---  ┆ ---  │
│ i32  ┆ f64  │
╞══════╪══════╡
│ 1    ┆ 3.1  │
│ 2    ┆ 4.1  │
│ 3    ┆ 5.9  │
│ 4    ┆ 2.6  │
│ 5    ┆ 53.5 │
│ 6    ┆ 8.97 │
└──────┴──────┘
>>>
>>> tf.{"i": "index", "f": "amount"})
>>>
┌───────┬────────┐
│ index ┆ amount │
│ ----- ┆ ------ │
│ i32   ┆ f64    │
╞═══════╪════════╡
│ 1     ┆ 3.1    │
│ 2     ┆ 4.1    │
│ 3     ┆ 5.9    │
│ 4     ┆ 2.6    │
│ 5     ┆ 53.5   │
│ 6     ┆ 8.97   │
└───────┴────────┘

TableFrame.select(

*exprs: td_typing.IntoExpr | Iterable[td_typing.IntoExpr],

**named_exprs: td_typing.IntoExpr,

) → TableFrame

Categories:: projection

Select column(s) from the TableFrame.

Parameters:

exprs – Columns or expressions to select.
named_exprs – Named expressions to select.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
│ A   ┆ 100 │
│ M   ┆ 50  │
└─────┴─────┘
>>>
>>> tf.select(td.col("a"), td.col("b").mul(2).alias("bx2"),)
>>>
┌─────┬─────┐
│ a   ┆ bx2 │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 2   │
│ X   ┆ 20  │
│ C   ┆ 6   │
│ D   ┆ 10  │
│ M   ┆ 18  │
│ A   ┆ 200 │
│ M   ┆ 100 │
└─────┴─────┘

TableFrame.slice( offset: int, length: int | None = None, ) → TableFrame

Categories:: filters

Return a TableFrame with a slice of the original TableFrame

Parameters:

offset – Slice starting index.
length – The length of the slice. None means all the way to the end.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.slice(2,2)
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ C   ┆ 3   │
│ D   ┆ 5   │
└─────┴─────┘

TableFrame.sort( by: td_typing.IntoExpr | Iterable[td_typing.IntoExpr], *more_by: td_typing.IntoExpr, descending: bool | Sequence[bool] = False, nulls_last: bool | Sequence[bool] = False, maintain_order: bool = False, ) → TableFrame

Categories:: tableframe

Sort the TableFrame by the given column(s) or expression(s).

Parameters:

by – Column(s) or expression(s) to sort by.
more_by – Additional colums to sort by.
descending – Specifies if the sorting should be descending or not.
nulls_last – Specifies if null values should be placed last.
maintain_order – Preserve the order of equal rows.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf.sort(td.col("a"), descending = True)
>>>
┌──────┬───────┐
│ a    ┆ b     │
│ ---  ┆ ---   │
│ str  ┆ f32   │
╞══════╪═══════╡
│ A    ┆ 1.0   │
│ X    ┆ 10.0  │
│ C    ┆ 3.0   │
│ D    ┆ 5.0   │
│ M    ┆ 9.0   │
│ A    ┆ 100.0 │
│ M    ┆ 50.0  │
│ null ┆ 20.0  │
│ F    ┆ null  │
└──────┴───────┘

TableFrame.tail( n: int = 5, ) → TableFrame

Categories:: filters

Return a TableFrame with the last n rows.

Parameters:: n – The number of rows to return.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 1   │
│ X   ┆ 10  │
│ C   ┆ 3   │
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘
>>>
>>> tf.tail(2)
>>>
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ D   ┆ 5   │
│ M   ┆ 9   │
└─────┴─────┘

TableFrame.to_dict() → dict[str, list[Any]]

Categories:: tableframe

Creates a dictionary from this tableframe.

TableFrame.to_pandas() → pd.DataFrame

Categories:: tableframe

Creates a pandas dataframe from this tableframe.

TableFrame.to_polars_df() → DataFrame

Categories:: tableframe

Creates a polars dataframe from this tableframe.

TableFrame.to_polars_lf() → LazyFrame

Categories:: tableframe

Creates a polars lazyframe from this tableframe.

TableFrame.udf( on: td_typing.IntoExpr | list[td_typing.IntoExpr], function: td_udf.UDF, ) → TableFrame

Categories:: projection

Apply a user-defined function (UDF) to the columns resolved by expr.

The selected columns are supplied to function, which can implement either on_batch or on_element. An on_batch implementation receives a list of Polars series representing the selected columns and must return a list of Polars series with matching length. An on_element implementation receives a list of Python scalars for each row and returns a list of scalars; the framework wraps this in an efficient batch executor, so data still flows in batches even when authoring row-wise logic. In both cases the returned series become new columns appended to the original TableFrame.

By default, both methods receive their inputs as a list. Override the signature property in your UDF class to return “unpacked” to have each column passed as a separate argument instead.

Creating UDFs:

Subclass tabsdata.tableframe.udf.function.UDF.
Implement __init__ to call super().__init__(output_columns) where output_columns is a tuple or list of tuples (name, data type) specifying the UDF default output schema (column names and data types). Each tuple must contain a column name (string) and a data type (DataType).
Override exactly one of on_batch or on_element, to implement the UDF function logic.
Return a list of TabsData Series (for on_batch) or TabsData supported scalars (for on_element) with the same length as specified in the output schema.
If overriding the on_batch method, the return type must be a list of TabsData Series. If overriding the on_element method, the return type must be a list of supported TabsData scalar values. For both cases, the number of elements in the returned lists must match the number of elements in the output_columns list provided to the UDF constructor.

Using UDFs:

Instantiate a function created as above.

Pass it to TableFrame method udf().

Optionally use UDF.output_columns() to override output column names or data types after instantiation.

Parameters:

on – Expression selecting the input column(s) of the UDF.
function – Instance of tabsdata.tableframe.udf.function.UDF defining on_batch or on_element to produce the output series.

Examples

>>> import tabsdata as td
>>> import tabsdata.tableframe as tdf
>>>
>>> class SumUDF(tdf.UDF):
...     def __init__(self):
...         super().__init__(("total", tdf.Int64))
...
...     def on_batch(self, series):
...         return [series[0] + series[1]]
>>>
>>> tf = td.TableFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
>>> print(tf)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 10  │
│ 2   ┆ 20  │
│ 3   ┆ 30  │
└─────┴─────┘
>>> tf.udf(td.col("a", "b"), SumUDF())
>>> print(tf)
┌─────┬─────┬───────┐
│ a   ┆ b   ┆ total │
│ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═══════╡
│ 1   ┆ 10  ┆ 11    │
│ 2   ┆ 20  ┆ 22    │
│ 3   ┆ 30  ┆ 33    │
└─────┴─────┴───────┘

>>> class RatioUDF(tdf.UDF):
...     def __init__(self):
...         super().__init__(("ratio", tdf.Float64))
...
...     def on_element(self, values):
...         return [values[0] / values[1]]
>>>
>>> tf = td.TableFrame({"numerator": [10, 20, 30],
>>>                     "denominator": [2, 5, 10],})
>>> print(tf)
┌───────────┬──────────────┐
│ numerator ┆ denominator  │
│ ---       ┆ ---          │
│ i64       ┆ i64          │
╞═══════════╪══════════════╡
│ 10        ┆ 2            │
│ 20        ┆ 5            │
│ 30        ┆ 10           │
└───────────┴──────────────┘
>>> tf.udf(td.col("numerator", "denominator"), RatioUDF()).collect()
>>> print(tf)
┌───────────┬──────────────┬──────┐
│ numerator ┆ denominator  ┆ ratio│
│ ---       ┆ ---          ┆ ---  │
│ i64       ┆ i64          ┆ f64  │
╞═══════════╪══════════════╪══════╡
│ 10        ┆ 2            ┆ 5.0  │
│ 20        ┆ 5            ┆ 4.0  │
│ 30        ┆ 10           ┆ 3.0  │
└───────────┴──────────────┴──────┘

Using signature property to receive individual arguments:

>>> class RatioUnpackedUDF(tdf.UDF):
...     def __init__(self):
...         super().__init__(("ratio", tdf.Float64))
...
...     @property
...     def signature(self):
...         return "unpacked"
...
...     def on_element(self, numerator, denominator):
...         return [numerator / denominator]
>>>
>>> tf = td.TableFrame({"numerator": [10, 20, 30],
>>>                     "denominator": [2, 5, 10],})
>>> tf.udf(td.col("numerator", "denominator"), RatioUnpackedUDF()).collect()
>>> print(tf)
┌───────────┬──────────────┬──────┐
│ numerator ┆ denominator  ┆ ratio│
│ ---       ┆ ---          ┆ ---  │
│ i64       ┆ i64          ┆ f64  │
╞═══════════╪══════════════╪══════╡
│ 10        ┆ 2            ┆ 5.0  │
│ 20        ┆ 5            ┆ 4.0  │
│ 30        ┆ 10           ┆ 3.0  │
└───────────┴──────────────┴──────┘

TableFrame.unique( subset: td_typing.ColumnNameOrSelector | Collection[td_typing.ColumnNameOrSelector] | None = None, *, keep: td_typing.UniqueKeepStrategy = 'any', maintain_order: bool = False, ) → TableFrame

Categories:: filters

Deduplicate rows from the TableFrame.

Parameters:

subset – Columns to evaluate for duplicate values. If None, all columns are considered.
keep – Strategy to keep duplicates: first, last, any, none ( eliminate duplicate rows).
maintain_order – Preserve the order of the rows.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ X    ┆ 10   │
│ C    ┆ 3    │
│ D    ┆ 5    │
│ M    ┆ 9    │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ null ┆ 20   │
│ F    ┆ null │
└──────┴──────┘
>>>
>>> tf.unique("a", keep="last")
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ D    ┆ 5    │
│ C    ┆ 3    │
│ X    ┆ 10   │
│ A    ┆ 100  │
│ M    ┆ 50   │
│ F    ┆ null │
│ null ┆ 20   │
└──────┴──────┘

TableFrame.unnest( columns: td_typing.ColumnNameOrSelector | Collection[td_typing.ColumnNameOrSelector], *more_columns: td_typing.ColumnNameOrSelector, ) → TableFrame

Categories:: projection

Expand one or more struct columns so that each field within the struct becomes a separate column in the TableFrame.

The resulting TableFrame will place these new columns in the same position as the original struct column, effectively replacing it. This makes it easier to work directly with the inner fields as standard columns.

Parameters:

columns – Name of the struct column(s) to expand.
more_columns – Additional struct columns to expand, provided as positional arguments.

Example:

>>> import tabsdata as td
>>>
>>> tf = td.TableFrame({
...     "id": [1, 2],
...     "info": [
...         {"name": "Alice", "age": 30},
...         {"name": "Bob", "age": None},
...     ],
...     "status": ["active", "inactive"],
... })
>>>
>>> tf
>>>
┌─────┬───────────────┬───────────┐
│ id  ┆ info          ┆ status    │
│ --- ┆ ---           ┆ ---       │
│ i64 ┆ struct[2]     ┆ str       │
╞═════╪═══════════════╪═══════════╡
│ 1   ┆ {"Alice",30}  ┆ active    │
│ 2   ┆ {"Bob",null}  ┆ inactive  │
└─────┴───────────────┴───────────┘
>>>
>>> tf.unnest("info")
>>>
┌─────┬───────┬──────┬───────────┐
│ id  ┆ name  ┆ age  ┆ status    │
│ --- ┆ ---   ┆ ---  ┆ ---       │
│ i64 ┆ str   ┆ i64  ┆ str       │
╞═════╪═══════╪══════╪═══════════╡
│ 1   ┆ Alice ┆ 30   ┆ active    │
│ 2   ┆ Bob   ┆ null ┆ inactive  │
└─────┴───────┴──────┴───────────┘

TableFrame.with_columns(

*exprs: td_typing.IntoExpr | Iterable[td_typing.IntoExpr],

**named_exprs: td_typing.IntoExpr,

) → TableFrame

Categories:: projection

Add columns to the TableFrame.

Parameters:

exprs – Columns or expressions to add.
named_exprs – Named expressions to add.

Example:

>>> import tabsdata as td
>>>
>>> tf: td.TableFrame ...
>>>
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  │
│ 10.0 ┆ 10.0 │
│ 4.0  ┆ 10.0 │
│ 5.0  ┆ null │
│ null ┆ null │
└──────┴──────┘
>>>
>>> tf.with_columns(td.col("x").mul(td.col("y")).alias("z"))
>>>
┌──────┬──────┬──────┐
│ x    ┆ y    ┆ z    │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╡
│ 1.0  ┆ 2.0  ┆ 2.0  │
│ 2.0  ┆ 2.0  ┆ 4.0  │
│ NaN  ┆ NaN  ┆ NaN  │
│ 4.0  ┆ NaN  ┆ NaN  │
│ 5.0  ┆ null ┆ null │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘

class TableInput( table: str | List[str], )

Bases: SourcePlugin

Table-based data inputs.

stream( working_dir: str, ) → list[TableFrame | None | list[TableFrame | None]]

property table: str | List[str]

The table(s) to load.

Type:: str | List[str]

TableInput.stream( working_dir: str, ) → list[TableFrame | None | list[TableFrame | None]]

class TableOutput( table: str | list[str], )

Bases: DestinationPlugin

Categories:: destination

Table-based data outputs.

property table: str | list[str]

The table(s) to create. If multiple tables are provided, they must be provided as a list.

Type:: str | list[str]

Bases: object

Class to decorate a function with metadata and methods for use in a Tabsdata: environment.

Attributes:

property input: SourcePlugin | None

The data to be used when running the function.

Type:: SourcePlugin | None

property name: str

The name with which the function will be registered.

Type:: str

property original_file

The file where the original function is defined in the user’s computer

Type:: str

property original_folder: str

The folder where the original function is defined, as a local path in the user’s computer.

Type:: str

property original_function: Callable

The original function that was decorated, without any behaviour modifications.

Type:: Callable

property output: DestinationPlugin | None

The location where the function results will be saved when run.

Type:: DestinationPlugin

property output_actions: list[OnTablesAction]

Actions to perform on the output tables.

Type:: list[OnTablesAction]

property trigger_by: list[str] | CronTrigger | StageTrigger | None

The trigger(s) that will cause the function to execute. It must be another table or tables in the system, a CronTrigger, or a StageTrigger in the case of a publisher.

Type:: List[str]

property type: str

The type of the function. It can be “publisher”, “transformer” or “subscriber”.

Type:: str

class UserPasswordCredentials( user: str | Secret, password: str | Secret, )

Bases: Credentials

Categories:: credentials

Represents credentials consisting of a user and a password.

This class is used to manage and securely handle user and password credentials. It includes functionalities to set user and password details, convert them into a dictionary representation, and provide a string representation for debugging purposes.

property password: Secret

The password.

Type:: Secret

property user: Secret

The user.

Type:: Secret

concat( items: Iterable[TdType], how: Literal['vertical', 'vertical_relaxed', 'diagonal', 'diagonal_relaxed'] = 'vertical', ) → TdType

Categories:: union

Combine multiple TableFrames by stacking their rows.

Parameters:

items – The TableFrames to concatenate.
how –
{‘vertical’, ‘vertical_relaxed’, ‘diagonal’, ‘diagonal_relaxed’}
- vertical: Appends the rows of each input below the previous one. All inputs must have exactly the same column names and types; otherwise the operation fails.
- vertical_relaxed: Same as vertical, but if columns with the same name have different data types across inputs, they are converted to a common type (e.g. Int32 → Int64).
- diagonal: Aligns columns by name across all inputs. If a column is missing from a particular input, that input is padded with null values for the missing column. Matching columns keep their original type if consistent.
- diagonal_relaxed: Same as diagonal, but when matching columns have different data types, they are converted to a common type (e.g. Int32 → Int64).

Example:

>>> import tabsdata as td
>>>
>>> tf1: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ a    ┆ 1    │
│ b    ┆ 2    │
└──────┴──────┘
>>>
>>> tf2: td.TableFrame ...
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ x    ┆ 10   │
│ y    ┆ 20   │
└──────┴──────┘
>>>
>>> tf = td.concat([tf1, tf2])
>>>
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ a    ┆ 1    │
│ b    ┆ 2    │
│ x    ┆ 10   │
│ y    ┆ 20   │
└──────┴──────┘

Categories:: publisher

Decorator to define a function to publish data to Tabsdata.

Parameters:

source – Where to obtain the data that will be provided as an input to the function.
tables – Where to store the output of the function.
name – The name with which the function will be registered. If not provided, the current function name will be used.
trigger_by – The trigger that will cause the function to execute. It can be a table in the system, a list of tables, None (in which case it must be triggered manually), or a CronTrigger.
on_tables (OnTablesSpec, optional) – Actions to perform on output tables after executing the function. Can be a single action or a list of them. Defaults to None.

Returns:

The function converted to a Tabsdata Function.

Return type:

callable

Categories:: subscriber

Decorator to define a function to subscribe to data present in Tabsdata.

Parameters:

tables – Where to obtain the data that will be provided as an input to the function.
destination – Where to store the output of the function.
name – The name with which the function will be registered. If not provided, the current function name will be used.
trigger_by – The trigger that will cause the function to execute. It can be a table in the system, a list of tables, None (in which case it must be triggered manually), or a CronTrigger. Defaults to all dependencies.

Returns:

The function converted to a Tabsdata Function.

Return type:

callable

transformer( input_tables: TableInput | str | List[str], output_tables: TableOutput | str | List[str], name: str = None, trigger_by: TransformerTriggerBySpec = '*', on_output_tables: OnTablesSpec = None, ) → callable

Categories:: transformer

Decorator to define a function to transform data present in Tabsdata.

Parameters:

input_tables (TableInput | str | List[str]) – Where to obtain the data that will be provided as an input to the function.
output_tables (TableOutput | str | List[str]) – Where to store the output of the function.
name (str, optional) – The name with which the function will be registered. If not provided, the current function name will be used.
trigger_by (TransformerTriggerBySpec, optional) – The trigger that will cause the function to execute. It can be a table in the system, a list of tables, None (in which case it must be triggered manually), or a CronTrigger. Defaults to all dependencies.
on_output_tables (OnTablesSpec, optional) – Actions to perform on output tables after executing the function. Can be a single action or a list of them. Defaults to None.

Returns:

The function converted to a Tabsdata Function.

Return type:

callable