Working with Tables#

A table is the basic unit of operation inside Tabsdata. A Tabsdata table is data organized in rows and columns, similar to a database table. Each column has a name and all the values in the column have the same data type. A Tabsdata table only exsists inside the Tabsdata server.

Tables are created in Tabsdata by publishers importing external source data as tables inside the Tabsdata server or by transformers processing some tables in the server to create new ones.

You can use any Tabsdata table, including previous versions of tables, as input to create or modify other tables in Tabsdata. You can also export Tabsdata tables to external systems as CSV, jsonl, ndjson, and parquet files.

Table Operations#

Some of the key table operations are demonstrated in this section using the following persons Tabsdata table:


┌────────────┬────────────┬────────────┬─────────────┬────────────┬────────┐
 identifier  first_name  last_name   nationality  birthdate   height 
 ---         ---         ---         ---          ---         ---    
 str         str         str         str          str         f64    
╞════════════╪════════════╪════════════╪═════════════╪════════════╪════════╡
 74-93/03    Bennett     Chandler    Portuguese   1985-10-11  1.59   
                                                                     
 68-52/94    Elijah      Lara        Costa Rican  2003-10-16  1.75   
 37-41/89    Thresa      Carroll     Cuban        2005-02-03  1.53   
 66-27/19    Alethia     Wilkinson   Colombian    1999-04-17  1.86   
 35-07/51    Elin        English     Brazilian    1998-02-12  1.88   
                                                                     
                                                               
 85-61/58    Dudley      Valenzuela  Iranian      1998-04-12  1.77   
 36-24/33    Willard     Hodge       Israeli      2002-07-31  1.77   
                                                                     
 05-43/20    Salena      Waller      Colombian    2014-04-19  1.88   
 49-35/73    August      Dillard     Guatemalan   1988-05-25  1.83   
 40-49/72    Coy         Goodman     Uruguayan    1999-02-17  1.56   
                                                                     
└────────────┴────────────┴────────────┴─────────────┴────────────┴────────┘

To work with the table, you define a transformer as follows,

import tabsdata as td

    @td.transformer(
        source = ["persons"],
        tables = ["persons_modified"],
    )

    def transf (tf: td.TableFrame):
        tf_modified = <function_logic>
        return tf_modified

where <function_logic> can be substituted with various table operations that are demonstrated in the following sections.

Projection#

Projection operations derive projected views from underlying tables.

Column selection#

Here is an example code to select specific columns from a table.

tf.select(td.col("first_name"), td.col("last_name"))
tf.select("first_name", "last_name")
tf.select(["first_name", "last_name"])

Result:

┌────────────┬────────────┐
 first_name  last_name  
 ---         ---        
 str         str        
╞════════════╪════════════╡
 Bennett     Chandler   
 Elijah      Lara       
 Thresa      Carroll    
 Alethia     Wilkinson  
 Elin        English    
                      
 Dudley      Valenzuela 
 Willard     Hodge      
 Salena      Waller     
 August      Dillard    
 Coy         Goodman    
└────────────┴────────────┘

Dropping a column:#

Here is an example code to drop specific columns from a table.

tf.drop("last_name")

Result:

┌────────────┬────────────┬─────────────┬────────────┬────────┐
 identifier  first_name  nationality  birthdate   height 
 ---         ---         ---          ---         ---    
 str         str         str          str         f64    
╞════════════╪════════════╪═════════════╪════════════╪════════╡
 74-93/03    Bennett     Portuguese   1985-10-11  1.59   
 68-52/94    Elijah      Costa Rican  2003-10-16  1.75   
 37-41/89    Thresa      Cuban        2005-02-03  1.53   
 66-27/19    Alethia     Colombian    1999-04-17  1.86   
 35-07/51    Elin        Brazilian    1998-02-12  1.88   
                                                    
 85-61/58    Dudley      Iranian      1998-04-12  1.77   
 36-24/33    Willard     Israeli      2002-07-31  1.77   
 05-43/20    Salena      Colombian    2014-04-19  1.88   
 49-35/73    August      Guatemalan   1988-05-25  1.83   
 40-49/72    Coy         Uruguayan    1999-02-17  1.56   
└────────────┴────────────┴─────────────┴────────────┴────────┘

Column concatenation#

Here is an example code to concatenate 2 columns from a table and call the output column full_name.

tf.select((td.col("first_name") + " " + td.col("last_name")).alias("full_name"))

Result:

┌───────────────────┐
 full_name         
 ---               
 str               
╞═══════════════════╡
 Bennett Chandler  
 Elijah Lara       
 Thresa Carroll    
 Alethia Wilkinson 
 Elin English      
                  
 Dudley Valenzuela 
 Willard Hodge     
 Salena Waller     
 August Dillard    
 Coy Goodman       
└───────────────────┘

Column string expressions#

Here is an example code to convert the last_name column to upper case.

tf.select(
    (td.col("first_name") + " " + td.col("last_name").str.to_uppercase()).alias(
        "full_name"
    )
)

Result:

┌───────────────────┐
 full_name         
 ---               
 str               
╞═══════════════════╡
 Bennett CHANDLER  
 Elijah LARA       
 Thresa CARROLL    
 Alethia WILKINSON 
 Elin ENGLISH      
                  
 Dudley VALENZUELA 
 Willard HODGE     
 Salena WALLER     
 August DILLARD    
 Coy GOODMAN       
└───────────────────┘

Column type casting and datetime expressions#

Here is an example code to convert the values in the birthdate column to polars Date format, select the year from the converted value, and call the resultant column birth_year.

tf.select(
    td.col("first_name"),
    td.col("last_name"),
    td.col("birthdate").cast(td.Date).dt.year().alias("birth_year"),
)

Result:

┌────────────┬────────────┬────────────┐
 first_name  last_name   birth_year 
 ---         ---         ---        
 str         str         i32        
╞════════════╪════════════╪════════════╡
 Bennett     Chandler    1985       
 Elijah      Lara        2003       
 Thresa      Carroll     2005       
 Alethia     Wilkinson   1999       
 Elin        English     1998       
                                 
 Dudley      Valenzuela  1998       
 Willard     Hodge       2002       
 Salena      Waller      2014       
 August      Dillard     1988       
 Coy         Goodman     1999       
└────────────┴────────────┴────────────┘

Column numeric expressions#

Here is an example code to select the height column from the table, multiply the values to hundred, and name the result column height_in_cm.

tf.select(
    td.col("first_name"),
    td.col("last_name"),
    (td.col("height") * 100).alias("height_in_cm"),
)

Result:

┌────────────┬────────────┬──────────────┐
 first_name  last_name   height_in_cm 
 ---         ---         ---          
 str         str         f64          
╞════════════╪════════════╪══════════════╡
 Bennett     Chandler    159.0        
 Elijah      Lara        175.0        
 Thresa      Carroll     153.0        
 Alethia     Wilkinson   186.0        
 Elin        English     188.0        
                                   
 Dudley      Valenzuela  177.0        
 Willard     Hodge       177.0        
 Salena      Waller      188.0        
 August      Dillard     183.0        
 Coy         Goodman     156.0        
└────────────┴────────────┴──────────────┘

Filter#


Simple condition on column value#

Here is an example code to filter the values of the last_name column by the first letter as “A”.

tf.filter(td.col("last_name").str.starts_with("A"))

Result:

┌────────────┬────────────┬───────────┬─────────────┬────────────┬────────┐
 identifier  first_name  last_name  nationality  birthdate   height 
 ---         ---         ---        ---          ---         ---    
 str         str         str        str          str         f64    
╞════════════╪════════════╪═══════════╪═════════════╪════════════╪════════╡
 10-08/26    Milan       Austin     Swedish      1981-08-28  1.94   
                                                                    
 18-23/80    Gustavo     Ayers      Russian      1991-08-27  1.84   
                                                                    
 43-25/03    Barton      Alvarez    Swiss        1995-01-08  1.55   
 46-09/51    Marvis      Alston     Ecuadorian   2009-03-30  1.55   
 84-84/87    Alfred      Andrews    Ethiopian    2010-08-29  1.82   
                                                              
 99-98/21    Doretta     Albert     Ukrainian    2011-11-10  1.92   
 13-33/25    Sharika     Aguirre    Romanian     1985-10-20  1.61   
                                                                    
 60-97/39    Krysta      Albert     Greek        2017-10-20  1.99   
 97-89/11    Elden       Alvarado   Dominican    2014-05-09  1.8    
                                                                    
 17-61/97    Rosario     Austin     Argentinian  1982-12-18  1.77   
                                                                    
└────────────┴────────────┴───────────┴─────────────┴────────────┴────────┘

Multiple conditions on column values#

Here is an example code to filter by multiple conditions on height and last_name column.

tf.filter((td.col("height") > 1.80).and_(td.col("last_name").str.starts_with("A")))
tf.filter((td.col("height") > 1.80) & td.col("last_name").str.starts_with("A"))

Result:

┌────────────┬────────────┬───────────┬─────────────┬────────────┬────────┐
 identifier  first_name  last_name  nationality  birthdate   height 
 ---         ---         ---        ---          ---         ---    
 str         str         str        str          str         f64    
╞════════════╪════════════╪═══════════╪═════════════╪════════════╪════════╡
 10-08/26    Milan       Austin     Swedish      1981-08-28  1.94   
                                                                    
 18-23/80    Gustavo     Ayers      Russian      1991-08-27  1.84   
                                                                    
 84-84/87    Alfred      Andrews    Ethiopian    2010-08-29  1.82   
 98-38/79    Cody        Allison    Australian   2023-12-11  1.99   
                                                                    
 33-44/94    Walton      Avery      Chinese      1987-11-20  1.85   
                                                                    
                                                              
 75-70/19    Ngoc        Anderson   Ukrainian    2005-11-10  1.86   
                                                                    
 65-39/04    Meg         Andrews    Colombian    2006-04-26  1.86   
                                                                    
 35-32/81    Julienne    Alston     Chilean      2001-08-02  1.9    
                                                                    
 99-98/21    Doretta     Albert     Ukrainian    2011-11-10  1.92   
 60-97/39    Krysta      Albert     Greek        2017-10-20  1.99   
└────────────┴────────────┴───────────┴─────────────┴────────────┴────────┘

Aggregation#

Here is an example code to group by the birth year and count the number of people born each year. It first converts the birthdate column to a Date type, extracts the year, and names it birth_year. Then, it groups the data by birth_year and counts the occurrences of first_name, storing the result in the people_born column.

tf.group_by(td.col("birthdate").cast(td.Date).dt.year().alias("birth_year")).agg(
    td.col("first_name").count().alias("people_born")
)

Result:

┌────────────┬─────────────┐
 birth_year  people_born 
 ---         ---         
 i32         u32         
╞════════════╪═════════════╡
 2013        17          
 2008        17          
 1997        27          
 2014        26          
 2007        27          
                       
 1999        26          
 2017        25          
 2004        20          
 1983        24          
 1993        16          
└────────────┴─────────────┘

Join#

Here is an example code that uses aggregates, joins and select, to calculate the average birth year for each nationality and filter individuals born after their nationality’s average birth year.

First, it groups the dataset by nationality and computes the mean birth year, storing the result in the birth_year_mean column.

Then, it joins this aggregated data back to the original dataset on nationality and filters records where an individual’s birth year is greater than their nationality’s average.

Finally, the code selects only the last_name, birth_year_mean, and birthdate columns to display the relevant information.

Note that the following example is structured differently from the previous examples, because we need to derive the birth_y_mean table from persons table, to demonstrate the joins operation.

import tabsdata as td


@td.transformer(
    input_tables=["persons"],
    output_tables=["persons_modified"],
)
def transf(tf: td.TableFrame):
    birth_y_mean = tf.group_by("nationality").agg(
        td.col("birthdate").cast(td.Date).dt.year().mean().alias("birth_year_mean")
    )

    res = tf.join(birth_y_mean, on="nationality").filter(
        td.col("birthdate").cast(td.Date).dt.year() > td.col("birth_year_mean")
    )

    tf_modified = res.select(
        td.col("last_name"), td.col("birth_year_mean"), td.col("birthdate")
    )

    return tf_modified

Result:

┌───────────┬─────────────────┬────────────┐
 last_name  birth_year_mean  birthdate  
 ---        ---              ---        
 str        f64              str        
╞═══════════╪═════════════════╪════════════╡
 Carroll    2000.25          2005-02-03 
 Brock      1999.545455      2013-10-12 
 Shields    2003.172414      2013-12-31 
 Decker     2005.769231      2006-06-26 
 Rodriguez  2002.866667      2019-06-14 
                                     
 Chan       2002.181818      2017-08-26 
 Vargas     1998.925926      2003-07-12 
 Crosby     2006.769231      2010-01-17 
 Jensen     1998.925926      2005-08-09 
 Waller     2004.444444      2014-04-19 
└───────────┴─────────────────┴────────────┘

Table Commits#

A table commit is generated by a successful execution of a publisher or transformer. When a function reads a table, it reads its latest commit by default.

A commit creation is based on the successful execution of the function and not on changes in the data of the table.

The Tabsdata server retains previous commits and metadata of all tables for auditing and traceability. Once created, the data of a table commit is immutable.

Working with Table Commits#

Tabsdata uses the HEAD command to refer to the latest commit of a table.

import tabsdata as td


@td.transformer(
    input_tables=[
        "minors_g@HEAD^",
        "adults_g@HEAD-5",
    ],
    output_tables=["minors_g_oldcommit", "adults_g_oldcommit"],
)
def transfs(tf1: td.TableFrame, tf2: td.TableFrame):
    t1 = tf1.drop_nulls()
    t2 = tf2.drop_nulls()
    return t1, t2

In the above example, you define a transformer that takes two tables: 1 table commit behind of minors_g and 5 table commits behind of adults_g, drops nulls from them, and returns the tables as minors_g_oldcommit and adults_g_oldcommit.

Following commands can be used in the input of transformers and subscribers to refer to the previous commits:

  • HEAD-10: Refers to the table commit 10 versions before the current one.

  • HEAD^^^: Refers to the table commit 3 versions before the current one.

  • HEAD-3..7: Refers to the list of table commits between 3 and 7 versions before the current one. (3 and 7 are included)

  • HEAD, HEAD^^^, HEAD-7..9: Refers to the list of table commits containing the current commit, the 3rd commit before the current version, and between 7 and 9 versions before the current one.