Skip to content

Table

A Table object provides a centralized location for metadata about a source table. This metadata determines the type of operations that can be applied to the table's views and includes essential information for feature engineering.

Important

A source table can only be associated with one active Table object in a catalog at a time. This means that the active Table object in the catalog is the source of truth for the metadata of the source table. If a Table object becomes deprecated, a new Table object can be registered with the same source table.

Registering Tables

Before registering tables, ensure that the catalog you want to work with is active.

catalog = fb.Catalog.activate(<catalog_name>)

Select the source table you are interested in.

ds = fb.FeatureStore.get("playground").get_data_source()
source_table = ds.get_source_table(
    database_name="spark_catalog",
    schema_name="GROCERY",
    table_name="GROCERYINVOICE"
)

To create Table objects from a SourceTable object, you must use specific methods depending on the type of data contained in the source table:

Registering a table according to its type determines the types of feature engineering operations that are possible on the table's views and enforces guardrails accordingly.

Example of registering an event table using the create_event_table() method:

invoice_table = source_table.create_event_table(
    name="GROCERYINVOICE",
    event_id_column="GroceryInvoiceGuid",
    event_timestamp_column="Timestamp",
    event_timestamp_timezone_offset_column="tz_offset",
    record_creation_timestamp_column="record_available_at"
)

Implementing Default Job Settings for Consistency

A default feature job setting is established at the table level to streamline the configuration of feature job settings and ensure consistency across features developed by different team members.

CRON Job Settings

CRON job settings consist of four key parameters:

  1. Crontab — Defines the cron schedule for the feature job, specifying when the job should run.
  2. Time Zone — Determines the time zone in which the cron schedule operates.
  3. Blind Spot — Specifies the period of time immediately preceding the job execution that should be excluded to avoid data leakage from late-arriving records.
  4. Reference Time Zone — Defines the time zone used for calendar-based aggregation periods (e.g., daily, weekly, or monthly). This ensures consistent calendar boundaries across data sources in different time zones.

    • For example, if a scheduled job runs at 2025/01/31 23:00 UTC and the reference time zone is Asia/Singapore, the corresponding calendar date is 2025/02/01. Therefore, the aggregation for the most recent full month would cover January.
    • Typically, the reference time zone should be the westernmost time zone among those associated with the data’s timestamps, ensuring that each aggregation window fully includes all relevant observations.

Use the CronFeatureJobSetting class together with the update_default_feature_job_setting() method to update the default feature job setting.

# Update the default CRON feature job setting
job_setting = fb.CronFeatureJobSetting(
    crontab="0 13 1 * *",
    timezone="America/Los_Angeles",
)
credit_card_balance.update_default_feature_job_setting(job_setting)

The time zones are defined by the IANA Time Zone Database (commonly known as the tz database).

# Get all available time zones
from pydantic_extra_types.timezone_name import get_timezones
get_timezones()

Periodic Job Settings for Event Tables

For an EventTable, the default feature job setting can be also initialized based on an automated analysis of the table data's availability and freshness. This configuration follows a Periodic Job Setting model defined by three key parameters:

  1. Period: Specifies how often the batch process should run.

    • Example: A period of 60m indicates that the feature job executes every 60 minutes.
  2. Offset: Defines the time delay from the end of the period to when the feature job starts.

    • Example: With period: 60m and offset: 130s, the feature job starts 2 minutes and 10 seconds after each hour—at 00:02:10, 01:02:10, 02:02:10, ..., 23:02:10.
  3. Blind Spot: Specifies the period of time immediately preceding the job execution that should be excluded to avoid data leakage from late-arriving records.

This analysis depends on the presence of record creation timestamps in the source table, typically populated during data warehouse updates.

You can initialize the default feature job setting using the initialize_default_feature_job_setting() method:

invoice_table.initialize_default_feature_job_setting()

Note

ItemTable objects automatically inherit the default feature job setting from their associated EventTable objects.

Managing Periodic Feature Job Settings

You can manage and refine default feature job settings using the following methods:

# Create a new analysis with a specific time period
analysis = invoice_table.create_new_feature_job_setting_analysis(
    analysis_date=pd.Timestamp('2023-04-10'),
    analysis_length=3600*24*28,
)
# List previous analyses
invoice_table.list_feature_job_setting_analysis()
# Retrieve a specific analysis
analysis = fb.FeatureJobSettingAnalysis.get_by_id(<analysis_id>)
# Backtest a manual setting
manual_setting = fb.FeatureJobSetting(
    blind_spot="135s",
    period="60m",
    offset="90s",
)
backtest_result = analysis.backtest(feature_job_setting=manual_setting)
# Update the default feature job setting
invoice_table.update_default_feature_job_setting(manual_setting)

Enhancing Feature Engineering with Metadata

Optionally, you can include additional metadata at the column level after creating a table to support feature engineering further.

This could involve identifying columns that reference specific entities using the as_entity method:

# Tag the entities for the grocery invoice table
invoice_table.GroceryInvoiceGuid.as_entity("groceryinvoice")
invoice_table.GroceryCustomerGuid.as_entity("grocerycustomer")

This could also involve defining default cleaning operations using the update_critical_data_info method:

# Discount amount should not be negative
items_table.Discount.update_critical_data_info(
    cleaning_operations=[
        fb.MissingValueImputation(imputed_value=0),
        fb.ValueBeyondEndpointImputation(
            type="less_than", end_point=0, imputed_value=0
        ),
    ]
)

For more details, refer to the TableColumn documentation page.

Managing Table status

When a table is created, it is automatically added to the active catalog with its status set to 'PUBLIC_DRAFT'. Once the table is prepared for feature engineering, you can modify its status to 'PUBLISHED'.

Note

If a table needs to be deprecated, update its status to 'DEPRECATED'.

After deprecating a table,

To obtain the current status of a table, use the status property. To change the status, use the update_status() method:

print(invoice_table.status)
invoice_table.update_status("PUBLISHED")

Accessing a Table from the Catalog

Existing tables can be accessed through the catalog using the list_tables() and get_table() methods.

# List tables in the catalog
catalog.list_tables()
# Retrieve a table
invoice_table = catalog.get_table("GROCERYINVOICE")

You can also retrieve a Table object using its Object ID using the get_table_by_id() method.

table = catalog.get_table_by_id("TableID")

Exploring a Table

To explore a table, you can:

  • obtain detailed information using the info() method
  • acquire descriptive statistics using the describe() method
  • obtain a selection of rows using the preview() method
  • obtain a larger random selection of rows based on a specified time range, size, and seed using the sample() method
# Obtain detailed information on a table
invoice_table.info()
# Acquire descriptive statistics for a table
invoice_table.describe()
# Obtain a selection of table rows
df = invoice_table.preview(limit=20)
# Obtain a random selection of table rows based on a specified time range, size, and seed
df = invoice_table.sample(
    from_timestamp=pd.Timestamp('2023-04-01'),
    to_timestamp=pd.Timestamp('2023-05-01'),
    size=100, seed=23
)

By default, the statistics and materialization are computed before applying cleaning operations defined at the table level. To include these cleaning operations, set the after_cleaning parameter to True.

invoice_table.describe(after_cleaning=True)

Creating Views to Prepare Data Before Defining Features

To prepare data before defining features, View objects are created from Table objects using the get_view method.

customer_table = catalog.get_table("GROCERYCUSTOMER")
invoice_view = invoice_table.get_view()

Besides EventView, ItemView, SnapshotsView, TimeSeriesView, DimensionView, and SCDView, another type of view can be created from an SCDTable: Change Views. These views provide a way to analyze changes happening in a specific attribute within the natural key of the SCD table. To get a Change view, use the get_change_view method:

address_changed_view = customer_table.get_change_view(
    track_changes_column="StreetAddress"
)

Updating Description

Table and column descriptions are automatically fetched from your Data Warehouse when they are available. If these descriptions are missing or incomplete, you have the option to edit and update them.

To see a description of a Table object, use the description property:

invoice_table.description

To update description of a Table object, use the update_description() method:

invoice_table.update_description(
    'Grocery invoice details, containing the timestamp and the total amount of the invoice'
)