2. Register Tables

Registering tables in the catalog¶

Our catalog is created and we can start registering tables in it.

First of all, let's activate our catalog.¶

We will be repeating this command in following notebooks.

In [1]:

            
                Copied!
                
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)

10:39:15 | WARNING  | Service endpoint is inaccessible: http://featurebyte-server:8088/
10:39:15 | INFO     | Using profile: tutorial
10:39:15 | INFO     | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
10:39:15 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
10:39:15 | INFO     | SDK version: 2.0.1.dev67
10:39:15 | INFO     | No catalog activated.
10:39:16 | INFO     | Catalog activated: Grocery Dataset Tutorial
16:04:52 | INFO     | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
16:04:52 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
16:04:52 | WARNING  | Remote SDK version (1.1.0.dev7) is different from local (1.1.0.dev1). Update local SDK to avoid unexpected behavior.
16:04:52 | INFO     | No catalog activated.
16:04:52 | INFO     | Catalog activated: Grocery Dataset Tutorial

Get data source¶

To be able to get source tables from the data warehouse we need to get data source, which our catalog has access to.

This data source contains collection of tables in our database, and we can use it to explore our DB schema:

In [2]:

            
                Copied!
                
ds = catalog.get_data_source()
ds = catalog.get_data_source()

Here we see we have access to a number of databases. For these tutorials we will use the one called 'DEMO_DATASETS' and the 'GROCERY' schema under it.

In [3]:

            
                Copied!
                
ds.list_databases()
ds.list_databases()

Out[3]:

['DEMO_DATASETS', 'TUTORIAL']

In [4]:

            
                Copied!
                
database_name = 'DEMO_DATASETS'
ds.list_schemas(database_name=database_name)
database_name = 'DEMO_DATASETS'
ds.list_schemas(database_name=database_name)

Out[4]:

['CREDITCARD', 'GROCERY', 'HEALTHCARE', 'INFORMATION_SCHEMA']

In [5]:

            
                Copied!
                
schema_name = 'GROCERY'
ds.list_source_tables(database_name=database_name, schema_name=schema_name)
schema_name = 'GROCERY'
ds.list_source_tables(database_name=database_name, schema_name=schema_name)

Out[5]:

['GROCERYCUSTOMER', 'INVOICEITEMS', 'GROCERYPRODUCT', 'GROCERYINVOICE']

Get source tables¶

We identified database and schema we want to work with, it is time to get source tables and register them in the catalog.

In [6]:

            
                Copied!
                
                    
                    
                
                

        
customer_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYCUSTOMER"
)
invoice_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYINVOICE"
)
items_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="INVOICEITEMS"
)
product_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYPRODUCT"
)
customer_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYCUSTOMER"
)
invoice_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYINVOICE"
)
items_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="INVOICEITEMS"
)
product_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYPRODUCT"
)

Exploring Source Tables¶

You can obtain descriptive statistics, preview a selection of rows, or collect additional information on their columns.

In [7]:

            
                Copied!
                
# Obtain descriptive statistics
invoice_source_table.describe()
# Obtain descriptive statistics
invoice_source_table.describe()

Done! |████████████████████████████████████████| 100% in 12.2s (0.08%/s)

Out[7]:

	GroceryInvoiceGuid	GroceryCustomerGuid	Timestamp	tz_offset	record_available_at	Amount
dtype	VARCHAR	VARCHAR	TIMESTAMP	VARCHAR	TIMESTAMP	FLOAT
unique	74446	500	74386	4	18440	6647
%missing	0.0	0.0	0.0	0.0	0.0	0.0
%empty	0	0	NaN	0	NaN	NaN
entropy	6.214608	5.864452	NaN	0.77445	NaN	NaN
top	003224bd-aad1-4e34-9182-1f8f3a6b0a57	cea213d4-36e4-48c3-ae8d-c7a25911e11c	NaN	+02:00	NaN	NaN
freq	1.0	1324.0	NaN	42695.0	NaN	NaN
mean	NaN	NaN	NaN	NaN	NaN	19.195901
std	NaN	NaN	NaN	NaN	NaN	23.729811
min	NaN	NaN	2022-01-01T04:17:46.000000000	NaN	2022-01-01T05:01:00.000000000	0.0
25%	NaN	NaN	NaN	NaN	NaN	4.29
50%	NaN	NaN	NaN	NaN	NaN	10.605
75%	NaN	NaN	NaN	NaN	NaN	24.55
max	NaN	NaN	2024-12-18T01:14:32.000000000	NaN	2024-12-18T02:01:00.000000000	360.84

Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)

Out[7]:

	GroceryInvoiceGuid	GroceryCustomerGuid	Timestamp	tz_offset	record_available_at	Amount
dtype	VARCHAR	VARCHAR	TIMESTAMP	VARCHAR	TIMESTAMP	FLOAT
unique	61513	500	61461	4	15181	6647
%missing	0.0	0.0	0.0	0.0	0.0	0.0
%empty	0	0	NaN	0	NaN	NaN
entropy	6.214608	5.860121	NaN	0.782711	NaN	NaN
top	003224bd-aad1-4e34-9182-1f8f3a6b0a57	cea213d4-36e4-48c3-ae8d-c7a25911e11c	2022-02-08 14:47:42.000	+02:00	2022-08-23 15:01:00.000	1
freq	1.0	1072.0	2.0	33673.0	17.0	1043.0
mean	NaN	NaN	NaN	NaN	NaN	19.165033
std	NaN	NaN	NaN	NaN	NaN	23.732982
min	NaN	NaN	2022-01-01T04:17:46.000000000	NaN	2022-01-01T05:01:00.000000000	0.0
25%	NaN	NaN	NaN	NaN	NaN	4.28
50%	NaN	NaN	NaN	NaN	NaN	10.58
75%	NaN	NaN	NaN	NaN	NaN	24.53
max	NaN	NaN	2024-06-12T07:29:25.000000000	NaN	2024-06-12T08:01:00.000000000	360.84

In [8]:

            
                Copied!
                
# Preview a selection of rows
invoice_source_table.preview()
# Preview a selection of rows
invoice_source_table.preview()

Out[8]:

	GroceryInvoiceGuid	GroceryCustomerGuid	Timestamp	tz_offset	record_available_at	Amount
0	753a59e9-1291-4882-bc7a-39633607e192	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-01-04 17:15:23	+01:00	2022-01-04 18:01:00	6.17
1	040c86f7-9e16-4468-bf9f-b80afc4a3610	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-01-17 16:38:55	+01:00	2022-01-17 17:01:00	5.58
2	460fe41e-258c-409d-85bb-b1b639659a02	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-02-02 18:09:57	+01:00	2022-02-02 19:01:00	242.15
3	46c48917-06a9-4f53-994e-d7ab45717073	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-03-13 17:38:52	+01:00	2022-03-13 18:01:00	76.63
4	29f43ed8-c684-45a3-8a6e-e5e09f228549	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-03-19 13:32:36	+01:00	2022-03-19 14:01:00	40.33
5	b649add7-08fa-4185-be0a-3dc351befcd1	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-03-26 13:38:56	+01:00	2022-03-26 14:01:00	3.52
6	b9353637-8a7e-4ee9-9095-380d2df051f6	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-04-16 12:53:14	+02:00	2022-04-16 13:01:00	22.30
7	82b59f81-4e08-48f4-b191-06311b429dd7	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-04-29 19:24:24	+02:00	2022-04-29 20:01:00	51.71
8	8cc9e4cc-7593-4f67-84fc-7b9107f2cb57	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-05-14 12:27:20	+02:00	2022-05-14 13:01:00	40.20
9	26c8945d-e1d4-4dbf-9841-f46bbf58f556	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-05-24 13:08:21	+02:00	2022-05-24 14:01:00	2.50

In [9]:

            
                Copied!
                
# Collect additional information on their columns
invoice_source_table.columns_info
# Collect additional information on their columns
invoice_source_table.columns_info

Out[9]:

[ColumnInfo(name='GroceryInvoiceGuid', dtype='VARCHAR', description='Unique identifier of each row in the table, in GUID format. Uniquely identifies each invoice.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='GroceryCustomerGuid', dtype='VARCHAR', description='Unique identifier for each customer, in GUID format.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='Timestamp', dtype='TIMESTAMP', description='The GMT timestamp of when this invoice transaction event occurred.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='tz_offset', dtype='VARCHAR', description='The local timezone offset of the invoice event.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='record_available_at', dtype='TIMESTAMP', description='A timestamp for when this row was added to the database.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='Amount', dtype='FLOAT', description='The total amount of the invoice, including all items and any discounts applied. Cannot be negative.', entity_id=None, semantic_id=None, critical_data_info=None)]

Registering Tables in the Catalog¶

This step, though slightly more intricate than our previous actions, is vital for our subsequent feature engineering tasks.

For accurate feature derivation, FeatureByte needs to understand the 'roles' of various tables.

We categorize tables into four types:

Event tables - These capture distinct business events occurring at specific moments. An example would be customer invoices, specifically, the "Grocery Invoice" in our scenario.
Item tables - These delve into the specifics about an event, such as the products a customer purchased. In our context, this is represented by "Invoice Items".
Slowly Changing Dimension tables - These denote data similar to dimensions but may evolve over time. For instance, customers might shift addresses or update other details. In our use case, this is the "Grocery Customer".
Dimension tables - These tables contain unchanging descriptive data, like information on particular products retailed in a store, exemplified by "Grocery Product" in our setting.

Feel free to explore more on tables and table types, but at this point understanding just basic differences is more than enough.

Let's register each table using respective type:

In [10]:

            
                Copied!
                
                    
                    
                
                

        
customer_table = customer_source_table.create_scd_table(
    name="GROCERYCUSTOMER",
    surrogate_key_column='RowID',
    natural_key_column="GroceryCustomerGuid",
    effective_timestamp_column="ValidFrom",
    current_flag_column ="CurrentRecord",
    record_creation_timestamp_column="record_available_at"
)
invoice_table = invoice_source_table.create_event_table(   
    name="GROCERYINVOICE",
    event_id_column="GroceryInvoiceGuid",
    event_timestamp_column="Timestamp",
    event_timestamp_timezone_offset_column="tz_offset",
    record_creation_timestamp_column="record_available_at"
)
items_table = items_source_table.create_item_table(
    name="INVOICEITEMS",
    event_id_column="GroceryInvoiceGuid",
    item_id_column="GroceryInvoiceItemGuid",
    event_table_name="GROCERYINVOICE",
    record_creation_timestamp_column="record_available_at"    
)
product_table = product_source_table.create_dimension_table(
    name="GROCERYPRODUCT",
    dimension_id_column="GroceryProductGuid"
)
customer_table = customer_source_table.create_scd_table(
    name="GROCERYCUSTOMER",
    surrogate_key_column='RowID',
    natural_key_column="GroceryCustomerGuid",
    effective_timestamp_column="ValidFrom",
    current_flag_column ="CurrentRecord",
    record_creation_timestamp_column="record_available_at"
)
invoice_table = invoice_source_table.create_event_table(   
    name="GROCERYINVOICE",
    event_id_column="GroceryInvoiceGuid",
    event_timestamp_column="Timestamp",
    event_timestamp_timezone_offset_column="tz_offset",
    record_creation_timestamp_column="record_available_at"
)
items_table = items_source_table.create_item_table(
    name="INVOICEITEMS",
    event_id_column="GroceryInvoiceGuid",
    item_id_column="GroceryInvoiceItemGuid",
    event_table_name="GROCERYINVOICE",
    record_creation_timestamp_column="record_available_at"    
)
product_table = product_source_table.create_dimension_table(
    name="GROCERYPRODUCT",
    dimension_id_column="GroceryProductGuid"
)

After this we will be able to see our tables in the catalog:

In [11]:

            
                Copied!
                
display(catalog.list_tables())
display(catalog.list_tables())

	id	name	type	status	entities	created_at
0	676235f9375a850179e42f8f	GROCERYPRODUCT	dimension_table	PUBLIC_DRAFT	[]	2024-12-18T02:39:53.572000
1	676235f7375a850179e42f8e	INVOICEITEMS	item_table	PUBLIC_DRAFT	[]	2024-12-18T02:39:51.944000
2	676235f5375a850179e42f8d	GROCERYINVOICE	event_table	PUBLIC_DRAFT	[]	2024-12-18T02:39:50.049000
3	676235f3375a850179e42f8c	GROCERYCUSTOMER	scd_table	PUBLIC_DRAFT	[]	2024-12-18T02:39:48.264000

Initialize feature job settings¶

The last step we need to tackle is setting up the feature job settings.

Essentially, these settings determine when and how frequently we want to update the feature store. It also sets a 'blind spot' period, which is the time gap between when a feature is computed and the latest available event.

For instance, in our grocery context, if we aim to predict customer spending for the upcoming two weeks, we might consider using all of their invoices up to the present. However, our data pipeline may not capture the most recent invoices immediately due to the time required for data collection from edge devices, processing through ETL, and other steps. If we base our predictions on data up to a certain point, our production scenario might not be accurately represented, leading to poor prediction accuracy.

Lukily, featurebyte is smart enough to compute those settings for us:

In [12]:

            
                Copied!
                
invoice_table.initialize_default_feature_job_setting()
invoice_table.initialize_default_feature_job_setting()

Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: GROCERYINVOICE
The analysis period starts at 2024-11-20 01:14:32 and ends at 2024-12-18 01:14:32
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp for GROCERYINVOICE is record_available_at

STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4171.399594320487 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 54000.0 s
based on a total of 480 unique record creation timestamps.

The BEST ESTIMATE FOR GROCERYINVOICE UPDATE FREQUENCY is every 1 hour

The longer time between record creations are due to 184 MISSING UPDATES.

GROCERYINVOICE UPDATE TIME starts 1.0 minute and ends 1.0 minute after the start of each 1 hour

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 2 minutes after the start of each 1 hour.
This includes a buffer of 60 s to allow for late jobs.

The 59 jobs that occurred after missing jobs don't seem to have processed significantly older records.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 120 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 120 s
- blind spot for 99.9 % of events to land: 120 s
- blind spot for 99.95 % of events to land: 120 s
- blind spot for 99.99 % of events to land: 120 s
- blind spot for 99.995 % of events to land: 120 s
- blind spot for 100.0 % of events to land: 120 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 120 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 3600
offset: 120
blind_spot: 120

The resulting FEATURE CUTOFF offset is 0 s.

Feature Tiles Event Landing Time

For a feature cutoff at 0 s:
- time for 99.5 % of events to land: 120 s
- time for 99.9 % of events to land: 120 s
- time for 99.95 % of events to land: 120 s
- time for 99.99 % of events to land: 120 s
- time for 99.995 % of events to land: 120 s
- time for 100.0 % of events to land: 120 s

Backtest Result

For the feature job setting:
- Period = 3600 s / Offset = 120 s / Blind spot = 120 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2024-11-20 01:00:00 to 2024-12-18 01:00:00, the table is regularly updated 1.0 minute after the start of each 1 hour within a interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 2 minutes after the start of each 1 hour.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 120 s.

The recommended Default Feature Job setting is:

period: 3600 s
offset: 120 s
blind_spot: 120 s

Done! |████████████████████████████████████████| 100% in 12.1s (0.08%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: GROCERYINVOICE
The analysis period starts at 2024-05-15 07:29:25 and ends at 2024-06-12 07:29:25
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp for GROCERYINVOICE is record_available_at

STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4176.910911966263 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 36000.0 s
based on a total of 481 unique record creation timestamps.

The BEST ESTIMATE FOR GROCERYINVOICE UPDATE FREQUENCY is every 1 hour

The longer time between record creations are due to 182 MISSING UPDATES.

GROCERYINVOICE UPDATE TIME starts 1.0 minute and ends 1.0 minute after the start of each 1 hour

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 2 minutes after the start of each 1 hour.
This includes a buffer of 60 s to allow for late jobs.

The 63 jobs that occurred after missing jobs don't seem to have processed significantly older records.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 120 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 120 s
- blind spot for 99.9 % of events to land: 120 s
- blind spot for 99.95 % of events to land: 120 s
- blind spot for 99.99 % of events to land: 120 s
- blind spot for 99.995 % of events to land: 120 s
- blind spot for 100.0 % of events to land: 120 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 120 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 3600
offset: 120
blind_spot: 120

The resulting FEATURE CUTOFF offset is 0 s.

Feature Tiles Event Landing Time

For a feature cutoff at 0 s:
- time for 99.5 % of events to land: 120 s
- time for 99.9 % of events to land: 120 s
- time for 99.95 % of events to land: 120 s
- time for 99.99 % of events to land: 120 s
- time for 99.995 % of events to land: 120 s
- time for 100.0 % of events to land: 120 s

Backtest Result

For the feature job setting:
- Period = 3600 s / Offset = 120 s / Blind spot = 120 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2024-05-15 07:00:00 to 2024-06-12 07:00:00, the table is regularly updated 1.0 minute after the start of each 1 hour within a interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 2 minutes after the start of each 1 hour.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 120 s.

The recommended Default Feature Job setting is:

period: 3600 s
offset: 120 s
blind_spot: 120 s

You can always override default feature job settings, see update_default_feature_job_setting.

That's it for this tutorial, now we are ready for modeling our entities.

2. Register Tables

Registering tables in the catalog¶

First of all, let's activate our catalog.¶

Get data source¶

Get source tables¶

Exploring Source Tables¶

Registering Tables in the Catalog¶

Initialize feature job settings¶

More details about concepts we discussed in this tutorial:¶

SDK reference for¶