Quick Start Tutorial: Model Training¶

Learning Objectives¶

In this tutorial you will learn:

How to design an observation set for your use case
How to materialize training data
How your ML training environment can consume training data

Set up the prerequisites¶

Learning Objectives

In this section you will:

start your local featurebyte server
import libraries
learn the about catalogs
activate a pre-built catalog

In [1]:

            
                Copied!
                
# library imports
import pandas as pd
import numpy as np
import random

# load the featurebyte SDK
import featurebyte as fb

# start the local server, then wait for it to be healthy before proceeding
fb.playground()
# library imports
import pandas as pd
import numpy as np
import random

# load the featurebyte SDK
import featurebyte as fb

# start the local server, then wait for it to be healthy before proceeding
fb.playground()

14:59:05 | INFO     | Using configuration file at: /Users/jevonyeoh/.featurebyte/config.yaml
14:59:05 | WARNING  | No valid profile specified. Update config file or specify valid profile name with "use_profile".
14:59:05 | INFO     | (1/4) Starting featurebyte services
Container redis  Running
Container spark-thrift  Running
Container mongo-rs  Running
Container featurebyte-worker  Running
Container featurebyte-server  Running
Container mongo-rs  Waiting
Container redis  Waiting
Container mongo-rs  Waiting
Container redis  Healthy
Container mongo-rs  Healthy
Container mongo-rs  Healthy
14:59:06 | INFO     | (2/4) Creating local spark feature store
14:59:08 | INFO     | (3/4) Import datasets
14:59:28 | INFO     | Dataset grocery already exists, skipping import
14:59:28 | INFO     | Dataset healthcare already exists, skipping import
14:59:28 | INFO     | Dataset creditcard already exists, skipping import
14:59:28 | INFO     | (4/4) Playground environment started successfully. Ready to go! 🚀

Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up¶

Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.

In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.

In [2]:

            
                Copied!
                
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *

# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.QuickStartModelTraining)
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *

# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.QuickStartModelTraining)

Cleaning up existing tutorial catalogs

14:59:28 | INFO     | Catalog activated: quick start model training 20230814:1439

Cleaning catalog: quick start model training 20230814:1439
  2 observation tables
Done! |████████████████████████████████████████| 100% in 12.3s (0.08%/s)        
Done! |████████████████████████████████████████| 100% in 20.3s (0.05%/s)        
Building a quick start catalog for model training named [quick start model training 20230814:1500]
Creating new catalog

15:00:09 | INFO     | Catalog activated: quick start model training 20230814:1500

Catalog created
Registering the source tables
Registering the entities
Tagging the entities to columns in the data tables
Populating the feature store with example features
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)         
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)         
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)         
Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.1s
Done! |████████████████████████████████████████| 100% in 6.1s (0.16%/s)         
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)         
Done! |████████████████████████████████████████| 100% in 6.2s (0.16%/s)         
Done! |████████████████████████████████████████| 100% in 6.2s (0.16%/s)         
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)         
Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.1s
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)         
Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.1s
Done! |████████████████████████████████████████| 100% in 6.2s (0.16%/s)         
Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.1s
Setting feature readiness
Done! |████████████████████████████████████████| 100% in 6.1s (0.16%/s)         
Loading Feature(s) |████████████████████████████████████████| 8/8 [100%] in 0.1s
Catalog created and pre-populated with data and features

Example: Create views from tables in the Catalog¶

In [3]:

            
                Copied!
                
# create the views
grocery_customer_view = catalog.get_view("GROCERYCUSTOMER")
grocery_invoice_view = catalog.get_view("GROCERYINVOICE")
grocery_items_view = catalog.get_view("INVOICEITEMS")
grocery_product_view = catalog.get_view("GROCERYPRODUCT")
# create the views
grocery_customer_view = catalog.get_view("GROCERYCUSTOMER")
grocery_invoice_view = catalog.get_view("GROCERYINVOICE")
grocery_items_view = catalog.get_view("INVOICEITEMS")
grocery_product_view = catalog.get_view("GROCERYPRODUCT")

Create an observation set for your use case¶

Learning Objectives

In this section you will learn:

the purpose of observation sets
the relationship between entities, point in time, and observation sets
how to design an observation set suitable for training data

Case Study: Predicting Customer Spend¶

Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.

Concept: Materialization¶

A feature in FeatureByte is defined by the logical plan for its computation. The act of computing the feature is known as Feature Materialization.

The materialization of features is made on demand to fulfill historical requests, whereas for prediction purposes, feature values are generated through a batch process called a "Feature Job". The Feature Job is scheduled based on the defined settings associated with each feature.

Concept: Observation set¶

An observation set combines entity key values and historical points-in-time, for which you wish to materialize feature values.

The observation set can be a pandas DataFrame or an ObservationTable object representing an observation set in the feature store. An accepted serving name must be used for the column containing the entity values. The column containing points-in-time must be labelled "POINT-IN-TIME" and the point-in-time timestamps should be in UTC.

Concept: Point in time¶

A point-in-time for a feature refers to a specific moment in the past with which the feature's values are associated.

It is a crucial aspect of historical feature serving, which allows machine learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.

Case Study: Predicting Customer Spend¶

Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.

In [4]:

            
                Copied!
                
                    
                    
                
                

        
# get the feature list for the target
import json

customer_target = catalog.get_target("next_customer_sales_14d")

# display details about the target
info = customer_target.info()
display_info = {
    key: info[key] for key in ("id", "target_name", "entities", "window", "primary_table")
}
print(json.dumps(display_info, indent=4))
# get the feature list for the target
import json

customer_target = catalog.get_target("next_customer_sales_14d")

# display details about the target
info = customer_target.info()
display_info = {
    key: info[key] for key in ("id", "target_name", "entities", "window", "primary_table")
}
print(json.dumps(display_info, indent=4))

{
    "id": "64d9d165b15645afd3dacd3f",
    "target_name": "next_customer_sales_14d",
    "entities": [
        {
            "name": "grocerycustomer",
            "serving_names": [
                "GROCERYCUSTOMERGUID"
            ],
            "catalog_name": "quick start model training 20230814:1500"
        }
    ],
    "window": "14d",
    "primary_table": [
        {
            "name": "GROCERYINVOICE",
            "status": "PUBLIC_DRAFT",
            "catalog_name": "quick start model training 20230814:1500"
        }
    ]
}

In [5]:

            
                Copied!
                
                    
                    
                
                

        
# create a large observation table from a view

# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & (
    grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01")
)
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table = observation_set_view.create_observation_table(
    name="10,000 Customers immediately after each purchase from May-22 to Mar-23",
    sample_rows=10000,
    columns=["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping={
        "Timestamp": "POINT_IN_TIME",
        "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
    },
)

# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())
# create a large observation table from a view

# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & (
    grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01")
)
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table = observation_set_view.create_observation_table(
    name="10,000 Customers immediately after each purchase from May-22 to Mar-23",
    sample_rows=10000,
    columns=["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping={
        "Timestamp": "POINT_IN_TIME",
        "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
    },
)

# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())

Done! |████████████████████████████████████████| 100% in 24.7s (0.04%/s)        
Downloading table |████████████████████████████████████████| 10000/10000 [100%]

	POINT_IN_TIME	GROCERYCUSTOMERGUID
0	2022-04-05 18:55:03	5c96089d-95f7-4a12-ab13-e082836253f1
1	2022-04-11 11:49:05	5c96089d-95f7-4a12-ab13-e082836253f1
2	2022-04-18 15:29:26	5c96089d-95f7-4a12-ab13-e082836253f1
3	2022-05-14 15:00:07	5c96089d-95f7-4a12-ab13-e082836253f1
4	2022-05-20 13:03:26	5c96089d-95f7-4a12-ab13-e082836253f1
...	...	...
9995	2022-05-14 16:02:35	afeec4ce-0a90-41f1-802b-7ff2bb42b292
9996	2022-05-17 09:40:52	afeec4ce-0a90-41f1-802b-7ff2bb42b292
9997	2022-05-29 15:29:15	afeec4ce-0a90-41f1-802b-7ff2bb42b292
9998	2022-06-06 18:40:34	afeec4ce-0a90-41f1-802b-7ff2bb42b292
9999	2022-06-08 09:10:29	afeec4ce-0a90-41f1-802b-7ff2bb42b292

10000 rows × 2 columns

Materialize Training Data¶

Learning Objectives

In this section you will learn:

how to create a target observation table
how to create historical training data using the target observation table

Example: Get target values¶

In [6]:

            
                Copied!
                
# Materialize the target
training_data_target_table = customer_target.compute_target_table(
    observation_table, observation_table_name="target_observation_table"
)

display(training_data_target_table.to_pandas())
# Materialize the target
training_data_target_table = customer_target.compute_target_table(
    observation_table, observation_table_name="target_observation_table"
)

display(training_data_target_table.to_pandas())

Done! |████████████████████████████████████████| 100% in 24.9s (0.04%/s)        
Downloading table |████████████████████████████████████████| 10000/10000 [100%]

	POINT_IN_TIME	GROCERYCUSTOMERGUID	next_customer_sales_14d
0	2022-11-14 14:07:14	abdef773-ab72-43b6-8e77-050804c1c5fc	111.22
1	2022-11-05 19:48:29	776ed61f-ae99-40b4-989b-1195e4901090	15.42
2	2022-11-18 12:24:01	9b1b8037-8506-4a54-981a-3b7e694a489f	81.43
3	2022-10-12 14:05:21	2b068f1d-d99b-4c2f-a737-46f619a76cc8	49.44
4	2022-07-25 11:16:04	94127b9f-1366-4bbe-afea-7cd77225da52	53.55
...	...	...	...
9995	2022-08-15 15:46:59	f6a783f7-5091-46fa-8ebf-aa13ec868234	136.39
9996	2022-08-16 14:34:40	f6a783f7-5091-46fa-8ebf-aa13ec868234	136.88
9997	2022-07-31 13:32:24	ff38d86f-cd9a-4860-9b0a-eb387bfe0a10	0.00
9998	2022-12-30 08:00:14	ff38d86f-cd9a-4860-9b0a-eb387bfe0a10	40.55
9999	2022-11-09 17:41:38	b2261da2-2d30-481b-b40a-6e0c803787b6	16.30

10000 rows × 3 columns

Example: Get historical values with target¶

In [7]:

            
                Copied!
                
# list the feature lists
display(catalog.list_feature_lists())
# list the feature lists
display(catalog.list_feature_lists())

	id	name	num_feature	status	deployed	readiness_frac	online_frac	tables	entities	primary_entities	created_at
0	64d9d15fb15645afd3dacd3b	Features	8	DRAFT	False	1.0	0.0	[GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS...	[grocerycustomer, frenchstate]	[grocerycustomer]	2023-08-14T07:01:51.756000

In [8]:

            
                Copied!
                
# get the feature list
feature_list = catalog.get_feature_list("Features")
# get the feature list
feature_list = catalog.get_feature_list("Features")

Loading Feature(s) |████████████████████████████████████████| 8/8 [100%] in 0.1s

In [9]:

            
                Copied!
                
                    
                    
                
                

        
# Compute the historical feature table by passing in the observation table that contains the target values
training_table_features = feature_list.compute_historical_feature_table(
    training_data_target_table,
    historical_feature_table_name="customer training table - invoices Apr-22 to Mar-23 - features only",
)

# display the training data
training_data = training_table_features.to_pandas()
display(training_data)
# Compute the historical feature table by passing in the observation table that contains the target values
training_table_features = feature_list.compute_historical_feature_table(
    training_data_target_table,
    historical_feature_table_name="customer training table - invoices Apr-22 to Mar-23 - features only",
)

# display the training data
training_data = training_table_features.to_pandas()
display(training_data)

Done! |████████████████████████████████████████| 100% in 2:23.8 (0.01%/s)       
Downloading table |████████████████████████████████████████| 10000/10000 [100%]

	POINT_IN_TIME	GROCERYCUSTOMERGUID	next_customer_sales_14d	StatePopulation	StateAvgInvoiceAmount_28d	StateMeanLatitude	StateMeanLongitude	CustomerInventoryStability_14d28d	CustomerStateSimilarity_28d	CustomerSpend_28d	CustomerAvgInvoiceAmount_28d
0	2022-04-01 08:43:25	352d1de1-4419-40e5-b2a5-6d6922384b05	6.79	183	18.021939	48.740582	2.237559	0.866025	0.367632	15.98	7.990000
1	2022-04-01 09:57:05	ed56f1f6-310d-4b7c-9f5b-554103282f15	94.06	3	15.970000	48.354199	-1.871965	1.000000	0.871330	35.84	35.840000
2	2022-04-01 10:41:29	8759ff7c-4cad-44e7-82dd-f89c925699be	15.66	14	15.516444	43.404298	3.330159	0.931614	0.578746	54.94	10.988000
3	2022-04-01 11:29:55	4b348211-553b-4831-8463-8a1e936f67d4	12.85	183	18.001381	48.740582	2.237559	0.842424	0.462505	47.62	7.936667
4	2022-04-01 12:20:01	b21ae11c-83cf-4146-832e-1163413a3295	21.73	5	8.032955	49.185500	-0.530407	0.968960	0.823614	94.11	2.940938
...	...	...	...	...	...	...	...	...	...	...	...
9995	2022-12-31 13:16:54	e8828f69-2a66-4ef2-a66d-5db523f03174	71.28	18	24.036600	47.401700	-1.075038	0.818199	0.733589	160.74	40.185000
9996	2022-12-31 13:45:31	5fc2332e-03ac-448d-bf34-f3322cdc295e	162.79	181	19.284127	48.739038	2.242254	0.916700	0.724985	265.07	15.592353
9997	2022-12-31 14:15:06	b429441a-8a9f-4d54-8aba-835be15192c4	2.99	181	19.304609	48.739038	2.242254	0.874818	0.550837	44.71	8.942000
9998	2022-12-31 14:30:35	2b068f1d-d99b-4c2f-a737-46f619a76cc8	38.62	13	20.206724	49.391777	0.934599	0.246718	0.688752	122.79	15.348750
9999	2022-12-31 16:42:07	df0b0c04-f51b-48a5-b330-772cae5b9283	17.87	8	18.112222	48.815086	4.386779	0.937464	0.675442	51.71	5.171000

10000 rows × 11 columns

Consuming training data¶

Learning Objectives

In this section you will learn:

how to save a training file
how to use a pandas data frame

Example: Save the training data to a file¶

In [10]:

            
                Copied!
                
# save training data as a csv file
training_data.to_csv("training_data.csv", index=False)
# save training data as a csv file
training_data.to_csv("training_data.csv", index=False)

In [11]:

            
                Copied!
                
# save the training file as a parquet file
training_data.to_parquet("training_data.parquet")
# save the training file as a parquet file
training_data.to_parquet("training_data.parquet")

Example: Training a scikit learn model¶

Note that you will need to install scikit learn https://scikit-learn.org/stable/install.html

In [12]:

            
                Copied!
                
# EDA on the training data
training_data.describe()
# EDA on the training data
training_data.describe()

Out[12]:

	next_customer_sales_14d	StatePopulation	StateAvgInvoiceAmount_28d	StateMeanLatitude	StateMeanLongitude	CustomerInventoryStability_14d28d	CustomerStateSimilarity_28d	CustomerSpend_28d	CustomerAvgInvoiceAmount_28d
count	10000.000000	10000.00000	9998.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	9608.000000
mean	65.452547	79.67350	18.213624	45.447747	3.160511	0.758742	0.588101	134.637051	18.029736
std	68.585633	75.67829	3.626862	9.410332	8.875622	0.300981	0.220383	125.196521	14.809287
min	0.000000	1.00000	4.000000	-12.713308	-50.017299	0.000000	0.000000	0.000000	0.620000
25%	14.865000	14.00000	16.569308	44.663768	2.237559	0.727607	0.481496	42.247500	8.290000
50%	44.275000	33.00000	17.812956	48.211446	2.241215	0.887224	0.625734	96.640000	14.594881
75%	93.922500	180.00000	20.154028	48.739799	5.054081	0.946350	0.741034	191.910000	23.050256
max	487.200000	183.00000	47.358750	50.669452	45.189819	1.000000	1.000000	837.360000	332.300000

In [13]:

            
                Copied!
                
# do any columns in the training data contain missing values?
training_data.isna().any()
# do any columns in the training data contain missing values?
training_data.isna().any()

Out[13]:

POINT_IN_TIME                        False
GROCERYCUSTOMERGUID                  False
next_customer_sales_14d              False
StatePopulation                      False
StateAvgInvoiceAmount_28d             True
StateMeanLatitude                    False
StateMeanLongitude                   False
CustomerInventoryStability_14d28d    False
CustomerStateSimilarity_28d          False
CustomerSpend_28d                    False
CustomerAvgInvoiceAmount_28d          True
dtype: bool

In [14]:

            
                Copied!
                
! pip install scikit-learn
! pip install scikit-learn

Collecting scikit-learn
  Obtaining dependency information for scikit-learn from https://files.pythonhosted.org/packages/e1/5f/0b5b11fd766b674b0eb887e15006175503f23c230ced2a22fb186262e1e5/scikit_learn-1.3.0-cp310-cp310-macosx_12_0_arm64.whl.metadata
  Using cached scikit_learn-1.3.0-cp310-cp310-macosx_12_0_arm64.whl.metadata (11 kB)
Requirement already satisfied: numpy>=1.17.3 in /Users/jevonyeoh/Library/Caches/pypoetry/virtualenvs/featurebyte-tzUCsHv9-py3.10/lib/python3.10/site-packages (from scikit-learn) (1.24.4)
Requirement already satisfied: scipy>=1.5.0 in /Users/jevonyeoh/Library/Caches/pypoetry/virtualenvs/featurebyte-tzUCsHv9-py3.10/lib/python3.10/site-packages (from scikit-learn) (1.9.3)
Collecting joblib>=1.1.1 (from scikit-learn)
  Obtaining dependency information for joblib>=1.1.1 from https://files.pythonhosted.org/packages/10/40/d551139c85db202f1f384ba8bcf96aca2f329440a844f924c8a0040b6d02/joblib-1.3.2-py3-none-any.whl.metadata
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Obtaining dependency information for threadpoolctl>=2.0.0 from https://files.pythonhosted.org/packages/81/12/fd4dea011af9d69e1cad05c75f3f7202cdcbeac9b712eea58ca779a72865/threadpoolctl-3.2.0-py3-none-any.whl.metadata
  Using cached threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Using cached scikit_learn-1.3.0-cp310-cp310-macosx_12_0_arm64.whl (9.5 MB)
Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.2/302.2 kB 6.9 MB/s eta 0:00:000:00:01
Using cached threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.3.2 scikit-learn-1.3.0 threadpoolctl-3.2.0

In [15]:

            
                Copied!
                
                    
                    
                
                

        
# use sklearn to train a random forest regression model on the training data
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    training_data.drop(columns=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]),
    training_data["next_customer_sales_14d"],
    test_size=0.2,
    random_state=42,
)

# train the model
model = HistGradientBoostingRegressor()
model.fit(X_train, y_train)

# get predictions
y_pred = model.predict(X_test)

# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: ", mse)

# save the model
import joblib

joblib.dump(model, "model.pkl")
# use sklearn to train a random forest regression model on the training data
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    training_data.drop(columns=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]),
    training_data["next_customer_sales_14d"],
    test_size=0.2,
    random_state=42,
)

# train the model
model = HistGradientBoostingRegressor()
model.fit(X_train, y_train)

# get predictions
y_pred = model.predict(X_test)

# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: ", mse)

# save the model
import joblib

joblib.dump(model, "model.pkl")

Mean squared error:  4.711981674155705

Out[15]:

['model.pkl']

Next Steps¶

Now that you've completed the quick-start feature engineering tutorial, you can put your knowledge into practice or learn more:

Learn more about materializing features via the "Deep Dive Materializing Features" tutorial
Put your knowledge into practice by creating features in the "credit card dataset feature engineering playground" or "healthcare dataset feature engineering playground" workspaces
Learn more about feature governance via the "Quick Start Feature Governance" tutorial