Deep Dive Tutorial: Materializing Features¶
Learning Objectives¶
In this tutorial you will learn:
- How to construct an observation set
- How features, entities, and observation sets are used together
- How to preview features
- How to get historical values
- How and why to deploy features
- How to serve and consume deployed features
Set up the prerequisites¶
Learning Objectives
In this section you will:
- start your local featurebyte server
- import libraries
- learn the about catalogs
- activate a pre-built catalogs
Load the featurebyte library and connect to the local instance of featurebyte¶
# library imports
import pandas as pd
import numpy as np
# load the featurebyte SDK
import featurebyte as fb
# start the local server, then wait for it to be healthy before proceeding
fb.playground()
15:12:54 | INFO | Using configuration file at: /Users/jevonyeoh/.featurebyte/config.yaml 15:12:54 | WARNING | No valid profile specified. Update config file or specify valid profile name with "use_profile". 15:12:54 | INFO | (1/4) Starting featurebyte services Container spark-thrift Running Container redis Running Container mongo-rs Running Container featurebyte-server Running Container featurebyte-worker Running Container mongo-rs Waiting Container mongo-rs Waiting Container redis Waiting Container mongo-rs Healthy Container mongo-rs Healthy Container redis Healthy 15:12:56 | INFO | (2/4) Creating local spark feature store 15:12:56 | INFO | (3/4) Import datasets 15:13:02 | INFO | Dataset grocery already exists, skipping import 15:13:02 | INFO | Dataset healthcare already exists, skipping import 15:13:02 | INFO | Dataset creditcard already exists, skipping import 15:13:02 | INFO | (4/4) Playground environment started successfully. Ready to go! 🚀
Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up¶
Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.
In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *
# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.DeepDiveMaterializingFeatures)
Cleaning up existing tutorial catalogs
15:13:03 | INFO | Catalog activated: deep dive materializing features 20230814:1507
Cleaning catalog: deep dive materializing features 20230814:1507 2 observation tables Done! |████████████████████████████████████████| 100% in 9.2s (0.11%/s) |████████████████████████████████████████| █▆▄ 100% in 7s (~0s, 0.1%/ Done! |████████████████████████████████████████| 100% in 9.3s (0.11%/s)
15:13:25 | INFO | Catalog activated: deep dive materializing features 20230814:1513
Building a deep dive catalog for materializing features named [deep dive materializing features 20230814:1513] Creating new catalog Catalog created Registering the source tables Registering the entities Tagging the entities to columns in the data tables Populating the feature store with example features Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s) Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.1s Catalog created and pre-populated with data and features
Load the tables for this catalog¶
# get the tables for this catalog
grocery_customer_table = catalog.get_table("GROCERYCUSTOMER")
grocery_items_table = catalog.get_table("INVOICEITEMS")
grocery_invoice_table = catalog.get_table("GROCERYINVOICE")
grocery_product_table = catalog.get_table("GROCERYPRODUCT")
Create views for the tables in this catalog¶
# create the views
grocery_customer_view = grocery_customer_table.get_view()
grocery_invoice_view = grocery_invoice_table.get_view()
grocery_items_view = grocery_items_table.get_view()
grocery_product_view = grocery_product_table.get_view()
How to construct an observation set¶
Learning Objectives
In this section you will learn:
- the purpose of observation sets
- the relationship between entities, point in time, and observation sets
- how to construct an observation set
Concept: Materialization¶
A feature in FeatureByte is defined by the logical plan for its computation. The act of computing the feature is known as Feature Materialization.
The materialization of features is made on demand to fulfill historical requests, whereas for prediction purposes, feature values are generated through a batch process called a "Feature Job". The Feature Job is scheduled based on the defined settings associated with each feature.
Concept: Observation set¶
An observation set combines entity key values and historical points-in-time, for which you wish to materialize feature values.
The observation set can be a Pandas DataFrame or an ObservationTable object representing an observation set in the feature store.
Concept: Point in time¶
A point-in-time for a feature refers to a specific moment in the past with which the feature's values are associated.
It is a crucial aspect of historical feature serving, which allows machine learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.
An observation set is created as a Pandas DataFrame containing the keys for the primary entity, and points in time. The column name for the primary entity must be its serving name, and the column name for the point in time must be "POINT_IN_TIME".
Example: Create an observation set based upon events¶
Some use cases are about events, and require predictions to be triggered when a specified event occurs.
A use case requiring predictions about a grocery customer whenever an invoice event occurs, your observation set may be sampled from historical invoices.
# show the serving name for grocery customer
entity_list = catalog.list_entities()
display(entity_list[entity_list.name == "grocerycustomer"])
id | name | serving_names | created_at | |
---|---|---|---|---|
3 | 64d9d431b9f8dba844210a3e | grocerycustomer | [GROCERYCUSTOMERGUID] | 2023-08-14T07:13:53.350000 |
# get a sample of 200 customer IDs and invoice event timestamps from 01-Apr-2022 to 31-Mar-2023
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & (
grocery_invoice_view["Timestamp"] <= pd.to_datetime("2023-03-31")
)
observation_set = (
grocery_invoice_view[filter]
.sample(200)[["GroceryCustomerGuid", "Timestamp"]]
.rename(
{
"Timestamp": "POINT_IN_TIME",
"GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
},
axis=1,
)
)
display(observation_set)
GROCERYCUSTOMERGUID | POINT_IN_TIME | |
---|---|---|
0 | a900e82a-5742-4929-aaf7-7e79ed5383f2 | 2022-04-14 20:01:23 |
1 | 7a024068-3f99-4114-9d90-3a61f679be51 | 2022-07-05 16:03:08 |
2 | 5b185248-658c-4dbe-bbb7-70d215fb6a05 | 2022-12-20 07:59:08 |
3 | 12c2d702-1b92-4375-8fd4-5b3bd18f7d87 | 2022-11-13 17:09:40 |
4 | d7316f3d-6ea9-49b6-97f0-3d20ea9d1331 | 2023-01-31 18:11:55 |
... | ... | ... |
195 | 4eb4ee84-ee13-4eec-9c26-61b6eb4ba35b | 2022-10-31 09:22:10 |
196 | 2b54ef0e-8b02-4f1e-896a-767d23a6162a | 2022-09-03 12:17:46 |
197 | 3eb57343-4b91-4e06-bed5-c763514c4e64 | 2022-04-05 18:52:48 |
198 | 144a0fe4-2137-43f6-b266-411b9eb7cb31 | 2023-01-30 14:21:34 |
199 | 888aa655-927f-41c8-a0ba-7dab2872fca8 | 2022-08-13 15:18:07 |
200 rows × 2 columns
Concept: Observation table¶
An ObservationTable object is a representation of an observation set in the feature store. Unlike a local Pandas DataFrame, the ObservationTable is part of the catalog and can be shared or reused.
ObservationTable objects can be created from a source table or from a view after subsampling.
Example: Create an observation table based upon events¶
# create a large observation table from a view
# observation tables are the recommended workflow for training data
# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & (
grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01")
)
observation_set_view = grocery_invoice_view[filter].copy()
# create a new observation table
observation_table = observation_set_view.create_observation_table(
name="10000 customers who were active between 01-Apr-2022 and 31-Mar-2023",
sample_rows=10000,
columns=["Timestamp", "GroceryCustomerGuid"],
columns_rename_mapping={
"Timestamp": "POINT_IN_TIME",
"GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
},
)
# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())
Done! |████████████████████████████████████████| 100% in 25.3s (0.04%/s) Downloading table |████████████████████████████████████████| 10000/10000 [100%]
POINT_IN_TIME | GROCERYCUSTOMERGUID | |
---|---|---|
0 | 2022-04-05 06:51:50 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
1 | 2022-04-05 18:55:03 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
2 | 2022-04-08 13:10:00 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
3 | 2022-04-11 11:49:05 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
4 | 2022-04-18 15:29:26 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
... | ... | ... |
9995 | 2022-08-22 09:26:49 | afeec4ce-0a90-41f1-802b-7ff2bb42b292 |
9996 | 2022-09-09 16:02:52 | afeec4ce-0a90-41f1-802b-7ff2bb42b292 |
9997 | 2022-09-09 16:06:59 | afeec4ce-0a90-41f1-802b-7ff2bb42b292 |
9998 | 2022-09-23 16:05:45 | afeec4ce-0a90-41f1-802b-7ff2bb42b292 |
9999 | 2022-10-04 16:38:45 | afeec4ce-0a90-41f1-802b-7ff2bb42b292 |
10000 rows × 2 columns
Example: Create an observation set based upon regularly scheduled batch predictions¶
Some use cases require predictions to be triggered at regular time periods. Some use cases have conditions for which only a subset of entities require predictions.
A use case requiring monthly predictions for recently active customers may use an observation set containing sample customer IDs combined with predefined timestamps.
# define a function to list a sample of the customers who were active in a given month
def get_recently_active_customers(month_number):
# filter the invoices by month
filter = (grocery_invoice_view["Timestamp"].dt.month == month_number) & (
grocery_invoice_view["Timestamp"].dt.year == 2022
)
# get a list of customers who made an invoice in the month
recently_active_customers = (
grocery_invoice_view[filter].sample(200)["GroceryCustomerGuid"].unique()
)
# get the start of the month
point_in_time = pd.Timestamp(f"2022-{month_number}-01")
# get the end of the month
end_of_month = point_in_time + pd.DateOffset(months=1)
# get the point in time by subtracting 0.001 second from the end of the month
point_in_time = end_of_month - pd.Timedelta(seconds=0.001)
# combine the point in time with the customer IDs
recently_active_customers = pd.DataFrame(
{
"GROCERYCUSTOMERGUID": recently_active_customers,
"POINT_IN_TIME": point_in_time,
}
)
return recently_active_customers
# create an observation set comprised of up to 200 customers per month who were active in that month in the second half of 2022
observation_set = pd.concat(
[get_recently_active_customers(month_number) for month_number in range(7, 13)],
ignore_index=True,
)
display(observation_set)
GROCERYCUSTOMERGUID | POINT_IN_TIME | |
---|---|---|
0 | 575ceb64-e6ef-446d-9a38-929e35e4cbef | 2022-07-31 23:59:59.999 |
1 | b95f380e-7e7b-4bca-9762-fd9a4fd07419 | 2022-07-31 23:59:59.999 |
2 | cfd39ed9-3140-4af5-9f72-77881aa6c2a8 | 2022-07-31 23:59:59.999 |
3 | 79b85aee-d548-4e6d-89b0-6969fcce5feb | 2022-07-31 23:59:59.999 |
4 | db2d5721-8869-40f7-984c-a94d614fdf69 | 2022-07-31 23:59:59.999 |
... | ... | ... |
856 | ff38d86f-cd9a-4860-9b0a-eb387bfe0a10 | 2022-12-31 23:59:59.999 |
857 | 5fc2332e-03ac-448d-bf34-f3322cdc295e | 2022-12-31 23:59:59.999 |
858 | 6132395b-aa85-4fc7-849d-8b8bbd47e1f9 | 2022-12-31 23:59:59.999 |
859 | c6ef9073-3351-4f54-869a-4c926a479520 | 2022-12-31 23:59:59.999 |
860 | 20f61507-e7d7-450d-b44f-665d1dfd889f | 2022-12-31 23:59:59.999 |
861 rows × 2 columns
Previewing features¶
Learning Objectives
In this section you will learn:
- how to preview features
- the limitations of previews
Example: Preview features¶
During feature prototyping, new features may not have been saved to the catalog. A data scientist will want to preview sample features to sensibility check their feature declaration.
# create a lookup feature that is the city in which the customer resides
french_state_lookup = grocery_customer_view.City.as_feature("CustomerCity")
# preview materialized values for the unsaved feature
display(french_state_lookup.preview(observation_set.sample(5)))
GROCERYCUSTOMERGUID | POINT_IN_TIME | CustomerCity | |
---|---|---|---|
266 | dd1dcef9-26b3-4de6-95b0-36410c1ecf98 | 2022-08-31 23:59:59.999 | TOURCOING |
19 | 9e88c6d9-7c42-4a00-96b0-0012d79a1e15 | 2022-07-31 23:59:59.999 | THIAIS |
789 | ea512344-adc5-45ac-a419-9613c61a8e98 | 2022-12-31 23:59:59.999 | LA CIOTAT |
172 | 32a95683-859c-486d-ab56-8621affcce2a | 2022-08-31 23:59:59.999 | LILLE |
555 | 08d9c64b-b5e1-40d3-9964-0b3e216ff0c7 | 2022-10-31 23:59:59.999 | SAINT-LOUIS |
Feature previews are not suited to creating training files or feature serving. Previews have a limitation of 50 rows and do not create an audit trail.
Create training data¶
Learning Objectives
In this section you will learn:
- how to design an observation set suitable for training data
- how to get historical values for the target
- how to get historical values for a feature list, and create training data
Design an Observation Set for Training¶
Observation Training Design: A training data observation set should typically meet the following criteria:
- be collected from a time period that does not start until after the earliest data availability timestamp plus longest time window in the features
- be collected from a time period that ends before the latest data timestamp less the time window of the target value
- uses points in time that align with the anticipated timing of the use case inference, whether it's based on a regular schedule, triggered by an event, or any other timing mechanism.
- does not have duplicate rows
- has a column containing the primary entity of the use case, using its serving name
- has a column, named "POINT_IN_TIME", containing the points in time
- has for the same entity key points in time that have time intervals greater than the horizon of the target to avoid leakage
Case Study: Predicting Customer Spend¶
Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.
Example: Create an observation table for training data¶
# describe the customer view
display(grocery_customer_view.describe())
RowID | GroceryCustomerGuid | ValidFrom | Gender | Title | GivenName | MiddleInitial | Surname | StreetAddress | City | State | PostalCode | BrowserUserAgent | DateOfBirth | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dtype | VARCHAR | VARCHAR | TIMESTAMP | VARCHAR | VARCHAR | VARCHAR | VARCHAR | VARCHAR | VARCHAR | VARCHAR | VARCHAR | VARCHAR | VARCHAR | DATE | FLOAT | FLOAT |
unique | 530 | 500 | 530 | 2 | 4 | 347 | 26 | 352 | 512 | 300 | 27 | 353 | 82 | 495 | 530 | 530 |
%missing | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
%empty | 0 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN |
entropy | 6.214608 | 6.191446 | NaN | 0.692285 | 1.146938 | 5.726251 | 2.925542 | 5.749627 | 6.201803 | 5.435211 | 2.49532 | 5.763347 | 3.814598 | NaN | NaN | NaN |
top | 0069200d-adf5-490a-acca-14bdf78072a0 | 0b7196a2-2dab-4218-a234-e193f7bc4470 | 2019-01-01 07:23:45 | male | Mr. | Joanna | A | Saindon | 1 cours Jean Jaures | PARIS | Île-de-France | 75004 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... | NaN | -12.704022 | -0.102024 |
freq | 1.0 | 3.0 | 1.0 | 276.0 | 264.0 | 5.0 | 66.0 | 6.0 | 2.0 | 25.0 | 189.0 | 5.0 | 51.0 | NaN | 1.0 | 1.0 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 46.50512 | 2.383389 |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.108698 | 7.822694 |
min | NaN | NaN | 2019-01-01T07:23:45.000000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1937-07-02T00:00:00.000000000 | -12.71811 | -61.12404 |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 44.861372 | 1.959153 |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 48.555884 | 2.40135 |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 48.912734 | 4.734203 |
max | NaN | NaN | 2023-01-30T19:14:03.000000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003-11-05T00:00:00.000000000 | 51.11185 | 45.214809 |
Note that there are 471 unique customers
# describe the invoice view
display(grocery_invoice_view.describe())
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | Amount | |
---|---|---|---|---|---|
dtype | VARCHAR | VARCHAR | TIMESTAMP | VARCHAR | FLOAT |
unique | 39828 | 500 | 39803 | 4 | 6668 |
%missing | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
%empty | 0 | 0 | NaN | 0 | NaN |
entropy | 6.214608 | 5.824943 | NaN | 0.817283 | NaN |
top | 000949fe-1884-40bb-939d-a52df200981f | 3019bdbf-667c-4081-acb5-26cd2d559c5e | 2022-01-05 11:34:17 | +02:00 | 1.0 |
freq | 1.0 | 639.0 | 2.0 | 22375.0 | 834.0 |
mean | NaN | NaN | NaN | NaN | 18.355359 |
std | NaN | NaN | NaN | NaN | 22.735611 |
min | NaN | NaN | 2022-01-01T04:17:46.000000000 | NaN | 0.0 |
25% | NaN | NaN | NaN | NaN | 4.0 |
50% | NaN | NaN | NaN | NaN | 10.2 |
75% | NaN | NaN | NaN | NaN | 23.52 |
max | NaN | NaN | 2023-06-30T21:19:35.000000000 | NaN | 354.39 |
Note that the earliest data timestamp is at the beginning of 2022, and the timestamps end in the present.
# get the customer feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")
# display details about the features in the customer feature list
display(customer_feature_list.list_features())
Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.2s
id | name | version | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | is_default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 64d9d432b9f8dba844210a46 | StateMeanLongitude | V230814 | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-08-14T07:13:57.195000 | True |
1 | 64d9d432b9f8dba844210a45 | StateMeanLatitude | V230814 | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-08-14T07:13:56.928000 | True |
2 | 64d9d432b9f8dba844210a44 | CustomerInventoryMostFrequent_4w | V230814 | VARCHAR | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-08-14T07:13:56.680000 | True |
3 | 64d9d432b9f8dba844210a43 | CustomerInventoryEntropy_4w | V230814 | FLOAT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-08-14T07:13:55.752000 | True |
Note that the longest time window in the features is 4 weeks.
# get the feature list for the target
import json
next_customer_sales_14d_target = catalog.get_target("next_customer_sales_14d")
# display details about the target
info = next_customer_sales_14d_target.info()
display_info = {
key: info[key] for key in ("id", "target_name", "entities", "window", "primary_table")
}
print(json.dumps(display_info, indent=4))
{ "id": "64d9d439b9f8dba844210a49", "target_name": "next_customer_sales_14d", "entities": [ { "name": "grocerycustomer", "serving_names": [ "GROCERYCUSTOMERGUID" ], "catalog_name": "deep dive materializing features 20230814:1513" } ], "window": "14d", "primary_table": [ { "name": "GROCERYINVOICE", "status": "PUBLIC_DRAFT", "catalog_name": "deep dive materializing features 20230814:1513" } ] }
Note that the time window for the target is 14 days
We can conclude that it would be safe for the training data observation set's points in time to commence on 29-Jan-2022 and end 14 days before the present.
We will create an observation set for invoice dates from Feb-22 to Dec-22.
# create a large observation table from a view
# filter to get Feb-22 to Jan-23
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-02-01")) & (
grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01")
)
observation_set_view = grocery_invoice_view[filter].copy()
# create a new observation table
observation_table_large = observation_set_view.create_observation_table(
name="1000 customers who were active between 01-Feb-2022 and 31-Jan-2023",
sample_rows=1000,
columns=["Timestamp", "GroceryCustomerGuid"],
columns_rename_mapping={
"Timestamp": "POINT_IN_TIME",
"GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
},
)
# if the observation table isn't too large, you can materialize it
display(observation_table_large.to_pandas())
Done! |████████████████████████████████████████| 100% in 21.9s (0.05%/s) Downloading table |████████████████████████████████████████| 1000/1000 [100%] in
POINT_IN_TIME | GROCERYCUSTOMERGUID | |
---|---|---|
0 | 2022-04-11 11:49:05 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
1 | 2022-09-28 16:24:45 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
2 | 2022-11-12 19:24:41 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
3 | 2022-02-22 10:35:03 | abdef773-ab72-43b6-8e77-050804c1c5fc |
4 | 2022-04-07 14:27:49 | abdef773-ab72-43b6-8e77-050804c1c5fc |
... | ... | ... |
995 | 2022-08-16 15:53:16 | bfb599c9-404c-42c1-addf-84b7b1b42ca8 |
996 | 2022-10-25 14:32:56 | 5fb1a274-1d0a-422a-8d48-a6a55ddc22a7 |
997 | 2022-02-11 04:14:42 | c9bdbb70-27e7-4ca1-a429-17b67703c06b |
998 | 2022-12-21 05:06:41 | c9bdbb70-27e7-4ca1-a429-17b67703c06b |
999 | 2022-04-14 18:17:22 | bbaff8e5-44ab-4f61-a4e6-405f274bf429 |
1000 rows × 2 columns
Example: Get historical values¶
# use the get historical features function to get the feature values for the observation set
training_data_features = customer_feature_list.compute_historical_features(observation_set)
display(training_data_features)
Done! |████████████████████████████████████████| 100% in 57.6s (0.02%/s) Downloading table |████████████████████████████████████████| 861/861 [100%] in 0 Done! |████████████████████████████████████████| 100% in 9.2s (0.11%/s)
GROCERYCUSTOMERGUID | POINT_IN_TIME | CustomerInventoryEntropy_4w | CustomerInventoryMostFrequent_4w | StateMeanLatitude | StateMeanLongitude | |
---|---|---|---|---|---|---|
0 | 575ceb64-e6ef-446d-9a38-929e35e4cbef | 2022-07-31 23:59:59.999 | 2.257205 | Pizza Surgelées | 48.906913 | 4.453320 |
1 | b95f380e-7e7b-4bca-9762-fd9a4fd07419 | 2022-07-31 23:59:59.999 | 1.927392 | Colas, Thés glacés et Sodas | 48.737227 | 2.240549 |
2 | cfd39ed9-3140-4af5-9f72-77881aa6c2a8 | 2022-07-31 23:59:59.999 | 3.309872 | Pains | 48.737227 | 2.240549 |
3 | 79b85aee-d548-4e6d-89b0-6969fcce5feb | 2022-07-31 23:59:59.999 | 2.614161 | Colas, Thés glacés et Sodas | 47.176003 | 6.023457 |
4 | db2d5721-8869-40f7-984c-a94d614fdf69 | 2022-07-31 23:59:59.999 | 2.153532 | Bières et Cidres | 48.737227 | 2.240549 |
... | ... | ... | ... | ... | ... | ... |
856 | ff38d86f-cd9a-4860-9b0a-eb387bfe0a10 | 2022-12-31 23:59:59.999 | 3.133063 | Sirops | 45.500198 | 5.054081 |
857 | 5fc2332e-03ac-448d-bf34-f3322cdc295e | 2022-12-31 23:59:59.999 | 3.254689 | Fromages | 48.739038 | 2.242254 |
858 | 6132395b-aa85-4fc7-849d-8b8bbd47e1f9 | 2022-12-31 23:59:59.999 | 2.474379 | Chips et Tortillas | 48.739038 | 2.242254 |
859 | c6ef9073-3351-4f54-869a-4c926a479520 | 2022-12-31 23:59:59.999 | 2.378475 | Pizza Surgelées | 43.456104 | 5.887195 |
860 | 20f61507-e7d7-450d-b44f-665d1dfd889f | 2022-12-31 23:59:59.999 | 3.442462 | Glaces et Sorbets | 47.662871 | 1.349651 |
861 rows × 6 columns
Example: Get target values¶
We can materialize the Target values by calling compute_target
, or compute_target_table
.
# Materialize the target feature using get historical features
training_data_target_table = next_customer_sales_14d_target.compute_target_table(
observation_table_large, observation_table_name="next_customer_sales_14d_target_table"
)
training_data_target = training_data_target_table.to_pandas()
display(training_data_target)
Done! |████████████████████████████████████████| 100% in 31.6s (0.03%/s) Downloading table |████████████████████████████████████████| 1000/1000 [100%] in
POINT_IN_TIME | GROCERYCUSTOMERGUID | next_customer_sales_14d | |
---|---|---|---|
0 | 2022-09-19 17:37:03 | 42839dea-155f-46c3-b724-48b4c2328aed | 0.00 |
1 | 2022-10-19 13:01:26 | c52ea584-e4e4-4edb-8ba4-99cecfdaa08d | 175.65 |
2 | 2022-06-21 08:56:47 | d7316f3d-6ea9-49b6-97f0-3d20ea9d1331 | 317.74 |
3 | 2022-11-12 11:06:42 | c5c71432-a165-4f38-9fad-48ed116802ce | 25.99 |
4 | 2022-10-05 11:52:39 | 550a1960-1ca0-4965-b85f-227d1b68ccd4 | 105.70 |
... | ... | ... | ... |
995 | 2022-06-26 11:33:49 | b21ae11c-83cf-4146-832e-1163413a3295 | 52.38 |
996 | 2022-03-13 13:10:58 | 888aa655-927f-41c8-a0ba-7dab2872fca8 | 63.55 |
997 | 2022-02-04 17:10:47 | 34be2f38-fe5b-4c18-863d-178b7ad6ff4e | 142.95 |
998 | 2022-03-12 14:04:01 | 32dd07d0-2c16-4b34-8cc9-01f258e0b935 | 18.64 |
999 | 2022-02-27 16:11:56 | 0b2de469-1034-4a6a-ac20-f040e7e40cbd | 101.02 |
1000 rows × 3 columns
Concept: Historical feature table¶
A HistoricalFeatureTable object represents a table in the feature store containing historical feature values from a historical feature request. The historical feature values can also be obtained as a Pandas DataFrame, but using a HistoricalFeatureTable object has some benefits such as handling large tables, storing the data in the feature store for reuse, and offering full lineage of the training and test data.
# the syntax is different when using an observation table to create a historical feature table
# Compute the historical feature table
training_table = customer_feature_list.compute_historical_feature_table(
training_data_target_table,
historical_feature_table_name="customer training table on 1000 customers who were active between 01-Feb-2022 and 31-Jan-2023",
)
training_data = training_table.to_pandas()
# display the training data
display(training_data)
Done! |████████████████████████████████████████| 100% in 1:07.3 (0.01%/s) Downloading table |████████████████████████████████████████| 1000/1000 [100%] in
POINT_IN_TIME | GROCERYCUSTOMERGUID | next_customer_sales_14d | CustomerInventoryEntropy_4w | CustomerInventoryMostFrequent_4w | StateMeanLatitude | StateMeanLongitude | |
---|---|---|---|---|---|---|---|
0 | 2022-02-01 11:25:09 | 33a71f26-f36f-423f-8ddb-a7d674102d3b | 81.36 | 3.289136 | Café | 48.740263 | 2.236025 |
1 | 2022-02-01 16:49:20 | d0251d4c-f16a-4db2-a4d2-f025cb90b3be | 190.88 | 3.033688 | Fromages | 43.688197 | 1.489452 |
2 | 2022-02-01 19:09:15 | 9eb1b37c-a1f8-498c-b201-55c948a5887f | 174.55 | 2.562665 | Viande Surgelée | 48.740263 | 2.236025 |
3 | 2022-02-02 09:16:14 | fadcc70c-8050-47bd-9dee-cafae17f3397 | 104.84 | 2.205598 | Pains | 45.475549 | 5.041807 |
4 | 2022-02-02 16:40:50 | 9359ef7b-7fd8-4587-bc40-e89f6acc1218 | 55.94 | 2.242005 | Soupe | 48.740263 | 2.236025 |
... | ... | ... | ... | ... | ... | ... | ... |
995 | 2022-12-30 15:26:35 | 38d069c8-794b-4c5a-aaf6-3323f1244752 | 34.49 | 3.514241 | Colas, Thés glacés et Sodas | 43.456104 | 5.887195 |
996 | 2022-12-30 17:55:57 | a484f740-1168-48ad-b7fa-3317d61d9a87 | 189.43 | 3.243454 | Chips et Tortillas | 43.456104 | 5.887195 |
997 | 2022-12-31 12:06:17 | 11465751-0f71-413d-b8c9-90b5a8f26c5f | 0.00 | 2.390462 | Colas, Thés glacés et Sodas | 48.739038 | 2.242254 |
998 | 2022-12-31 15:53:59 | c6ef9073-3351-4f54-869a-4c926a479520 | 174.75 | 2.755016 | Jus Frais | 43.456104 | 5.887195 |
999 | 2022-12-31 18:20:47 | 7aafa7b6-67fd-49ec-a799-4b8751fc7867 | 14.35 | 3.535194 | Chips et Tortillas | 44.676056 | -0.494788 |
1000 rows × 7 columns
Deploying features¶
Learning Objectives
In this section you will learn:
- feature readiness
- feature list status
- how to deploy a feature list
Feature readiness¶
To help differentiate features that are in the prototype stage and features that are ready for production, a feature version can have one of four readiness levels:
PRODUCTION_READY: ready for deployment in production environments.
PUBLIC_DRAFT: shared for feedback purposes.
DRAFT: in the prototype stage.
DEPRECATED`: not advised for use in either training or prediction.
# view the readiness of the features
catalog.list_features()
id | name | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 64d9d432b9f8dba844210a46 | StateMeanLongitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-08-14T07:13:57.201000 |
1 | 64d9d432b9f8dba844210a45 | StateMeanLatitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-08-14T07:13:56.935000 |
2 | 64d9d432b9f8dba844210a44 | CustomerInventoryMostFrequent_4w | VARCHAR | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-08-14T07:13:56.689000 |
3 | 64d9d432b9f8dba844210a43 | CustomerInventoryEntropy_4w | FLOAT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-08-14T07:13:55.761000 |
When a feature has been reviewed and is ready for production, its readiness can be upgraded.
# get CustomerInventoryEntropy_4w
customer_inventory_entropy_4w = catalog.get_feature("CustomerInventoryEntropy_4w")
# check feature definition file
customer_inventory_entropy_4w.definition
# Generated by SDK version: 0.4.2
from bson import ObjectId
from featurebyte import DimensionTable
from featurebyte import FeatureJobSetting
from featurebyte import ItemTable
# item_table name: "INVOICEITEMS", event_table name: "GROCERYINVOICE"
item_table = ItemTable.get_by_id(ObjectId("64d9d42ab9f8dba844210a3c"))
item_view = item_table.get_view(
event_suffix=None,
view_mode="manual",
drop_column_names=[],
column_cleaning_operations=[],
event_drop_column_names=["record_available_at"],
event_column_cleaning_operations=[],
event_join_column_names=[
"Timestamp",
"GroceryInvoiceGuid",
"GroceryCustomerGuid",
"tz_offset",
],
)
# dimension_table name: "GROCERYPRODUCT"
dimension_table = DimensionTable.get_by_id(ObjectId("64d9d431b9f8dba844210a3d"))
dimension_view = dimension_table.get_view(
view_mode="manual", drop_column_names=[], column_cleaning_operations=[]
)
joined_view = item_view.join(
dimension_view, on="GroceryProductGuid", how="left", rsuffix=""
)
grouped = joined_view.groupby(
by_keys=["GroceryCustomerGuid"], category="ProductGroup"
).aggregate_over(
value_column=None,
method="count",
windows=["4w"],
feature_names=["CustomerInventory_4w"],
feature_job_setting=FeatureJobSetting(
blind_spot="0s", frequency="3600s", time_modulo_frequency="90s"
),
skip_fill_na=True,
)
feat = grouped["CustomerInventory_4w"]
feat_1 = feat.cd.entropy()
feat_1.name = "CustomerInventoryEntropy_4w"
output = feat_1
output.save(_id=ObjectId("64d9d432b9f8dba844210a43"))
# change the readiness to public
customer_inventory_entropy_4w.update_readiness("PRODUCTION_READY")
# view the readiness of the features
catalog.list_features()
id | name | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 64d9d432b9f8dba844210a46 | StateMeanLongitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-08-14T07:13:57.201000 |
1 | 64d9d432b9f8dba844210a45 | StateMeanLatitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-08-14T07:13:56.935000 |
2 | 64d9d432b9f8dba844210a44 | CustomerInventoryMostFrequent_4w | VARCHAR | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-08-14T07:13:56.689000 |
3 | 64d9d432b9f8dba844210a43 | CustomerInventoryEntropy_4w | FLOAT | PRODUCTION_READY | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-08-14T07:13:55.761000 |
Feature list status¶
Feature lists can be assigned one of five status levels to differentiate between experimental feature lists and those suitable for deployment or already deployed.
- DEPLOYED: Assigned to feature list with at least one deployed version.
- TEMPLATE: For feature lists as reference templates or safe starting points.
- PUBLIC_DRAFT: For feature lists shared for feedback purposes.
- DRAFT: For feature lists in the prototype stage.
- DEPRECATED: For outdated or unnecessary feature lists.
# view the status of the feature lists
display(catalog.list_feature_lists())
id | name | num_feature | status | deployed | readiness_frac | online_frac | tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 64d9d432b9f8dba844210a47 | CustomerFeatures | 4 | DRAFT | False | 0.25 | 0.0 | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [grocerycustomer, frenchstate] | [grocerycustomer] | 2023-08-14T07:13:57.299000 |
When a feature list is ready for review, its status can be updated.
# get the CustomerFeatures feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")
# update the status to PUBLIC_DRAFT
customer_feature_list.update_status("PUBLIC_DRAFT")
# view the status of the feature lists
display(catalog.list_feature_lists())
Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.1s
id | name | num_feature | status | deployed | readiness_frac | online_frac | tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 64d9d432b9f8dba844210a47 | CustomerFeatures | 4 | PUBLIC_DRAFT | False | 0.25 | 0.0 | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [grocerycustomer, frenchstate] | [grocerycustomer] | 2023-08-14T07:13:57.299000 |
Deploying a feature list¶
# deploy the customer feature list
deployment = customer_feature_list.deploy(make_production_ready=True)
deployment.enable()
# view the status of the feature lists
display(catalog.list_feature_lists())
Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.1s Done! |████████████████████████████████████████| 100% in 6.2s (0.16%/s) Done! |████████████████████████████████████████| 100% in 54.1s (0.02%/s)
id | name | num_feature | status | deployed | readiness_frac | online_frac | tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 64d9d432b9f8dba844210a47 | CustomerFeatures | 4 | DEPLOYED | True | 1.0 | 1.0 | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [grocerycustomer, frenchstate] | [grocerycustomer] | 2023-08-14T07:13:57.299000 |
Why deploy?¶
When you deploy a feature list, behind the scenes the Feature Store starts regularly pre-calculating and caching feature values. This can significantly reduce the latency of feature serving.
Serving and consuming features¶
Learning Objectives
In this section you will learn:
- the point in time used for production serving
- how to create a Python function to consume a feature list
- how to consume a feature list
Point in time for deployment¶
The production feature serving API uses the current time as its point in time. To consume the feature list, send only the primary entity via the serving name.
Automatically create a Python function for consuming the API¶
You can either use a python template or a shell script where the generated code will use the curl command to send the request.
For the python template, set the language parameter value as 'python'. For the shell script, set the language parameter value as 'sh'.
# get a python template for consuming the feature serving API
deployment.get_online_serving_code(language="python")
Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.1s
from typing import Any, Dict
import pandas as pd
import requests
def request_features(entity_serving_names: Dict[str, Any]) -> pd.DataFrame:
"""
Send POST request to online serving endpoint
Parameters
----------
entity_serving_names: Dict[str, Any]
Entity serving name values to used for serving request
Returns
-------
pd.DataFrame
"""
response = requests.post(
url="http://localhost:8088/deployment/64d9d595b9f8dba844210a52/online_features",
headers={"Content-Type": "application/json", "active-catalog-id": "64d9d415b9f8dba844210a39"},
json={"entity_serving_names": entity_serving_names},
)
assert response.status_code == 200, response.json()
return pd.DataFrame.from_dict(response.json()["features"])
request_features([{"GROCERYCUSTOMERGUID": "0041bdff-4917-42d5-bd6d-5a555ac616c5"}])
Copy the online serving code that was generated above, paste it into the cell below, then run it
# replace the contents of this Python code cell with the output from to_be_deployed.get_online_serving_code(language="python")
Concept: Batch request table¶
A BatchRequestTable object is a representation of a table in the feature store that specifies entity values for batch serving.
# this is a new use case, a daily batch run for customers who were active in the latest 24 hours
# filter the invoice view to get customers who had an invoice in the latest 24 hours
batch_request_timestamp = pd.Timestamp.now(tz="utc")
filter = grocery_invoice_view["Timestamp"] > batch_request_timestamp - pd.to_timedelta(
24, unit="hour"
)
recently_active_view = grocery_invoice_view[filter].copy()
display(recently_active_view.preview())
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | Amount |
---|
# create a batch request table from the filtered view
# note that the table does not contain a prediction point in time
# batch requests use the batch run time as the point in time
batch_request_table = recently_active_view.create_batch_request_table(
"customer batch request for customers active in the latest 24 hours as at "
+ str(batch_request_timestamp),
columns=["GroceryCustomerGuid"],
columns_rename_mapping={"GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)
Done! |████████████████████████████████████████| 100% in 22.5s (0.04%/s)
Concept: Batch feature table¶
A BatchFeatureTable object is a representation of a table in the feature store that contains feature values from batch serving. The object includes metadata on the Deployment and the BatchRequestTable used to create it.
# enable the deployment - this is a pre-requisite
if not deployment.enabled:
deployment.enable()
# request batch features
batch_features = deployment.compute_batch_feature_table(
batch_request_table=batch_request_table,
batch_feature_table_name="customer batch feature data for customers active in the latest 24 hours as at "
+ str(batch_request_timestamp),
)
Done! |████████████████████████████████████████| 100% in 34.4s (0.03%/s)
# display the contents of the batch feature table
display(batch_features.to_pandas())
Downloading table |████████████████████████████████████████| 0 in 0.1s (0.00/s)
GROCERYCUSTOMERGUID | CustomerInventoryEntropy_4w | CustomerInventoryMostFrequent_4w | StateMeanLatitude | StateMeanLongitude |
---|
# display the batch feature table metadata
batch_features.info()
name | customer batch feature data for customers active in the latest 24 hours as at 2023-08-14 07:21:05.319086+00:00 | ||||||
created_at | 2023-08-14 07:22:03 | ||||||
updated_at | None | ||||||
description | None | ||||||
batch_request_table_name | customer batch request for customers active in the latest 24 hours as at 2023-08-14 07:21:05.319086+00:00 | ||||||
deployment_name | Deployment with CustomerFeatures_V230814 | ||||||
table_details |
|
Disable a deployment¶
# disable the feature list deployment
deployment.disable()
Working... |██████████████████████████ | ▇▅▃ 65% in 23s (~12s, 0.0%
Next Steps¶
Now that you've completed the deep dive materializing features tutorial, you can put your knowledge into practice or learn more:
- Put your knowledge into practice by creating features in the "credit card dataset feature engineering playground" or "healthcare dataset feature engineering playground" catalogs
- Learn more about feature governance via the "Quick Start Feature Governance" tutorial
- Learn about data modeling via the "Deep Dive Data Modeling" tutorial