Feature
A Feature object contains the logical plan (also referred to as a blueprint) to compute a feature.
The feature values are computed by:
- Using a set of observations for training purposes.
- Enumerating the values of the feature's associated entity for prediction.
Features can sometimes be extracted directly from existing attributes in the source tables. However, in many cases, features are created through a sequence of operations like row transformations, joins, filters, and aggregates.
In FeatureByte, the computational blueprint for Feature objects can be defined from View objects in three ways:
Feature objects can also be formed as transformations of one or more Feature objects or as new versions of existing features.
Lookup features¶
Lookup features are simple features extracted directly from entity attributes in a view without the need for aggregation. For example, features extracted from a column in a specific view reflect characteristics of the entity linked with that view's primary key.
Consider the Grocery dataset used in our tutorials. Here, you can designate the "Amount" column from the "GROCERYINVOICE" table view as a feature for the "groceryinvoice" entity using the as_feature()
method:
invoice_view = catalog.get_view("GROCERYINVOICE")
invoice_view["Amount"].as_feature("Invoice_Amount")
For a Slowly Changing Dimension (SCD) view where attributes change overtime, the feature is linked to the entity identified by the table's natural key. By default, the feature value is acquired by selecting:
- The active attribute at the points-in-time of the observation set used by the historical feature request, and
- The current attribute for prediction.
In the following example, the "UsesWindows" feature indicates whether a customer is using Windows. It is a Lookup feature for the "grocerycustomer" entity that is identified by the natural key "GroceryCustomerGuid" of the SCD table "GROCERYCUSTOMER".
customer_view = catalog.get_view("GROCERYCUSTOMER")
# Extract operating system from BrowserUserAgent column
customer_view["OperatingSystemIsWindows"] = \
customer_view.BrowserUserAgent.str.contains("Windows")
# Create a feature from the OperatingSystemIsWindows column
uses_windows = customer_view.OperatingSystemIsWindows.as_feature("UsesWindows")
In case of an SCD view, you can specify an offset, if you want the attribute value from a specific point in the past.
In the following example, we use an offset of 28 days to create a feature that indicates the attribute value four weeks prior to the observation point.
uses_windows_28d_ago = customer_view.OperatingSystemIsWindows.as_feature(
"UsesWindows_28d_ago", offset='28d'
)
Aggregate features¶
Aggregate features are an important type of feature engineering that involves applying various aggregation functions to a collection of data points grouped by an entity (or a tuple of entities). Supported aggregation functions include the latest, count, sum, average, minimum, maximum, and standard deviation.
Below is the two step process to define an aggregate feature:
-
Determine the level of analysis by grouping view rows based on columns representing one or more entities in the view using the
groupby()
method. -
Select the aggregation type according to the view type and the level of analysis. There are three types of aggregations:
Non-Temporal Aggregates
: Features created through aggregation operations without considering temporal aspects.Aggregates Over A Window
: Features generated by aggregating data within a specific time frame, commonly used for analyzing event data, item data, and change view data.Aggregates “As At” a Point-In-Time
: Features generated by aggregating data active at a specific moment in time, available only for SCD views.
Non-Temporal Aggregate example¶
A Non-Temporal Aggregate is obtained using the aggregate()
method on a GroupBy object:
# Get the number of items in each invoice
invoice_item_count = items_by_invoice.aggregate(
None,
method=fb.AggFunc.COUNT,
feature_name="InvoiceItemCount",
)
Important
To avoid time leakage, non-temporal aggregates are only supported for Item views, when the grouping key is the event key of the Item view. An example of such features is the count of items in an Order.
Aggregate Over a Window example¶
An Aggregate Over a Window is obtained using the aggregate_over()
method on a GroupBy object:
# Group items by the column GroceryCustomerGuid that references the customer entity
items_by_customer = items_view.groupby("GroceryCustomerGuid")
# Declare features that measure the discount received by customer
customer_discounts = items_by_customer.aggregate_over(
"Discount",
method=fb.AggFunc.SUM,
feature_names=["CustomerDiscounts_7d", "CustomerDiscounts_28d"],
fill_value=0,
windows=['7d', '28d']
)
Note
The output is a FeatureGroup object as the operation can support multiple window settings.
To extract a Feature object from a FeatureGroup object, you can use the Feature object's name to subset it, as shown below:
Note
By default, Aggregate Over a Window features use the default feature job setting defined at their primary table level.
You can set a different feature job setting when defining the feature.
# Set a different feature job setting
customer_discount = items_by_customer.aggregate_over(
"Discount",
method=fb.AggFunc.SUM,
feature_names=["CustomerDiscounts_7d", "CustomerDiscounts_28d"],
fill_value=0,
windows=['7d', '28d'],
feature_job_setting=fb.FeatureJobSetting(
blind_spot="135s",
period="60m",
offset="90s",
)
)
You can specify an offset, if you want the window to be shifted by a duration prior to the observation point. The window is shifted but the window size remains the same.
In the following example, we use an offset of 7 days to create a feature that aggregates 28 days of events where the aggregation window ends 7 days prior to the observation point:
customer_discounts = items_by_customer.aggregate_over(
"Discount",
method=fb.AggFunc.SUM,
feature_names=["CustomerDiscounts_28d_offset_7d"],
windows=['28d'],
offset='7d,
)
Aggregate As At a Point-in-Time example¶
An Aggregate As At a Point-in-Time is obtained using the aggregate_asat()
method on a GroupBy object:
# get view
customer_view = catalog.get_view("GROCERYCUSTOMER")
# Group rows by the State column referencing the state entity
groupby_state = customer_view.groupby("State")
# Declare feature that counts the number of customers in the State
state_customers_count = groupby_state.aggregate_asat(
None,
method=fb.AggFunc.COUNT,
feature_name="StateCustomersCount"
)
Note
- The key used to create aggregate features based on a specific point-in-time should not be the natural key of the SCDView. This is because, for any given natural key value, there can only be one active row at a particular point-in-time.
- You can specify an offset, if you want the aggregation to be done on rows active at a specific time before the point-in-time specified by the feature request.
# Declare same feature as at 28 days before the point-in-time
state_customers_count_28d = groupby_state.aggregate_asat(
None,
method=fb.AggFunc.COUNT,
feature_name="StateCustomersCount",
offset='28d'
)
Cross Aggregate features¶
Cross Aggregate features are a type of Aggregate Feature that involves aggregating data across different categories. This enables the creation of features that capture patterns in an entity across these categories.
Below is the two step process to define an Cross Aggregate feature:
-
Determine the level of analysis by grouping view rows based on columns representing one or more entities in the view, and utilizing a categorical column for performing operations across categories using the
groupby()
method.# Join product view to items view product_view = catalog.get_view("GROCERYPRODUCT") items_view = items_view.join(product_view) # Group items by the column GroceryCustomerGuid that references the customer entity # And use ProductGroup as the column to perform operations across items_by_customer_across_product_group = items_view.groupby( by_keys = "GroceryCustomerGuid", category=”ProductGroup” )
-
Select the aggregation type according to the view type. Similar to Aggregate features, three types are supported: Non-Temporal Aggregate, Aggregate Over a Window, and Aggregate As At a Point-in-Time.
Note
The feature value of a Cross Aggregate features after materialization is a dictionary with keys representing the categories of the categorical column and their corresponding values indicating the aggregated values for each category.
Transforming Features¶
Feature objects can be derived from multiple Feature objects through generic, numeric, string, datetime, and dictionary transforms.
When a feature is derived from features with different primary entities, the entity relationships determine the primary entity, and the lowest level entity is selected as the primary entity. If the entities have no relationship, the primary entity becomes a tuple of those entities.
Generic Transforms¶
Generic transformations applicable to ViewColumn objects can also be applied to Feature objects of any data type. The list of generic transforms can be found in the provided glossary.
Numeric Transforms¶
Numeric Features can be manipulated using built-in arithmetic operators (+, -, *, /). For example:
In addition to these arithmetic operations, other numeric transformations that are applicable to ViewColumn objects can also be applied to Feature objects.
String Transforms¶
String Feature objects can be concatenated directly, as shown below:
Other String transforms that are applicable to ViewColumn objects can also be applied to Feature objects.
Datetime Transforms¶
The Datetime Feature objects can be transformed in several ways, such as calculating differences, adding time intervals, or extracting date components. The glossary provides a list of supported dateparts transforms that are applicable to ViewColumn objects and can also be used with Feature objects.
Point-in-time Transforms¶
Features can be derived from the points-in-time provided during feature materialization.
This allows the creation of "time since" features that compare the latest event timestamp with the point-in-time provided in the feature request.
# Create feature that retrieves the timestamp of the latest invoice of a Customer
invoice_view = catalog.get_view("GROCERYINVOICE")
latest_invoice = invoice_view.groupby("GroceryCustomerGuid").aggregate_over(
value_column="Timestamp",
method="latest",
windows=[None],
feature_names=["Customer Latest Visit"],
)
# Create feature that computes the time since the latest invoice
feature = (
fb.RequestColumn.point_in_time() - latest_invoice["Customer Latest Visit"]
).dt.hour
feature.name = "Customer number of hours since last visit"
Note
For historical feature requests, the point-in-time values are provided by the "POINT_IN_TIME" column of the observation set.
For online and batch serving, the point-in-time value is the timestamp when the feature request is made.
Dictionary Transforms¶
Additional transformations are supported for features resulting from Cross Aggregate features. These include:
get_value
: Retrieves the value based on the key provided.most_frequent
: Retrieves the most frequent key.unique_count
: Computes number of distinct keys.entropy
: Computes the entropy over the keys.get_rank
: Computes the rank of a particular key.get_relative_frequency
: Computes the relative frequency of a particular key.cosine_similarity
: Computes the cosine similarity with another cross aggregate feature.
In this example, a feature is created to measure the similarity of customer purchases and purchases of customers living in the same state using two Cross Aggregate features using the cosine_similarity()
method:
# Join customer view to items view
items_view = items_view.join(customer_view)
# Cross Aggregate feature of purchases of customers living in the same state
# across product group over the past 4 weeks
state_inventory_28d = items_view.groupby(
by_keys="State", category="ProductGroup"
).aggregate_over(
"TotalCost",
method=fb.AggFunc.SUM,
feature_names=["StateInventory_28d"],
windows=['28d']
)
# Create a feature that measures the similarity of customer purchases
# and purchases of customers living in the same state
customer_state_similarity_28d = \
customer_inventory_28d["CustomerInventory_28d"].cd.cosine_similarity(
state_inventory_28d["StateInventory_28d"]
)
customer_state_similarity_28d.name = \
"Customer Similarity with purchases in the same state over 28 days"
Conditional Transforms¶
You can apply if-then-else logic by using conditional statements, which include other Feature objects related to the same entity.
cond = customer_state == "Ile-de-France"
customer_spent_over_7d[cond] = 100 + customer_spent_over_7d[cond]
Previewing a Feature¶
First, verify the primary entity of a Feature, which indicates the entities that can be used to serve the feature. A feature can be served by its primary entity or any descendant serving entities.
You can obtain the primary entity of a feature by using the primary_entity
method as shown below:
# This should show the name of the primary entity together with its serving names.
# The only accepted serving_name in this example is 'GROCERYCUSTOMERGUID'.
display(customer_state_similarity_28d.primary_entity)
Note
You can preview a Feature object using a small observation set of up to 50 rows. Unlike the compute_historical_features()
method, this method does not store partial aggregations (tiles) to speed up future computation. Instead, it computes the feature values on the fly and should be used only for small observation sets for debugging or exploring unsaved features.
The small observation set must combine historical points-in-time and key values of the primary entity from the feature. Associated serving entities can also be utilized.
An accepted serving name should be used for the column containing the entity values.
The historical points-in-time must be timestamps in UTC and must be contained in a column named 'POINT-IN-TIME'.
The preview()
method returns a pandas DataFrame.
import pandas as pd
observation_set = pd.DataFrame({
'GROCERYCUSTOMERGUID': ["30e3fbe4-3cbe-4d51-b6ca-1f990ef9773d"],
'POINT_IN_TIME': [pd.Timestamp("2022-12-17 12:12:40")]
})
display(customer_state_similarity_28d.preview(observation_set))
Adding a Feature Object to the Catalog¶
Before saving a feature derived from transformations and adding it to the catalog, assign a name.
Saving a Feature Object makes the object persistent and adds it to the catalog.
Note
After saving it, a Feature object cannot be modified. New Feature Objects with the same namespace can be created to support versioning. Refer to the versioning section, for more details.
Listing Unsaved Features¶
Features that have not been saved will not be persisted once you close your Notebook. Use the list_unsaved_features()
method to check what features are still unsaved. Save the features that you wish to keep.
Setting Feature Readiness¶
To help differentiate Feature objects that are in the prototype stage and objects that are ready for production, a Feature object can have one of four readiness levels:
PRODUCTION_READY
: Assigned to Feature objects ready for deployment in production environments.PUBLIC_DRAFT
: For Feature objects shared for feedback purposes.DRAFT
: For Feature objects in the prototype stage.DEPRECATED
: For feature objects not advised for training or online serving.
By default, new Feature objects are assigned the DRAFT status. You can delete only Draft Feature objects and cannot revert other statuses to DRAFT.
Important
Only one Feature object belonging to a group of Feature objects with the same namespace can be designated as PRODUCTION_READY at a time.
When a Feature object is promoted to PRODUCTION_READY, guardrails are applied automatically to compare the Feature object's cleaning operations and feature job setting with the latest defaults. If you are assured in the promoted Feature object's settings, you can bypass these guardrails by setting ignore_guardrails to True.
You can change the readiness state of a Feature object using the update_readiness
method:
display(customer_state_similarity_28d.readiness)
customer_state_similarity_28d.update_readiness("PUBLIC_DRAFT")
Managing Feature Versions¶
A new feature version is a new Feature object with the same namespace as the original Feature object. The new Feature object has its own Object ID and version name.
New feature versions allow for reusing a Feature with different feature job settings or cleaning operations. If the source table's availability or freshness changes, new feature versions can be created with updated feature job settings. If data quality in the source table changes, new feature versions can be generated with cleaning operations to address new quality issues. Older feature versions can continue to be served without disrupting ML tasks that rely on the feature.
A new version can be created by updating the current feature's feature job setting (if provided) and table cleaning operations (if provided) using the create_new_version()
method. The new version's readiness is set to "DRAFT" by default.
new_version = customer_state_similarity_28d.create_new_version(
table_feature_job_settings=[
fb.TableFeatureJobSetting(
table_name="GROCERYINVOICE",
feature_job_setting=fb.FeatureJobSetting(
blind_spot="60s",
period="3600s",
offset="90s",
)
)
]
)
print(new_version.readiness)
The Object ID and version name of the new Feature object can be accessed using the id
and version
properties. The name remains the same as the original Feature object.
print("new_version.name", new_version.name)
print("new_version.id", new_version.id)
print("new_version.version", new_version.version)
You can list Feature objects (versions) with the same namespace from any Feature object using the list_versions
method.
Setting a Default Feature Version¶
The default version simplifies feature reuse by providing the most appropriate version when none is explicitly specified. By default, the feature's default version mode is automatic, selecting the highest readiness level version. The most recent one becomes the default if multiple versions have the same readiness level.
You can change the feature's default version mode using the update_default_version_mode()
method.
When a feature's default version mode is set to manual, you can designate a specific Feature object among the highest readiness level versions as the default version (as opposed to the most recent one in automatic mode) for Feature objects with the same namespace using the as_default_version()
method.
new_version.update_readiness("PUBLIC_DRAFT") # new_version becomes the default version
new_version.update_default_version_mode("MANUAL")
customer_state_similarity_28d.as_default_version()
To reset the default version mode of the feature and make the original feature version the default, use the following code:
customer_state_similarity_28d.update_default_version_mode("AUTO")
customer_state_similarity_28d.is_default
Accessing a Feature from the Catalog¶
You can refer to the catalog to view a list of existing features, including their detailed information, using the list_features()
method:
Note
The list_features()
method returns the default version of each feature.
To obtain the default version of a feature, utilize its namespace when using the get_feature()
method. If you want to retrieve a specific version, provide the version name as well.
default_version = catalog.get_feature("CustomerStateSimilarity_28d")
new_version_added_to_catalog = catalog.get_feature(
"CustomerStateSimilarity_28d", version=new_version.version
)
You can also retrieve a Feature object using its Object ID using the get_feature_by_id()
method.
Accessing the Feature Definition file of a Feature object¶
The feature definition file is a Feature object's single source of truth. The file is generated automatically after a feature is declared in the SDK.
This file uses the same SDK syntax as the feature declaration and provides an explicit outline of the intended operations of the feature declaration, including those inherited but not explicitly declared by you. For example, these operations may include feature job settings and cleaning operations inherited from tables metadata.
The feature definition file is the basis for generating the final logical execution graph, which is then transpiled into platform-specific SQL (e.g. SnowSQL, SparkSQL) for feature materialization.
The file can be easily displayed in the SDK using the definition
property.