Skip to content

featurebyte.FeatureList.compute_historical_features

compute_historical_features(
observation_set: DataFrame,
serving_names_mapping: Optional[Dict[str, str]]=None
) -> DataFrame

Description

Returns a DataFrame with feature values for analysis, model training, or evaluation. The historical features request data consists of an observation set that combines historical points-in-time and key values of the primary entity from the feature list.

Associated serving entities can also be utilized.

Initial computation might take more time, but following calls will be faster due to pre-computed and saved partially aggregated data (tiles).

A training data observation set should typically meet the following criteria:

  • be collected from a time period that does not start until after the earliest data availability timestamp plus longest time window in the features
  • be collected from a time period that ends before the latest data timestamp less the time window of the target value
  • uses points in time that align with the anticipated timing of the use case inference, whether it's based on a regular schedule, triggered by an event, or any other timing mechanism.
  • does not have duplicate rows
  • has for the same entity, key points in time that have time intervals greater than the horizon of the target to avoid leakage.

Parameters

  • observation_set: DataFrame
    Observation set DataFrame which combines historical points-in-time and values of the feature primary entity or its descendant (serving entities). The column containing the point-in-time values should be named POINT_IN_TIME, while the columns representing entity values should be named using accepted serving names for the entity.

  • serving_names_mapping: Optional[Dict[str, str]]
    Optional serving names mapping if the training events table has different serving name columns than those defined in Entities, mapping from original serving name to new name.

Returns

  • DataFrame
    Materialized historical features.

Note: POINT_IN_TIME values will be converted to UTC time.

Examples

Create a feature list with two features.

>>> feature_list = fb.FeatureList(
...     [
...         catalog.get_feature("InvoiceCount_60days"),
...         catalog.get_feature("InvoiceAmountAvg_60days"),
...     ],
...     name="InvoiceFeatures",
... )
Prepare observation set with POINT_IN_TIME and serving names columns.

>>> observation_set = pd.DataFrame({
...     "POINT_IN_TIME": pd.date_range(start="2022-04-15", end="2022-04-30", freq="2D"),
...     "GROCERYCUSTOMERGUID": ["a2828c3b-036c-4e2e-9bd6-30c9ee9a20e3"] * 8,
... })
Retrieve materialized historical features.

>>> feature_list.compute_historical_features(observation_set)
  POINT_IN_TIME                   GROCERYCUSTOMERGUID  InvoiceCount_60days  InvoiceAmountAvg_60days
0    2022-04-15  a2828c3b-036c-4e2e-9bd6-30c9ee9a20e3                  9.0                10.223333
1    2022-04-17  a2828c3b-036c-4e2e-9bd6-30c9ee9a20e3                  9.0                10.223333
2    2022-04-19  a2828c3b-036c-4e2e-9bd6-30c9ee9a20e3                  9.0                10.223333
3    2022-04-21  a2828c3b-036c-4e2e-9bd6-30c9ee9a20e3                 10.0                 9.799000
4    2022-04-23  a2828c3b-036c-4e2e-9bd6-30c9ee9a20e3                 10.0                 9.799000
5    2022-04-25  a2828c3b-036c-4e2e-9bd6-30c9ee9a20e3                  9.0                 9.034444
6    2022-04-27  a2828c3b-036c-4e2e-9bd6-30c9ee9a20e3                 10.0                 9.715000
7    2022-04-29  a2828c3b-036c-4e2e-9bd6-30c9ee9a20e3                 10.0                 9.715000

Retrieve materialized historical features with serving names mapping.

>>> historical_features = feature_list.compute_historical_features(
...     observation_set=observation_set,
...     serving_names_mapping={"GROCERYCUSTOMERGUID": "CUSTOMERGUID"},
... )

See Also