Skip to content

Feature EDA

Prerequisites

This page uses the wait_for_task helper defined in API Overview.

Run EDA on features to analyze their distributions and relationships with the target. You can provide features from different sources.

When the ideation pipeline runs EDA, it uses Residual EDA if a naive prediction is available (typical for forecast use cases). Instead of scoring features against the raw target, Residual EDA measures what each feature adds on top of the naive baseline — plots are computed on residuals (target minus naive) or ratios (target divided by naive), and scoring uses the Incremental Predictive Score. This ensures features are evaluated on their incremental contribution rather than on signal the naive prediction already captures. See Residual EDA for details.

You can trigger Residual EDA explicitly via POST /eda by including the naive_prediction field in the request body.

From Feature IDs

import featurebyte as fb

client = fb.Configurations().get_client()

response = client.post(
    "/eda",
    json={
        "feature_ids": [feature_id_1, feature_id_2],
        "use_case_id": use_case_id,
    },
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)

eda_id = task.get("payload", {}).get("output_document_id")

From a Feature Ideation

response = client.post(
    "/eda",
    json={
        "feature_ideation_id": feature_ideation_id,
        "use_case_id": use_case_id,
    },
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)

eda_id = task.get("payload", {}).get("output_document_id")

From a Feature List

response = client.post(
    "/eda",
    json={
        "feature_list_id": feature_list_id,
        "use_case_id": use_case_id,
    },
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)

eda_id = task.get("payload", {}).get("output_document_id")

Parameters:

Parameter Type Required Description
use_case_id string Yes ID of the use case providing context for EDA
feature_ids array One of List of feature IDs to analyze
feature_ideation_id string One of ID of a feature ideation to analyze all its features
feature_list_id string One of ID of a feature list to analyze all its features
naive_prediction object No Include to trigger Residual EDA (see Residual EDA)
naive_prediction.feature_id string ID of the naive prediction feature
naive_prediction.structure string "additive" (residuals: target - naive) or "multiplicative" (ratios: target / naive)
overwrite boolean No Overwrite existing EDA results (default: false)

Get Feature EDA Details

Retrieve full EDA results for a specific feature, including power scores and analysis metadata:

response = client.get(f"/eda/{feature_eda_id}")
eda = response.json()

print(f"Feature: {eda.get('feature_id')}")
print(f"Predictive power score: {eda.get('predictive_power_score')}")
print(f"Feature type: {eda.get('feature_type')}")
print(f"Target type: {eda.get('target_type')}")
print(f"Target categories: {eda.get('target_categories')}")

Response fields:

Field Type Description
id string Feature EDA ID
feature_id string Associated feature ID
feature_type string Feature type: "numerical", "categorical", "text", "dict", "embedding"
target_type string Target type: "REGRESSION", "BINARY_CLASSIFICATION", etc.
predictive_power_score float Predictive power score (higher = more predictive)
null_power_score float Power score from null values alone
no_bucket_power_score float Power score without bucketing
target_categories array Available target categories (for classification)
feature_categories array Available feature categories (dictionary keys or embedding dimensions to filter by)
feature_source string Source of the feature (e.g., "CATALOG")
definition_hash string Hash of the feature definition
version object Feature version identifier
error_reason string Error description if the analysis failed
plots array Plot data objects, each containing an info field with summary statistics (see below)
use_case_id string Associated use case ID
context_id string Associated context ID
observation_table_id string Observation table used for EDA

Summary Statistics

Each item in plots contains an info object with distribution and target statistics:

eda = client.get(f"/eda/{feature_eda_id}").json()

for plot in eda.get("plots", []):
    info = plot.get("info", {})
    print(f"Count: {info.get('count')}, Unique: {info.get('unique')}")
    print(f"Mean: {info.get('mean')}, Std: {info.get('stddev')}")
    print(f"Min: {info.get('min_val')}, Max: {info.get('max_val')}")
    print(f"Missing: {info.get('num_missing')}, Zeros: {info.get('num_zeros')}")
    print(f"Target mean (non-missing): {info.get('target_mean_non_missing')}")

info fields (numeric features):

Field Type Description
count integer Total number of observations
unique integer Number of distinct values
mean float Mean value
stddev float Standard deviation
min_val float Minimum value
max_val float Maximum value
q01, q05, q10, q25, q50, q75, q90, q95, q99 float Percentiles
num_missing integer Number of missing values
num_non_missing integer Number of non-missing values
num_zeros integer Number of zero values
num_non_zeros integer Number of non-zero values
pct_zeros float Percentage of zeros
num_lower_outliers integer Number of lower outliers
num_upper_outliers integer Number of upper outliers
target_mean_non_missing float Mean target value for non-missing feature values
target_mean_missing float Mean target value for missing feature values
target_mean_zeros float Mean target value for zero feature values
target_mean_non_zeros float Mean target value for non-zero feature values
target_mean_lower_outliers float Mean target value for lower outlier feature values
target_mean_upper_outliers float Mean target value for upper outlier feature values

Get Plot Options

Check what plot options are available for a feature (e.g., target categories for classification, or dictionary/embedding keys):

response = client.request(
    "OPTIONS",
    f"/eda/{feature_id}/plots",
    params={"use_case_id": use_case_id},
)
options = response.json()

Parameters:

Parameter Type Required Description
use_case_id string No ID of the use case
context_id string No ID of the context (alternative to use_case_id)
feature_table_id string No ID of a feature table

Response fields:

Field Type Description
target_categories array Available target categories (classification use cases)
feature_categories array Available feature categories (dictionary and embedding features — keys/dimensions to filter by)

Get EDA Plots

Retrieve rendered plots for a specific feature. The feature_id must be one of the features included in the EDA.

response = client.get(
    f"/eda/{feature_id}/plots",
    params={
        "use_case_id": use_case_id,
        "height": 500,
        "width": 1000,
        "font_size": 16,
        "output_format": "html",
    },
)
plots = response.json()

Parameters:

Parameter Type Required Description
use_case_id string No ID of the use case
context_id string No ID of the context (alternative to use_case_id)
target_category any No Filter by target category (from OPTIONS response)
feature_category any No Filter by dictionary key or embedding dimension (from OPTIONS response)
height integer No Plot height in pixels (default: 500)
width integer No Plot width in pixels (default: 1000)
font_size integer No Font size in pixels (default: 16)
output_format string No "html" (default) or "json"

Response fields (each item):

Field Type Description
plot_type string Type of plot (e.g., distribution, feature vs target)
plots array List of rendered plot objects with content (HTML string)

The response is a plot list. See Displaying Plots for how to render plots in Jupyter, save as HTML, or embed in a web application.