Feature EDA¶
Prerequisites
This page uses the wait_for_task helper defined in API Overview.
Run EDA on features to analyze their distributions and relationships with the target. You can provide features from different sources.
When the ideation pipeline runs EDA, it uses Residual EDA if a naive prediction is available (typical for forecast use cases). Instead of scoring features against the raw target, Residual EDA measures what each feature adds on top of the naive baseline — plots are computed on residuals (target minus naive) or ratios (target divided by naive), and scoring uses the Incremental Predictive Score. This ensures features are evaluated on their incremental contribution rather than on signal the naive prediction already captures. See Residual EDA for details.
You can trigger Residual EDA explicitly via POST /eda by including the naive_prediction field in the request body.
From Feature IDs¶
import featurebyte as fb
client = fb.Configurations().get_client()
response = client.post(
"/eda",
json={
"feature_ids": [feature_id_1, feature_id_2],
"use_case_id": use_case_id,
},
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
eda_id = task.get("payload", {}).get("output_document_id")
From a Feature Ideation¶
response = client.post(
"/eda",
json={
"feature_ideation_id": feature_ideation_id,
"use_case_id": use_case_id,
},
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
eda_id = task.get("payload", {}).get("output_document_id")
From a Feature List¶
response = client.post(
"/eda",
json={
"feature_list_id": feature_list_id,
"use_case_id": use_case_id,
},
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
eda_id = task.get("payload", {}).get("output_document_id")
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
use_case_id |
string | Yes | ID of the use case providing context for EDA |
feature_ids |
array | One of | List of feature IDs to analyze |
feature_ideation_id |
string | One of | ID of a feature ideation to analyze all its features |
feature_list_id |
string | One of | ID of a feature list to analyze all its features |
naive_prediction |
object | No | Include to trigger Residual EDA (see Residual EDA) |
naive_prediction.feature_id |
string | — | ID of the naive prediction feature |
naive_prediction.structure |
string | — | "additive" (residuals: target - naive) or "multiplicative" (ratios: target / naive) |
overwrite |
boolean | No | Overwrite existing EDA results (default: false) |
Get Feature EDA Details¶
Retrieve full EDA results for a specific feature, including power scores and analysis metadata:
response = client.get(f"/eda/{feature_eda_id}")
eda = response.json()
print(f"Feature: {eda.get('feature_id')}")
print(f"Predictive power score: {eda.get('predictive_power_score')}")
print(f"Feature type: {eda.get('feature_type')}")
print(f"Target type: {eda.get('target_type')}")
print(f"Target categories: {eda.get('target_categories')}")
Response fields:
| Field | Type | Description |
|---|---|---|
id |
string | Feature EDA ID |
feature_id |
string | Associated feature ID |
feature_type |
string | Feature type: "numerical", "categorical", "text", "dict", "embedding" |
target_type |
string | Target type: "REGRESSION", "BINARY_CLASSIFICATION", etc. |
predictive_power_score |
float | Predictive power score (higher = more predictive) |
null_power_score |
float | Power score from null values alone |
no_bucket_power_score |
float | Power score without bucketing |
target_categories |
array | Available target categories (for classification) |
feature_categories |
array | Available feature categories (dictionary keys or embedding dimensions to filter by) |
feature_source |
string | Source of the feature (e.g., "CATALOG") |
definition_hash |
string | Hash of the feature definition |
version |
object | Feature version identifier |
error_reason |
string | Error description if the analysis failed |
plots |
array | Plot data objects, each containing an info field with summary statistics (see below) |
use_case_id |
string | Associated use case ID |
context_id |
string | Associated context ID |
observation_table_id |
string | Observation table used for EDA |
Summary Statistics¶
Each item in plots contains an info object with distribution and target statistics:
eda = client.get(f"/eda/{feature_eda_id}").json()
for plot in eda.get("plots", []):
info = plot.get("info", {})
print(f"Count: {info.get('count')}, Unique: {info.get('unique')}")
print(f"Mean: {info.get('mean')}, Std: {info.get('stddev')}")
print(f"Min: {info.get('min_val')}, Max: {info.get('max_val')}")
print(f"Missing: {info.get('num_missing')}, Zeros: {info.get('num_zeros')}")
print(f"Target mean (non-missing): {info.get('target_mean_non_missing')}")
info fields (numeric features):
| Field | Type | Description |
|---|---|---|
count |
integer | Total number of observations |
unique |
integer | Number of distinct values |
mean |
float | Mean value |
stddev |
float | Standard deviation |
min_val |
float | Minimum value |
max_val |
float | Maximum value |
q01, q05, q10, q25, q50, q75, q90, q95, q99 |
float | Percentiles |
num_missing |
integer | Number of missing values |
num_non_missing |
integer | Number of non-missing values |
num_zeros |
integer | Number of zero values |
num_non_zeros |
integer | Number of non-zero values |
pct_zeros |
float | Percentage of zeros |
num_lower_outliers |
integer | Number of lower outliers |
num_upper_outliers |
integer | Number of upper outliers |
target_mean_non_missing |
float | Mean target value for non-missing feature values |
target_mean_missing |
float | Mean target value for missing feature values |
target_mean_zeros |
float | Mean target value for zero feature values |
target_mean_non_zeros |
float | Mean target value for non-zero feature values |
target_mean_lower_outliers |
float | Mean target value for lower outlier feature values |
target_mean_upper_outliers |
float | Mean target value for upper outlier feature values |
Get Plot Options¶
Check what plot options are available for a feature (e.g., target categories for classification, or dictionary/embedding keys):
response = client.request(
"OPTIONS",
f"/eda/{feature_id}/plots",
params={"use_case_id": use_case_id},
)
options = response.json()
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
use_case_id |
string | No | ID of the use case |
context_id |
string | No | ID of the context (alternative to use_case_id) |
feature_table_id |
string | No | ID of a feature table |
Response fields:
| Field | Type | Description |
|---|---|---|
target_categories |
array | Available target categories (classification use cases) |
feature_categories |
array | Available feature categories (dictionary and embedding features — keys/dimensions to filter by) |
Get EDA Plots¶
Retrieve rendered plots for a specific feature. The feature_id must be one of the features included in the EDA.
response = client.get(
f"/eda/{feature_id}/plots",
params={
"use_case_id": use_case_id,
"height": 500,
"width": 1000,
"font_size": 16,
"output_format": "html",
},
)
plots = response.json()
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
use_case_id |
string | No | ID of the use case |
context_id |
string | No | ID of the context (alternative to use_case_id) |
target_category |
any | No | Filter by target category (from OPTIONS response) |
feature_category |
any | No | Filter by dictionary key or embedding dimension (from OPTIONS response) |
height |
integer | No | Plot height in pixels (default: 500) |
width |
integer | No | Plot width in pixels (default: 1000) |
font_size |
integer | No | Font size in pixels (default: 16) |
output_format |
string | No | "html" (default) or "json" |
Response fields (each item):
| Field | Type | Description |
|---|---|---|
plot_type |
string | Type of plot (e.g., distribution, feature vs target) |
plots |
array | List of rendered plot objects with content (HTML string) |
The response is a plot list. See Displaying Plots for how to render plots in Jupyter, save as HTML, or embed in a web application.