Evaluation¶
See also
Forecast UI Tutorial: Predict and Evaluate | Concepts: Leaderboard | Concepts: Regression Evaluation | Concepts: Binary Classification Evaluation | API Tutorial: Credit Default — Step 12 | API Tutorial: Store Sales Forecast — Step 7
Prerequisites
This page uses the client and wait_for_task helpers defined in API Overview.
This page covers how to evaluate models using the validation leaderboard, holdout leaderboard, and evaluation plots.
Validation Leaderboard¶
The validation leaderboard ranks all models trained with a validation set. Models are automatically added to the leaderboard when they are trained with a validation observation table.
Find the Leaderboard¶
client = fb.Configurations().get_client()
response = client.get(
"/catalog/leaderboard",
params={
"observation_table_id": validation_table_id,
"observation_table_purpose": "validation",
"role": "OUTCOME",
},
)
leaderboard = response.json()["data"][0]
leaderboard_id = leaderboard["_id"]
primary_metric = leaderboard["primary_metric"]
sort_dir = leaderboard.get("sort_order", "desc")
print(f"Leaderboard: {leaderboard['name']}, metric: {primary_metric}")
Leaderboard query parameters:
| Parameter | Type | Description |
|---|---|---|
observation_table_id |
string | Filter by associated observation table |
observation_table_purpose |
string | Filter by purpose: "validation", "training", "holdout" |
role |
string | Leaderboard role: "OUTCOME" |
Leaderboard response fields:
| Field | Type | Description |
|---|---|---|
id |
string | Leaderboard ID |
name |
string | Leaderboard name |
primary_metric |
string | Metric used for ranking |
sort_order |
string | "asc" (lower is better) or "desc" (higher is better) |
evaluation_metrics |
array | List of metrics computed |
List Models in the Leaderboard¶
Use GET /catalog/ml_model with the leaderboard_id to list models sorted by metric:
response = client.get(
"/catalog/ml_model",
params={
"leaderboard_id": leaderboard_id,
"sort_by": primary_metric,
"sort_dir": sort_dir,
"sort_by_metric": True,
"show_refits": True,
"leaderboard_role": "OUTCOME",
"page_size": 100,
},
)
models = response.json()["data"]
for m in models:
scores = {s["metric_name"]: round(s["score"], 4) for s in m.get("evaluation_scores", []) if s.get("score") is not None}
print(f" {m['name']}: {scores}")
# Best model is first (already sorted by metric)
best_model_id = models[0]["_id"]
Query parameters:
| Parameter | Type | Description |
|---|---|---|
leaderboard_id |
string | Filter to models in this leaderboard |
sort_by |
string | Metric name to sort by (e.g., "auc", "rmse", "gini_norm") |
sort_dir |
string | "asc" or "desc" |
sort_by_metric |
boolean | Must be true to sort by evaluation metric name |
show_refits |
boolean | Include refit models (default: false) |
leaderboard_role |
string | "OUTCOME" |
Model response fields (each item in data):
| Field | Type | Description |
|---|---|---|
id |
string | Model ID |
name |
string | Model name |
evaluation_scores |
array | List of {metric_name, score} pairs |
feature_list_id |
string | Feature list used |
model_template_type |
string | Template type (e.g., "LIGHTGBM") |
is_pipeline_generated |
boolean | Whether the model was created by a pipeline |
Holdout Leaderboard¶
A holdout leaderboard is created automatically when predictions are generated on a holdout observation table that has a target. Generate predictions to trigger it, then retrieve the leaderboard.
response = client.post(
f"/ml_model/{ml_model_id}/prediction_table",
json={
"request_input": {
"request_type": "observation_table",
"table_id": holdout_table_id,
},
"include_input_features": False,
},
)
task_id = response.json()["id"]
wait_for_task(client, task_id)
View Holdout Leaderboard Results¶
List models in the holdout leaderboard sorted by the primary metric:
response = client.get(
"/catalog/ml_model",
params={
"leaderboard_id": leaderboard_id,
"sort_by": "rmse",
"sort_dir": "asc",
"sort_by_metric": True,
"show_refits": True,
"leaderboard_role": "OUTCOME",
"page_size": 100,
},
)
models = response.json()["data"]
for m in models:
scores = {s["metric_name"]: round(s["score"], 4) for s in m.get("evaluation_scores", []) if s.get("score") is not None}
print(f" {m['name']}: {scores}")
Preview Leaderboard Predictions¶
Inspect the predictions a specific model made on the holdout set:
response = client.post(
f"/leaderboard/{leaderboard_id}/ml_model/{ml_model_id}/preview",
)
preview = response.json()
Response fields:
| Field | Type | Description |
|---|---|---|
columns |
array | Column names in the preview |
data |
array | Rows of prediction data (entity IDs, point in time, predicted values, actuals) |
Evaluation Plots¶
The API generates interactive Bokeh plots for model evaluation. The endpoint returns self-contained HTML with embedded JavaScript — no external dependencies needed.
Get Available Plot Options¶
The available plot types depend on the model type:
Response fields:
| Field | Type | Description |
|---|---|---|
options |
array | Available plot types for this model (e.g., "distribution", "roc_curve") |
holdout_tables |
array | Observation tables available for evaluation, each with table_type, table_id, and table_name |
Regression models:
distribution— predicted vs actual distributionspredicted_vs_actual— scatter plot of predicted vs actual valuespredicted_vs_actual_per_bin— binned predicted vs actual
Binary classification models:
roc_curve— ROC curve with AUCprecision_recall_curve— precision-recall tradeoffks_and_gain_curve— KS statistic and gain curvelift_curve— lift chartgain_report— cumulative gain reportpredicted_vs_actual_per_bin— calibration plotdistribution— score distributionsconfusion_matrix— confusion matrix
Uplift models:
incremental_uplift— incremental uplift curveqini_curve— Qini coefficient curvegain_report— uplift gain reportpredicted_vs_actual_per_bin— binned upliftdistribution— uplift score distributions
The response also includes holdout_tables, a list of observation tables available for evaluation.
Create an Evaluation Plot¶
response = client.post(
f"/ml_model/{ml_model_id}/evaluate",
json={
"option": "predicted_vs_actual",
"plot_params": {
"height": 500,
"width": 1000,
"font_size": 16,
},
"holdout_table": {
"table_type": "observation_table",
"table_id": validation_table_id,
},
},
)
html_content = response.json()["content"]
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
option |
string | Yes | Plot type (see options list above) |
plot_params |
object | No | Plot sizing configuration |
plot_params.height |
integer | No | Plot height in pixels (default: 500) |
plot_params.width |
integer | No | Plot width in pixels (default: 1000) |
plot_params.font_size |
integer | No | Font size in pixels (default: 16) |
holdout_table |
object | No | Observation table to evaluate against |
holdout_table.table_type |
string | Yes | Must be "observation_table" |
holdout_table.table_id |
string | Yes | ID of the observation table |
The content field contains a self-contained Bokeh HTML document. See Displaying Plots for how to render it in Jupyter, save as HTML, or embed in a web application.
Forecast Comparison¶
For forecast use cases, the API can generate interactive plots comparing predictions vs actual target values across forecast points. The plot shows one prediction line per point-in-time, with an optional target (actual) line overlay.
List Available Entities¶
Before creating a forecast comparison, retrieve the distinct entity values in the prediction table:
# Submit entity extraction task (first time only)
response = client.post(f"/prediction_table/{prediction_table_id}/prediction_entities")
task_id = response.json()["id"]
wait_for_task(client, task_id)
# Get available entity values
response = client.get(f"/prediction_table/{prediction_table_id}/prediction_entities")
entities = response.json()
# columns: entity column names (serving names)
# data: distinct entity value combinations
print(f"Entity columns: {entities['entity_data']['columns']}")
for row in entities["entity_data"]["data"][:5]:
print(f" {row}")
Response fields:
| Field | Type | Description |
|---|---|---|
prediction_table_id |
string | Prediction table ID |
entity_data.columns |
array | Entity column names (serving names) |
entity_data.data |
array | Distinct entity value combinations (each row is a list of values matching the columns) |
Create a Forecast Comparison Plot¶
response = client.post(
f"/prediction_table/{prediction_table_id}/forecast_comparison",
json={
"entity_filter": {
"item_store_id": "FOODS_3_001_CA_1",
},
"plot_params": {
"height": 500,
"width": 1000,
"font_size": 16,
},
},
)
task_id = response.json()["id"]
wait_for_task(client, task_id)
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
entity_filter |
object | Yes | Key-value pairs mapping entity column names to values (e.g., {"item_store_id": "FOODS_3_001_CA_1"}) |
plot_params |
object | No | Plot sizing configuration |
plot_params.height |
integer | No | Plot height in pixels (default: 500) |
plot_params.width |
integer | No | Plot width in pixels (default: 1000) |
plot_params.font_size |
integer | No | Font size in pixels (default: 16) |
The entity_filter specifies which entity to plot (e.g., a specific item-store combination). The plot is generated asynchronously.
Retrieve the Plot¶
# List forecast comparisons for a prediction table
response = client.get(
f"/prediction_table/{prediction_table_id}/forecast_comparison",
)
comparisons = response.json()["data"]
# Get a specific forecast comparison result
forecast_comparison_id = comparisons[0]["id"]
response = client.get(
f"/prediction_table/{prediction_table_id}/forecast_comparison/{forecast_comparison_id}",
)
html_content = response.json()["content"]
The content field contains a self-contained Bokeh HTML plot, just like evaluation plots. It includes:
- Target line (grey) — actual values, when the observation table has a target
- Prediction lines (colored) — one line per point-in-time, showing how the model predicted each forecast point
- Interactive widgets — "From" and "To" point-in-time selectors to filter which prediction lines are visible
See Displaying Plots for how to render the plot in Jupyter, save as HTML, or embed in a web application.