Development Dataset¶
Prerequisites
This page uses the client and wait_for_task helpers defined in API Overview.
A development dataset is a time-bounded, sampled copy of your source tables used for feature engineering exploration. It allows the ideation pipeline to run EDA and feature selection on a manageable subset of data, significantly reducing iteration time.
Prerequisites: Tables must be registered and an EDA observation table must be set on the use case.
Ideation and development datasets
An ideation pipeline can only use a development dataset if its EDA observation table is the same table used to generate the dataset (or a subset of it generated by FeatureByte). If the observation tables don't match, the pipeline will run on the full data instead.
The entity selection used by the development dataset also constrains the ideation. See Entity Selection — Development Datasets for details.
Workflow¶
Creating a development dataset is a multi-step process:
1. Get defaults → 2. Create plan → 3. Create sampling tables → 4. Materialize tables
(POST defaults) (POST plan) (PATCH sampling_tables) (PATCH sampled_tables)
The plan progresses through statuses: DRAFT → ENTITY_SAMPLING → ACTIVE.
Step 1: Get Default Plan Configuration¶
client = fb.Configurations().get_client()
response = client.post(
"/development_plan/defaults",
json={
"observation_table_id": eda_observation_table_id,
},
)
plan_defaults = response.json()
print(f"Feature lookback: {plan_defaults.get('feature_lookback_in_months')} months")
print(f"Tables to sample: {len(plan_defaults.get('table_ids_subset', []))}")
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
observation_table_id |
string | Yes | ID of the observation table to base sampling on (typically the EDA table) |
development_dataset_id |
string | No | ID of an existing development dataset to reuse |
Step 2: Create the Plan¶
Creating a plan automatically creates a development dataset in DRAFT status.
# Optionally customize before creating
plan_defaults["feature_lookback_in_months"] = 25
plan_defaults["max_sample_to_full_ratio"] = 0.15
response = client.post("/development_plan", json=plan_defaults)
development_plan = response.json()
development_plan_id = development_plan["_id"]
development_dataset_id = development_plan.get("development_dataset_id")
print(f"Plan: {development_plan_id}, Dataset: {development_dataset_id}")
Plan create parameters (returned by defaults, can be customized):
| Parameter | Type | Description |
|---|---|---|
observation_table_id |
string | ID of the observation table |
development_dataset_name |
string | Name for the development dataset |
entity_selection |
object | Entity selection metadata (suggested, final, eligible). See Entity Selection. |
feature_lookback_in_months |
integer | How far back to look for feature data (default varies by data) |
storing_database_name |
string | Database where sampled tables will be stored |
storing_schema_name |
string | Schema where sampled tables will be stored |
table_ids_subset |
array | IDs of tables to include in sampling |
tables_skip_sampling |
array | IDs of tables to skip sampling (use full table) |
max_sample_to_full_ratio |
float | Maximum ratio of sampled rows to full table (default: 0.05). Tables exceeding this ratio are not materialized unless you increase the threshold. |
Step 3: Create Sampling Tables¶
Compute the distinct entity IDs for each table:
response = client.patch(f"/development_plan/{development_plan_id}/sampling_tables")
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Sampling tables: {task['status']}")
After this step, you can review the SQL plan to see how tables will be sampled:
response = client.get(f"/development_plan/{development_plan_id}/sql_plan")
sql_plan = response.json()
Step 4: Materialize Development Tables¶
Create the actual sampled tables in the warehouse:
response = client.patch(f"/development_plan/{development_plan_id}/sampled_tables", json={})
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Development tables: {task['status']}")
The dataset status transitions to ACTIVE once materialization completes.
Quick Create (Steps 1-2 Only)¶
For most use cases, you can skip the manual sampling/materialization steps. The ideation pipeline will automatically handle dataset preparation when you provide the development_dataset_id:
# Get defaults and create plan (dataset auto-created)
response = client.post(
"/development_plan/defaults",
json={"observation_table_id": eda_observation_table_id},
)
plan_defaults = response.json()
response = client.post("/development_plan", json=plan_defaults)
development_plan = response.json()
development_dataset_id = development_plan.get("development_dataset_id")
Get Development Dataset Details¶
Response fields:
| Field | Type | Description |
|---|---|---|
id |
string | Development dataset ID |
name |
string | Display name |
status |
string | "DRAFT", "ENTITY_SAMPLING", or "ACTIVE" |
source_type |
string | "SOURCE_TABLES" or "OBSERVATION_TABLE" |
development_plan_id |
string | ID of the associated development plan |
observation_table_id |
string | ID of the source observation table |
sample_from_timestamp |
datetime | Start of the sampling time range |
sample_to_timestamp |
datetime | End of the sampling time range |
development_tables |
array | List of sampled tables |
feature_lookback_in_months |
integer | How far back feature data was sampled |
Get Detailed Info¶
response = client.get(
f"/development_dataset/{development_dataset_id}/info",
params={"verbose": True},
)
info = response.json()
Response fields:
| Field | Type | Description |
|---|---|---|
sample_from_timestamp |
datetime | Start of the sampling time range |
sample_to_timestamp |
datetime | End of the sampling time range |
development_tables |
array | List of sampled tables with names, feature store, and status |
status |
string | Dataset status |
source_type |
string | How the dataset was created |
List Development Datasets¶
response = client.get(
"/development_dataset",
params={"page": 1, "page_size": 20},
)
datasets = response.json()["data"]
Filter by use case:
Get Development Plan¶
List plans:
response = client.get(
"/development_plan",
params={"context_id": context_id},
)
plans = response.json()["data"]
Delete¶
# Delete development dataset
response = client.delete(f"/development_dataset/{development_dataset_id}")
task_id = response.json()["id"]
wait_for_task(client, task_id)
# Delete development plan
response = client.delete(f"/development_plan/{development_plan_id}")
task_id = response.json()["id"]
wait_for_task(client, task_id)