Development Dataset¶

Workflow¶

Creating a development dataset is a multi-step process:

1. Get defaults  →  2. Create plan  →  3. Create sampling tables  →  4. Materialize tables
   (POST defaults)     (POST plan)       (PATCH sampling_tables)      (PATCH sampled_tables)

The plan progresses through statuses: DRAFT → ENTITY_SAMPLING → ACTIVE.

Step 1: Get Default Plan Configuration¶

client = fb.Configurations().get_client()

response = client.post(
    "/development_plan/defaults",
    json={
        "observation_table_id": eda_observation_table_id,
    },
)
plan_defaults = response.json()
print(f"Feature lookback: {plan_defaults.get('feature_lookback_in_months')} months")
print(f"Tables to sample: {len(plan_defaults.get('table_ids_subset', []))}")

Parameters:

Parameter	Type	Required	Description
`observation_table_id`	string	Yes	ID of the observation table to base sampling on (typically the EDA table)
`development_dataset_id`	string	No	ID of an existing development dataset to reuse

Step 2: Create the Plan¶

Creating a plan automatically creates a development dataset in DRAFT status.

# Optionally customize before creating
plan_defaults["feature_lookback_in_months"] = 25
plan_defaults["max_sample_to_full_ratio"] = 0.15

response = client.post("/development_plan", json=plan_defaults)
development_plan = response.json()
development_plan_id = development_plan["_id"]
development_dataset_id = development_plan.get("development_dataset_id")
print(f"Plan: {development_plan_id}, Dataset: {development_dataset_id}")

Plan create parameters (returned by defaults, can be customized):

Parameter	Type	Description
`observation_table_id`	string	ID of the observation table
`development_dataset_name`	string	Name for the development dataset
`entity_selection`	object	Entity selection metadata (`suggested`, `final`, `eligible`). See Entity Selection.
`feature_lookback_in_months`	integer	How far back to look for feature data (default varies by data)
`storing_database_name`	string	Database where sampled tables will be stored
`storing_schema_name`	string	Schema where sampled tables will be stored
`table_ids_subset`	array	IDs of tables to include in sampling
`tables_skip_sampling`	array	IDs of tables to skip sampling (use full table)
`max_sample_to_full_ratio`	float	Maximum ratio of sampled rows to full table (default: 0.05). Tables exceeding this ratio are not materialized unless you increase the threshold.

Step 3: Create Sampling Tables¶

Compute the distinct entity IDs for each table:

response = client.patch(f"/development_plan/{development_plan_id}/sampling_tables")
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Sampling tables: {task['status']}")

After this step, you can review the SQL plan to see how tables will be sampled:

response = client.get(f"/development_plan/{development_plan_id}/sql_plan")
sql_plan = response.json()

Step 4: Materialize Development Tables¶

Create the actual sampled tables in the warehouse:

response = client.patch(f"/development_plan/{development_plan_id}/sampled_tables", json={})
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Development tables: {task['status']}")

The dataset status transitions to ACTIVE once materialization completes.

Quick Create (Steps 1-2 Only)¶

For most use cases, you can skip the manual sampling/materialization steps. The ideation pipeline will automatically handle dataset preparation when you provide the development_dataset_id:

# Get defaults and create plan (dataset auto-created)
response = client.post(
    "/development_plan/defaults",
    json={"observation_table_id": eda_observation_table_id},
)
plan_defaults = response.json()

response = client.post("/development_plan", json=plan_defaults)
development_plan = response.json()
development_dataset_id = development_plan.get("development_dataset_id")

Get Development Dataset Details¶

response = client.get(f"/development_dataset/{development_dataset_id}")
dataset = response.json()

Response fields:

Field	Type	Description
`id`	string	Development dataset ID
`name`	string	Display name
`status`	string	`"DRAFT"`, `"ENTITY_SAMPLING"`, or `"ACTIVE"`
`source_type`	string	`"SOURCE_TABLES"` or `"OBSERVATION_TABLE"`
`development_plan_id`	string	ID of the associated development plan
`observation_table_id`	string	ID of the source observation table
`sample_from_timestamp`	datetime	Start of the sampling time range
`sample_to_timestamp`	datetime	End of the sampling time range
`development_tables`	array	List of sampled tables
`feature_lookback_in_months`	integer	How far back feature data was sampled

Get Detailed Info¶

response = client.get(
    f"/development_dataset/{development_dataset_id}/info",
    params={"verbose": True},
)
info = response.json()

Response fields:

Field	Type	Description
`sample_from_timestamp`	datetime	Start of the sampling time range
`sample_to_timestamp`	datetime	End of the sampling time range
`development_tables`	array	List of sampled tables with names, feature store, and status
`status`	string	Dataset status
`source_type`	string	How the dataset was created

List Development Datasets¶

response = client.get(
    "/development_dataset",
    params={"page": 1, "page_size": 20},
)
datasets = response.json()["data"]

Filter by use case:

response = client.get(
    "/catalog/development_dataset",
    params={"use_case_id": use_case_id},
)

Get Development Plan¶

response = client.get(f"/development_plan/{development_plan_id}")
plan = response.json()

List plans:

response = client.get(
    "/development_plan",
    params={"context_id": context_id},
)
plans = response.json()["data"]

Delete¶

# Delete development dataset
response = client.delete(f"/development_dataset/{development_dataset_id}")
task_id = response.json()["id"]
wait_for_task(client, task_id)

# Delete development plan
response = client.delete(f"/development_plan/{development_plan_id}")
task_id = response.json()["id"]
wait_for_task(client, task_id)