Skip to content

Development Dataset

Prerequisites

This page uses the client and wait_for_task helpers defined in API Overview.

A development dataset is a time-bounded, sampled copy of your source tables used for feature engineering exploration. It allows the ideation pipeline to run EDA and feature selection on a manageable subset of data, significantly reducing iteration time.

Prerequisites: Tables must be registered and an EDA observation table must be set on the use case.

Ideation and development datasets

An ideation pipeline can only use a development dataset if its EDA observation table is the same table used to generate the dataset (or a subset of it generated by FeatureByte). If the observation tables don't match, the pipeline will run on the full data instead.

The entity selection used by the development dataset also constrains the ideation. See Entity Selection — Development Datasets for details.

Workflow

Creating a development dataset is a multi-step process:

1. Get defaults  →  2. Create plan  →  3. Create sampling tables  →  4. Materialize tables
   (POST defaults)     (POST plan)       (PATCH sampling_tables)      (PATCH sampled_tables)

The plan progresses through statuses: DRAFTENTITY_SAMPLINGACTIVE.

Step 1: Get Default Plan Configuration

client = fb.Configurations().get_client()

response = client.post(
    "/development_plan/defaults",
    json={
        "observation_table_id": eda_observation_table_id,
    },
)
plan_defaults = response.json()
print(f"Feature lookback: {plan_defaults.get('feature_lookback_in_months')} months")
print(f"Tables to sample: {len(plan_defaults.get('table_ids_subset', []))}")

Parameters:

Parameter Type Required Description
observation_table_id string Yes ID of the observation table to base sampling on (typically the EDA table)
development_dataset_id string No ID of an existing development dataset to reuse

Step 2: Create the Plan

Creating a plan automatically creates a development dataset in DRAFT status.

# Optionally customize before creating
plan_defaults["feature_lookback_in_months"] = 25
plan_defaults["max_sample_to_full_ratio"] = 0.15

response = client.post("/development_plan", json=plan_defaults)
development_plan = response.json()
development_plan_id = development_plan["_id"]
development_dataset_id = development_plan.get("development_dataset_id")
print(f"Plan: {development_plan_id}, Dataset: {development_dataset_id}")

Plan create parameters (returned by defaults, can be customized):

Parameter Type Description
observation_table_id string ID of the observation table
development_dataset_name string Name for the development dataset
entity_selection object Entity selection metadata (suggested, final, eligible). See Entity Selection.
feature_lookback_in_months integer How far back to look for feature data (default varies by data)
storing_database_name string Database where sampled tables will be stored
storing_schema_name string Schema where sampled tables will be stored
table_ids_subset array IDs of tables to include in sampling
tables_skip_sampling array IDs of tables to skip sampling (use full table)
max_sample_to_full_ratio float Maximum ratio of sampled rows to full table (default: 0.05). Tables exceeding this ratio are not materialized unless you increase the threshold.

Step 3: Create Sampling Tables

Compute the distinct entity IDs for each table:

response = client.patch(f"/development_plan/{development_plan_id}/sampling_tables")
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Sampling tables: {task['status']}")

After this step, you can review the SQL plan to see how tables will be sampled:

response = client.get(f"/development_plan/{development_plan_id}/sql_plan")
sql_plan = response.json()

Step 4: Materialize Development Tables

Create the actual sampled tables in the warehouse:

response = client.patch(f"/development_plan/{development_plan_id}/sampled_tables", json={})
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Development tables: {task['status']}")

The dataset status transitions to ACTIVE once materialization completes.

Quick Create (Steps 1-2 Only)

For most use cases, you can skip the manual sampling/materialization steps. The ideation pipeline will automatically handle dataset preparation when you provide the development_dataset_id:

# Get defaults and create plan (dataset auto-created)
response = client.post(
    "/development_plan/defaults",
    json={"observation_table_id": eda_observation_table_id},
)
plan_defaults = response.json()

response = client.post("/development_plan", json=plan_defaults)
development_plan = response.json()
development_dataset_id = development_plan.get("development_dataset_id")

Get Development Dataset Details

response = client.get(f"/development_dataset/{development_dataset_id}")
dataset = response.json()

Response fields:

Field Type Description
id string Development dataset ID
name string Display name
status string "DRAFT", "ENTITY_SAMPLING", or "ACTIVE"
source_type string "SOURCE_TABLES" or "OBSERVATION_TABLE"
development_plan_id string ID of the associated development plan
observation_table_id string ID of the source observation table
sample_from_timestamp datetime Start of the sampling time range
sample_to_timestamp datetime End of the sampling time range
development_tables array List of sampled tables
feature_lookback_in_months integer How far back feature data was sampled

Get Detailed Info

response = client.get(
    f"/development_dataset/{development_dataset_id}/info",
    params={"verbose": True},
)
info = response.json()

Response fields:

Field Type Description
sample_from_timestamp datetime Start of the sampling time range
sample_to_timestamp datetime End of the sampling time range
development_tables array List of sampled tables with names, feature store, and status
status string Dataset status
source_type string How the dataset was created

List Development Datasets

response = client.get(
    "/development_dataset",
    params={"page": 1, "page_size": 20},
)
datasets = response.json()["data"]

Filter by use case:

response = client.get(
    "/catalog/development_dataset",
    params={"use_case_id": use_case_id},
)

Get Development Plan

response = client.get(f"/development_plan/{development_plan_id}")
plan = response.json()

List plans:

response = client.get(
    "/development_plan",
    params={"context_id": context_id},
)
plans = response.json()["data"]

Delete

# Delete development dataset
response = client.delete(f"/development_dataset/{development_dataset_id}")
task_id = response.json()["id"]
wait_for_task(client, task_id)

# Delete development plan
response = client.delete(f"/development_plan/{development_plan_id}")
task_id = response.json()["id"]
wait_for_task(client, task_id)