Credit Default: End-to-End SDK + API Tutorial¶
This tutorial replicates the Credit Default UI Tutorials using Python code. The SDK handles catalog setup, table registration, and entity management. The REST API handles ideation, training, evaluation, and deployment.
Prerequisites:
- FeatureByte instance with the
playgroundfeature store connected toDEMO_DATASETS.CREDIT_DEFAULT - Python environment with
featurebyteSDK installed - Profile
tutorialconfigured (see SDK Setup)
What you'll build:
- Register 7 source tables and tag entities (SDK)
- Run table EDA and semantic detection (API)
- Formulate a use case and create observation tables (SDK)
- Run an automated ideation pipeline (API)
- Refine features and train a standalone model (API)
- Evaluate on a holdout set and deploy (API)
Setup¶
import time
import featurebyte as fb
fb.use_profile("tutorial")
DATABASE_NAME = "DEMO_DATASETS"
SCHEMA_NAME = "CREDIT_DEFAULT"
CATALOG_NAME = "Credit Default API Tutorial"
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=2, read=3, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x335da8940>: Failed to establish a new connection: [Errno 61] Connection refused')': /api/v1/status
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=1, read=3, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x335da8d30>: Failed to establish a new connection: [Errno 61] Connection refused')': /api/v1/status
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=0, read=3, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x335da9060>: Failed to establish a new connection: [Errno 61] Connection refused')': /api/v1/status
06:46:32 | WARNING | Service endpoint is inaccessible: http://127.0.0.1:5000/api/v1
WARNING :featurebyte:Service endpoint is inaccessible: http://127.0.0.1:5000/api/v1
06:46:32 | INFO | Using profile: tutorial
INFO :featurebyte:Using profile: tutorial
06:46:32 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
INFO :featurebyte:Using configuration file at: /Users/gxav/.featurebyte/config.yaml
06:46:32 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
INFO :featurebyte:Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
06:46:32 | INFO | SDK version: 3.4.1.dev7
INFO :featurebyte:SDK version: 3.4.1.dev7
06:46:32 | INFO | No catalog activated.
INFO :featurebyte:No catalog activated.
def wait_for_task(client, task_id, poll_interval=30):
"""Poll a task until completion. Returns the full task response."""
while True:
task = client.get(f"/task/{task_id}").json()
if task["status"] in ("SUCCESS", "FAILURE"):
if task["status"] == "FAILURE":
print(f"Task FAILED: {task.get('traceback', 'no traceback')}")
return task
print(f" status: {task['status']}...")
time.sleep(poll_interval)
Step 1: Create Catalog¶
Corresponds to UI Tutorial: Create Catalog
catalog = fb.Catalog.create(CATALOG_NAME, "playground")
catalog.activate(CATALOG_NAME)
ds = catalog.get_data_source()
client = fb.Configurations().get_client()
# Get the feature store ID for API calls
feature_store = fb.FeatureStore.get("playground")
feature_store_id = str(feature_store.id)
print(f"Catalog '{catalog.name}' created. Feature store: playground")
06:46:33 | INFO | Catalog activated: Credit Default API Tutorial INFO :featurebyte.api.catalog:Catalog activated: Credit Default API Tutorial
Catalog 'Credit Default API Tutorial' created. Feature store: playground
SDK Reference: Catalog | FeatureStore | DataSource
Step 1b: Analyze Source Tables (API)¶
Corresponds to UI Tutorial: Register Tables — the "magic wand" feature API docs: Source Data Exploration
Before registering, use the API to analyze source tables and detect their types.
# Generate AI-powered summaries for all tables
table_names = [
"NEW_APPLICATION", "CLIENT_PROFILE", "BUREAU",
"PREVIOUS_APPLICATION", "LOAN_STATUS",
"INSTALLMENTS_PAYMENTS", "CREDIT_CARD_MONTHLY_BALANCE",
]
response = client.post(
"/table/source_table_summary",
json={
"feature_store_id": feature_store_id,
"database_name": DATABASE_NAME,
"schema_name": SCHEMA_NAME,
"table_names": table_names,
},
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Table summaries generated: {task['status']}")
# List tables with summaries
response = client.get(
f"/feature_store/{feature_store_id}/table",
params={
"database_name": DATABASE_NAME,
"schema_name": SCHEMA_NAME,
},
)
for t in response.json():
name = t["name"]
summary = t.get("summary", "")
print(f"{name}: {summary[:200] if summary else '(no summary)'}...")
status: STARTED... Table summaries generated: SUCCESS BUREAU: The BUREAU table contains information about credits taken by clients from other financial institutions, as reported to the credit bureau. It includes details such as the client's ID, unique identifier... CLIENT_PROFILE: The CLIENT_PROFILE table provides detailed information about each client's profile. It includes various attributes such as the client's unique identifier (ClientID), personal details like birthdate an... CREDIT_CARD_MONTHLY_BALANCE: The CREDIT_CARD_MONTHLY_BALANCE table provides a comprehensive summary of monthly balances for credit cards. It includes detailed information about each credit card, such as the card ID and the associ... INSTALLMENTS_PAYMENTS: The INSTALLMENTS_PAYMENTS table records the details of monthly installment payments for loans. It includes information about each installment, such as its unique ID, the associated loan application ID... LOAN_STATUS: The LOAN_STATUS table is designed to track the status of loans, specifically focusing on whether a loan has been terminated or is still active. It includes key identifiers such as the loan ID and appl... NEW_APPLICATION: The NEW_APPLICATION table contains detailed information about new loan applications submitted by clients. It includes various attributes related to the application itself, the client, and the client's... OBSERVATIONS_WITH_TARGET: The OBSERVATIONS_WITH_TARGET table is designed for training purposes and contains data related to loan applications. It includes a timestamp indicating the specific point in time for each prediction, ... OBSERVATION_EDA_TABLE: The OBSERVATION_EDA_TABLE is designed for training purposes and contains data related to loan applications. It includes a timestamp indicating the specific point in time for each prediction, an intege... PREVIOUS_APPLICATION: The PREVIOUS_APPLICATION table contains detailed information about prior loan applications made by clients. It includes various attributes related to the application process, such as the application I...
# Analyze a source table to detect its type
response = client.post(
"/table/source_table_analysis",
json={
"feature_store_id": feature_store_id,
"database_name": DATABASE_NAME,
"schema_name": SCHEMA_NAME,
"table_name": "BUREAU",
},
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
# Get analysis results
analysis_id = task.get("payload", {}).get("output_document_id")
response = client.get(f"/table/source_table_analysis/{analysis_id}")
analysis = response.json()
print(f"Table: BUREAU\n")
print("-"*8)
print(f'Suggested type:\n{analysis["table_type"]}\n')
print("-"*8)
print(f'Type explanation:\n{analysis["type_explanation"]}\n')
print("-"*8)
print(f'Setting explanation:\n{analysis["setting_explanation"]}\n')
print("-"*8)
print(f'Warnings:{analysis["warnings"]}')
status: PENDING... status: PENDING... Table: BUREAU -------- Suggested type: event_table -------- Type explanation: The BUREAU table fits the definition of an event_table because it captures unique events related to credit activities reported to the credit bureau. Each row in the table represents a distinct credit event associated with a client, as indicated by the unique identifier `SK_ID_BUREAU`, which is a recoded ID for each credit bureau credit related to a loan application. The presence of multiple timestamps, such as `bureau_application_time`, `credit_end_date`, `credit_end_fact`, `credit_update`, and `available_at`, further supports the classification as an event_table, as these timestamps capture specific points in time when events related to the credit activities occurred. Additionally, the table includes status fields like `CREDIT_ACTIVE` and various financial metrics that describe the state of the credit at the time of the event, aligning with the characteristics of an event_table that records distinct events with associated details. -------- Setting explanation: **Event ID**: The Event ID is set to `SK_ID_BUREAU`, which serves as the unique identifier for each credit event recorded in the BUREAU table. This ID is crucial for distinguishing between different credit events associated with a client, as each row in the table represents a distinct credit activity reported to the credit bureau. By using `SK_ID_BUREAU` as the Event ID, we ensure that each event can be uniquely identified and tracked throughout the feature engineering and machine learning processes. **Event Timestamp**: The Event Timestamp is designated as `credit_update`, which indicates the specific point in time when the last information about the Credit Bureau credit was updated. This timestamp is essential for understanding the temporal aspect of each credit event, allowing us to analyze the sequence and timing of credit activities. The `credit_update` timestamp is assumed to be recorded in UTC, which provides a standardized time reference for all events. **Event Timestamp Schema**: The Event Timestamp Schema specifies that the `credit_update` timestamp follows a Timestamp data type with no additional format string required. It is recorded in Coordinated Universal Time (UTC), ensuring consistency and comparability across different events and datasets. This schema helps in accurately interpreting the timing of events and aligning them with other time-based data. **Record Creation Timestamp**: The Record Creation Timestamp is set to `available_at`, which denotes when the record was added to the data warehouse. This timestamp is important for tracking the data ingestion process and understanding when the information became available for analysis. It provides a reference point for the data's availability and can be used to assess the timeliness and currency of the data in the context of feature engineering and machine learning. -------- Warnings: The credit_update column is assumed to be recorded in UTC. Please confirm.
Step 2: Register Tables¶
Corresponds to UI Tutorial: Register Tables
We register 7 tables from DEMO_DATASETS.CREDIT_DEFAULT:
- NEW_APPLICATION (Dimension) — loan application data
- CLIENT_PROFILE (SCD) — client demographics over time
- BUREAU (Event) — bureau credit reports
- PREVIOUS_APPLICATION (Event) — prior loan applications
- LOAN_STATUS (SCD) — loan repayment status over time
- INSTALLMENTS_PAYMENTS (Event) — installment payment events
- CREDIT_CARD_MONTHLY_BALANCE (Time Series) — monthly credit card balances
def get_source(table_name):
return ds.get_source_table(
database_name=DATABASE_NAME,
schema_name=SCHEMA_NAME,
table_name=table_name,
)
# Dimension table
new_application = get_source("NEW_APPLICATION").create_dimension_table(
name="NEW_APPLICATION",
dimension_id_column="SK_ID_CURR",
record_creation_timestamp_column="available_at",
)
print("Registered NEW_APPLICATION (Dimension)")
Registered NEW_APPLICATION (Dimension)
# SCD tables
client_profile = get_source("CLIENT_PROFILE").create_scd_table(
name="CLIENT_PROFILE",
natural_key_column="ClientID",
effective_timestamp_column="SCD_effective_timestamp",
end_timestamp_column="SCD_end_timestamp",
record_creation_timestamp_column="available_at",
)
print("Registered CLIENT_PROFILE (SCD)")
loan_status = get_source("LOAN_STATUS").create_scd_table(
name="LOAN_STATUS",
natural_key_column="LOAN_ID",
effective_timestamp_column="SCD_Effective_Timestamp",
end_timestamp_column="SCD_End_Timestamp",
record_creation_timestamp_column="available_at",
)
print("Registered LOAN_STATUS (SCD)")
Registered CLIENT_PROFILE (SCD) Registered LOAN_STATUS (SCD)
# Event tables
bureau = get_source("BUREAU").create_event_table(
name="BUREAU",
event_id_column="SK_ID_BUREAU",
event_timestamp_column="credit_update",
record_creation_timestamp_column="available_at",
)
bureau.initialize_default_feature_job_setting()
print("Registered BUREAU (Event)")
previous_application = get_source("PREVIOUS_APPLICATION").create_event_table(
name="PREVIOUS_APPLICATION",
event_id_column="APPLICATION_ID",
event_timestamp_column="decision_date",
record_creation_timestamp_column="available_at",
)
previous_application.initialize_default_feature_job_setting()
print("Registered PREVIOUS_APPLICATION (Event)")
installments = get_source("INSTALLMENTS_PAYMENTS").create_event_table(
name="INSTALLMENTS_PAYMENTS",
event_id_column="INSTALMENT_ID",
event_timestamp_column="actual_installment_date",
record_creation_timestamp_column="available_at",
)
installments.update_default_feature_job_setting(
feature_job_setting=fb.FeatureJobSetting(
blind_spot="1h",
period="24h",
offset="19h",
)
)
print("Registered INSTALLMENTS_PAYMENTS (Event)")
Done! |████████████████████████████████████████| 100% in 12.4s (0.08%/s)
The analysis period starts at 2026-02-28 10:03:14 and ends at 2026-03-28 10:03:14
The column used for the event timestamp is credit_update
The column used for the record creation timestamp for BUREAU is available_at
STATISTICS ON TIME BETWEEN BUREAU RECORDS CREATIONS
- Average time is 1482.132273220094 s
- Median time is 60.0 s
- Lowest time is 60.0 s
- Largest time is 83520.0 s
based on a total of 1199 unique record creation timestamps.
BUREAU UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours
This includes a buffer of 600 s to allow for late jobs.
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25200 s
- blind spot for 99.9 % of events to land: 25800 s
- blind spot for 99.95 % of events to land: 25800 s
- blind spot for 99.99 % of events to land: 25800 s
- blind spot for 99.995 % of events to land: 25800 s
- blind spot for 100.0 % of events to land: 25800 s
In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25800
The resulting FEATURE CUTOFF offset is 86340 s.
For a feature cutoff at 86340 s:
- time for 99.5 % of events to land: 25800 s
- time for 99.9 % of events to land: 25800 s
- time for 99.95 % of events to land: 25800 s
- time for 99.99 % of events to land: 25800 s
- time for 99.995 % of events to land: 25800 s
- time for 100.0 % of events to land: 25800 s
- Period = 86400 s / Offset = 25740 s / Blind spot = 25800 s
The backtest found that all records would have been processed on time.
- Based on the past records created from 2026-02-28 10:00:00 to 2026-03-28 10:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
- The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
- Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25800 s.
- period: 86400 s
- offset: 25740 s
- blind_spot: 25800 s
Registered BUREAU (Event) Done! |████████████████████████████████████████| 100% in 9.3s (0.11%/s)
The analysis period starts at 2026-02-25 02:25:16 and ends at 2026-03-25 02:25:16
The column used for the event timestamp is decision_date
The column used for the record creation timestamp for PREVIOUS_APPLICATION is available_at
STATISTICS ON TIME BETWEEN PREVIOUS_APPLICATION RECORDS CREATIONS
- Average time is 9508.470588235294 s
- Median time is 240.0 s
- Lowest time is 60.0 s
- Largest time is 85920.0 s
based on a total of 234 unique record creation timestamps.
PREVIOUS_APPLICATION UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours
This includes a buffer of 600 s to allow for late jobs.
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25200 s
- blind spot for 99.9 % of events to land: 25200 s
- blind spot for 99.95 % of events to land: 25200 s
- blind spot for 99.99 % of events to land: 25200 s
- blind spot for 99.995 % of events to land: 25200 s
- blind spot for 100.0 % of events to land: 25200 s
In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25200
The resulting FEATURE CUTOFF offset is 540 s.
For a feature cutoff at 540 s:
- time for 99.5 % of events to land: 25200 s
- time for 99.9 % of events to land: 25200 s
- time for 99.95 % of events to land: 25200 s
- time for 99.99 % of events to land: 25200 s
- time for 99.995 % of events to land: 25200 s
- time for 100.0 % of events to land: 25200 s
- Period = 86400 s / Offset = 25740 s / Blind spot = 25200 s
The backtest found that all records would have been processed on time.
- Based on the past records created from 2026-02-25 02:00:00 to 2026-03-25 02:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
- The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
- Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25200 s.
- period: 86400 s
- offset: 25740 s
- blind_spot: 25200 s
Registered PREVIOUS_APPLICATION (Event) Registered INSTALLMENTS_PAYMENTS (Event)
# Time series table
cc_balance = get_source("CREDIT_CARD_MONTHLY_BALANCE").create_time_series_table(
name="CREDIT_CARD_MONTHLY_BALANCE",
series_id_column="CARD_ID",
reference_datetime_column="balance_month",
reference_datetime_schema=fb.TimestampSchema(
format_string="%Y-%m",
timezone="America/Los_Angeles",
),
time_interval=fb.TimeInterval(value=1, unit="MONTH"),
record_creation_timestamp_column="available_at",
)
cc_balance.update_default_feature_job_setting(
feature_job_setting=fb.CronFeatureJobSetting(
crontab="0 13 1 * *",
timezone="America/Los_Angeles",
)
)
print("Registered CREDIT_CARD_MONTHLY_BALANCE (Time Series)")
Registered CREDIT_CARD_MONTHLY_BALANCE (Time Series)
Step 3: Register Entities¶
Corresponds to UI Tutorial: Register Entities
# Create entities
entity_new_app = fb.Entity.create(name="New Application", serving_names=["SK_ID_CURR"])
entity_client = fb.Entity.create(name="Client", serving_names=["ClientID"])
entity_bureau = fb.Entity.create(name="BureauReportedCredit", serving_names=["SK_ID_BUREAU"])
entity_prior_app = fb.Entity.create(name="PriorApplication", serving_names=["APPLICATION_ID"])
entity_loan = fb.Entity.create(name="Loan", serving_names=["LOAN_ID"])
entity_installment = fb.Entity.create(name="Installment", serving_names=["INSTALMENT_ID"])
print(f"Created {len(catalog.list_entities())} entities")
Created 6 entities
# Tag entity columns on each table
new_application["SK_ID_CURR"].as_entity("New Application")
new_application["ClientID"].as_entity("Client")
client_profile["ClientID"].as_entity("Client")
bureau["SK_ID_BUREAU"].as_entity("BureauReportedCredit")
bureau["ClientID"].as_entity("Client")
previous_application["APPLICATION_ID"].as_entity("PriorApplication")
previous_application["ClientID"].as_entity("Client")
loan_status["LOAN_ID"].as_entity("Loan")
loan_status["APPLICATION_ID"].as_entity("PriorApplication")
installments["INSTALMENT_ID"].as_entity("Installment")
installments["APPLICATION_ID"].as_entity("PriorApplication")
cc_balance["CARD_ID"].as_entity("BureauReportedCredit")
cc_balance["ClientID"].as_entity("Client")
print("Entity tagging complete")
Entity tagging complete
SDK Reference: Entity | TableColumn.as_entity()
Step 4: Formulate Use Case¶
Corresponds to UI Tutorial: Formulate Use Cases
Create a context (what we're predicting for), a target (what we're predicting), and a use case that ties them together.
context = fb.Context.create(
name="New Loan Application",
primary_entity=["New Application"],
description="Loan application under review.",
)
print(f"Context: {context.name}")
Context: New Loan Application
# Create target from the NEW_APPLICATION table
# The target column indicates whether the client defaulted within 6 months
target_name = "Loan_Default"
target = fb.TargetNamespace.create(
name=target_name,
window="182d",
dtype="INT",
primary_entity=["New Application"],
target_type=fb.TargetType.CLASSIFICATION,
positive_label=1,
)
print(f"Target: {target.name}")
Target: Loan_Default
use_case = fb.UseCase.create(
name="Loan Default by client",
target_name="Loan_Default",
context_name="New Loan Application",
description="Predict clients' payment difficulties the next 6 months for new loan based on its application data and prior credit history.",
)
use_case_id = str(use_case.id)
print(f"Use case: {use_case.name} (id: {use_case_id})")
Use case: Loan Default by client (id: 69d97e5806b1c7ee52d4fbab)
Step 5: Create Observation Tables¶
Corresponds to UI Tutorial: Create Observation Tables
We create observation tables from the pre-built OBSERVATIONS_WITH_TARGET source table.
Each table serves a different purpose: EDA (50K sample), training, validation, and holdout.
obs_source = get_source("OBSERVATIONS_WITH_TARGET")
# Full training table
training_table = obs_source.create_observation_table(
name="Applications up to Sept 2024",
columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
context_name="New Loan Application",
target_column="Loan_Default",
sample_from_timestamp="2019-04-01",
sample_to_timestamp="2024-10-01",
)
training_table.update_purpose(fb.Purpose.TRAINING)
use_case.add_observation_table("Applications up to Sept 2024")
training_table_id = str(training_table.id)
print(f"Training table: {training_table.name} (id: {training_table_id})")
06:48:57 | WARNING | Primary entities will be a mandatory parameter in SDK version 0.7. WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.
Done! |████████████████████████████████████████| 100% in 18.6s (0.05%/s) Training table: Applications up to Sept 2024 (id: 69d97e5906b1c7ee52d4fbac)
# Validation table (Q4 2024)
validation_table = obs_source.create_observation_table(
name="Applications Q4 2024",
columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
context_name="New Loan Application",
target_column="Loan_Default",
sample_from_timestamp="2024-10-01",
sample_to_timestamp="2025-01-01",
)
validation_table.update_purpose(fb.Purpose.VALIDATION_TEST)
use_case.add_observation_table("Applications Q4 2024")
validation_table_id = str(validation_table.id)
print(f"Validation table: {validation_table.name} (id: {validation_table_id})")
06:49:17 | WARNING | Primary entities will be a mandatory parameter in SDK version 0.7. WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.
Done! |████████████████████████████████████████| 100% in 18.6s (0.05%/s) Validation table: Applications Q4 2024 (id: 69d97e6d06b1c7ee52d4fbad)
# EDA table (50K sample)
eda_table = obs_source.create_observation_table(
name="50K applications",
columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
context_name="New Loan Application",
target_column="Loan_Default",
sample_rows=50000,
)
eda_table.update_purpose(fb.Purpose.EDA)
use_case.add_observation_table("50K applications")
use_case.update_default_eda_table("50K applications")
eda_table_id = str(eda_table.id)
print(f"EDA table: {eda_table.name} (id: {eda_table_id})")
06:49:37 | WARNING | Primary entities will be a mandatory parameter in SDK version 0.7. WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.
Done! |████████████████████████████████████████| 100% in 21.7s (0.05%/s) EDA table: 50K applications (id: 69d97e8106b1c7ee52d4fbae)
SDK Reference: ObservationTable | SourceTable.create_observation_table()
Step 6: Table EDA¶
Corresponds to UI Tutorial: Set Default Cleaning Operations API docs: Table EDA
Run EDA on each table to discover data quality issues and review column distributions.
# Run EDA on the CLIENT_PROFILE table as an example
client_profile_id = str(client_profile.id)
response = client.post(
"/table_eda",
json={
"table_id": client_profile_id,
"analysis_size": 10000,
"seed": 42,
},
)
task_id = response.json()["id"]
print(f"Table EDA started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Table EDA: {task['status']}")
Table EDA started (task: 0e836836-49b6-4a02-99e4-f47a9ba49fb8) status: PENDING... Table EDA: SUCCESS
# List EDA runs and view column analysis
response = client.get(f"/table_eda?table_id={client_profile_id}")
eda_runs = response.json()["data"]
table_analysis_id = eda_runs[0]["_id"]
response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}")
columns = response.json()["data"]
print(f"Analyzed {len(columns)} columns")
for col in columns:
print("-" * 8)
print(f"{col['column_name']}: {col['summary']}")
Analyzed 10 columns -------- ClientID: The 'ClientID' column is a numeric unique identifier with no detected issues or cleaning operations applied. The dataset contains 10,000 entries, all of which are non-missing, with 9,863 unique values. The mean ClientID is 278,192.36, with a standard deviation of 101,910.21. The values range from a minimum of 100,002 to a maximum of 456,214. The distribution is fairly uniform across the range, as indicated by the histogram, with no zeros or outliers present. The quartiles are well-distributed, with the 25th percentile at 189,863.75, the median at 278,951, and the 75th percentile at 365,607.75. There are no excluded rows in the EDA plot, and the data does not require any cleaning. -------- BIRTHDATE: The BIRTHDATE column, which is of VARCHAR type, represents client birthdates and is suspected to be in a string-based datetime format of YYYY-MM-DD. No cleaning operations were applied to this column. The raw data analysis shows that it is categorical with 10,000 entries, all of which are non-missing, and 7,431 unique values. The most frequent birthdate is 1964-06-13, appearing with a frequency of 0.0005. The top 20 birthdates each have a frequency of 5, except for the last few which have a frequency of 4, while the remaining dates not in the top 20 collectively account for 9,909 entries. There is no additional analysis on cleaned data as no cleaning was performed. -------- GENDER: The GENDER column in the dataset is a categorical variable with no detected issues or cleaning operations applied. The column contains data for 10,000 clients, with no missing values. There are two unique categories: 'F' and 'M'. The most frequent category is 'F', which appears 6,588 times, accounting for approximately 65.88% of the data, while 'M' appears 3,412 times. The distribution of gender is visualized in a bar plot, showing a higher frequency of 'F' compared to 'M'. No further cleaning or analysis was conducted on this column. -------- FLAG_OWN_CAR: The column "FLAG_OWN_CAR" is a categorical variable indicating whether a client owns a car, with no detected issues or cleaning operations applied. The raw data consists of 10,000 entries with no missing values. There are two unique categories: 'N' and 'Y', with 'N' being the most frequent category, appearing 65.8% of the time (6,580 occurrences), while 'Y' appears 34.2% of the time (3,420 occurrences). The data distribution is visualized in a plot showing the frequency of each category, confirming the dominance of the 'N' category. No further analysis on cleaned data is available as no cleaning was necessary. -------- FLAG_OWN_REALTY: The column "FLAG_OWN_REALTY" is a categorical variable indicating whether a client owns a house or flat, with no detected issues or cleaning operations applied. The raw data consists of 10,000 entries with no missing values. There are two unique categories: 'Y' and 'N'. The majority of the entries, 69.68%, are labeled as 'Y', indicating that most clients own a house or flat. The remaining 30.32% are labeled as 'N'. The frequency plot confirms this distribution, showing a higher count for 'Y' compared to 'N'. No further analysis on cleaned data is available as no cleaning was necessary. -------- CNT_CHILDREN: The column "CNT_CHILDREN" represents the number of children a client has and is of integer type with no detected issues or cleaning operations applied. The exploratory data analysis on the raw data reveals that the column is numeric with a total of 10,000 non-missing entries and no missing values. The data shows a high percentage of zeros (70.56%), indicating that most clients do not have children. The mean number of children is 0.4081, with a standard deviation of 0.7091. The values range from 0 to 6, with the majority of clients having between 0 and 1 child, as indicated by the 75th percentile being 1. The distribution is right-skewed, with a few clients having up to 6 children. There are no outliers or zeros excluded from the EDA plot, which shows the frequency of each number of children, with 0 children being the most common. No cleaning operations were necessary, so the cleaned data analysis is not applicable. -------- INCOME_TYPE: The INCOME_TYPE column is a categorical variable with no detected issues, thus no cleaning operations were applied. The column contains data on clients' income types, such as businessman, working, and maternity leave, among others. The dataset consists of 10,000 entries with no missing values. There are 5 unique categories, with "Working" being the most frequent category, accounting for 51.47% of the data. Other notable categories include "Commercial associate" and "Pensioner," with frequencies of 2,297 and 1,848, respectively. The least frequent category is "Student," with only 2 occurrences. The distribution of income types is visualized in a bar plot, highlighting the dominance of the "Working" category. -------- EDUCATION_TYPE: The EDUCATION_TYPE column in the dataset represents the highest level of education achieved by clients and is of categorical type with no detected issues or cleaning operations applied. The raw data consists of 10,000 entries with no missing values and five unique categories. The most frequent category is "Secondary / secondary special," which accounts for 71% of the data, followed by "Higher education" with a frequency of 24.39%. Other categories include "Incomplete higher," "Lower secondary," and "Academic degree," with frequencies of 3.37%, 1.19%, and 0.5%, respectively. The data distribution indicates a significant majority of clients have completed secondary education, with a smaller proportion achieving higher education levels. No further analysis on cleaned data is available as no cleaning operations were necessary. -------- FAMILY_STATUS: The FAMILY_STATUS column is a categorical variable representing the family status of clients, with no detected issues or cleaning operations applied. The dataset contains 10,000 entries with no missing values and five unique categories. The most frequent category is "Married," comprising 63.79% of the data, followed by "Single / not married" at 15.01%, "Civil marriage" at 9.79%, "Separated" at 6.34%, and "Widow" at 5.07%. The distribution of family status is visualized in a bar plot, highlighting the predominance of the "Married" category. No further analysis on cleaned data is available as no cleaning was necessary. -------- HOUSING_TYPE: The "HOUSING_TYPE" column is a categorical variable with no detected issues or cleaning operations applied. It contains data on the housing situation of clients, such as renting or living with parents. The dataset consists of 10,000 entries with no missing values. There are six unique categories, with "House / apartment" being the most common, accounting for 89.13% of the entries. Other categories include "Co-op apartment," "Municipal apartment," "Office apartment," "Rented apartment," and "With parents," with significantly lower frequencies. The distribution indicates that the majority of clients reside in houses or apartments, while a smaller proportion live in other types of housing arrangements.
# View columns with issues only
response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}&issues=true")
columns_with_issues = response.json()["data"]
for col in columns_with_issues:
print(f"{col['column_name']} issues: {col['issues']}")
print(f"summary: {col['summary']}")
BIRTHDATE issues: Suspected string-based datetime format found: YYYY-MM-DD. summary: The BIRTHDATE column, which is of VARCHAR type, represents client birthdates and is suspected to be in a string-based datetime format of YYYY-MM-DD. No cleaning operations were applied to this column. The raw data analysis shows that it is categorical with 10,000 entries, all of which are non-missing, and 7,431 unique values. The most frequent birthdate is 1964-06-13, appearing with a frequency of 0.0005. The top 20 birthdates each have a frequency of 5, except for the last few which have a frequency of 4, while the remaining dates not in the top 20 collectively account for 9,909 entries. There is no additional analysis on cleaned data as no cleaning was performed.
# Apply a cleaning operation based on EDA insight
client_profile["BIRTHDATE"].update_critical_data_info(
cleaning_operations=[
fb.AddTimestampSchema(
timestamp_schema=fb.TimestampSchema(
is_utc_time=False,
format_string="YYYY-MM-DD",
timezone="America/Los_Angeles",
),
),
]
)
response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}")
columns = response.json()["data"]
for col in columns:
outdated = col['outdated']
if outdated:
print(f"{col['column_name']} is outdated")
BIRTHDATE is outdated
# Rerun EDA
response = client.post(
"/table_eda",
json={
"table_id": client_profile_id,
"analysis_size": 10000,
"seed": 42,
},
)
task_id = response.json()["id"]
print(f"Table EDA started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Table EDA: {task['status']}")
# List EDA runs and view column analysis
response = client.get(f"/table_eda?table_id={client_profile_id}")
eda_runs = response.json()["data"]
table_analysis_id = eda_runs[0]["_id"]
response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}&outdated=false")
columns = response.json()["data"]
for col in columns:
if col['column_name'] == "BIRTHDATE":
print(f"{col['column_name']}: {col['summary']}")
column_analysis_id = col["_id"]
break
Table EDA started (task: 60d4a52c-b18b-49f7-89dc-f7e9e3169532) status: STARTED... Table EDA: SUCCESS BIRTHDATE: The BIRTHDATE column, initially in a string-based datetime format (YYYY-MM-DD), was cleaned and converted to a timestamp format with the time zone adjusted to UTC from America/Los_Angeles. In the raw data, the column was treated as categorical with 10,000 non-missing entries and 7,431 unique values, with the most frequent date being 1964-06-13, appearing 0.05% of the time. The cleaned data, now in timestamp format, shows a mean birthdate of October 18, 1978, with a standard deviation of approximately 12.3 years. The birthdates range from September 30, 1950, to December 7, 2005. The distribution is fairly uniform with no outliers, as indicated by the histogram, which shows a gradual increase in frequency from the 1950s to the 1980s, peaking around the late 1980s, and then tapering off towards the 2000s. The data is complete with no missing values in both raw and cleaned datasets.
cleaned_plots = client.patch(
f"/column_eda/{column_analysis_id}",
params={"cleaned": True},
).json()
from IPython.display import HTML, display
display(HTML(cleaned_plots[0]["plots"][0]["content"]))
# Apply cleaning to DAYS_EMPLOYED in NEW_APPLICATION too
new_application["DAYS_EMPLOYED"].update_critical_data_info(
cleaning_operations=[
fb.DisguisedValueImputation(disguised_values=[365243], imputed_value=None),
]
)
SDK Reference: TableColumn.update_critical_data_info() | AddTimestampSchema | DisguisedValueImputation
Step 7: Semantic Detection¶
Corresponds to UI Tutorial: Update Descriptions and Tag Semantics API docs: Semantic Detection
Run AI-powered semantic detection to identify column types (currency, ratio, identifier, etc.). The ideation pipeline will use these semantics to generate better features.
# Run semantic detection on the BUREAU table
bureau_table = catalog.get_table("BUREAU")
bureau_table_id = str(bureau_table.id)
response = client.post(
"/semantic_detection/column_semantic_detection",
json={
"table_id": bureau_table_id,
"sample_enabled": True,
},
)
task_id = response.json()["id"]
print(f"Semantic detection started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Semantic detection: {task['status']}")
Semantic detection started (task: d6931081-85ce-456f-bebb-6b66ffc66602) status: STARTED... Semantic detection: SUCCESS
# Review detection results
semantic_detection_id = task.get("payload", {}).get("output_document_id")
response = client.get(f"/semantic_detection/{semantic_detection_id}")
detection = response.json()
# Show a few suggested semantics
for item in detection.get("suggested_semantics", [])[:10]:
existing = item.get("existing_semantic_tag", "-")
proposed = item.get("proposed_semantic_tag", "-")
print(f" {item['table_name']}.{item['column_name']}: {existing} -> {proposed}")
BUREAU.available_at: record_creation_timestamp -> record_creation_timestamp BUREAU.credit_update: event_timestamp -> event_timestamp BUREAU.SK_ID_BUREAU: event_id -> event_id BUREAU.ClientID: None -> unique_identifier BUREAU.bureau_application_time: None -> timestamp_field BUREAU.credit_end_date: None -> timestamp_field BUREAU.credit_end_fact: None -> timestamp_field BUREAU.CNT_CREDIT_PROLONG: None -> count BUREAU.AMT_CREDIT_SUM: None -> non_negative_amount BUREAU.AMT_CREDIT_SUM_DEBT: None -> unbounded_amount
# Apply suggested semantics to table columns
# This sets the ground truth so ideation uses them directly
for item in detection.get("suggested_semantics", []):
table_name = item.get("table_name")
column_name = item.get("column_name")
final_tag = item.get("final_semantic_tag")
if final_tag:
table_obj = catalog.get_table(table_name)
table_type = table_obj.type.lower()
table_id = str(table_obj.id)
client.patch(
f"/{table_type}/{table_id}/column_semantic",
json={
"column_semantic_updates": [
{"column_name": column_name, "semantic": final_tag}
],
},
)
print("Semantic tags applied to table columns")
Semantic tags applied to table columns
Step 8: Create Development Dataset¶
Corresponds to UI Tutorial: Create Development Dataset API docs: Development Dataset
A development dataset is a sampled copy of source tables used to speed up ideation.
# Get default plan configuration from the EDA table
response = client.post(
"/development_plan/defaults",
json={"observation_table_id": eda_table_id},
)
plan_defaults = response.json()
print(f"Default feature lookback: {plan_defaults.get('feature_lookback_in_months')} months")
print(f"Tables to sample: {len(plan_defaults.get('table_ids_subset', []))}")
Default feature lookback: 15 months Tables to sample: 7
# Customize the plan to match the UI tutorial settings
# - Increase feature lookback to 25 months for broader historical data
# - Increase max sample ratio to 0.15 to ensure materialization
plan_defaults["feature_lookback_in_months"] = 25
plan_defaults["max_sample_to_full_ratio"] = 0.15
# Create the development plan
response = client.post("/development_plan", json=plan_defaults)
development_plan = response.json()
development_plan_id = development_plan["_id"]
development_dataset_id = development_plan.get("development_dataset_id")
print(f"Development plan created: {development_plan_id}")
print(f"Development dataset: {development_dataset_id}")
print(f"Feature lookback: {plan_defaults['feature_lookback_in_months']} months")
print(f"Max sample ratio: {plan_defaults['max_sample_to_full_ratio']}")
Development plan created: 69d97effc4e0e91875fd94c0 Development dataset: 69d97effc4e0e91875fd94c1 Feature lookback: 25 months Max sample ratio: 0.15
# Step 3: Create sampling tables (compute distinct entity IDs per table)
response = client.patch(f"/development_plan/{development_plan_id}/sampling_tables")
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Sampling tables: {task['status']}")
# Review the SQL sampling plan
response = client.get(f"/development_plan/{development_plan_id}/sql_plan")
sql_plan = response.json()
print(f"Tables to skip sampling: {len(sql_plan.get('tables_skip_sampling', []))}")
print(f"Max sample ratio: {sql_plan.get('max_sample_to_full_ratio')}")
print(f"SQL: {sql_plan['sql_str'][:1000]}...........")
status: STARTED... Sampling tables: SUCCESS Tables to skip sampling: 0 Max sample ratio: 0.15 SQL: ------------------------------------------------------- -- 1) CREATE TABLES WITH DISTINCT ENTITY IDS ------------------------------------------------------- -- Build a Context list of distinct New_Application IDs -- Distinct keys (SK_ID_CURR) from 50K applications CREATE TABLE "TUTORIAL"."TUTORIAL_PROD"."DEV_69d97effc4e0e91875fd94c0_New_Application_CONTEXT_IDS" AS SELECT DISTINCT base."SK_ID_CURR" FROM "TUTORIAL"."TUTORIAL_PROD"."OBSERVATION_TABLE_69d97e8598a542e55a3c2fb2" AS base -- Build a Context list of distinct Client IDs -- Join NEW_APPLICATION with DEV_69d97effc4e0e91875fd94c0_New_Application_CONTEXT_IDS CREATE TABLE "TUTORIAL"."TUTORIAL_PROD"."DEV_69d97effc4e0e91875fd94c0_Client_CONTEXT_IDS" AS SELECT DISTINCT base."ClientID" FROM "DEMO_DATASETS"."CREDIT_DEFAULT"."NEW_APPLICATION" AS base INNER JOIN "TUTORIAL"."TUTORIAL_PROD"."DEV_69d97effc4e0e91875fd94c0_New_Application_CONTEXT_IDS" AS join_0 ON base."SK_ID_CURR" = join_0."SK_ID_CURR" -- Build a Sampling list of disti...........
# Step 4: Materialize development tables in the warehouse
response = client.patch(
f"/development_plan/{development_plan_id}/sampled_tables",
json={},
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Development tables materialized: {task['status']}")
# Verify dataset is active
response = client.get(f"/development_dataset/{development_dataset_id}")
dataset = response.json()
print(f"Dataset status: {dataset['status']}")
print(f"Tables: {len(dataset.get('development_tables', []))}")
status: STARTED... Development tables materialized: SUCCESS Dataset status: Active Tables: 7
Step 9: Run Ideation Pipeline¶
Corresponds to UI Tutorial: Ideate Features and Models API docs: Ideation
The ideation pipeline automates the full feature engineering workflow: table selection, semantic detection, transforms, feature generation, EDA, feature selection, and model training.
# Create a pipeline with the development dataset for faster ideation
response = client.post(
"/pipeline",
json={
"action": "create",
"use_case_id": use_case_id,
"pipeline_type": "FEATURE_IDEATION",
"development_dataset_id": development_dataset_id,
},
)
pipeline_id = response.json()["_id"]
print(f"Pipeline created: {pipeline_id}")
Pipeline created: 69d97f3d49c7d923c28449fa
# Configure training and validation tables
response = client.patch(
f"/pipeline/{pipeline_id}/step_configs",
json={
"step_type": "model-train-setup-v2",
"data_source": {
"type": "train_valid_observation_tables",
"training_table_id": training_table_id,
"validation_table_id": validation_table_id,
},
},
)
print("Configured training/validation tables")
Configured training/validation tables
# Run the pipeline to completion
response = client.patch(
f"/pipeline/{pipeline_id}",
json={"action": "advance", "step_type": "end"},
)
pipeline_task = response.json()["pipeline_runner_task"]
if pipeline_task:
task_id = pipeline_task["task_id"]
print(f"Pipeline running (task: {task_id})")
print("This will take a while...")
task = wait_for_task(client, task_id, poll_interval=180)
print(f"Pipeline: {task['status']}")
Pipeline running (task: 12c6a7c6-e211-4f49-9d2f-e1d29bf9efe4) This will take a while... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... Pipeline: SUCCESS
# Monitor pipeline status
response = client.get(f"/pipeline/{pipeline_id}")
data = response.json()
print(f"Current step: {data['current_step_type']}")
for group in data["groups"]:
for step in group["steps"]:
marker = "+" if step["step_status"] == "completed" else " "
print(f" [{marker}] {step['step_type']}: {step['step_status']}")
Current step: end [+] start: completed [+] table-selection: completed [+] semantic-detection: completed [+] transform: completed [+] filter: completed [+] ideation-metadata: completed [+] feature-ideation: completed [+] eda: completed [+] feature-selection: completed [+] model-train-setup-v2: completed [+] model-train: completed
for group in data["groups"]:
for step in group["steps"]:
if step["step_type"] == "model-train-setup-v2":
primary_metric = step.get("primary_metric")
print("primary_metric:", primary_metric)
primary_metric: roc_auc
# Get the best model from the pipeline's validation leaderboard
response = client.get(
"/catalog/ml_model",
params={
"use_case_id": use_case_id,
"pipeline_id": pipeline_id,
"pipeline_id_to_mark": pipeline_id,
"page": 1,
"page_size": 100,
"sort_by": primary_metric,
"sort_by_metric": True,
},
)
models = response.json()["data"]
for m in models:
tag = " (best)" if m == models[0] else ""
print(f" {m['name']}{tag}: {m['metrics']['valid'][primary_metric]}")
ideation_model_id = models[0]["_id"]
print(f"\nBest model: {models[0]['name']} (id: {ideation_model_id})")
LightGBM [358 features: Loan Default Risk Assessment Features] (best): 0.7976001150827741 XGBoost [358 features: Loan Default Risk Assessment Features]: 0.7953798526034435 Best model: LightGBM [358 features: Loan Default Risk Assessment Features] (id: 69d98b9b885d3d27f3fca1d0)
# View the pipeline report
response = client.get(f"/pipeline/{pipeline_id}/report")
report = response.json()
print(f"Summary: {report.get('summary', 'N/A')[:500]}")
Summary: The use case focuses on predicting loan defaults within the next 6 months using three aggregation time windows: **26 weeks**, **52 weeks**, and **104 weeks**. Key tables involved are **INSTALLMENTS_PAYMENTS**, **PREVIOUS_APPLICATION**, **BUREAU**, **NEW_APPLICATION**, **CLIENT_PROFILE**, and **LOAN_STATUS**, which provide insights into clients' payment behaviors and credit history. Data transformations and feature creation resulted in **2385 ideated features** across **27 themes**, with most fea
# Download report as PDF
response = client.get(f"/pipeline/{pipeline_id}/report_pdf")
with open("ideation_report.pdf", "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
print("Report saved to ideation_report.pdf")
Report saved to ideation_report.pdf
Step 9b: Explore Ideated Features¶
API docs: Ideated Features
Browse the features generated by ideation. Each feature includes its SDK code, relevance score, and a construction lineage — useful for understanding, reproducing, or modifying features.
# Get the feature ideation ID from the pipeline
response = client.get(f"/pipeline/{pipeline_id}/feature_ideation")
feature_ideation_id = response.json().get("feature_ideation_id")
# List top features by Predictive Score
response = client.get(
f"/catalog/feature_ideation/{feature_ideation_id}/suggested_features",
params={"page_size": 10, "sort_by": "predictive_power_score", "sort_dir": "desc"},
)
suggested_features = response.json()["data"]
for f in suggested_features[:10]:
print(f"{f['feature_name']}")
print(f" Predictive Score: {f.get('predictive_power_score', 'N/A')}")
print(f" Type: {f.get('signal_type', 'N/A')}, Table: {f.get('primary_table', [])}")
print()
NEW_APPLICATION_EXT_SOURCE_3 Predictive Score: 0.30629863467855545 Type: attribute, Table: ['NEW_APPLICATION'] NEW_APPLICATION_EXT_SOURCE_2 Predictive Score: 0.28749787401070903 Type: attribute, Table: ['NEW_APPLICATION'] CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w Predictive Score: 0.22078103627818924 Type: stats, Table: ['BUREAU'] CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_52w Predictive Score: 0.21296984363327365 Type: stats, Table: ['BUREAU'] CLIENT_Min_of_BureauReportedCredits_Available_Credits_26w Predictive Score: 0.20329257137420864 Type: stats, Table: ['BUREAU'] CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_26w Predictive Score: 0.2020876017095521 Type: stats, Table: ['BUREAU'] CLIENT_Min_of_BureauReportedCredits_Available_Credits_52w Predictive Score: 0.20138730853691222 Type: stats, Table: ['BUREAU'] CLIENT_Min_of_BureauReportedCredits_Available_Credits_104w Predictive Score: 0.1988541944357085 Type: stats, Table: ['BUREAU'] CLIENT_Avg_of_BureauReportedCredits_Available_Credits_52w Predictive Score: 0.18368288675295674 Type: stats, Table: ['BUREAU'] CLIENT_Avg_of_BureauReportedCredits_Available_Credits_26w Predictive Score: 0.18346759290787085 Type: stats, Table: ['BUREAU']
# View the SDK code for a feature
response = client.get(
f"/catalog/feature_ideation/{feature_ideation_id}/suggested_features",
params={"page_size": 100, "sort_by": "predictive_power_score", "sort_dir": "desc"},
)
for feature in suggested_features:
if feature["feature_name"] == "CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w":
break
print(f"Feature: {feature['feature_name']}")
print(f"Description: {feature.get('feature_description', '')}")
print(f"\nSDK Code:\n{feature['code']}")
Feature: CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w
Description: Max of BureauReportedCredits AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUMS for the Client over a 104w period.
SDK Code:
"""
SDK code to create CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w
Feature description:
Max of BureauReportedCredits AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUMS for the Client over a 104w
period.
"""
import featurebyte as fb
#==================================================================================================
# Activate catalog
#==================================================================================================
catalog = fb.Catalog.activate("Credit Default API Tutorial")
#==================================================================================================
# Get view from table
#==================================================================================================
# Get view from BUREAU event table.
bureau_view = catalog.get_view("BUREAU")
#==================================================================================================
# Create ratio column
#==================================================================================================
bureau_view["AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM"] = (
bureau_view["AMT_CREDIT_SUM_DEBT"] / bureau_view["AMT_CREDIT_SUM"]
)
#==================================================================================================
# Do window aggregation from BUREAU
#==================================================================================================
# Group BUREAU view by Client entity (ClientID).
bureau_view_by_client =\
bureau_view.groupby(['ClientID'])
#--------------------------------------------------------------------------------------------------
# Get Max of AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM for the Client over time.
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w =\
bureau_view_by_client.aggregate_over(
"AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM", method="max",
feature_names=["CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w"],
windows=["104w"],
)["CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w"]
#==================================================================================================
# Save feature
#==================================================================================================
# Save feature
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w.save()
#==================================================================================================
# Update feature type
#==================================================================================================
# Update feature type
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w.update_feature_type(
"numeric"
)
#==================================================================================================
# Add description
#==================================================================================================
# Add description
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w.update_description(
"Max of BureauReportedCredits AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUMS "
"for the Client over a 104w period."
)
# Get full lineage for a feature — shows step-by-step construction
response = client.get(
f"/catalog/feature_ideation/suggested_feature_metadata/{feature['_id']}",
)
metadata = response.json()
# Show each construction step
lineage = metadata.get("lineage", {})
for node in lineage.get("nodes", []):
print(f"Step: {node['title']}")
print(f" {node['description']}")
print(f"Code:\n{node['code']}")
print()
Step: Create View from Table
Create view from BUREAU event table.
Code:
# Get view from BUREAU event table.
bureau_view = catalog.get_view("BUREAU")
Step: Transform based on a Transform Object
Divide AMT_CREDIT_SUM_DEBT by AMT_CREDIT_SUM within bureau_view.
Code:
bureau_view["AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM"] = (
bureau_view["AMT_CREDIT_SUM_DEBT"] / bureau_view["AMT_CREDIT_SUM"]
)
Step: GroupBy
Group bureau_view by ClientID.
Code:
# Group BUREAU view by Client entity (ClientID).
bureau_view_by_client = bureau_view.groupby(["ClientID"])
Step: Aggregate BureauReportedCredits by Client
Maximum 'AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM' by 'ClientID' Over a 104w Period using 'bureau_view_by_client'.
Code:
# Get Max of AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM for the Client over time.
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w = bureau_view_by_client.aggregate_over(
"AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM",
method="max",
feature_names=[
"CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w"
],
windows=["104w"],
)[
"CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w"
]
Step 10: Feature Refinement¶
Corresponds to UI Tutorial: Refine Ideation and Create Feature List API docs: Feature Refinement
Extract the top features by importance from the ideation model and create a refined feature list.
# Create a refined feature list from model key importance
response = client.post(
"/feature_list_from_model",
json={
"mode": "Feature key importance based",
"ml_model_id": ideation_model_id,
"top_n": 200,
"importance_threshold_percentage": 0.90,
},
)
task_id = response.json()["id"]
print(f"Feature refinement started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Feature refinement: {task['status']}")
Feature refinement started (task: 6366404d-5149-4469-83a2-9ca5d0c4e5cc) status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... Feature refinement: SUCCESS
# Inspect the refined feature list
feature_list_from_model_id = task.get("payload", {}).get("output_document_id")
response = client.get(f"/feature_list_from_model/{feature_list_from_model_id}")
result = response.json()
feature_list_id = result["feature_list_id"]
print(f"Feature keys selected: {result['feature_keys_created_count']}")
print(f"Total features: {result['features_selected_count']}")
# Get feature list details
response = client.get(f"/feature_list/{feature_list_id}")
feature_list = response.json()
print(f"Feature list: {feature_list['name']} ({len(feature_list['feature_ids'])} features)")
Feature keys selected: 37 Total features: 200 Feature list: 200 Features from LightGBM [358 features: Loan Default Risk Assessment Features] (Top 200 Feature Keys) (200 features)
Step 10b: Feature EDA¶
API docs: Feature EDA
Run EDA on the refined features to analyze their distributions and relationship with the target.
# Run EDA on the refined feature list
response = client.post(
"/eda",
json={
"feature_list_id": feature_list_id,
"use_case_id": use_case_id,
},
)
task_id = response.json()["id"]
print(f"Feature EDA started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Feature EDA: {task['status']}")
Feature EDA started (task: 7a8412ee-707e-4595-adfb-c790e6925679) status: STARTED... Feature EDA: SUCCESS
# View EDA plots for the first feature
from IPython.display import HTML, display
response = client.get(f"/feature_list/{feature_list_id}")
first_feature_id = response.json()["feature_ids"][0]
response = client.get(
f"/eda/{first_feature_id}/plots",
params={"use_case_id": use_case_id},
)
plots = response.json()
for plot in plots:
for p in plot.get("plots", []):
if "content" in p:
display(HTML(p["content"]))
Step 11: Train Standalone Model¶
Corresponds to UI Tutorial: Create New Feature Lists and Models API docs: Model Training
Train a model on the refined feature list using the recommended settings.
# Get suggested model settings
response = client.get(
f"/use_case/{use_case_id}/ml_model_template_setting",
params={
"training_table_id": training_table_id,
"validation_table_id": validation_table_id,
"feature_list_id": feature_list_id,
"machine_learning_role": "OUTCOME",
},
)
settings = response.json()
print(f"Objective: {settings['objective']}")
print(f"Metric: {settings['metric']}")
print(f"Calibration: {settings.get('calibration_method')}")
Objective: binary Metric: area_under_curve Calibration: None
# Get available model templates
response = client.get(
f"/use_case/{use_case_id}/ml_model_template",
params={
"feature_list_id": feature_list_id,
"training_table_id": training_table_id,
"objective": settings["objective"],
"metric": settings["metric"],
"machine_learning_role": "OUTCOME",
},
)
templates = response.json()["data"]
for t in templates:
print(f" Template: {t['type']} (id: {t['_id']})")
# Use the first template
template = templates[0]
Template: NCTsDE_XGB (id: 6838241967efe8d7542bb238) Template: NCTsDE_LGB (id: 6838241967efe8d7542bb23b) Template: NCTDE_XGB (id: 6838241967efe8d7542bb232) Template: NCTDE_LGB (id: 6838241967efe8d7542bb235)
# Extract default parameters from template
node_name_to_parameters = {}
for preprocessor in template.get("preprocessors", []):
params = {
p["name"]: p["default_value"]
for p in preprocessor.get("parameters_metadata", [])
if p.get("default_value") is not None
}
if params:
node_name_to_parameters[preprocessor["node_name"]] = params
model_info = template.get("model", {})
if model_info:
params = {
p["name"]: p["default_value"]
for p in model_info.get("parameters_metadata", [])
if p.get("default_value") is not None
}
if params:
node_name_to_parameters[model_info["node_name"]] = params
print(f"Nodes configured: {list(node_name_to_parameters.keys())}")
Nodes configured: ['transformer_2', 'estimator_1']
# Train the model
payload = {
"use_case_id": use_case_id,
"model_name": "Credit Default - Refined Features",
"data_source": {
"type": "train_valid_observation_tables",
"training_table_id": training_table_id,
"validation_table_id": validation_table_id,
},
"feature_list_id": feature_list_id,
"model_template_type": template["type"],
"objective": settings["objective"],
"metric": settings["metric"],
"node_name_to_parameters": node_name_to_parameters,
"role": "OUTCOME",
}
if settings.get("calibration_method"):
payload["calibration_method"] = settings["calibration_method"]
response = client.post("/ml_model", json=payload)
task_id = response.json()["id"]
print(f"Model training started (task: {task_id})")
task = wait_for_task(client, task_id)
ml_model_id = task.get("payload", {}).get("output_document_id")
print(f"Model trained: {ml_model_id}")
Model training started (task: 1c792cdf-d1bf-4714-8a5d-1c2b51ce964f) status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... Model trained: 69d99d9dc4e0e91875fd9ef0
# View model details
response = client.get(f"/catalog/ml_model/{ml_model_id}")
model = response.json()
print(f"Model: {model['name']}")
print(f"Template: {model['model_template_type']}")
print(f"Features: {len(model.get('feature_importance', []))}")
# Show top 5 features by importance
for fi in sorted(model.get("feature_importance", []), key=lambda x: x["importance"], reverse=True)[:5]:
print(f" {fi['feature']}: {fi['importance_percent'] * 100:.1f}%")
Model: Credit Default - Refined Features Template: NCTsDE_XGB Features: 200 NEW_APPLICATION_EXT_SOURCE_2: 8.1% NEW_APPLICATION_EXT_SOURCE_3: 7.0% NEW_APPLICATION_EXT_SOURCE_1: 3.5% CLIENT_GENDER: 3.2% NEW_APPLICATION_DAYS_EMPLOYED: 2.9%
Step 12: Evaluate on Holdout¶
Corresponds to UI Tutorial: Refit Model API docs: Evaluation | Batch Predictions
Create a holdout observation table, generate predictions, and evaluate. The leaderboard is created automatically when predictions are generated on an observation table with a target.
# Create holdout observation table (Q1 2025)
holdout_table = obs_source.create_observation_table(
name="Applications Q1 2025",
columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
context_name="New Loan Application",
target_column="Loan_Default",
sample_from_timestamp="2025-01-01",
sample_to_timestamp="2025-04-01",
)
holdout_table.update_purpose(fb.Purpose.VALIDATION_TEST)
use_case.add_observation_table("Applications Q1 2025")
holdout_table_id = str(holdout_table.id)
print(f"Holdout table: {holdout_table.name} (id: {holdout_table_id})")
09:27:57 | WARNING | Primary entities will be a mandatory parameter in SDK version 0.7. WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.
Done! |████████████████████████████████████████| 100% in 15.5s (0.06%/s) Holdout table: Applications Q1 2025 (id: 69d9a39d06b1c7ee52d4fbaf)
# Generate predictions on holdout
response = client.post(
f"/ml_model/{ml_model_id}/prediction_table",
json={
"request_input": {
"request_type": "observation_table",
"table_id": holdout_table_id,
},
"include_input_features": False,
},
)
task_id = response.json()["id"]
print(f"Prediction started (task: {task_id})")
task = wait_for_task(client, task_id)
prediction_table_id = task.get("payload", {}).get("output_document_id")
print(f"Prediction table: {prediction_table_id}")
Prediction started (task: 08cb2d92-88a0-4956-8d88-04d66462560b) status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... Prediction table: 69d9a3b149c7d923c2844a05
# Download predictions as Parquet
import io
import pyarrow.parquet as pq
response = client.get(
f"/prediction_table/parquet/{prediction_table_id}",
stream=True,
)
buffer = io.BytesIO()
for chunk in response.iter_content(chunk_size=8192):
if chunk:
buffer.write(chunk)
buffer.seek(0)
predictions_df = pq.read_table(buffer).to_pandas()
print(f"Downloaded {len(predictions_df)} predictions")
predictions_df.head()
Downloaded 12527 predictions
| __FB_TABLE_ROW_INDEX | POINT_IN_TIME | SK_ID_CURR | Loan_Default | prediction | prediction_prob | |
|---|---|---|---|---|---|---|
| 0 | 1 | 2025-01-10 17:14:33 | 100011 | 0 | 0 | 0.062988 |
| 1 | 2 | 2025-02-12 11:40:00 | 100024 | 0 | 0 | 0.089458 |
| 2 | 3 | 2025-03-05 20:38:59 | 100060 | 0 | 0 | 0.029231 |
| 3 | 4 | 2025-02-07 18:08:40 | 100082 | 0 | 0 | 0.042675 |
| 4 | 5 | 2025-02-08 10:00:18 | 100084 | 0 | 0 | 0.145059 |
# The holdout leaderboard is created automatically when predictions
# are generated on an observation table with a target.
# Find it by observation table ID.
response = client.get(
"/catalog/leaderboard",
params={
"observation_table_id": holdout_table_id,
"observation_table_purpose": "holdout",
"role": "OUTCOME",
},
)
leaderboard = response.json()["data"][0]
leaderboard_id = leaderboard["_id"]
primary_metric = leaderboard["primary_metric"]
sort_dir = leaderboard.get("sort_order", "desc")
print(f"Leaderboard: {leaderboard['name']} (metric: {primary_metric}, {sort_dir})")
Leaderboard: Applications Q1 2025_holdout_leaderboard (metric: roc_auc, desc)
# View leaderboard results - list models sorted by metric
response = client.get(
"/catalog/ml_model",
params={
"leaderboard_id": leaderboard_id,
"sort_by": primary_metric,
"sort_dir": sort_dir,
"sort_by_metric": True,
"show_refits": True,
"leaderboard_role": "OUTCOME",
"page_size": 100,
},
)
models = response.json()["data"]
for m in models:
tag = " (best)" if m == models[0] else ""
print(f" {m['name']}{tag}: {m['leaderboard_evaluation_scores'][primary_metric]}")
ideation_model_id = models[0]["_id"]
print(f"\nBest model: {models[0]['name']} (id: {ideation_model_id})")
Credit Default - Refined Features (best): 0.7955597455713544 Best model: Credit Default - Refined Features (id: 69d99d9dc4e0e91875fd9ef0)
# Generate evaluation plots
response = client.request("OPTIONS", f"/ml_model/{ml_model_id}/evaluate")
options = response.json()
print(f"Available plots: {options.get('options', [])}")
Available plots: ['roc_curve', 'precision_recall_curve', 'ks_and_gain_curve', 'lift_curve', 'gain_report', 'predicted_vs_actual_per_bin', 'distribution', 'confusion_matrix']
from IPython.display import HTML, display
# Kolmogorov-Smirnow / Gain curve (binary classification)
response = client.post(
f"/ml_model/{ml_model_id}/evaluate",
json={
"option": "ks_and_gain_curve",
"plot_params": {"height": 500, "width": 800, "font_size": 14},
"holdout_table": {"table_type": "observation_table", "table_id": holdout_table_id},
},
)
display(HTML(response.json()["content"]))
Step 12b: Refit Model¶
Corresponds to UI Tutorial: Refit Model API docs: Model Training — Refit
Refit the best model on more recent data while keeping the same feature list and hyperparameters.
# Create a more recent training table
refit_training_table = obs_source.create_observation_table(
name="Applications up to Dec 2024",
columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
context_name="New Loan Application",
target_column="Loan_Default",
sample_from_timestamp="2019-04-01",
sample_to_timestamp="2025-01-01",
)
refit_training_table.update_purpose(fb.Purpose.TRAINING)
use_case.add_observation_table("Applications up to Dec 2024")
refit_training_table_id = str(refit_training_table.id)
print(f"Refit training table: {refit_training_table.name}")
09:31:49 | WARNING | Primary entities will be a mandatory parameter in SDK version 0.7. WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.
Done! |████████████████████████████████████████| 100% in 18.6s (0.05%/s) Refit training table: Applications up to Dec 2024
# Refit the model on more recent data
response = client.post(
f"/ml_model/{ml_model_id}/refit",
json={
"data_source": {
"type": "train_valid_observation_tables",
"training_table_id": refit_training_table_id,
"validation_table_id": None,
},
"model_name": "Credit Default - Refit Dec 2024",
},
)
task_id = response.json()["id"]
print(f"Refit started (task: {task_id})")
task = wait_for_task(client, task_id)
refit_model_id = task.get("payload", {}).get("output_document_id")
print(f"Refit model: {refit_model_id}")
Refit started (task: f762121a-e9fd-4c0c-97ee-0d1f764f0379) status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... status: STARTED... Refit model: 69d9a49949c7d923c2844a08
Step 13: Deploy¶
Corresponds to UI Tutorial: Deploy and Serve API docs: Deployment
# Get the feature list object via SDK
feature_list_obj = fb.FeatureList.get(feature_list["name"])
# Deploy with make_production_ready=True
# This upgrades all features to PRODUCTION_READY and creates the deployment
deployment = feature_list_obj.deploy(
deployment_name="Credit Default - Refined Model",
make_production_ready=True,
use_case_name="Loan Default by client",
)
print(f"Deployment: {deployment.name}")
Loading Feature(s) |████████████████████████████████████████| 200/200 [100%] in Done! |████████████████████████████████████████| 100% in 40.2s (0.02%/s) Done! |████████████████████████████████████████| 100% in 6.2s (0.16%/s) Deployment: Credit Default - Refined Model
# Enable the deployment
deployment.enable()
print(f"Deployment enabled")
Done! |████████████████████████████████████████| 100% in 10:20.9 (0.00%/s) Deployment enabled
# Verify deployment is active
catalog.list_deployments()
| id | name | feature_list_name | feature_list_version | num_feature | enabled | |
|---|---|---|---|---|---|---|
| 0 | 69d9aa2f06b1c7ee52d4fbb1 | Credit Default - Refined Model | 200 Features from LightGBM [358 features: Loan... | V260411 | 200 | True |
deployment_id = str(deployment.id)
# Generate deployment SQL for batch serving
response = client.post("/deployment_sql", json={"deployment_id": deployment_id})
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
response = client.get("/deployment_sql", params={"deployment_id": deployment_id})
sql_result = response.json()
print("Deployment SQL generated. Schedule this in your warehouse for batch feature computation.")
status: STARTED... Deployment SQL generated. Schedule this in your warehouse for batch feature computation.
print(sql_result["data"][0]["feature_table_sqls"][0]["sql_code"][:2000], "....")
SELECT
L."SK_ID_CURR",
R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_ACTIVE_is_Closed_104w",
R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_ACTIVE_is_Closed_26w",
R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_ACTIVE_is_Closed_52w",
R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_TYPE_is_Consumer_credit_104w",
R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_TYPE_is_Consumer_credit_26w",
R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_TYPE_is_Consumer_credit_52w",
R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_TYPE_is_Mortgage_104w",
R."CLIENT_Proportion_of_Count_of_BureauReportedCredits_when_BureauReportedCredit_CREDIT_ACTIVE_is_Active_52w",
R."CLIENT_Proportion_of_Count_of_BureauReportedCredits_when_BureauReportedCredit_CREDIT_ACTIVE_is_Closed_104w",
R."CLIENT_Latest_BureauReportedCredit_AMT_CREDIT_MAX_OVERDUE_104w",
R."CLIENT_Latest_BureauReportedCredit_AMT_CREDIT_SUM_104w",
R."CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_MAX_OVERDUES_104w",
R."CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUMS_104w",
R."CLIENT_Min_of_BureauReportedCredits_AMT_CREDIT_MAX_OVERDUES_52w",
R."CLIENT_Na_count_of_BureauReportedCredits_AMT_CREDIT_MAX_OVERDUES_52w",
R."CLIENT_Na_count_of_BureauReportedCredits_AMT_CREDIT_SUM_LIMITS_26w",
R."CLIENT_Sum_of_BureauReportedCredits_AMT_CREDIT_SUMS_104w",
R."CLIENT_Time_To_Latest_BureauReportedCredit_bureau_application_time_104w",
R."CLIENT_Time_To_Latest_BureauReportedCredit_credit_end_date_104w",
{{ CURRENT_TIMESTAMP }} AS "POINT_IN_TIME"
FROM (
WITH ENTITY_UNIVERSE AS (
SELECT
{{ CURRENT_TIMESTAMP }} AS "POINT_IN_TIME",
"SK_ID_CURR"
FROM (
SELE ....
# Disable the deployment
deployment.disable()
print(f"Deployment disabled")
Deployment disabled
SDK Reference: FeatureList.deploy() | Deployment
Summary¶
This tutorial walked through the full FeatureByte workflow:
| Step | Method | What we did |
|---|---|---|
| 1 | SDK | Created catalog, registered 7 tables, created 6 entities |
| 1b | API | Analyzed source tables and generated AI summaries |
| 2-5 | SDK | Registered tables, formulated use case, created observation tables |
| 6 | API | Ran table EDA and applied cleaning operations |
| 7 | API | Ran semantic detection and applied semantic tags |
| 8 | API | Created development dataset for faster ideation |
| 9 | API | Ran automated ideation pipeline, downloaded report PDF |
| 9b | API | Explored ideated features — SDK code, relevance scores, lineage |
| 10 | API | Refined features using key importance (90% threshold) |
| 10b | API | Ran feature EDA on refined features with plots |
| 11 | API | Trained standalone model on refined feature list |
| 12 | API | Evaluated on holdout set with leaderboard, plots, and Parquet download |
| 12b | API | Refit model on more recent data |
| 13 | API | Created and enabled deployment with batch SQL |
Next steps:
- Try different entity selections by running parallel ideation pipelines
- Run standalone feature selection with custom parameters
- Explore the Store Sales Forecast tutorial for a time series forecasting example