Credit Default: End-to-End SDK + API Tutorial¶

This tutorial replicates the Credit Default UI Tutorials using Python code. The SDK handles catalog setup, table registration, and entity management. The REST API handles ideation, training, evaluation, and deployment.

Prerequisites:

FeatureByte instance with the playground feature store connected to DEMO_DATASETS.CREDIT_DEFAULT
Python environment with featurebyte SDK installed
Profile tutorial configured (see SDK Setup)

What you'll build:

Register 7 source tables and tag entities (SDK)
Run table EDA and semantic detection (API)
Formulate a use case and create observation tables (SDK)
Run an automated ideation pipeline (API)
Refine features and train a standalone model (API)
Evaluate on a holdout set and deploy (API)

Setup¶

In [1]:

            
                Copied!
                
import time
import featurebyte as fb

fb.use_profile("tutorial")

DATABASE_NAME = "DEMO_DATASETS"
SCHEMA_NAME = "CREDIT_DEFAULT"
CATALOG_NAME = "Credit Default API Tutorial"
import time
import featurebyte as fb

fb.use_profile("tutorial")

DATABASE_NAME = "DEMO_DATASETS"
SCHEMA_NAME = "CREDIT_DEFAULT"
CATALOG_NAME = "Credit Default API Tutorial"

WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=2, read=3, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x335da8940>: Failed to establish a new connection: [Errno 61] Connection refused')': /api/v1/status
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=1, read=3, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x335da8d30>: Failed to establish a new connection: [Errno 61] Connection refused')': /api/v1/status
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=0, read=3, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x335da9060>: Failed to establish a new connection: [Errno 61] Connection refused')': /api/v1/status
06:46:32 | WARNING  | Service endpoint is inaccessible: http://127.0.0.1:5000/api/v1
WARNING :featurebyte:Service endpoint is inaccessible: http://127.0.0.1:5000/api/v1
06:46:32 | INFO     | Using profile: tutorial
INFO    :featurebyte:Using profile: tutorial
06:46:32 | INFO     | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
INFO    :featurebyte:Using configuration file at: /Users/gxav/.featurebyte/config.yaml
06:46:32 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
INFO    :featurebyte:Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
06:46:32 | INFO     | SDK version: 3.4.1.dev7
INFO    :featurebyte:SDK version: 3.4.1.dev7
06:46:32 | INFO     | No catalog activated.
INFO    :featurebyte:No catalog activated.

In [2]:

            
                Copied!
                
                    
                    
                
                

        
def wait_for_task(client, task_id, poll_interval=30):
    """Poll a task until completion. Returns the full task response."""
    while True:
        task = client.get(f"/task/{task_id}").json()
        if task["status"] in ("SUCCESS", "FAILURE"):
            if task["status"] == "FAILURE":
                print(f"Task FAILED: {task.get('traceback', 'no traceback')}")
            return task
        print(f"  status: {task['status']}...")
        time.sleep(poll_interval)
def wait_for_task(client, task_id, poll_interval=30):
    """Poll a task until completion. Returns the full task response."""
    while True:
        task = client.get(f"/task/{task_id}").json()
        if task["status"] in ("SUCCESS", "FAILURE"):
            if task["status"] == "FAILURE":
                print(f"Task FAILED: {task.get('traceback', 'no traceback')}")
            return task
        print(f"  status: {task['status']}...")
        time.sleep(poll_interval)

Step 1: Create Catalog¶

Corresponds to UI Tutorial: Create Catalog

In [3]:

            
                Copied!
                
catalog = fb.Catalog.create(CATALOG_NAME, "playground")
catalog.activate(CATALOG_NAME)

ds = catalog.get_data_source()
client = fb.Configurations().get_client()

# Get the feature store ID for API calls
feature_store = fb.FeatureStore.get("playground")
feature_store_id = str(feature_store.id)

print(f"Catalog '{catalog.name}' created. Feature store: playground")
catalog = fb.Catalog.create(CATALOG_NAME, "playground")
catalog.activate(CATALOG_NAME)

ds = catalog.get_data_source()
client = fb.Configurations().get_client()

# Get the feature store ID for API calls
feature_store = fb.FeatureStore.get("playground")
feature_store_id = str(feature_store.id)

print(f"Catalog '{catalog.name}' created. Feature store: playground")

06:46:33 | INFO     | Catalog activated: Credit Default API Tutorial
INFO    :featurebyte.api.catalog:Catalog activated: Credit Default API Tutorial

Catalog 'Credit Default API Tutorial' created. Feature store: playground

SDK Reference: Catalog | FeatureStore | DataSource

Step 1b: Analyze Source Tables (API)¶

Corresponds to UI Tutorial: Register Tables — the "magic wand" feature API docs: Source Data Exploration

Before registering, use the API to analyze source tables and detect their types.

In [4]:

            
                Copied!
                
                    
                    
                
                

        
# Generate AI-powered summaries for all tables
table_names = [
    "NEW_APPLICATION", "CLIENT_PROFILE", "BUREAU",
    "PREVIOUS_APPLICATION", "LOAN_STATUS",
    "INSTALLMENTS_PAYMENTS", "CREDIT_CARD_MONTHLY_BALANCE",
]

response = client.post(
    "/table/source_table_summary",
    json={
        "feature_store_id": feature_store_id,
        "database_name": DATABASE_NAME,
        "schema_name": SCHEMA_NAME,
        "table_names": table_names,
    },
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Table summaries generated: {task['status']}")

# List tables with summaries
response = client.get(
    f"/feature_store/{feature_store_id}/table",
    params={
        "database_name": DATABASE_NAME,
        "schema_name": SCHEMA_NAME,
    },
)
for t in response.json():
    name = t["name"]
    summary = t.get("summary", "")
    print(f"{name}: {summary[:200] if summary else '(no summary)'}...")
# Generate AI-powered summaries for all tables
table_names = [
    "NEW_APPLICATION", "CLIENT_PROFILE", "BUREAU",
    "PREVIOUS_APPLICATION", "LOAN_STATUS",
    "INSTALLMENTS_PAYMENTS", "CREDIT_CARD_MONTHLY_BALANCE",
]

response = client.post(
    "/table/source_table_summary",
    json={
        "feature_store_id": feature_store_id,
        "database_name": DATABASE_NAME,
        "schema_name": SCHEMA_NAME,
        "table_names": table_names,
    },
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Table summaries generated: {task['status']}")

# List tables with summaries
response = client.get(
    f"/feature_store/{feature_store_id}/table",
    params={
        "database_name": DATABASE_NAME,
        "schema_name": SCHEMA_NAME,
    },
)
for t in response.json():
    name = t["name"]
    summary = t.get("summary", "")
    print(f"{name}: {summary[:200] if summary else '(no summary)'}...")

status: STARTED...
Table summaries generated: SUCCESS
BUREAU: The BUREAU table contains information about credits taken by clients from other financial institutions, as reported to the credit bureau. It includes details such as the client's ID, unique identifier...
CLIENT_PROFILE: The CLIENT_PROFILE table provides detailed information about each client's profile. It includes various attributes such as the client's unique identifier (ClientID), personal details like birthdate an...
CREDIT_CARD_MONTHLY_BALANCE: The CREDIT_CARD_MONTHLY_BALANCE table provides a comprehensive summary of monthly balances for credit cards. It includes detailed information about each credit card, such as the card ID and the associ...
INSTALLMENTS_PAYMENTS: The INSTALLMENTS_PAYMENTS table records the details of monthly installment payments for loans. It includes information about each installment, such as its unique ID, the associated loan application ID...
LOAN_STATUS: The LOAN_STATUS table is designed to track the status of loans, specifically focusing on whether a loan has been terminated or is still active. It includes key identifiers such as the loan ID and appl...
NEW_APPLICATION: The NEW_APPLICATION table contains detailed information about new loan applications submitted by clients. It includes various attributes related to the application itself, the client, and the client's...
OBSERVATIONS_WITH_TARGET: The OBSERVATIONS_WITH_TARGET table is designed for training purposes and contains data related to loan applications. It includes a timestamp indicating the specific point in time for each prediction, ...
OBSERVATION_EDA_TABLE: The OBSERVATION_EDA_TABLE is designed for training purposes and contains data related to loan applications. It includes a timestamp indicating the specific point in time for each prediction, an intege...
PREVIOUS_APPLICATION: The PREVIOUS_APPLICATION table contains detailed information about prior loan applications made by clients. It includes various attributes related to the application process, such as the application I...

In [5]:

            
                Copied!
                
                    
                    
                
                

        
# Analyze a source table to detect its type
response = client.post(
    "/table/source_table_analysis",
    json={
        "feature_store_id": feature_store_id,
        "database_name": DATABASE_NAME,
        "schema_name": SCHEMA_NAME,
        "table_name": "BUREAU",
    },
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)

# Get analysis results
analysis_id = task.get("payload", {}).get("output_document_id")
response = client.get(f"/table/source_table_analysis/{analysis_id}")
analysis = response.json()

print(f"Table: BUREAU\n")
print("-"*8)
print(f'Suggested type:\n{analysis["table_type"]}\n')
print("-"*8)
print(f'Type explanation:\n{analysis["type_explanation"]}\n')
print("-"*8)
print(f'Setting explanation:\n{analysis["setting_explanation"]}\n')
print("-"*8)
print(f'Warnings:{analysis["warnings"]}')
# Analyze a source table to detect its type
response = client.post(
    "/table/source_table_analysis",
    json={
        "feature_store_id": feature_store_id,
        "database_name": DATABASE_NAME,
        "schema_name": SCHEMA_NAME,
        "table_name": "BUREAU",
    },
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)

# Get analysis results
analysis_id = task.get("payload", {}).get("output_document_id")
response = client.get(f"/table/source_table_analysis/{analysis_id}")
analysis = response.json()

print(f"Table: BUREAU\n")
print("-"*8)
print(f'Suggested type:\n{analysis["table_type"]}\n')
print("-"*8)
print(f'Type explanation:\n{analysis["type_explanation"]}\n')
print("-"*8)
print(f'Setting explanation:\n{analysis["setting_explanation"]}\n')
print("-"*8)
print(f'Warnings:{analysis["warnings"]}')

status: PENDING...
status: PENDING...
Table: BUREAU

--------
Suggested type:
event_table

--------
Type explanation:
The BUREAU table fits the definition of an event_table because it captures unique events related to credit activities reported to the credit bureau. Each row in the table represents a distinct credit event associated with a client, as indicated by the unique identifier `SK_ID_BUREAU`, which is a recoded ID for each credit bureau credit related to a loan application. The presence of multiple timestamps, such as `bureau_application_time`, `credit_end_date`, `credit_end_fact`, `credit_update`, and `available_at`, further supports the classification as an event_table, as these timestamps capture specific points in time when events related to the credit activities occurred. Additionally, the table includes status fields like `CREDIT_ACTIVE` and various financial metrics that describe the state of the credit at the time of the event, aligning with the characteristics of an event_table that records distinct events with associated details.

--------
Setting explanation:
**Event ID**: The Event ID is set to `SK_ID_BUREAU`, which serves as the unique identifier for each credit event recorded in the BUREAU table. This ID is crucial for distinguishing between different credit events associated with a client, as each row in the table represents a distinct credit activity reported to the credit bureau. By using `SK_ID_BUREAU` as the Event ID, we ensure that each event can be uniquely identified and tracked throughout the feature engineering and machine learning processes.

**Event Timestamp**: The Event Timestamp is designated as `credit_update`, which indicates the specific point in time when the last information about the Credit Bureau credit was updated. This timestamp is essential for understanding the temporal aspect of each credit event, allowing us to analyze the sequence and timing of credit activities. The `credit_update` timestamp is assumed to be recorded in UTC, which provides a standardized time reference for all events.

**Event Timestamp Schema**: The Event Timestamp Schema specifies that the `credit_update` timestamp follows a Timestamp data type with no additional format string required. It is recorded in Coordinated Universal Time (UTC), ensuring consistency and comparability across different events and datasets. This schema helps in accurately interpreting the timing of events and aligning them with other time-based data.

**Record Creation Timestamp**: The Record Creation Timestamp is set to `available_at`, which denotes when the record was added to the data warehouse. This timestamp is important for tracking the data ingestion process and understanding when the information became available for analysis. It provides a reference point for the data's availability and can be used to assess the timeliness and currency of the data in the context of feature engineering and machine learning.

--------
Warnings:
The credit_update column is assumed to be recorded in UTC. Please confirm.

Step 2: Register Tables¶

Corresponds to UI Tutorial: Register Tables

We register 7 tables from DEMO_DATASETS.CREDIT_DEFAULT:

NEW_APPLICATION (Dimension) — loan application data
CLIENT_PROFILE (SCD) — client demographics over time
BUREAU (Event) — bureau credit reports
PREVIOUS_APPLICATION (Event) — prior loan applications
LOAN_STATUS (SCD) — loan repayment status over time
INSTALLMENTS_PAYMENTS (Event) — installment payment events
CREDIT_CARD_MONTHLY_BALANCE (Time Series) — monthly credit card balances

In [6]:

            
                Copied!
                
                    
                    
                
                

        
def get_source(table_name):
    return ds.get_source_table(
        database_name=DATABASE_NAME,
        schema_name=SCHEMA_NAME,
        table_name=table_name,
    )
def get_source(table_name):
    return ds.get_source_table(
        database_name=DATABASE_NAME,
        schema_name=SCHEMA_NAME,
        table_name=table_name,
    )

In [7]:

            
                Copied!
                
                    
                    
                
                

        
# Dimension table
new_application = get_source("NEW_APPLICATION").create_dimension_table(
    name="NEW_APPLICATION",
    dimension_id_column="SK_ID_CURR",
    record_creation_timestamp_column="available_at",
)
print("Registered NEW_APPLICATION (Dimension)")
# Dimension table
new_application = get_source("NEW_APPLICATION").create_dimension_table(
    name="NEW_APPLICATION",
    dimension_id_column="SK_ID_CURR",
    record_creation_timestamp_column="available_at",
)
print("Registered NEW_APPLICATION (Dimension)")

Registered NEW_APPLICATION (Dimension)

In [8]:

            
                Copied!
                
                    
                    
                
                

        
# SCD tables
client_profile = get_source("CLIENT_PROFILE").create_scd_table(
    name="CLIENT_PROFILE",
    natural_key_column="ClientID",
    effective_timestamp_column="SCD_effective_timestamp",
    end_timestamp_column="SCD_end_timestamp",
    record_creation_timestamp_column="available_at",
)
print("Registered CLIENT_PROFILE (SCD)")

loan_status = get_source("LOAN_STATUS").create_scd_table(
    name="LOAN_STATUS",
    natural_key_column="LOAN_ID",
    effective_timestamp_column="SCD_Effective_Timestamp",
    end_timestamp_column="SCD_End_Timestamp",
    record_creation_timestamp_column="available_at",
)
print("Registered LOAN_STATUS (SCD)")
# SCD tables
client_profile = get_source("CLIENT_PROFILE").create_scd_table(
    name="CLIENT_PROFILE",
    natural_key_column="ClientID",
    effective_timestamp_column="SCD_effective_timestamp",
    end_timestamp_column="SCD_end_timestamp",
    record_creation_timestamp_column="available_at",
)
print("Registered CLIENT_PROFILE (SCD)")

loan_status = get_source("LOAN_STATUS").create_scd_table(
    name="LOAN_STATUS",
    natural_key_column="LOAN_ID",
    effective_timestamp_column="SCD_Effective_Timestamp",
    end_timestamp_column="SCD_End_Timestamp",
    record_creation_timestamp_column="available_at",
)
print("Registered LOAN_STATUS (SCD)")

Registered CLIENT_PROFILE (SCD)
Registered LOAN_STATUS (SCD)

In [9]:

            
                Copied!
                
                    
                    
                
                

        
# Event tables
bureau = get_source("BUREAU").create_event_table(
    name="BUREAU",
    event_id_column="SK_ID_BUREAU",
    event_timestamp_column="credit_update",
    record_creation_timestamp_column="available_at",
)
bureau.initialize_default_feature_job_setting()
print("Registered BUREAU (Event)")

previous_application = get_source("PREVIOUS_APPLICATION").create_event_table(
    name="PREVIOUS_APPLICATION",
    event_id_column="APPLICATION_ID",
    event_timestamp_column="decision_date",
    record_creation_timestamp_column="available_at",
)
previous_application.initialize_default_feature_job_setting()
print("Registered PREVIOUS_APPLICATION (Event)")

installments = get_source("INSTALLMENTS_PAYMENTS").create_event_table(
    name="INSTALLMENTS_PAYMENTS",
    event_id_column="INSTALMENT_ID",
    event_timestamp_column="actual_installment_date",
    record_creation_timestamp_column="available_at",
)
installments.update_default_feature_job_setting(
    feature_job_setting=fb.FeatureJobSetting(
        blind_spot="1h",
        period="24h",
        offset="19h",
    )
)
print("Registered INSTALLMENTS_PAYMENTS (Event)")
# Event tables
bureau = get_source("BUREAU").create_event_table(
    name="BUREAU",
    event_id_column="SK_ID_BUREAU",
    event_timestamp_column="credit_update",
    record_creation_timestamp_column="available_at",
)
bureau.initialize_default_feature_job_setting()
print("Registered BUREAU (Event)")

previous_application = get_source("PREVIOUS_APPLICATION").create_event_table(
    name="PREVIOUS_APPLICATION",
    event_id_column="APPLICATION_ID",
    event_timestamp_column="decision_date",
    record_creation_timestamp_column="available_at",
)
previous_application.initialize_default_feature_job_setting()
print("Registered PREVIOUS_APPLICATION (Event)")

installments = get_source("INSTALLMENTS_PAYMENTS").create_event_table(
    name="INSTALLMENTS_PAYMENTS",
    event_id_column="INSTALMENT_ID",
    event_timestamp_column="actual_installment_date",
    record_creation_timestamp_column="available_at",
)
installments.update_default_feature_job_setting(
    feature_job_setting=fb.FeatureJobSetting(
        blind_spot="1h",
        period="24h",
        offset="19h",
    )
)
print("Registered INSTALLMENTS_PAYMENTS (Event)")

Done! |████████████████████████████████████████| 100% in 12.4s (0.08%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: BUREAU
The analysis period starts at 2026-02-28 10:03:14 and ends at 2026-03-28 10:03:14
The column used for the event timestamp is credit_update
The column used for the record creation timestamp for BUREAU is available_at

STATISTICS ON TIME BETWEEN BUREAU RECORDS CREATIONS
- Average time is 1482.132273220094 s
- Median time is 60.0 s
- Lowest time is 60.0 s
- Largest time is 83520.0 s
based on a total of 1199 unique record creation timestamps.

The BEST ESTIMATE FOR BUREAU UPDATE FREQUENCY is every 24 hours

BUREAU UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 7 hours 9 minutes after the start of each 24 hours.
This includes a buffer of 600 s to allow for late jobs.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 25800 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25200 s
- blind spot for 99.9 % of events to land: 25800 s
- blind spot for 99.95 % of events to land: 25800 s
- blind spot for 99.99 % of events to land: 25800 s
- blind spot for 99.995 % of events to land: 25800 s
- blind spot for 100.0 % of events to land: 25800 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 25800 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25800

The resulting FEATURE CUTOFF offset is 86340 s.

Feature Tiles Event Landing Time

For a feature cutoff at 86340 s:
- time for 99.5 % of events to land: 25800 s
- time for 99.9 % of events to land: 25800 s
- time for 99.95 % of events to land: 25800 s
- time for 99.99 % of events to land: 25800 s
- time for 99.995 % of events to land: 25800 s
- time for 100.0 % of events to land: 25800 s

Backtest Result

For the feature job setting:
- Period = 86400 s / Offset = 25740 s / Blind spot = 25800 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2026-02-28 10:00:00 to 2026-03-28 10:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25800 s.

The recommended Default Feature Job setting is:

period: 86400 s
offset: 25740 s
blind_spot: 25800 s

Registered BUREAU (Event)
Done! |████████████████████████████████████████| 100% in 9.3s (0.11%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: PREVIOUS_APPLICATION
The analysis period starts at 2026-02-25 02:25:16 and ends at 2026-03-25 02:25:16
The column used for the event timestamp is decision_date
The column used for the record creation timestamp for PREVIOUS_APPLICATION is available_at

STATISTICS ON TIME BETWEEN PREVIOUS_APPLICATION RECORDS CREATIONS
- Average time is 9508.470588235294 s
- Median time is 240.0 s
- Lowest time is 60.0 s
- Largest time is 85920.0 s
based on a total of 234 unique record creation timestamps.

The BEST ESTIMATE FOR PREVIOUS_APPLICATION UPDATE FREQUENCY is every 24 hours

PREVIOUS_APPLICATION UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 7 hours 9 minutes after the start of each 24 hours.
This includes a buffer of 600 s to allow for late jobs.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 25200 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25200 s
- blind spot for 99.9 % of events to land: 25200 s
- blind spot for 99.95 % of events to land: 25200 s
- blind spot for 99.99 % of events to land: 25200 s
- blind spot for 99.995 % of events to land: 25200 s
- blind spot for 100.0 % of events to land: 25200 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 25200 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25200

The resulting FEATURE CUTOFF offset is 540 s.

Feature Tiles Event Landing Time

For a feature cutoff at 540 s:
- time for 99.5 % of events to land: 25200 s
- time for 99.9 % of events to land: 25200 s
- time for 99.95 % of events to land: 25200 s
- time for 99.99 % of events to land: 25200 s
- time for 99.995 % of events to land: 25200 s
- time for 100.0 % of events to land: 25200 s

Backtest Result

For the feature job setting:
- Period = 86400 s / Offset = 25740 s / Blind spot = 25200 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2026-02-25 02:00:00 to 2026-03-25 02:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25200 s.

The recommended Default Feature Job setting is:

period: 86400 s
offset: 25740 s
blind_spot: 25200 s

Registered PREVIOUS_APPLICATION (Event)
Registered INSTALLMENTS_PAYMENTS (Event)

In [10]:

            
                Copied!
                
                    
                    
                
                

        
# Time series table
cc_balance = get_source("CREDIT_CARD_MONTHLY_BALANCE").create_time_series_table(
    name="CREDIT_CARD_MONTHLY_BALANCE",
    series_id_column="CARD_ID",
    reference_datetime_column="balance_month",
    reference_datetime_schema=fb.TimestampSchema(
        format_string="%Y-%m",
        timezone="America/Los_Angeles",
    ),
    time_interval=fb.TimeInterval(value=1, unit="MONTH"),
    record_creation_timestamp_column="available_at",
)
cc_balance.update_default_feature_job_setting(
    feature_job_setting=fb.CronFeatureJobSetting(
        crontab="0 13 1 * *",
        timezone="America/Los_Angeles",
    )
)
print("Registered CREDIT_CARD_MONTHLY_BALANCE (Time Series)")
# Time series table
cc_balance = get_source("CREDIT_CARD_MONTHLY_BALANCE").create_time_series_table(
    name="CREDIT_CARD_MONTHLY_BALANCE",
    series_id_column="CARD_ID",
    reference_datetime_column="balance_month",
    reference_datetime_schema=fb.TimestampSchema(
        format_string="%Y-%m",
        timezone="America/Los_Angeles",
    ),
    time_interval=fb.TimeInterval(value=1, unit="MONTH"),
    record_creation_timestamp_column="available_at",
)
cc_balance.update_default_feature_job_setting(
    feature_job_setting=fb.CronFeatureJobSetting(
        crontab="0 13 1 * *",
        timezone="America/Los_Angeles",
    )
)
print("Registered CREDIT_CARD_MONTHLY_BALANCE (Time Series)")

Registered CREDIT_CARD_MONTHLY_BALANCE (Time Series)

Step 3: Register Entities¶

Corresponds to UI Tutorial: Register Entities

In [11]:

            
                Copied!
                
                    
                    
                
                

        
# Create entities
entity_new_app = fb.Entity.create(name="New Application", serving_names=["SK_ID_CURR"])
entity_client = fb.Entity.create(name="Client", serving_names=["ClientID"])
entity_bureau = fb.Entity.create(name="BureauReportedCredit", serving_names=["SK_ID_BUREAU"])
entity_prior_app = fb.Entity.create(name="PriorApplication", serving_names=["APPLICATION_ID"])
entity_loan = fb.Entity.create(name="Loan", serving_names=["LOAN_ID"])
entity_installment = fb.Entity.create(name="Installment", serving_names=["INSTALMENT_ID"])

print(f"Created {len(catalog.list_entities())} entities")
# Create entities
entity_new_app = fb.Entity.create(name="New Application", serving_names=["SK_ID_CURR"])
entity_client = fb.Entity.create(name="Client", serving_names=["ClientID"])
entity_bureau = fb.Entity.create(name="BureauReportedCredit", serving_names=["SK_ID_BUREAU"])
entity_prior_app = fb.Entity.create(name="PriorApplication", serving_names=["APPLICATION_ID"])
entity_loan = fb.Entity.create(name="Loan", serving_names=["LOAN_ID"])
entity_installment = fb.Entity.create(name="Installment", serving_names=["INSTALMENT_ID"])

print(f"Created {len(catalog.list_entities())} entities")

Created 6 entities

In [12]:

            
                Copied!
                
# Tag entity columns on each table
new_application["SK_ID_CURR"].as_entity("New Application")
new_application["ClientID"].as_entity("Client")

client_profile["ClientID"].as_entity("Client")

bureau["SK_ID_BUREAU"].as_entity("BureauReportedCredit")
bureau["ClientID"].as_entity("Client")

previous_application["APPLICATION_ID"].as_entity("PriorApplication")
previous_application["ClientID"].as_entity("Client")

loan_status["LOAN_ID"].as_entity("Loan")
loan_status["APPLICATION_ID"].as_entity("PriorApplication")

installments["INSTALMENT_ID"].as_entity("Installment")
installments["APPLICATION_ID"].as_entity("PriorApplication")

cc_balance["CARD_ID"].as_entity("BureauReportedCredit")
cc_balance["ClientID"].as_entity("Client")

print("Entity tagging complete")
# Tag entity columns on each table
new_application["SK_ID_CURR"].as_entity("New Application")
new_application["ClientID"].as_entity("Client")

client_profile["ClientID"].as_entity("Client")

bureau["SK_ID_BUREAU"].as_entity("BureauReportedCredit")
bureau["ClientID"].as_entity("Client")

previous_application["APPLICATION_ID"].as_entity("PriorApplication")
previous_application["ClientID"].as_entity("Client")

loan_status["LOAN_ID"].as_entity("Loan")
loan_status["APPLICATION_ID"].as_entity("PriorApplication")

installments["INSTALMENT_ID"].as_entity("Installment")
installments["APPLICATION_ID"].as_entity("PriorApplication")

cc_balance["CARD_ID"].as_entity("BureauReportedCredit")
cc_balance["ClientID"].as_entity("Client")

print("Entity tagging complete")

Entity tagging complete

SDK Reference: Entity | TableColumn.as_entity()

Step 4: Formulate Use Case¶

Corresponds to UI Tutorial: Formulate Use Cases

Create a context (what we're predicting for), a target (what we're predicting), and a use case that ties them together.

In [13]:

            
                Copied!
                
                    
                    
                
                

        
context = fb.Context.create(
    name="New Loan Application",
    primary_entity=["New Application"],
    description="Loan application under review.",
)
print(f"Context: {context.name}")
context = fb.Context.create(
    name="New Loan Application",
    primary_entity=["New Application"],
    description="Loan application under review.",
)
print(f"Context: {context.name}")

Context: New Loan Application

In [14]:

            
                Copied!
                
                    
                    
                
                

        
# Create target from the NEW_APPLICATION table
# The target column indicates whether the client defaulted within 6 months
target_name = "Loan_Default"
target = fb.TargetNamespace.create(
    name=target_name,
    window="182d",
    dtype="INT",
    primary_entity=["New Application"],
    target_type=fb.TargetType.CLASSIFICATION,
    positive_label=1,
)
print(f"Target: {target.name}")
# Create target from the NEW_APPLICATION table
# The target column indicates whether the client defaulted within 6 months
target_name = "Loan_Default"
target = fb.TargetNamespace.create(
    name=target_name,
    window="182d",
    dtype="INT",
    primary_entity=["New Application"],
    target_type=fb.TargetType.CLASSIFICATION,
    positive_label=1,
)
print(f"Target: {target.name}")

Target: Loan_Default

In [15]:

            
                Copied!
                
                    
                    
                
                

        
use_case = fb.UseCase.create(
    name="Loan Default by client",
    target_name="Loan_Default",
    context_name="New Loan Application",
    description="Predict clients' payment difficulties the next 6 months for new loan based on its application data and prior credit history.",
)
use_case_id = str(use_case.id)
print(f"Use case: {use_case.name} (id: {use_case_id})")
use_case = fb.UseCase.create(
    name="Loan Default by client",
    target_name="Loan_Default",
    context_name="New Loan Application",
    description="Predict clients' payment difficulties the next 6 months for new loan based on its application data and prior credit history.",
)
use_case_id = str(use_case.id)
print(f"Use case: {use_case.name} (id: {use_case_id})")

Use case: Loan Default by client (id: 69d97e5806b1c7ee52d4fbab)

SDK Reference: Context | Target | UseCase

Step 5: Create Observation Tables¶

Corresponds to UI Tutorial: Create Observation Tables

We create observation tables from the pre-built OBSERVATIONS_WITH_TARGET source table. Each table serves a different purpose: EDA (50K sample), training, validation, and holdout.

In [16]:

            
                Copied!
                
                    
                    
                
                

        
obs_source = get_source("OBSERVATIONS_WITH_TARGET")

# Full training table
training_table = obs_source.create_observation_table(
    name="Applications up to Sept 2024",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_from_timestamp="2019-04-01",
    sample_to_timestamp="2024-10-01",
)
training_table.update_purpose(fb.Purpose.TRAINING)
use_case.add_observation_table("Applications up to Sept 2024")
training_table_id = str(training_table.id)
print(f"Training table: {training_table.name} (id: {training_table_id})")
obs_source = get_source("OBSERVATIONS_WITH_TARGET")

# Full training table
training_table = obs_source.create_observation_table(
    name="Applications up to Sept 2024",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_from_timestamp="2019-04-01",
    sample_to_timestamp="2024-10-01",
)
training_table.update_purpose(fb.Purpose.TRAINING)
use_case.add_observation_table("Applications up to Sept 2024")
training_table_id = str(training_table.id)
print(f"Training table: {training_table.name} (id: {training_table_id})")

06:48:57 | WARNING  | Primary entities will be a mandatory parameter in SDK version 0.7.
WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.

Done! |████████████████████████████████████████| 100% in 18.6s (0.05%/s)        
Training table: Applications up to Sept 2024 (id: 69d97e5906b1c7ee52d4fbac)

In [17]:

            
                Copied!
                
                    
                    
                
                

        
# Validation table (Q4 2024)
validation_table = obs_source.create_observation_table(
    name="Applications Q4 2024",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_from_timestamp="2024-10-01",
    sample_to_timestamp="2025-01-01",
)
validation_table.update_purpose(fb.Purpose.VALIDATION_TEST)
use_case.add_observation_table("Applications Q4 2024")
validation_table_id = str(validation_table.id)
print(f"Validation table: {validation_table.name} (id: {validation_table_id})")
# Validation table (Q4 2024)
validation_table = obs_source.create_observation_table(
    name="Applications Q4 2024",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_from_timestamp="2024-10-01",
    sample_to_timestamp="2025-01-01",
)
validation_table.update_purpose(fb.Purpose.VALIDATION_TEST)
use_case.add_observation_table("Applications Q4 2024")
validation_table_id = str(validation_table.id)
print(f"Validation table: {validation_table.name} (id: {validation_table_id})")

06:49:17 | WARNING  | Primary entities will be a mandatory parameter in SDK version 0.7.
WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.

Done! |████████████████████████████████████████| 100% in 18.6s (0.05%/s)        
Validation table: Applications Q4 2024 (id: 69d97e6d06b1c7ee52d4fbad)

In [18]:

            
                Copied!
                
                    
                    
                
                

        
# EDA table (50K sample)
eda_table = obs_source.create_observation_table(
    name="50K applications",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_rows=50000,
)
eda_table.update_purpose(fb.Purpose.EDA)
use_case.add_observation_table("50K applications")
use_case.update_default_eda_table("50K applications")
eda_table_id = str(eda_table.id)
print(f"EDA table: {eda_table.name} (id: {eda_table_id})")
# EDA table (50K sample)
eda_table = obs_source.create_observation_table(
    name="50K applications",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_rows=50000,
)
eda_table.update_purpose(fb.Purpose.EDA)
use_case.add_observation_table("50K applications")
use_case.update_default_eda_table("50K applications")
eda_table_id = str(eda_table.id)
print(f"EDA table: {eda_table.name} (id: {eda_table_id})")

06:49:37 | WARNING  | Primary entities will be a mandatory parameter in SDK version 0.7.
WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.

Done! |████████████████████████████████████████| 100% in 21.7s (0.05%/s)        
EDA table: 50K applications (id: 69d97e8106b1c7ee52d4fbae)

SDK Reference: ObservationTable | SourceTable.create_observation_table()

Step 6: Table EDA¶

Corresponds to UI Tutorial: Set Default Cleaning Operations API docs: Table EDA

Run EDA on each table to discover data quality issues and review column distributions.

In [19]:

            
                Copied!
                
                    
                    
                
                

        
# Run EDA on the CLIENT_PROFILE table as an example
client_profile_id = str(client_profile.id)

response = client.post(
    "/table_eda",
    json={
        "table_id": client_profile_id,
        "analysis_size": 10000,
        "seed": 42,
    },
)
task_id = response.json()["id"]
print(f"Table EDA started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Table EDA: {task['status']}")
# Run EDA on the CLIENT_PROFILE table as an example
client_profile_id = str(client_profile.id)

response = client.post(
    "/table_eda",
    json={
        "table_id": client_profile_id,
        "analysis_size": 10000,
        "seed": 42,
    },
)
task_id = response.json()["id"]
print(f"Table EDA started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Table EDA: {task['status']}")

Table EDA started (task: 0e836836-49b6-4a02-99e4-f47a9ba49fb8)
  status: PENDING...
Table EDA: SUCCESS

In [20]:

            
                Copied!
                
                    
                    
                
                

        
# List EDA runs and view column analysis
response = client.get(f"/table_eda?table_id={client_profile_id}")
eda_runs = response.json()["data"]
table_analysis_id = eda_runs[0]["_id"]

response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}")
columns = response.json()["data"]

print(f"Analyzed {len(columns)} columns")
for col in columns:
    print("-" * 8)
    print(f"{col['column_name']}: {col['summary']}")
# List EDA runs and view column analysis
response = client.get(f"/table_eda?table_id={client_profile_id}")
eda_runs = response.json()["data"]
table_analysis_id = eda_runs[0]["_id"]

response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}")
columns = response.json()["data"]

print(f"Analyzed {len(columns)} columns")
for col in columns:
    print("-" * 8)
    print(f"{col['column_name']}: {col['summary']}")

Analyzed 10 columns
--------
ClientID: The 'ClientID' column is a numeric unique identifier with no detected issues or cleaning operations applied. The dataset contains 10,000 entries, all of which are non-missing, with 9,863 unique values. The mean ClientID is 278,192.36, with a standard deviation of 101,910.21. The values range from a minimum of 100,002 to a maximum of 456,214. The distribution is fairly uniform across the range, as indicated by the histogram, with no zeros or outliers present. The quartiles are well-distributed, with the 25th percentile at 189,863.75, the median at 278,951, and the 75th percentile at 365,607.75. There are no excluded rows in the EDA plot, and the data does not require any cleaning.
--------
BIRTHDATE: The BIRTHDATE column, which is of VARCHAR type, represents client birthdates and is suspected to be in a string-based datetime format of YYYY-MM-DD. No cleaning operations were applied to this column. The raw data analysis shows that it is categorical with 10,000 entries, all of which are non-missing, and 7,431 unique values. The most frequent birthdate is 1964-06-13, appearing with a frequency of 0.0005. The top 20 birthdates each have a frequency of 5, except for the last few which have a frequency of 4, while the remaining dates not in the top 20 collectively account for 9,909 entries. There is no additional analysis on cleaned data as no cleaning was performed.
--------
GENDER: The GENDER column in the dataset is a categorical variable with no detected issues or cleaning operations applied. The column contains data for 10,000 clients, with no missing values. There are two unique categories: 'F' and 'M'. The most frequent category is 'F', which appears 6,588 times, accounting for approximately 65.88% of the data, while 'M' appears 3,412 times. The distribution of gender is visualized in a bar plot, showing a higher frequency of 'F' compared to 'M'. No further cleaning or analysis was conducted on this column.
--------
FLAG_OWN_CAR: The column "FLAG_OWN_CAR" is a categorical variable indicating whether a client owns a car, with no detected issues or cleaning operations applied. The raw data consists of 10,000 entries with no missing values. There are two unique categories: 'N' and 'Y', with 'N' being the most frequent category, appearing 65.8% of the time (6,580 occurrences), while 'Y' appears 34.2% of the time (3,420 occurrences). The data distribution is visualized in a plot showing the frequency of each category, confirming the dominance of the 'N' category. No further analysis on cleaned data is available as no cleaning was necessary.
--------
FLAG_OWN_REALTY: The column "FLAG_OWN_REALTY" is a categorical variable indicating whether a client owns a house or flat, with no detected issues or cleaning operations applied. The raw data consists of 10,000 entries with no missing values. There are two unique categories: 'Y' and 'N'. The majority of the entries, 69.68%, are labeled as 'Y', indicating that most clients own a house or flat. The remaining 30.32% are labeled as 'N'. The frequency plot confirms this distribution, showing a higher count for 'Y' compared to 'N'. No further analysis on cleaned data is available as no cleaning was necessary.
--------
CNT_CHILDREN: The column "CNT_CHILDREN" represents the number of children a client has and is of integer type with no detected issues or cleaning operations applied. The exploratory data analysis on the raw data reveals that the column is numeric with a total of 10,000 non-missing entries and no missing values. The data shows a high percentage of zeros (70.56%), indicating that most clients do not have children. The mean number of children is 0.4081, with a standard deviation of 0.7091. The values range from 0 to 6, with the majority of clients having between 0 and 1 child, as indicated by the 75th percentile being 1. The distribution is right-skewed, with a few clients having up to 6 children. There are no outliers or zeros excluded from the EDA plot, which shows the frequency of each number of children, with 0 children being the most common. No cleaning operations were necessary, so the cleaned data analysis is not applicable.
--------
INCOME_TYPE: The INCOME_TYPE column is a categorical variable with no detected issues, thus no cleaning operations were applied. The column contains data on clients' income types, such as businessman, working, and maternity leave, among others. The dataset consists of 10,000 entries with no missing values. There are 5 unique categories, with "Working" being the most frequent category, accounting for 51.47% of the data. Other notable categories include "Commercial associate" and "Pensioner," with frequencies of 2,297 and 1,848, respectively. The least frequent category is "Student," with only 2 occurrences. The distribution of income types is visualized in a bar plot, highlighting the dominance of the "Working" category.
--------
EDUCATION_TYPE: The EDUCATION_TYPE column in the dataset represents the highest level of education achieved by clients and is of categorical type with no detected issues or cleaning operations applied. The raw data consists of 10,000 entries with no missing values and five unique categories. The most frequent category is "Secondary / secondary special," which accounts for 71% of the data, followed by "Higher education" with a frequency of 24.39%. Other categories include "Incomplete higher," "Lower secondary," and "Academic degree," with frequencies of 3.37%, 1.19%, and 0.5%, respectively. The data distribution indicates a significant majority of clients have completed secondary education, with a smaller proportion achieving higher education levels. No further analysis on cleaned data is available as no cleaning operations were necessary.
--------
FAMILY_STATUS: The FAMILY_STATUS column is a categorical variable representing the family status of clients, with no detected issues or cleaning operations applied. The dataset contains 10,000 entries with no missing values and five unique categories. The most frequent category is "Married," comprising 63.79% of the data, followed by "Single / not married" at 15.01%, "Civil marriage" at 9.79%, "Separated" at 6.34%, and "Widow" at 5.07%. The distribution of family status is visualized in a bar plot, highlighting the predominance of the "Married" category. No further analysis on cleaned data is available as no cleaning was necessary.
--------
HOUSING_TYPE: The "HOUSING_TYPE" column is a categorical variable with no detected issues or cleaning operations applied. It contains data on the housing situation of clients, such as renting or living with parents. The dataset consists of 10,000 entries with no missing values. There are six unique categories, with "House / apartment" being the most common, accounting for 89.13% of the entries. Other categories include "Co-op apartment," "Municipal apartment," "Office apartment," "Rented apartment," and "With parents," with significantly lower frequencies. The distribution indicates that the majority of clients reside in houses or apartments, while a smaller proportion live in other types of housing arrangements.

In [21]:

            
                Copied!
                
# View columns with issues only
response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}&issues=true")

columns_with_issues = response.json()["data"]

for col in columns_with_issues:
    print(f"{col['column_name']} issues: {col['issues']}")
    print(f"summary: {col['summary']}")
# View columns with issues only
response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}&issues=true")

columns_with_issues = response.json()["data"]

for col in columns_with_issues:
    print(f"{col['column_name']} issues: {col['issues']}")
    print(f"summary: {col['summary']}")

BIRTHDATE issues: Suspected string-based datetime format found: YYYY-MM-DD. 
summary: The BIRTHDATE column, which is of VARCHAR type, represents client birthdates and is suspected to be in a string-based datetime format of YYYY-MM-DD. No cleaning operations were applied to this column. The raw data analysis shows that it is categorical with 10,000 entries, all of which are non-missing, and 7,431 unique values. The most frequent birthdate is 1964-06-13, appearing with a frequency of 0.0005. The top 20 birthdates each have a frequency of 5, except for the last few which have a frequency of 4, while the remaining dates not in the top 20 collectively account for 9,909 entries. There is no additional analysis on cleaned data as no cleaning was performed.

In [22]:

            
                Copied!
                
                    
                    
                
                

        
# Apply a cleaning operation based on EDA insight
client_profile["BIRTHDATE"].update_critical_data_info(
    cleaning_operations=[
        fb.AddTimestampSchema(
            timestamp_schema=fb.TimestampSchema(
                is_utc_time=False,
                format_string="YYYY-MM-DD",
                timezone="America/Los_Angeles",
            ),
        ),
    ]
)
# Apply a cleaning operation based on EDA insight
client_profile["BIRTHDATE"].update_critical_data_info(
    cleaning_operations=[
        fb.AddTimestampSchema(
            timestamp_schema=fb.TimestampSchema(
                is_utc_time=False,
                format_string="YYYY-MM-DD",
                timezone="America/Los_Angeles",
            ),
        ),
    ]
)

In [23]:

            
                Copied!
                
                    
                    
                
                

        
response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}")
columns = response.json()["data"]
for col in columns:
    outdated = col['outdated']
    if outdated:
        print(f"{col['column_name']} is outdated")
response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}")
columns = response.json()["data"]
for col in columns:
    outdated = col['outdated']
    if outdated:
        print(f"{col['column_name']} is outdated")

BIRTHDATE is outdated

In [24]:

            
                Copied!
                
                    
                    
                
                

        
# Rerun EDA
response = client.post(
    "/table_eda",
    json={
        "table_id": client_profile_id,
        "analysis_size": 10000,
        "seed": 42,
    },
)
task_id = response.json()["id"]
print(f"Table EDA started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Table EDA: {task['status']}")

# List EDA runs and view column analysis
response = client.get(f"/table_eda?table_id={client_profile_id}")
eda_runs = response.json()["data"]
table_analysis_id = eda_runs[0]["_id"]

response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}&outdated=false")
columns = response.json()["data"]
for col in columns:
    if col['column_name'] == "BIRTHDATE":
        print(f"{col['column_name']}: {col['summary']}")
        column_analysis_id = col["_id"]
        break
# Rerun EDA
response = client.post(
    "/table_eda",
    json={
        "table_id": client_profile_id,
        "analysis_size": 10000,
        "seed": 42,
    },
)
task_id = response.json()["id"]
print(f"Table EDA started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Table EDA: {task['status']}")

# List EDA runs and view column analysis
response = client.get(f"/table_eda?table_id={client_profile_id}")
eda_runs = response.json()["data"]
table_analysis_id = eda_runs[0]["_id"]

response = client.patch(f"/column_analysis?table_analysis_id={table_analysis_id}&outdated=false")
columns = response.json()["data"]
for col in columns:
    if col['column_name'] == "BIRTHDATE":
        print(f"{col['column_name']}: {col['summary']}")
        column_analysis_id = col["_id"]
        break

Table EDA started (task: 60d4a52c-b18b-49f7-89dc-f7e9e3169532)
  status: STARTED...
Table EDA: SUCCESS
BIRTHDATE: The BIRTHDATE column, initially in a string-based datetime format (YYYY-MM-DD), was cleaned and converted to a timestamp format with the time zone adjusted to UTC from America/Los_Angeles. In the raw data, the column was treated as categorical with 10,000 non-missing entries and 7,431 unique values, with the most frequent date being 1964-06-13, appearing 0.05% of the time. The cleaned data, now in timestamp format, shows a mean birthdate of October 18, 1978, with a standard deviation of approximately 12.3 years. The birthdates range from September 30, 1950, to December 7, 2005. The distribution is fairly uniform with no outliers, as indicated by the histogram, which shows a gradual increase in frequency from the 1950s to the 1980s, peaking around the late 1980s, and then tapering off towards the 2000s. The data is complete with no missing values in both raw and cleaned datasets.

In [25]:

            
                Copied!
                
cleaned_plots = client.patch(
    f"/column_eda/{column_analysis_id}",
    params={"cleaned": True},
).json()
cleaned_plots = client.patch(
    f"/column_eda/{column_analysis_id}",
    params={"cleaned": True},
).json()

In [26]:

            
                Copied!
                
from IPython.display import HTML, display

display(HTML(cleaned_plots[0]["plots"][0]["content"]))
from IPython.display import HTML, display

display(HTML(cleaned_plots[0]["plots"][0]["content"]))

Bokeh Plot

In [27]:

            
                Copied!
                
                    
                    
                
                

        
# Apply cleaning to DAYS_EMPLOYED in NEW_APPLICATION too
new_application["DAYS_EMPLOYED"].update_critical_data_info(
    cleaning_operations=[
        fb.DisguisedValueImputation(disguised_values=[365243], imputed_value=None),
    ]
)
# Apply cleaning to DAYS_EMPLOYED in NEW_APPLICATION too
new_application["DAYS_EMPLOYED"].update_critical_data_info(
    cleaning_operations=[
        fb.DisguisedValueImputation(disguised_values=[365243], imputed_value=None),
    ]
)

SDK Reference: TableColumn.update_critical_data_info() | AddTimestampSchema | DisguisedValueImputation

Step 7: Semantic Detection¶

Corresponds to UI Tutorial: Update Descriptions and Tag Semantics API docs: Semantic Detection

Run AI-powered semantic detection to identify column types (currency, ratio, identifier, etc.). The ideation pipeline will use these semantics to generate better features.

In [28]:

            
                Copied!
                
                    
                    
                
                

        
# Run semantic detection on the BUREAU table
bureau_table = catalog.get_table("BUREAU")
bureau_table_id = str(bureau_table.id)

response = client.post(
    "/semantic_detection/column_semantic_detection",
    json={
        "table_id": bureau_table_id,
        "sample_enabled": True,
    },
)
task_id = response.json()["id"]
print(f"Semantic detection started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Semantic detection: {task['status']}")
# Run semantic detection on the BUREAU table
bureau_table = catalog.get_table("BUREAU")
bureau_table_id = str(bureau_table.id)

response = client.post(
    "/semantic_detection/column_semantic_detection",
    json={
        "table_id": bureau_table_id,
        "sample_enabled": True,
    },
)
task_id = response.json()["id"]
print(f"Semantic detection started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Semantic detection: {task['status']}")

Semantic detection started (task: d6931081-85ce-456f-bebb-6b66ffc66602)
  status: STARTED...
Semantic detection: SUCCESS

In [29]:

            
                Copied!
                
                    
                    
                
                

        
# Review detection results
semantic_detection_id = task.get("payload", {}).get("output_document_id")
response = client.get(f"/semantic_detection/{semantic_detection_id}")
detection = response.json()

# Show a few suggested semantics
for item in detection.get("suggested_semantics", [])[:10]:
    existing = item.get("existing_semantic_tag", "-")
    proposed = item.get("proposed_semantic_tag", "-")
    print(f"  {item['table_name']}.{item['column_name']}: {existing} -> {proposed}")
# Review detection results
semantic_detection_id = task.get("payload", {}).get("output_document_id")
response = client.get(f"/semantic_detection/{semantic_detection_id}")
detection = response.json()

# Show a few suggested semantics
for item in detection.get("suggested_semantics", [])[:10]:
    existing = item.get("existing_semantic_tag", "-")
    proposed = item.get("proposed_semantic_tag", "-")
    print(f"  {item['table_name']}.{item['column_name']}: {existing} -> {proposed}")

  BUREAU.available_at: record_creation_timestamp -> record_creation_timestamp
  BUREAU.credit_update: event_timestamp -> event_timestamp
  BUREAU.SK_ID_BUREAU: event_id -> event_id
  BUREAU.ClientID: None -> unique_identifier
  BUREAU.bureau_application_time: None -> timestamp_field
  BUREAU.credit_end_date: None -> timestamp_field
  BUREAU.credit_end_fact: None -> timestamp_field
  BUREAU.CNT_CREDIT_PROLONG: None -> count
  BUREAU.AMT_CREDIT_SUM: None -> non_negative_amount
  BUREAU.AMT_CREDIT_SUM_DEBT: None -> unbounded_amount

In [30]:

            
                Copied!
                
                    
                    
                
                

        
# Apply suggested semantics to table columns
# This sets the ground truth so ideation uses them directly
for item in detection.get("suggested_semantics", []):
    table_name = item.get("table_name")
    column_name = item.get("column_name")
    final_tag = item.get("final_semantic_tag")
    if final_tag:
        table_obj = catalog.get_table(table_name)
        table_type = table_obj.type.lower()
        table_id = str(table_obj.id)
        client.patch(
            f"/{table_type}/{table_id}/column_semantic",
            json={
                "column_semantic_updates": [
                    {"column_name": column_name, "semantic": final_tag}
                ],
            },
        )

print("Semantic tags applied to table columns")
# Apply suggested semantics to table columns
# This sets the ground truth so ideation uses them directly
for item in detection.get("suggested_semantics", []):
    table_name = item.get("table_name")
    column_name = item.get("column_name")
    final_tag = item.get("final_semantic_tag")
    if final_tag:
        table_obj = catalog.get_table(table_name)
        table_type = table_obj.type.lower()
        table_id = str(table_obj.id)
        client.patch(
            f"/{table_type}/{table_id}/column_semantic",
            json={
                "column_semantic_updates": [
                    {"column_name": column_name, "semantic": final_tag}
                ],
            },
        )

print("Semantic tags applied to table columns")

Semantic tags applied to table columns

Step 8: Create Development Dataset¶

Corresponds to UI Tutorial: Create Development Dataset API docs: Development Dataset

A development dataset is a sampled copy of source tables used to speed up ideation.

In [31]:

            
                Copied!
                
                    
                    
                
                

        
# Get default plan configuration from the EDA table
response = client.post(
    "/development_plan/defaults",
    json={"observation_table_id": eda_table_id},
)
plan_defaults = response.json()
print(f"Default feature lookback: {plan_defaults.get('feature_lookback_in_months')} months")
print(f"Tables to sample: {len(plan_defaults.get('table_ids_subset', []))}")
# Get default plan configuration from the EDA table
response = client.post(
    "/development_plan/defaults",
    json={"observation_table_id": eda_table_id},
)
plan_defaults = response.json()
print(f"Default feature lookback: {plan_defaults.get('feature_lookback_in_months')} months")
print(f"Tables to sample: {len(plan_defaults.get('table_ids_subset', []))}")

Default feature lookback: 15 months
Tables to sample: 7

In [32]:

            
                Copied!
                
                    
                    
                
                

        
# Customize the plan to match the UI tutorial settings
# - Increase feature lookback to 25 months for broader historical data
# - Increase max sample ratio to 0.15 to ensure materialization
plan_defaults["feature_lookback_in_months"] = 25
plan_defaults["max_sample_to_full_ratio"] = 0.15

# Create the development plan
response = client.post("/development_plan", json=plan_defaults)
development_plan = response.json()
development_plan_id = development_plan["_id"]
development_dataset_id = development_plan.get("development_dataset_id")
print(f"Development plan created: {development_plan_id}")
print(f"Development dataset: {development_dataset_id}")
print(f"Feature lookback: {plan_defaults['feature_lookback_in_months']} months")
print(f"Max sample ratio: {plan_defaults['max_sample_to_full_ratio']}")
# Customize the plan to match the UI tutorial settings
# - Increase feature lookback to 25 months for broader historical data
# - Increase max sample ratio to 0.15 to ensure materialization
plan_defaults["feature_lookback_in_months"] = 25
plan_defaults["max_sample_to_full_ratio"] = 0.15

# Create the development plan
response = client.post("/development_plan", json=plan_defaults)
development_plan = response.json()
development_plan_id = development_plan["_id"]
development_dataset_id = development_plan.get("development_dataset_id")
print(f"Development plan created: {development_plan_id}")
print(f"Development dataset: {development_dataset_id}")
print(f"Feature lookback: {plan_defaults['feature_lookback_in_months']} months")
print(f"Max sample ratio: {plan_defaults['max_sample_to_full_ratio']}")

Development plan created: 69d97effc4e0e91875fd94c0
Development dataset: 69d97effc4e0e91875fd94c1
Feature lookback: 25 months
Max sample ratio: 0.15

In [33]:

            
                Copied!
                
                    
                    
                
                

        
# Step 3: Create sampling tables (compute distinct entity IDs per table)
response = client.patch(f"/development_plan/{development_plan_id}/sampling_tables")
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Sampling tables: {task['status']}")

# Review the SQL sampling plan
response = client.get(f"/development_plan/{development_plan_id}/sql_plan")
sql_plan = response.json()
print(f"Tables to skip sampling: {len(sql_plan.get('tables_skip_sampling', []))}")
print(f"Max sample ratio: {sql_plan.get('max_sample_to_full_ratio')}")
print(f"SQL: {sql_plan['sql_str'][:1000]}...........")
# Step 3: Create sampling tables (compute distinct entity IDs per table)
response = client.patch(f"/development_plan/{development_plan_id}/sampling_tables")
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Sampling tables: {task['status']}")

# Review the SQL sampling plan
response = client.get(f"/development_plan/{development_plan_id}/sql_plan")
sql_plan = response.json()
print(f"Tables to skip sampling: {len(sql_plan.get('tables_skip_sampling', []))}")
print(f"Max sample ratio: {sql_plan.get('max_sample_to_full_ratio')}")
print(f"SQL: {sql_plan['sql_str'][:1000]}...........")

  status: STARTED...
Sampling tables: SUCCESS
Tables to skip sampling: 0
Max sample ratio: 0.15
SQL: -------------------------------------------------------
-- 1) CREATE TABLES WITH DISTINCT ENTITY IDS
-------------------------------------------------------

-- Build a Context list of distinct New_Application IDs
-- Distinct keys (SK_ID_CURR) from 50K applications
CREATE TABLE "TUTORIAL"."TUTORIAL_PROD"."DEV_69d97effc4e0e91875fd94c0_New_Application_CONTEXT_IDS" AS
SELECT
  DISTINCT base."SK_ID_CURR"
FROM "TUTORIAL"."TUTORIAL_PROD"."OBSERVATION_TABLE_69d97e8598a542e55a3c2fb2" AS base

-- Build a Context list of distinct Client IDs
-- Join NEW_APPLICATION with DEV_69d97effc4e0e91875fd94c0_New_Application_CONTEXT_IDS
CREATE TABLE "TUTORIAL"."TUTORIAL_PROD"."DEV_69d97effc4e0e91875fd94c0_Client_CONTEXT_IDS" AS
SELECT
  DISTINCT base."ClientID"
FROM "DEMO_DATASETS"."CREDIT_DEFAULT"."NEW_APPLICATION" AS base
INNER JOIN "TUTORIAL"."TUTORIAL_PROD"."DEV_69d97effc4e0e91875fd94c0_New_Application_CONTEXT_IDS" AS join_0
  ON base."SK_ID_CURR" = join_0."SK_ID_CURR"

-- Build a Sampling list of disti...........

In [34]:

            
                Copied!
                
                    
                    
                
                

        
# Step 4: Materialize development tables in the warehouse
response = client.patch(
    f"/development_plan/{development_plan_id}/sampled_tables",
    json={},
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Development tables materialized: {task['status']}")

# Verify dataset is active
response = client.get(f"/development_dataset/{development_dataset_id}")
dataset = response.json()
print(f"Dataset status: {dataset['status']}")
print(f"Tables: {len(dataset.get('development_tables', []))}")
# Step 4: Materialize development tables in the warehouse
response = client.patch(
    f"/development_plan/{development_plan_id}/sampled_tables",
    json={},
)
task_id = response.json()["id"]
task = wait_for_task(client, task_id)
print(f"Development tables materialized: {task['status']}")

# Verify dataset is active
response = client.get(f"/development_dataset/{development_dataset_id}")
dataset = response.json()
print(f"Dataset status: {dataset['status']}")
print(f"Tables: {len(dataset.get('development_tables', []))}")

  status: STARTED...
Development tables materialized: SUCCESS
Dataset status: Active
Tables: 7

Step 9: Run Ideation Pipeline¶

Corresponds to UI Tutorial: Ideate Features and Models API docs: Ideation

The ideation pipeline automates the full feature engineering workflow: table selection, semantic detection, transforms, feature generation, EDA, feature selection, and model training.

In [35]:

            
                Copied!
                
                    
                    
                
                

        
# Create a pipeline with the development dataset for faster ideation
response = client.post(
    "/pipeline",
    json={
        "action": "create",
        "use_case_id": use_case_id,
        "pipeline_type": "FEATURE_IDEATION",
        "development_dataset_id": development_dataset_id,
    },
)
pipeline_id = response.json()["_id"]
print(f"Pipeline created: {pipeline_id}")
# Create a pipeline with the development dataset for faster ideation
response = client.post(
    "/pipeline",
    json={
        "action": "create",
        "use_case_id": use_case_id,
        "pipeline_type": "FEATURE_IDEATION",
        "development_dataset_id": development_dataset_id,
    },
)
pipeline_id = response.json()["_id"]
print(f"Pipeline created: {pipeline_id}")

Pipeline created: 69d97f3d49c7d923c28449fa

In [36]:

            
                Copied!
                
                    
                    
                
                

        
# Configure training and validation tables
response = client.patch(
    f"/pipeline/{pipeline_id}/step_configs",
    json={
        "step_type": "model-train-setup-v2",
        "data_source": {
            "type": "train_valid_observation_tables",
            "training_table_id": training_table_id,
            "validation_table_id": validation_table_id,
        },
    },
)
print("Configured training/validation tables")
# Configure training and validation tables
response = client.patch(
    f"/pipeline/{pipeline_id}/step_configs",
    json={
        "step_type": "model-train-setup-v2",
        "data_source": {
            "type": "train_valid_observation_tables",
            "training_table_id": training_table_id,
            "validation_table_id": validation_table_id,
        },
    },
)
print("Configured training/validation tables")

Configured training/validation tables

In [37]:

            
                Copied!
                
                    
                    
                
                

        
# Run the pipeline to completion
response = client.patch(
    f"/pipeline/{pipeline_id}",
    json={"action": "advance", "step_type": "end"},
)

pipeline_task = response.json()["pipeline_runner_task"]
if pipeline_task:
    task_id = pipeline_task["task_id"]
    print(f"Pipeline running (task: {task_id})")
    print("This will take a while...")
    task = wait_for_task(client, task_id, poll_interval=180)
    print(f"Pipeline: {task['status']}")
# Run the pipeline to completion
response = client.patch(
    f"/pipeline/{pipeline_id}",
    json={"action": "advance", "step_type": "end"},
)

pipeline_task = response.json()["pipeline_runner_task"]
if pipeline_task:
    task_id = pipeline_task["task_id"]
    print(f"Pipeline running (task: {task_id})")
    print("This will take a while...")
    task = wait_for_task(client, task_id, poll_interval=180)
    print(f"Pipeline: {task['status']}")

Pipeline running (task: 12c6a7c6-e211-4f49-9d2f-e1d29bf9efe4)
This will take a while...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
Pipeline: SUCCESS

In [38]:

            
                Copied!
                
                    
                    
                
                

        
# Monitor pipeline status
response = client.get(f"/pipeline/{pipeline_id}")
data = response.json()

print(f"Current step: {data['current_step_type']}")
for group in data["groups"]:
    for step in group["steps"]:
        marker = "+" if step["step_status"] == "completed" else " "
        print(f"  [{marker}] {step['step_type']}: {step['step_status']}")
# Monitor pipeline status
response = client.get(f"/pipeline/{pipeline_id}")
data = response.json()

print(f"Current step: {data['current_step_type']}")
for group in data["groups"]:
    for step in group["steps"]:
        marker = "+" if step["step_status"] == "completed" else " "
        print(f"  [{marker}] {step['step_type']}: {step['step_status']}")

Current step: end
  [+] start: completed
  [+] table-selection: completed
  [+] semantic-detection: completed
  [+] transform: completed
  [+] filter: completed
  [+] ideation-metadata: completed
  [+] feature-ideation: completed
  [+] eda: completed
  [+] feature-selection: completed
  [+] model-train-setup-v2: completed
  [+] model-train: completed

In [39]:

            
                Copied!
                
for group in data["groups"]:
    for step in group["steps"]:
        if step["step_type"] == "model-train-setup-v2":
            primary_metric = step.get("primary_metric")
print("primary_metric:", primary_metric)
for group in data["groups"]:
    for step in group["steps"]:
        if step["step_type"] == "model-train-setup-v2":
            primary_metric = step.get("primary_metric")
print("primary_metric:", primary_metric)

primary_metric: roc_auc

In [40]:

            
                Copied!
                
                    
                    
                
                

        
# Get the best model from the pipeline's validation leaderboard
response = client.get(
    "/catalog/ml_model",
    params={
        "use_case_id": use_case_id,
        "pipeline_id": pipeline_id,
        "pipeline_id_to_mark": pipeline_id,
        "page": 1,
        "page_size": 100,
        "sort_by": primary_metric,
        "sort_by_metric": True,
    },
)
models = response.json()["data"]

for m in models:
    tag = " (best)" if m == models[0] else ""
    print(f"  {m['name']}{tag}: {m['metrics']['valid'][primary_metric]}")

ideation_model_id = models[0]["_id"]
print(f"\nBest model: {models[0]['name']} (id: {ideation_model_id})")
# Get the best model from the pipeline's validation leaderboard
response = client.get(
    "/catalog/ml_model",
    params={
        "use_case_id": use_case_id,
        "pipeline_id": pipeline_id,
        "pipeline_id_to_mark": pipeline_id,
        "page": 1,
        "page_size": 100,
        "sort_by": primary_metric,
        "sort_by_metric": True,
    },
)
models = response.json()["data"]

for m in models:
    tag = " (best)" if m == models[0] else ""
    print(f"  {m['name']}{tag}: {m['metrics']['valid'][primary_metric]}")

ideation_model_id = models[0]["_id"]
print(f"\nBest model: {models[0]['name']} (id: {ideation_model_id})")

  LightGBM [358 features: Loan Default Risk Assessment Features] (best): 0.7976001150827741
  XGBoost [358 features: Loan Default Risk Assessment Features]: 0.7953798526034435

Best model: LightGBM [358 features: Loan Default Risk Assessment Features] (id: 69d98b9b885d3d27f3fca1d0)

In [41]:

            
                Copied!
                
# View the pipeline report
response = client.get(f"/pipeline/{pipeline_id}/report")
report = response.json()
print(f"Summary: {report.get('summary', 'N/A')[:500]}")
# View the pipeline report
response = client.get(f"/pipeline/{pipeline_id}/report")
report = response.json()
print(f"Summary: {report.get('summary', 'N/A')[:500]}")

Summary: The use case focuses on predicting loan defaults within the next 6 months using three aggregation time windows: **26 weeks**, **52 weeks**, and **104 weeks**. Key tables involved are **INSTALLMENTS_PAYMENTS**, **PREVIOUS_APPLICATION**, **BUREAU**, **NEW_APPLICATION**, **CLIENT_PROFILE**, and **LOAN_STATUS**, which provide insights into clients' payment behaviors and credit history. Data transformations and feature creation resulted in **2385 ideated features** across **27 themes**, with most fea

In [42]:

            
                Copied!
                
                    
                    
                
                

        
# Download report as PDF
response = client.get(f"/pipeline/{pipeline_id}/report_pdf")
with open("ideation_report.pdf", "wb") as f:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)
print("Report saved to ideation_report.pdf")
# Download report as PDF
response = client.get(f"/pipeline/{pipeline_id}/report_pdf")
with open("ideation_report.pdf", "wb") as f:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)
print("Report saved to ideation_report.pdf")

Report saved to ideation_report.pdf

Step 9b: Explore Ideated Features¶

API docs: Ideated Features

Browse the features generated by ideation. Each feature includes its SDK code, relevance score, and a construction lineage — useful for understanding, reproducing, or modifying features.

In [43]:

            
                Copied!
                
                    
                    
                
                

        
# Get the feature ideation ID from the pipeline
response = client.get(f"/pipeline/{pipeline_id}/feature_ideation")
feature_ideation_id = response.json().get("feature_ideation_id")

# List top features by Predictive Score
response = client.get(
    f"/catalog/feature_ideation/{feature_ideation_id}/suggested_features",
    params={"page_size": 10, "sort_by": "predictive_power_score", "sort_dir": "desc"},
)
suggested_features = response.json()["data"]

for f in suggested_features[:10]:
    print(f"{f['feature_name']}")
    print(f"  Predictive Score: {f.get('predictive_power_score', 'N/A')}")
    print(f"  Type: {f.get('signal_type', 'N/A')}, Table: {f.get('primary_table', [])}")
    print()
# Get the feature ideation ID from the pipeline
response = client.get(f"/pipeline/{pipeline_id}/feature_ideation")
feature_ideation_id = response.json().get("feature_ideation_id")

# List top features by Predictive Score
response = client.get(
    f"/catalog/feature_ideation/{feature_ideation_id}/suggested_features",
    params={"page_size": 10, "sort_by": "predictive_power_score", "sort_dir": "desc"},
)
suggested_features = response.json()["data"]

for f in suggested_features[:10]:
    print(f"{f['feature_name']}")
    print(f"  Predictive Score: {f.get('predictive_power_score', 'N/A')}")
    print(f"  Type: {f.get('signal_type', 'N/A')}, Table: {f.get('primary_table', [])}")
    print()

NEW_APPLICATION_EXT_SOURCE_3
  Predictive Score: 0.30629863467855545
  Type: attribute, Table: ['NEW_APPLICATION']

NEW_APPLICATION_EXT_SOURCE_2
  Predictive Score: 0.28749787401070903
  Type: attribute, Table: ['NEW_APPLICATION']

CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w
  Predictive Score: 0.22078103627818924
  Type: stats, Table: ['BUREAU']

CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_52w
  Predictive Score: 0.21296984363327365
  Type: stats, Table: ['BUREAU']

CLIENT_Min_of_BureauReportedCredits_Available_Credits_26w
  Predictive Score: 0.20329257137420864
  Type: stats, Table: ['BUREAU']

CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_26w
  Predictive Score: 0.2020876017095521
  Type: stats, Table: ['BUREAU']

CLIENT_Min_of_BureauReportedCredits_Available_Credits_52w
  Predictive Score: 0.20138730853691222
  Type: stats, Table: ['BUREAU']

CLIENT_Min_of_BureauReportedCredits_Available_Credits_104w
  Predictive Score: 0.1988541944357085
  Type: stats, Table: ['BUREAU']

CLIENT_Avg_of_BureauReportedCredits_Available_Credits_52w
  Predictive Score: 0.18368288675295674
  Type: stats, Table: ['BUREAU']

CLIENT_Avg_of_BureauReportedCredits_Available_Credits_26w
  Predictive Score: 0.18346759290787085
  Type: stats, Table: ['BUREAU']

In [44]:

            
                Copied!
                
                    
                    
                
                

        
# View the SDK code for a feature
response = client.get(
    f"/catalog/feature_ideation/{feature_ideation_id}/suggested_features",
    params={"page_size": 100, "sort_by": "predictive_power_score", "sort_dir": "desc"},
)
for feature in suggested_features:
    if feature["feature_name"] == "CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w":
        break
print(f"Feature: {feature['feature_name']}")
print(f"Description: {feature.get('feature_description', '')}")
print(f"\nSDK Code:\n{feature['code']}")
# View the SDK code for a feature
response = client.get(
    f"/catalog/feature_ideation/{feature_ideation_id}/suggested_features",
    params={"page_size": 100, "sort_by": "predictive_power_score", "sort_dir": "desc"},
)
for feature in suggested_features:
    if feature["feature_name"] == "CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w":
        break
print(f"Feature: {feature['feature_name']}")
print(f"Description: {feature.get('feature_description', '')}")
print(f"\nSDK Code:\n{feature['code']}")

Feature: CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w
Description: Max of BureauReportedCredits AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUMS for the Client over a 104w period.

SDK Code:
"""

SDK code to create CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w

Feature description:
Max of BureauReportedCredits AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUMS for the Client over a 104w
period.

"""

import featurebyte as fb

#==================================================================================================
# Activate catalog
#==================================================================================================

catalog = fb.Catalog.activate("Credit Default API Tutorial")

#==================================================================================================
# Get view from table
#==================================================================================================

# Get view from BUREAU event table.
bureau_view = catalog.get_view("BUREAU")

#==================================================================================================
# Create ratio column
#==================================================================================================
bureau_view["AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM"] = (
    bureau_view["AMT_CREDIT_SUM_DEBT"] / bureau_view["AMT_CREDIT_SUM"]
)

#==================================================================================================
# Do window aggregation from BUREAU
#==================================================================================================

# Group BUREAU view by Client entity (ClientID).
bureau_view_by_client =\
bureau_view.groupby(['ClientID'])

#--------------------------------------------------------------------------------------------------

# Get Max of AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM for the Client over time.
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w =\
bureau_view_by_client.aggregate_over(
    "AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM", method="max",
    feature_names=["CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w"],
    windows=["104w"],
)["CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w"]

#==================================================================================================
# Save feature
#==================================================================================================

# Save feature
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w.save()

#==================================================================================================
# Update feature type
#==================================================================================================

# Update feature type
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w.update_feature_type(
	"numeric"
)

#==================================================================================================
# Add description
#==================================================================================================

# Add description
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w.update_description(
	"Max of BureauReportedCredits AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUMS "
	"for the Client over a 104w period."
)

In [45]:

            
                Copied!
                
                    
                    
                
                

        
# Get full lineage for a feature — shows step-by-step construction
response = client.get(
    f"/catalog/feature_ideation/suggested_feature_metadata/{feature['_id']}",
)
metadata = response.json()

# Show each construction step
lineage = metadata.get("lineage", {})
for node in lineage.get("nodes", []):
    print(f"Step: {node['title']}")
    print(f"  {node['description']}")
    print(f"Code:\n{node['code']}")
    print()
# Get full lineage for a feature — shows step-by-step construction
response = client.get(
    f"/catalog/feature_ideation/suggested_feature_metadata/{feature['_id']}",
)
metadata = response.json()

# Show each construction step
lineage = metadata.get("lineage", {})
for node in lineage.get("nodes", []):
    print(f"Step: {node['title']}")
    print(f"  {node['description']}")
    print(f"Code:\n{node['code']}")
    print()

Step: Create View from Table
  Create view from BUREAU event table.
Code:
# Get view from BUREAU event table.
bureau_view = catalog.get_view("BUREAU")


Step: Transform based on a Transform Object
  Divide AMT_CREDIT_SUM_DEBT by AMT_CREDIT_SUM within bureau_view.
Code:
bureau_view["AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM"] = (
    bureau_view["AMT_CREDIT_SUM_DEBT"] / bureau_view["AMT_CREDIT_SUM"]
)


Step: GroupBy
  Group bureau_view by ClientID.
Code:
# Group BUREAU view by Client entity (ClientID).
bureau_view_by_client = bureau_view.groupby(["ClientID"])


Step: Aggregate BureauReportedCredits by Client
  Maximum 'AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM' by 'ClientID' Over a 104w Period using 'bureau_view_by_client'.
Code:
# Get Max of AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM for the Client over time.
client_max_of_bureaureportedcredits_amt_credit_sum_debt_to_amt_credit_sums_104w = bureau_view_by_client.aggregate_over(
    "AMT_CREDIT_SUM_DEBT To AMT_CREDIT_SUM",
    method="max",
    feature_names=[
        "CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w"
    ],
    windows=["104w"],
)[
    "CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUM_DEBT_To_AMT_CREDIT_SUMS_104w"
]

Corresponds to UI Tutorial: Refine Ideation and Create Feature List API docs: Feature Refinement

Extract the top features by importance from the ideation model and create a refined feature list.

In [46]:

            
                Copied!
                
                    
                    
                
                

        
# Create a refined feature list from model key importance
response = client.post(
    "/feature_list_from_model",
    json={
        "mode": "Feature key importance based",
        "ml_model_id": ideation_model_id,
        "top_n": 200,
        "importance_threshold_percentage": 0.90,
    },
)
task_id = response.json()["id"]
print(f"Feature refinement started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Feature refinement: {task['status']}")
# Create a refined feature list from model key importance
response = client.post(
    "/feature_list_from_model",
    json={
        "mode": "Feature key importance based",
        "ml_model_id": ideation_model_id,
        "top_n": 200,
        "importance_threshold_percentage": 0.90,
    },
)
task_id = response.json()["id"]
print(f"Feature refinement started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Feature refinement: {task['status']}")

Feature refinement started (task: 6366404d-5149-4469-83a2-9ca5d0c4e5cc)
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
Feature refinement: SUCCESS

In [47]:

            
                Copied!
                
                    
                    
                
                

        
# Inspect the refined feature list
feature_list_from_model_id = task.get("payload", {}).get("output_document_id")
response = client.get(f"/feature_list_from_model/{feature_list_from_model_id}")
result = response.json()

feature_list_id = result["feature_list_id"]
print(f"Feature keys selected: {result['feature_keys_created_count']}")
print(f"Total features: {result['features_selected_count']}")

# Get feature list details
response = client.get(f"/feature_list/{feature_list_id}")
feature_list = response.json()
print(f"Feature list: {feature_list['name']} ({len(feature_list['feature_ids'])} features)")
# Inspect the refined feature list
feature_list_from_model_id = task.get("payload", {}).get("output_document_id")
response = client.get(f"/feature_list_from_model/{feature_list_from_model_id}")
result = response.json()

feature_list_id = result["feature_list_id"]
print(f"Feature keys selected: {result['feature_keys_created_count']}")
print(f"Total features: {result['features_selected_count']}")

# Get feature list details
response = client.get(f"/feature_list/{feature_list_id}")
feature_list = response.json()
print(f"Feature list: {feature_list['name']} ({len(feature_list['feature_ids'])} features)")

Feature keys selected: 37
Total features: 200
Feature list: 200 Features from LightGBM [358 features: Loan Default Risk Assessment Features] (Top 200 Feature Keys) (200 features)

Step 10b: Feature EDA¶

API docs: Feature EDA

Run EDA on the refined features to analyze their distributions and relationship with the target.

In [48]:

            
                Copied!
                
                    
                    
                
                

        
# Run EDA on the refined feature list
response = client.post(
    "/eda",
    json={
        "feature_list_id": feature_list_id,
        "use_case_id": use_case_id,
    },
)
task_id = response.json()["id"]
print(f"Feature EDA started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Feature EDA: {task['status']}")
# Run EDA on the refined feature list
response = client.post(
    "/eda",
    json={
        "feature_list_id": feature_list_id,
        "use_case_id": use_case_id,
    },
)
task_id = response.json()["id"]
print(f"Feature EDA started (task: {task_id})")
task = wait_for_task(client, task_id)
print(f"Feature EDA: {task['status']}")

Feature EDA started (task: 7a8412ee-707e-4595-adfb-c790e6925679)
  status: STARTED...
Feature EDA: SUCCESS

In [49]:

            
                Copied!
                
                    
                    
                
                

        
# View EDA plots for the first feature
from IPython.display import HTML, display

response = client.get(f"/feature_list/{feature_list_id}")
first_feature_id = response.json()["feature_ids"][0]

response = client.get(
    f"/eda/{first_feature_id}/plots",
    params={"use_case_id": use_case_id},
)
plots = response.json()

for plot in plots:
    for p in plot.get("plots", []):
        if "content" in p:
            display(HTML(p["content"]))
# View EDA plots for the first feature
from IPython.display import HTML, display

response = client.get(f"/feature_list/{feature_list_id}")
first_feature_id = response.json()["feature_ids"][0]

response = client.get(
    f"/eda/{first_feature_id}/plots",
    params={"use_case_id": use_case_id},
)
plots = response.json()

for plot in plots:
    for p in plot.get("plots", []):
        if "content" in p:
            display(HTML(p["content"]))

Bokeh Plot

Step 11: Train Standalone Model¶

Corresponds to UI Tutorial: Create New Feature Lists and Models API docs: Model Training

Train a model on the refined feature list using the recommended settings.

In [50]:

            
                Copied!
                
                    
                    
                
                

        
# Get suggested model settings
response = client.get(
    f"/use_case/{use_case_id}/ml_model_template_setting",
    params={
        "training_table_id": training_table_id,
        "validation_table_id": validation_table_id,
        "feature_list_id": feature_list_id,
        "machine_learning_role": "OUTCOME",
    },
)
settings = response.json()
print(f"Objective: {settings['objective']}")
print(f"Metric: {settings['metric']}")
print(f"Calibration: {settings.get('calibration_method')}")
# Get suggested model settings
response = client.get(
    f"/use_case/{use_case_id}/ml_model_template_setting",
    params={
        "training_table_id": training_table_id,
        "validation_table_id": validation_table_id,
        "feature_list_id": feature_list_id,
        "machine_learning_role": "OUTCOME",
    },
)
settings = response.json()
print(f"Objective: {settings['objective']}")
print(f"Metric: {settings['metric']}")
print(f"Calibration: {settings.get('calibration_method')}")

Objective: binary
Metric: area_under_curve
Calibration: None

In [51]:

            
                Copied!
                
                    
                    
                
                

        
# Get available model templates
response = client.get(
    f"/use_case/{use_case_id}/ml_model_template",
    params={
        "feature_list_id": feature_list_id,
        "training_table_id": training_table_id,
        "objective": settings["objective"],
        "metric": settings["metric"],
        "machine_learning_role": "OUTCOME",
    },
)
templates = response.json()["data"]
for t in templates:
    print(f"  Template: {t['type']} (id: {t['_id']})")

# Use the first template
template = templates[0]
# Get available model templates
response = client.get(
    f"/use_case/{use_case_id}/ml_model_template",
    params={
        "feature_list_id": feature_list_id,
        "training_table_id": training_table_id,
        "objective": settings["objective"],
        "metric": settings["metric"],
        "machine_learning_role": "OUTCOME",
    },
)
templates = response.json()["data"]
for t in templates:
    print(f"  Template: {t['type']} (id: {t['_id']})")

# Use the first template
template = templates[0]

  Template: NCTsDE_XGB (id: 6838241967efe8d7542bb238)
  Template: NCTsDE_LGB (id: 6838241967efe8d7542bb23b)
  Template: NCTDE_XGB (id: 6838241967efe8d7542bb232)
  Template: NCTDE_LGB (id: 6838241967efe8d7542bb235)

In [52]:

            
                Copied!
                
                    
                    
                
                

        
# Extract default parameters from template
node_name_to_parameters = {}
for preprocessor in template.get("preprocessors", []):
    params = {
        p["name"]: p["default_value"]
        for p in preprocessor.get("parameters_metadata", [])
        if p.get("default_value") is not None
    }
    if params:
        node_name_to_parameters[preprocessor["node_name"]] = params

model_info = template.get("model", {})
if model_info:
    params = {
        p["name"]: p["default_value"]
        for p in model_info.get("parameters_metadata", [])
        if p.get("default_value") is not None
    }
    if params:
        node_name_to_parameters[model_info["node_name"]] = params

print(f"Nodes configured: {list(node_name_to_parameters.keys())}")
# Extract default parameters from template
node_name_to_parameters = {}
for preprocessor in template.get("preprocessors", []):
    params = {
        p["name"]: p["default_value"]
        for p in preprocessor.get("parameters_metadata", [])
        if p.get("default_value") is not None
    }
    if params:
        node_name_to_parameters[preprocessor["node_name"]] = params

model_info = template.get("model", {})
if model_info:
    params = {
        p["name"]: p["default_value"]
        for p in model_info.get("parameters_metadata", [])
        if p.get("default_value") is not None
    }
    if params:
        node_name_to_parameters[model_info["node_name"]] = params

print(f"Nodes configured: {list(node_name_to_parameters.keys())}")

Nodes configured: ['transformer_2', 'estimator_1']

In [53]:

            
                Copied!
                
                    
                    
                
                

        
# Train the model
payload = {
    "use_case_id": use_case_id,
    "model_name": "Credit Default - Refined Features",
    "data_source": {
        "type": "train_valid_observation_tables",
        "training_table_id": training_table_id,
        "validation_table_id": validation_table_id,
    },
    "feature_list_id": feature_list_id,
    "model_template_type": template["type"],
    "objective": settings["objective"],
    "metric": settings["metric"],
    "node_name_to_parameters": node_name_to_parameters,
    "role": "OUTCOME",
}
if settings.get("calibration_method"):
    payload["calibration_method"] = settings["calibration_method"]

response = client.post("/ml_model", json=payload)
task_id = response.json()["id"]
print(f"Model training started (task: {task_id})")
task = wait_for_task(client, task_id)

ml_model_id = task.get("payload", {}).get("output_document_id")
print(f"Model trained: {ml_model_id}")
# Train the model
payload = {
    "use_case_id": use_case_id,
    "model_name": "Credit Default - Refined Features",
    "data_source": {
        "type": "train_valid_observation_tables",
        "training_table_id": training_table_id,
        "validation_table_id": validation_table_id,
    },
    "feature_list_id": feature_list_id,
    "model_template_type": template["type"],
    "objective": settings["objective"],
    "metric": settings["metric"],
    "node_name_to_parameters": node_name_to_parameters,
    "role": "OUTCOME",
}
if settings.get("calibration_method"):
    payload["calibration_method"] = settings["calibration_method"]

response = client.post("/ml_model", json=payload)
task_id = response.json()["id"]
print(f"Model training started (task: {task_id})")
task = wait_for_task(client, task_id)

ml_model_id = task.get("payload", {}).get("output_document_id")
print(f"Model trained: {ml_model_id}")

Model training started (task: 1c792cdf-d1bf-4714-8a5d-1c2b51ce964f)
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
Model trained: 69d99d9dc4e0e91875fd9ef0

In [54]:

            
                Copied!
                
                    
                    
                
                

        
# View model details
response = client.get(f"/catalog/ml_model/{ml_model_id}")
model = response.json()
print(f"Model: {model['name']}")
print(f"Template: {model['model_template_type']}")
print(f"Features: {len(model.get('feature_importance', []))}")

# Show top 5 features by importance
for fi in sorted(model.get("feature_importance", []), key=lambda x: x["importance"], reverse=True)[:5]:
    print(f"  {fi['feature']}: {fi['importance_percent'] * 100:.1f}%")
# View model details
response = client.get(f"/catalog/ml_model/{ml_model_id}")
model = response.json()
print(f"Model: {model['name']}")
print(f"Template: {model['model_template_type']}")
print(f"Features: {len(model.get('feature_importance', []))}")

# Show top 5 features by importance
for fi in sorted(model.get("feature_importance", []), key=lambda x: x["importance"], reverse=True)[:5]:
    print(f"  {fi['feature']}: {fi['importance_percent'] * 100:.1f}%")

Model: Credit Default - Refined Features
Template: NCTsDE_XGB
Features: 200
  NEW_APPLICATION_EXT_SOURCE_2: 8.1%
  NEW_APPLICATION_EXT_SOURCE_3: 7.0%
  NEW_APPLICATION_EXT_SOURCE_1: 3.5%
  CLIENT_GENDER: 3.2%
  NEW_APPLICATION_DAYS_EMPLOYED: 2.9%

Step 12: Evaluate on Holdout¶

Corresponds to UI Tutorial: Refit Model API docs: Evaluation | Batch Predictions

Create a holdout observation table, generate predictions, and evaluate. The leaderboard is created automatically when predictions are generated on an observation table with a target.

In [55]:

            
                Copied!
                
                    
                    
                
                

        
# Create holdout observation table (Q1 2025)
holdout_table = obs_source.create_observation_table(
    name="Applications Q1 2025",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_from_timestamp="2025-01-01",
    sample_to_timestamp="2025-04-01",
)
holdout_table.update_purpose(fb.Purpose.VALIDATION_TEST)
use_case.add_observation_table("Applications Q1 2025")
holdout_table_id = str(holdout_table.id)
print(f"Holdout table: {holdout_table.name} (id: {holdout_table_id})")
# Create holdout observation table (Q1 2025)
holdout_table = obs_source.create_observation_table(
    name="Applications Q1 2025",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_from_timestamp="2025-01-01",
    sample_to_timestamp="2025-04-01",
)
holdout_table.update_purpose(fb.Purpose.VALIDATION_TEST)
use_case.add_observation_table("Applications Q1 2025")
holdout_table_id = str(holdout_table.id)
print(f"Holdout table: {holdout_table.name} (id: {holdout_table_id})")

09:27:57 | WARNING  | Primary entities will be a mandatory parameter in SDK version 0.7.
WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.

Done! |████████████████████████████████████████| 100% in 15.5s (0.06%/s)        
Holdout table: Applications Q1 2025 (id: 69d9a39d06b1c7ee52d4fbaf)

In [56]:

            
                Copied!
                
                    
                    
                
                

        
# Generate predictions on holdout
response = client.post(
    f"/ml_model/{ml_model_id}/prediction_table",
    json={
        "request_input": {
            "request_type": "observation_table",
            "table_id": holdout_table_id,
        },
        "include_input_features": False,
    },
)
task_id = response.json()["id"]
print(f"Prediction started (task: {task_id})")
task = wait_for_task(client, task_id)
prediction_table_id = task.get("payload", {}).get("output_document_id")
print(f"Prediction table: {prediction_table_id}")
# Generate predictions on holdout
response = client.post(
    f"/ml_model/{ml_model_id}/prediction_table",
    json={
        "request_input": {
            "request_type": "observation_table",
            "table_id": holdout_table_id,
        },
        "include_input_features": False,
    },
)
task_id = response.json()["id"]
print(f"Prediction started (task: {task_id})")
task = wait_for_task(client, task_id)
prediction_table_id = task.get("payload", {}).get("output_document_id")
print(f"Prediction table: {prediction_table_id}")

Prediction started (task: 08cb2d92-88a0-4956-8d88-04d66462560b)
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
Prediction table: 69d9a3b149c7d923c2844a05

In [57]:

            
                Copied!
                
                    
                    
                
                

        
# Download predictions as Parquet
import io
import pyarrow.parquet as pq

response = client.get(
    f"/prediction_table/parquet/{prediction_table_id}",
    stream=True,
)

buffer = io.BytesIO()
for chunk in response.iter_content(chunk_size=8192):
    if chunk:
        buffer.write(chunk)
buffer.seek(0)

predictions_df = pq.read_table(buffer).to_pandas()
print(f"Downloaded {len(predictions_df)} predictions")
predictions_df.head()
# Download predictions as Parquet
import io
import pyarrow.parquet as pq

response = client.get(
    f"/prediction_table/parquet/{prediction_table_id}",
    stream=True,
)

buffer = io.BytesIO()
for chunk in response.iter_content(chunk_size=8192):
    if chunk:
        buffer.write(chunk)
buffer.seek(0)

predictions_df = pq.read_table(buffer).to_pandas()
print(f"Downloaded {len(predictions_df)} predictions")
predictions_df.head()

Downloaded 12527 predictions

Out[57]:

	__FB_TABLE_ROW_INDEX	POINT_IN_TIME	SK_ID_CURR	prediction_prob
0	1	2025-01-10 17:14:33	100011	0.062988
1	2	2025-02-12 11:40:00	100024	0.089458
2	3	2025-03-05 20:38:59	100060	0.029231
3	4	2025-02-07 18:08:40	100082	0.042675
4	5	2025-02-08 10:00:18	100084	0.145059

In [58]:

            
                Copied!
                
                    
                    
                
                

        
# The holdout leaderboard is created automatically when predictions
# are generated on an observation table with a target.
# Find it by observation table ID.
response = client.get(
    "/catalog/leaderboard",
    params={
        "observation_table_id": holdout_table_id,
        "observation_table_purpose": "holdout",
        "role": "OUTCOME",
    },
)
leaderboard = response.json()["data"][0]
leaderboard_id = leaderboard["_id"]
primary_metric = leaderboard["primary_metric"]
sort_dir = leaderboard.get("sort_order", "desc")
print(f"Leaderboard: {leaderboard['name']} (metric: {primary_metric}, {sort_dir})")
# The holdout leaderboard is created automatically when predictions
# are generated on an observation table with a target.
# Find it by observation table ID.
response = client.get(
    "/catalog/leaderboard",
    params={
        "observation_table_id": holdout_table_id,
        "observation_table_purpose": "holdout",
        "role": "OUTCOME",
    },
)
leaderboard = response.json()["data"][0]
leaderboard_id = leaderboard["_id"]
primary_metric = leaderboard["primary_metric"]
sort_dir = leaderboard.get("sort_order", "desc")
print(f"Leaderboard: {leaderboard['name']} (metric: {primary_metric}, {sort_dir})")

Leaderboard: Applications Q1 2025_holdout_leaderboard (metric: roc_auc, desc)

In [59]:

            
                Copied!
                
                    
                    
                
                

        
# View leaderboard results - list models sorted by metric
response = client.get(
    "/catalog/ml_model",
    params={
        "leaderboard_id": leaderboard_id,
        "sort_by": primary_metric,
        "sort_dir": sort_dir,
        "sort_by_metric": True,
        "show_refits": True,
        "leaderboard_role": "OUTCOME",
        "page_size": 100,
    },
)
models = response.json()["data"]

for m in models:
    tag = " (best)" if m == models[0] else ""
    print(f"  {m['name']}{tag}: {m['leaderboard_evaluation_scores'][primary_metric]}")

ideation_model_id = models[0]["_id"]
print(f"\nBest model: {models[0]['name']} (id: {ideation_model_id})")
# View leaderboard results - list models sorted by metric
response = client.get(
    "/catalog/ml_model",
    params={
        "leaderboard_id": leaderboard_id,
        "sort_by": primary_metric,
        "sort_dir": sort_dir,
        "sort_by_metric": True,
        "show_refits": True,
        "leaderboard_role": "OUTCOME",
        "page_size": 100,
    },
)
models = response.json()["data"]

for m in models:
    tag = " (best)" if m == models[0] else ""
    print(f"  {m['name']}{tag}: {m['leaderboard_evaluation_scores'][primary_metric]}")

ideation_model_id = models[0]["_id"]
print(f"\nBest model: {models[0]['name']} (id: {ideation_model_id})")

  Credit Default - Refined Features (best): 0.7955597455713544

Best model: Credit Default - Refined Features (id: 69d99d9dc4e0e91875fd9ef0)

In [60]:

            
                Copied!
                
# Generate evaluation plots
response = client.request("OPTIONS", f"/ml_model/{ml_model_id}/evaluate")
options = response.json()
print(f"Available plots: {options.get('options', [])}")
# Generate evaluation plots
response = client.request("OPTIONS", f"/ml_model/{ml_model_id}/evaluate")
options = response.json()
print(f"Available plots: {options.get('options', [])}")

Available plots: ['roc_curve', 'precision_recall_curve', 'ks_and_gain_curve', 'lift_curve', 'gain_report', 'predicted_vs_actual_per_bin', 'distribution', 'confusion_matrix']

In [91]:

            
                Copied!
                
                    
                    
                
                

        
from IPython.display import HTML, display

# Kolmogorov-Smirnow / Gain curve (binary classification)
response = client.post(
    f"/ml_model/{ml_model_id}/evaluate",
    json={
        "option": "ks_and_gain_curve",
        "plot_params": {"height": 500, "width": 800, "font_size": 14},
        "holdout_table": {"table_type": "observation_table", "table_id": holdout_table_id},
    },
)
display(HTML(response.json()["content"]))
from IPython.display import HTML, display

# Kolmogorov-Smirnow / Gain curve (binary classification)
response = client.post(
    f"/ml_model/{ml_model_id}/evaluate",
    json={
        "option": "ks_and_gain_curve",
        "plot_params": {"height": 500, "width": 800, "font_size": 14},
        "holdout_table": {"table_type": "observation_table", "table_id": holdout_table_id},
    },
)
display(HTML(response.json()["content"]))

Bokeh Plot

Step 12b: Refit Model¶

Corresponds to UI Tutorial: Refit Model API docs: Model Training — Refit

Refit the best model on more recent data while keeping the same feature list and hyperparameters.

In [62]:

            
                Copied!
                
                    
                    
                
                

        
# Create a more recent training table
refit_training_table = obs_source.create_observation_table(
    name="Applications up to Dec 2024",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_from_timestamp="2019-04-01",
    sample_to_timestamp="2025-01-01",
)
refit_training_table.update_purpose(fb.Purpose.TRAINING)
use_case.add_observation_table("Applications up to Dec 2024")
refit_training_table_id = str(refit_training_table.id)
print(f"Refit training table: {refit_training_table.name}")
# Create a more recent training table
refit_training_table = obs_source.create_observation_table(
    name="Applications up to Dec 2024",
    columns_rename_mapping={"SK_ID_CURR": "SK_ID_CURR", "Loan_Default": "Loan_Default"},
    context_name="New Loan Application",
    target_column="Loan_Default",
    sample_from_timestamp="2019-04-01",
    sample_to_timestamp="2025-01-01",
)
refit_training_table.update_purpose(fb.Purpose.TRAINING)
use_case.add_observation_table("Applications up to Dec 2024")
refit_training_table_id = str(refit_training_table.id)
print(f"Refit training table: {refit_training_table.name}")

09:31:49 | WARNING  | Primary entities will be a mandatory parameter in SDK version 0.7.
WARNING :featurebyte.api.source_table:Primary entities will be a mandatory parameter in SDK version 0.7.

Done! |████████████████████████████████████████| 100% in 18.6s (0.05%/s)        
Refit training table: Applications up to Dec 2024

In [63]:

            
                Copied!
                
                    
                    
                
                

        
# Refit the model on more recent data
response = client.post(
    f"/ml_model/{ml_model_id}/refit",
    json={
        "data_source": {
            "type": "train_valid_observation_tables",
            "training_table_id": refit_training_table_id,
            "validation_table_id": None,
        },
        "model_name": "Credit Default - Refit Dec 2024",
    },
)
task_id = response.json()["id"]
print(f"Refit started (task: {task_id})")
task = wait_for_task(client, task_id)

refit_model_id = task.get("payload", {}).get("output_document_id")
print(f"Refit model: {refit_model_id}")
# Refit the model on more recent data
response = client.post(
    f"/ml_model/{ml_model_id}/refit",
    json={
        "data_source": {
            "type": "train_valid_observation_tables",
            "training_table_id": refit_training_table_id,
            "validation_table_id": None,
        },
        "model_name": "Credit Default - Refit Dec 2024",
    },
)
task_id = response.json()["id"]
print(f"Refit started (task: {task_id})")
task = wait_for_task(client, task_id)

refit_model_id = task.get("payload", {}).get("output_document_id")
print(f"Refit model: {refit_model_id}")

Refit started (task: f762121a-e9fd-4c0c-97ee-0d1f764f0379)
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
  status: STARTED...
Refit model: 69d9a49949c7d923c2844a08

Step 13: Deploy¶

Corresponds to UI Tutorial: Deploy and Serve API docs: Deployment

In [64]:

            
                Copied!
                
                    
                    
                
                

        
# Get the feature list object via SDK
feature_list_obj = fb.FeatureList.get(feature_list["name"])

# Deploy with make_production_ready=True
# This upgrades all features to PRODUCTION_READY and creates the deployment
deployment = feature_list_obj.deploy(
    deployment_name="Credit Default - Refined Model",
    make_production_ready=True,
    use_case_name="Loan Default by client",
)
print(f"Deployment: {deployment.name}")
# Get the feature list object via SDK
feature_list_obj = fb.FeatureList.get(feature_list["name"])

# Deploy with make_production_ready=True
# This upgrades all features to PRODUCTION_READY and creates the deployment
deployment = feature_list_obj.deploy(
    deployment_name="Credit Default - Refined Model",
    make_production_ready=True,
    use_case_name="Loan Default by client",
)
print(f"Deployment: {deployment.name}")

Loading Feature(s) |████████████████████████████████████████| 200/200 [100%] in 
Done! |████████████████████████████████████████| 100% in 40.2s (0.02%/s)        
Done! |████████████████████████████████████████| 100% in 6.2s (0.16%/s)         
Deployment: Credit Default - Refined Model

In [65]:

            
                Copied!
                
# Enable the deployment
deployment.enable()
print(f"Deployment enabled")
# Enable the deployment
deployment.enable()
print(f"Deployment enabled")

Done! |████████████████████████████████████████| 100% in 10:20.9 (0.00%/s)      
Deployment enabled

In [66]:

            
                Copied!
                
# Verify deployment is active
catalog.list_deployments()
# Verify deployment is active
catalog.list_deployments()

Out[66]:

	id	name	feature_list_name	feature_list_version	num_feature	enabled
0	69d9aa2f06b1c7ee52d4fbb1	Credit Default - Refined Model	200 Features from LightGBM [358 features: Loan...	V260411	200	True

In [69]:

            
                Copied!
                
                    
                    
                
                

        
deployment_id = str(deployment.id)
# Generate deployment SQL for batch serving
response = client.post("/deployment_sql", json={"deployment_id": deployment_id})
task_id = response.json()["id"]
task = wait_for_task(client, task_id)

response = client.get("/deployment_sql", params={"deployment_id": deployment_id})
sql_result = response.json()
print("Deployment SQL generated. Schedule this in your warehouse for batch feature computation.")
deployment_id = str(deployment.id)
# Generate deployment SQL for batch serving
response = client.post("/deployment_sql", json={"deployment_id": deployment_id})
task_id = response.json()["id"]
task = wait_for_task(client, task_id)

response = client.get("/deployment_sql", params={"deployment_id": deployment_id})
sql_result = response.json()
print("Deployment SQL generated. Schedule this in your warehouse for batch feature computation.")

  status: STARTED...
Deployment SQL generated. Schedule this in your warehouse for batch feature computation.

In [85]:

            
                Copied!
                
print(sql_result["data"][0]["feature_table_sqls"][0]["sql_code"][:2000], "....")
print(sql_result["data"][0]["feature_table_sqls"][0]["sql_code"][:2000], "....")

SELECT
  L."SK_ID_CURR",
  R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_ACTIVE_is_Closed_104w",
  R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_ACTIVE_is_Closed_26w",
  R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_ACTIVE_is_Closed_52w",
  R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_TYPE_is_Consumer_credit_104w",
  R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_TYPE_is_Consumer_credit_26w",
  R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_TYPE_is_Consumer_credit_52w",
  R."CLIENT_Proportion_of_BureauReportedCredits_AMT_CREDIT_SUMS_when_BureauReportedCredit_CREDIT_TYPE_is_Mortgage_104w",
  R."CLIENT_Proportion_of_Count_of_BureauReportedCredits_when_BureauReportedCredit_CREDIT_ACTIVE_is_Active_52w",
  R."CLIENT_Proportion_of_Count_of_BureauReportedCredits_when_BureauReportedCredit_CREDIT_ACTIVE_is_Closed_104w",
  R."CLIENT_Latest_BureauReportedCredit_AMT_CREDIT_MAX_OVERDUE_104w",
  R."CLIENT_Latest_BureauReportedCredit_AMT_CREDIT_SUM_104w",
  R."CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_MAX_OVERDUES_104w",
  R."CLIENT_Max_of_BureauReportedCredits_AMT_CREDIT_SUMS_104w",
  R."CLIENT_Min_of_BureauReportedCredits_AMT_CREDIT_MAX_OVERDUES_52w",
  R."CLIENT_Na_count_of_BureauReportedCredits_AMT_CREDIT_MAX_OVERDUES_52w",
  R."CLIENT_Na_count_of_BureauReportedCredits_AMT_CREDIT_SUM_LIMITS_26w",
  R."CLIENT_Sum_of_BureauReportedCredits_AMT_CREDIT_SUMS_104w",
  R."CLIENT_Time_To_Latest_BureauReportedCredit_bureau_application_time_104w",
  R."CLIENT_Time_To_Latest_BureauReportedCredit_credit_end_date_104w",
  {{ CURRENT_TIMESTAMP }} AS "POINT_IN_TIME"
FROM (
  WITH ENTITY_UNIVERSE AS (
    SELECT
      {{ CURRENT_TIMESTAMP }} AS "POINT_IN_TIME",
      "SK_ID_CURR"
    FROM (
      SELE ....

In [92]:

            
                Copied!
                
# Disable the deployment
deployment.disable()
print(f"Deployment disabled")
# Disable the deployment
deployment.disable()
print(f"Deployment disabled")

Deployment disabled

SDK Reference: FeatureList.deploy() | Deployment

Summary¶

This tutorial walked through the full FeatureByte workflow:

Step	Method	What we did
1	SDK	Created catalog, registered 7 tables, created 6 entities
1b	API	Analyzed source tables and generated AI summaries
2-5	SDK	Registered tables, formulated use case, created observation tables
6	API	Ran table EDA and applied cleaning operations
7	API	Ran semantic detection and applied semantic tags
8	API	Created development dataset for faster ideation
9	API	Ran automated ideation pipeline, downloaded report PDF
9b	API	Explored ideated features — SDK code, relevance scores, lineage
10	API	Refined features using key importance (90% threshold)
10b	API	Ran feature EDA on refined features with plots
11	API	Trained standalone model on refined feature list
12	API	Evaluated on holdout set with leaderboard, plots, and Parquet download
12b	API	Refit model on more recent data
13	API	Created and enabled deployment with batch SQL

Next steps:

Try different entity selections by running parallel ideation pipelines
Run standalone feature selection with custom parameters
Explore the Store Sales Forecast tutorial for a time series forecasting example

Credit Default: End-to-End SDK + API Tutorial¶

Setup¶

Step 1: Create Catalog¶

Step 1b: Analyze Source Tables (API)¶

Step 2: Register Tables¶

Step 3: Register Entities¶

Step 4: Formulate Use Case¶

Step 5: Create Observation Tables¶

Step 6: Table EDA¶

Step 7: Semantic Detection¶

Step 8: Create Development Dataset¶

Step 9: Run Ideation Pipeline¶

Step 9b: Explore Ideated Features¶

Step 10: Feature Refinement¶

Step 10b: Feature EDA¶

Step 11: Train Standalone Model¶

Step 12: Evaluate on Holdout¶

Step 12b: Refit Model¶

Step 13: Deploy¶

Summary¶