Entity Selection¶

How Entity Selection Works¶

For each table in the catalog, the system identifies three tiers of entities:

Tier	Description
Eligible	All entities that could be used for feature generation based on the data model relationships. This is the broadest set.
Suggested	A curated subset of eligible entities recommended by the system. Defaults to the use case primary entity, use case event entity, and their direct parents. This is the default selection.
Final	The entities actually used for ideation. Defaults to `suggested` unless explicitly overridden. Must be a subset of `eligible`.

The relationship is: final is a subset of suggested, which is a subset of eligible.

Default Selection (Suggested)¶

The system suggests entities that are most likely to produce useful features:

Use case primary entity — always included (e.g., Store for a store-level forecast)
Use case event/item entities — entities directly involved in the events being analyzed
Direct parents of event/item entities — one level up in the entity hierarchy (e.g., State as parent of Store)

This default avoids feature explosion by excluding entities that are far from the prediction target in the relationship graph.

Feature Explosion¶

Each selected entity generates a dedicated set of features per table: window aggregates across multiple time windows, cross-entity features, stability features, similarity features, and representation features. For example, a single table with 3 entities selected might generate 200+ features, while adding a 4th entity could add another 70+.

Entity pairs (cross-entity features like (item, state)) generate interaction features on top of single-entity features. The combinatorial effect means adding entities grows feature count significantly.

View Entity Selection¶

After creating a pipeline, inspect the default entity selection:

response = client.patch(
    f"/pipeline/{pipeline_id}/step_configs",
    json={"step_type": "ideation-metadata"},
)
config = response.json()

# Show the entity selection tiers
pending_configs = config.get("pending_step_configurations", [])
for cfg in pending_configs:
    if cfg.get("step_type") == "ideation-metadata":
        entity_sel = cfg.get("entity_selection", {})
        print(f"Eligible: {len(entity_sel.get('eligible', []))} table selections")
        print(f"Suggested: {len(entity_sel.get('suggested', []))} table selections")
        print(f"Final: {len(entity_sel.get('final', []))} table selections")

Each table selection contains:

Field	Type	Description
`table_id`	string	ID of the registered table
`entities`	array of arrays	Entity ID groups. `["id1"]` = single entity. `["id1", "id2"]` = cross-entity pair.

Override Entity Selection¶

To use a different entity selection than the default, set the final field:

# Build a custom entity selection
catalog = fb.Catalog.get_active()
entities_df = catalog.list_entities().set_index("name")

entity_selection = [
    {
        "table_id": str(catalog.get_table("SALES").id),
        "entities": [
            [str(entities_df.loc["Store", "id"])],          # single entity
            [str(entities_df.loc["State", "id"])],           # parent entity
        ],
    },
    {
        "table_id": str(catalog.get_table("STORE_STATE").id),
        "entities": [
            [str(entities_df.loc["Store", "id"])],
        ],
    },
]

response = client.patch(
    f"/pipeline/{pipeline_id}/step_configs",
    json={
        "step_type": "ideation-metadata",
        "entity_selection": {
            "final": entity_selection,
        },
    },
)

Entity Selection and Development Datasets¶

When a development dataset exists, it constrains the entity selection for ideation. This is because the development dataset contains pre-sampled tables — only entities present in the sampled data can be used.

The constraint works as follows:

The development plan records which entity selection was used to create the sampled tables
When ideation runs with that development dataset, both suggested and eligible are intersected with the development plan's entity selection
Entities not present in the development dataset are removed from all tiers

Parent entities and development datasets

Selecting entity parents (e.g., State as parent of Store) when creating a development dataset may reduce or eliminate sampling opportunities. If the parent entity has few distinct values, the sampled tables may not contain enough variation for meaningful feature generation.

For this reason, the default development plan typically uses only the primary entity and its direct relationships. If you need parent entity features, either:

Include the parent entity when creating the development plan
Run ideation without a development dataset (on the full data, which takes longer)

Check Development Dataset Entity Selection¶

# See which entities the development plan used
response = client.get(f"/development_dataset/{development_dataset_id}")
dataset = response.json()
development_plan_id = dataset.get("development_plan_id")

response = client.get(f"/development_plan/{development_plan_id}")
plan = response.json()
plan_entities = plan.get("entity_selection", {})
print(f"Development plan entities: {plan_entities}")

Strategies¶

Different entity selections produce different feature sets. A common pattern is to run multiple ideation pipelines in parallel with different entity selections and compare model performance.

Strategy	Entity selection	When to use
Default	Use the suggested selection	Good starting point; avoids feature explosion
Add parent entities	Include parent entities like `State` or `Category`	When parent-level aggregations (e.g., "average sales per state") may be predictive
Cross-entity	Add entity pairs like `["item_id", "state_id"]`	When interactions between entities may be predictive (e.g., item popularity by region)
Minimal	Primary entity only	When you want a fast, focused ideation with fewer features
Broad	Include all eligible entities	When you want maximum coverage and can tolerate longer runs and more features to filter

See Parallel Ideation for how to run multiple entity selection strategies simultaneously.