Entity Selection¶
Prerequisites
This page uses the client and wait_for_task helpers defined in API Overview.
Entity selection determines which entities are used to generate features for each table in an ideation pipeline. It is one of the most impactful configuration choices — selecting too many entities causes feature explosion, while selecting too few misses important signals.
How Entity Selection Works¶
For each table in the catalog, the system identifies three tiers of entities:
| Tier | Description |
|---|---|
| Eligible | All entities that could be used for feature generation based on the data model relationships. This is the broadest set. |
| Suggested | A curated subset of eligible entities recommended by the system. Defaults to the use case primary entity, use case event entity, and their direct parents. This is the default selection. |
| Final | The entities actually used for ideation. Defaults to suggested unless explicitly overridden. Must be a subset of eligible. |
The relationship is: final is a subset of suggested, which is a subset of eligible.
Default Selection (Suggested)¶
The system suggests entities that are most likely to produce useful features:
- Use case primary entity — always included (e.g.,
Storefor a store-level forecast) - Use case event/item entities — entities directly involved in the events being analyzed
- Direct parents of event/item entities — one level up in the entity hierarchy (e.g.,
Stateas parent ofStore)
This default avoids feature explosion by excluding entities that are far from the prediction target in the relationship graph.
Feature Explosion¶
Each selected entity generates a dedicated set of features per table: window aggregates across multiple time windows, cross-entity features, stability features, similarity features, and representation features. For example, a single table with 3 entities selected might generate 200+ features, while adding a 4th entity could add another 70+.
Entity pairs (cross-entity features like (item, state)) generate interaction features on top of single-entity features. The combinatorial effect means adding entities grows feature count significantly.
View Entity Selection¶
After creating a pipeline, inspect the default entity selection:
response = client.patch(
f"/pipeline/{pipeline_id}/step_configs",
json={"step_type": "ideation-metadata"},
)
config = response.json()
# Show the entity selection tiers
pending_configs = config.get("pending_step_configurations", [])
for cfg in pending_configs:
if cfg.get("step_type") == "ideation-metadata":
entity_sel = cfg.get("entity_selection", {})
print(f"Eligible: {len(entity_sel.get('eligible', []))} table selections")
print(f"Suggested: {len(entity_sel.get('suggested', []))} table selections")
print(f"Final: {len(entity_sel.get('final', []))} table selections")
Each table selection contains:
| Field | Type | Description |
|---|---|---|
table_id |
string | ID of the registered table |
entities |
array of arrays | Entity ID groups. ["id1"] = single entity. ["id1", "id2"] = cross-entity pair. |
Override Entity Selection¶
To use a different entity selection than the default, set the final field:
# Build a custom entity selection
catalog = fb.Catalog.get_active()
entities_df = catalog.list_entities().set_index("name")
entity_selection = [
{
"table_id": str(catalog.get_table("SALES").id),
"entities": [
[str(entities_df.loc["Store", "id"])], # single entity
[str(entities_df.loc["State", "id"])], # parent entity
],
},
{
"table_id": str(catalog.get_table("STORE_STATE").id),
"entities": [
[str(entities_df.loc["Store", "id"])],
],
},
]
response = client.patch(
f"/pipeline/{pipeline_id}/step_configs",
json={
"step_type": "ideation-metadata",
"entity_selection": {
"final": entity_selection,
},
},
)
Entity Selection and Development Datasets¶
When a development dataset exists, it constrains the entity selection for ideation. This is because the development dataset contains pre-sampled tables — only entities present in the sampled data can be used.
The constraint works as follows:
- The development plan records which entity selection was used to create the sampled tables
- When ideation runs with that development dataset, both
suggestedandeligibleare intersected with the development plan's entity selection - Entities not present in the development dataset are removed from all tiers
Parent entities and development datasets
Selecting entity parents (e.g., State as parent of Store) when creating a development dataset may reduce or eliminate sampling opportunities. If the parent entity has few distinct values, the sampled tables may not contain enough variation for meaningful feature generation.
For this reason, the default development plan typically uses only the primary entity and its direct relationships. If you need parent entity features, either:
- Include the parent entity when creating the development plan
- Run ideation without a development dataset (on the full data, which takes longer)
Check Development Dataset Entity Selection¶
# See which entities the development plan used
response = client.get(f"/development_dataset/{development_dataset_id}")
dataset = response.json()
development_plan_id = dataset.get("development_plan_id")
response = client.get(f"/development_plan/{development_plan_id}")
plan = response.json()
plan_entities = plan.get("entity_selection", {})
print(f"Development plan entities: {plan_entities}")
Strategies¶
Different entity selections produce different feature sets. A common pattern is to run multiple ideation pipelines in parallel with different entity selections and compare model performance.
| Strategy | Entity selection | When to use |
|---|---|---|
| Default | Use the suggested selection | Good starting point; avoids feature explosion |
| Add parent entities | Include parent entities like State or Category |
When parent-level aggregations (e.g., "average sales per state") may be predictive |
| Cross-entity | Add entity pairs like ["item_id", "state_id"] |
When interactions between entities may be predictive (e.g., item popularity by region) |
| Minimal | Primary entity only | When you want a fast, focused ideation with fewer features |
| Broad | Include all eligible entities | When you want maximum coverage and can tolerate longer runs and more features to filter |
See Parallel Ideation for how to run multiple entity selection strategies simultaneously.