Table EDA¶
See also
UI Tutorial: Set Default Cleaning Operations | Concepts: Table EDA | API Tutorial: Credit Default — Step 6
Prerequisites
This page uses the client and wait_for_task helpers defined in API Overview.
After registering a table via the SDK, you can run exploratory data analysis through the API to discover data quality issues at the table and column level.
Run Table EDA¶
client = fb.Configurations().get_client()
response = client.post(
"/table_eda",
json={
"table_id": table_id,
"analysis_size": 10000,
"seed": 42,
},
)
task_id = response.json()["id"]
wait_for_task(client, task_id)
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
table_id |
string | Yes | ID of the registered table |
analysis_size |
integer | Yes | Number of rows to sample for analysis |
seed |
integer | No | Random seed for reproducibility |
from_timestamp |
string | No | Start timestamp filter (ISO format) |
to_timestamp |
string | No | End timestamp filter (ISO format) |
timestamp_column |
string | No | Column to use for timestamp filtering |
overwrite |
boolean | No | Overwrite existing EDA results (default: false) |
List Table EDA Runs¶
response = client.get(
"/table_eda",
params={"table_id": table_id, "page": 1, "page_size": 20},
)
eda_runs = response.json()["data"]
table_analysis_id = eda_runs[0]["_id"]
Response fields (each item in data):
| Field | Type | Description |
|---|---|---|
id |
string | Table EDA ID |
table_id |
string | Table that was analyzed |
analysis_size |
integer | Number of rows sampled |
seed |
integer | Random seed used |
last_seen |
datetime | When the analysis was last run |
from_timestamp |
datetime | Start of analysis time range (if filtered) |
to_timestamp |
datetime | End of analysis time range (if filtered) |
View Column-Level Analysis¶
List Column Analyses¶
response = client.patch(
"/column_analysis",
json={
"table_analysis_id": table_analysis_id,
},
)
columns = response.json()["data"]
Response fields (each item in data):
| Field | Type | Description |
|---|---|---|
id |
string | Column analysis ID |
table_analysis_id |
string | Parent table EDA ID |
table_id |
string | Table ID |
column_name |
string | Column being analyzed |
column_dtype |
string | Data type (e.g., "INT", "FLOAT", "VARCHAR") |
column_description |
string | Column description |
issues |
array | Detected data quality issues (missing values, outliers, etc.) |
cleaning_reco |
array | Recommended cleaning operations |
semantic_level_1 |
string | Detected semantic type |
summary |
object | Summary statistics (count, unique, min, max, mean, etc.) |
outdated |
boolean | Whether the analysis needs to be refreshed |
Filter to only show columns with issues:
response = client.patch(
"/column_analysis",
json={
"table_analysis_id": table_analysis_id,
"issues": True,
},
)
Get Detailed Column Analysis¶
column_analysis_id = columns[0]["_id"]
response = client.patch(f"/column_analysis/{column_analysis_id}")
analysis = response.json()
The response contains the same fields as each item in the column analysis list above.
Column EDA Plots¶
Generate interactive Bokeh plots for a column's data distribution.
Get Plot Options¶
Check what categories are available for the column (e.g., dictionary keys or embedding dimensions):
response = client.request(
"OPTIONS",
f"/column_eda/{column_analysis_id}",
params={"cleaned": False},
)
options = response.json()
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
cleaned |
boolean | Yes | false for raw data categories, true for categories after cleaning operations |
Response fields:
| Field | Type | Description |
|---|---|---|
column_categories |
array | Available categories for the column (dictionary keys or embedding dimensions to filter by) |
Get Rendered Plots¶
response = client.patch(
f"/column_eda/{column_analysis_id}",
params={
"cleaned": False,
"height": 500,
"width": 1000,
"font_size": 16,
"output_format": "html",
},
)
plots = response.json()
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
cleaned |
boolean | Yes | false for raw data, true for data after cleaning operations are applied |
column_category |
any | No | Filter by dictionary key or embedding dimension (from OPTIONS response) |
height |
integer | No | Plot height in pixels (default: 500) |
width |
integer | No | Plot width in pixels (default: 1000) |
font_size |
integer | No | Font size in pixels (default: 16) |
output_format |
string | No | Output format: "html" (default) or "json" |
Response fields (each item in the response array):
| Field | Type | Description |
|---|---|---|
plot_type |
string | Type of plot (e.g., "distribution", "histogram") |
plots |
array | List of rendered plot objects, each with a content field containing self-contained Bokeh HTML |
The response is a plot list. See Displaying Plots for how to extract and render plots in Jupyter, save as HTML, or embed in a web application.
Compare Raw vs Cleaned¶
Generate plots for both raw and cleaned data to see the effect of cleaning operations: