Skip to content

Table EDA

Prerequisites

This page uses the client and wait_for_task helpers defined in API Overview.

After registering a table via the SDK, you can run exploratory data analysis through the API to discover data quality issues at the table and column level.

Run Table EDA

client = fb.Configurations().get_client()

response = client.post(
    "/table_eda",
    json={
        "table_id": table_id,
        "analysis_size": 10000,
        "seed": 42,
    },
)
task_id = response.json()["id"]
wait_for_task(client, task_id)

Parameters:

Parameter Type Required Description
table_id string Yes ID of the registered table
analysis_size integer Yes Number of rows to sample for analysis
seed integer No Random seed for reproducibility
from_timestamp string No Start timestamp filter (ISO format)
to_timestamp string No End timestamp filter (ISO format)
timestamp_column string No Column to use for timestamp filtering
overwrite boolean No Overwrite existing EDA results (default: false)

List Table EDA Runs

response = client.get(
    "/table_eda",
    params={"table_id": table_id, "page": 1, "page_size": 20},
)
eda_runs = response.json()["data"]
table_analysis_id = eda_runs[0]["_id"]

Response fields (each item in data):

Field Type Description
id string Table EDA ID
table_id string Table that was analyzed
analysis_size integer Number of rows sampled
seed integer Random seed used
last_seen datetime When the analysis was last run
from_timestamp datetime Start of analysis time range (if filtered)
to_timestamp datetime End of analysis time range (if filtered)

View Column-Level Analysis

List Column Analyses

response = client.patch(
    "/column_analysis",
    json={
        "table_analysis_id": table_analysis_id,
    },
)
columns = response.json()["data"]

Response fields (each item in data):

Field Type Description
id string Column analysis ID
table_analysis_id string Parent table EDA ID
table_id string Table ID
column_name string Column being analyzed
column_dtype string Data type (e.g., "INT", "FLOAT", "VARCHAR")
column_description string Column description
issues array Detected data quality issues (missing values, outliers, etc.)
cleaning_reco array Recommended cleaning operations
semantic_level_1 string Detected semantic type
summary object Summary statistics (count, unique, min, max, mean, etc.)
outdated boolean Whether the analysis needs to be refreshed

Filter to only show columns with issues:

response = client.patch(
    "/column_analysis",
    json={
        "table_analysis_id": table_analysis_id,
        "issues": True,
    },
)

Get Detailed Column Analysis

column_analysis_id = columns[0]["_id"]

response = client.patch(f"/column_analysis/{column_analysis_id}")
analysis = response.json()

The response contains the same fields as each item in the column analysis list above.

Column EDA Plots

Generate interactive Bokeh plots for a column's data distribution.

Get Plot Options

Check what categories are available for the column (e.g., dictionary keys or embedding dimensions):

response = client.request(
    "OPTIONS",
    f"/column_eda/{column_analysis_id}",
    params={"cleaned": False},
)
options = response.json()

Parameters:

Parameter Type Required Description
cleaned boolean Yes false for raw data categories, true for categories after cleaning operations

Response fields:

Field Type Description
column_categories array Available categories for the column (dictionary keys or embedding dimensions to filter by)

Get Rendered Plots

response = client.patch(
    f"/column_eda/{column_analysis_id}",
    params={
        "cleaned": False,
        "height": 500,
        "width": 1000,
        "font_size": 16,
        "output_format": "html",
    },
)
plots = response.json()

Parameters:

Parameter Type Required Description
cleaned boolean Yes false for raw data, true for data after cleaning operations are applied
column_category any No Filter by dictionary key or embedding dimension (from OPTIONS response)
height integer No Plot height in pixels (default: 500)
width integer No Plot width in pixels (default: 1000)
font_size integer No Font size in pixels (default: 16)
output_format string No Output format: "html" (default) or "json"

Response fields (each item in the response array):

Field Type Description
plot_type string Type of plot (e.g., "distribution", "histogram")
plots array List of rendered plot objects, each with a content field containing self-contained Bokeh HTML

The response is a plot list. See Displaying Plots for how to extract and render plots in Jupyter, save as HTML, or embed in a web application.

Compare Raw vs Cleaned

Generate plots for both raw and cleaned data to see the effect of cleaning operations:

# Raw data
raw_plots = client.patch(
    f"/column_eda/{column_analysis_id}",
    params={"cleaned": False},
).json()

# After cleaning operations
cleaned_plots = client.patch(
    f"/column_eda/{column_analysis_id}",
    params={"cleaned": True},
).json()