Skip to content

featurebyte.TableColumn.sample

sample(
size: int=10,
seed: int=1234,
from_timestamp: Union[datetime, str, NoneType]=None,
to_timestamp: Union[datetime, str, NoneType]=None,
after_cleaning: bool=False
) -> DataFrame

Description

Returns a Series that contains a random selection of rows of the table column based on a specified time range, size, and seed for sampling control. By default, the materialization process occurs before any cleaning operations that were defined at the column level.

Parameters

  • size: int
    default: 10
    Maximum number of rows to sample, with an upper bound of 10,000 rows.

  • seed: int
    default: 1234
    Seed to use for random sampling.

  • from_timestamp: Union[datetime, str, NoneType]
    Start of date range to sample from.

  • to_timestamp: Union[datetime, str, NoneType]
    End of date range to sample from.

  • after_cleaning: bool
    default: False
    Whether to sample the table after cleaning

Returns

  • DataFrame
    Sampled rows from the table column.

Examples

Sample 3 rows from the table.

>>> sample = catalog.get_table("GROCERYINVOICE")["Amount"].sample(3)
Sample 3 rows from the table with timestamps after cleaning operations have been applied.

>>> event_table = catalog.get_table("GROCERYINVOICE")
>>> event_table["Amount"].update_critical_data_info(
...   cleaning_operations=[
...     fb.MissingValueImputation(imputed_value=0),
...   ]
... )
>>> event_table["Amount"].sample(
...   size=3,
...   seed=111,
...   from_timestamp=datetime(2019, 1, 1),
...   to_timestamp=datetime(2023, 12, 31),
...   after_cleaning=True,
... )