Catalog¶
smallcat.catalog.Catalog ¶
Bases: BaseModel
A collection of named datasets with associated loader configuration.
The catalog maps user-defined keys to concrete dataset entries (e.g., CSV or Excel). It can be constructed from an in-memory dictionary, an Airflow Variable (JSON), or a YAML file.
Attributes:
| Name | Type | Description |
|---|---|---|
entries |
dict[str, CatalogEntry]
|
Mapping of dataset names to their configurations. |
Source code in src/smallcat/catalog.py
274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | |
entries
class-attribute
instance-attribute
¶
from_dict
staticmethod
¶
Create a catalog from a Python dictionary.
The dictionary must conform to the Catalog schema (i.e., include an
entries key mapping names to valid CatalogEntry objects).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dictionary
|
dict
|
A dictionary matching the |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Catalog |
Catalog
|
A validated |
Raises:
| Type | Description |
|---|---|
pydantic.ValidationError
|
If the dictionary does not match the schema. |
Source code in src/smallcat/catalog.py
from_airflow_variable
staticmethod
¶
Create a catalog from an Airflow Variable containing JSON.
The Airflow Variable should contain a JSON object compatible with the
Catalog schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variable_id
|
str
|
The Airflow Variable ID to read (expects JSON). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Catalog |
Catalog
|
A |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the Airflow Variable does not exist. |
pydantic.ValidationError
|
If the JSON payload is invalid for the model. |
Source code in src/smallcat/catalog.py
from_yaml
staticmethod
¶
Create a catalog from a YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dictionary_path
|
str | Path
|
Path to a YAML file whose content matches the
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
Catalog |
Catalog
|
A |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the YAML file cannot be found. |
pydantic.ValidationError
|
If the YAML content is invalid for the model. |
Source code in src/smallcat/catalog.py
get_dataset ¶
get_dataset(key: str) -> BaseDataset
Instantiate a concrete dataset for a given catalog entry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
The name of the catalog entry to resolve. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseDataset |
BaseDataset
|
A dataset instance ready to load/save the data. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the key is not present in the catalog. |
ValueError
|
If the entry's |
Source code in src/smallcat/catalog.py
load_pandas ¶
Load a dataset from the catalog into a pandas DataFrame.
Resolves the catalog entry identified by key and delegates to
:meth:EntryBase.load_pandas. This is equivalent to:
``self.entries[key].build_dataset().load_pandas(entry.location)``
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
The catalog entry name to load. |
required |
where
|
str | None
|
Optional SQL filter predicate forwarded to the dataset. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
pd.DataFrame: The loaded tabular data. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
Exception
|
Any error propagated from the underlying dataset's loader. |
Source code in src/smallcat/catalog.py
save_pandas ¶
save_pandas(key: str, df: pd.DataFrame) -> None
Save a pandas DataFrame to a dataset in the catalog.
Resolves the catalog entry identified by key and delegates to
:meth:EntryBase.save_pandas. This writes to the entry's configured
location with any format-specific save options applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
The catalog entry name to write to. |
required |
df
|
pd.DataFrame
|
The DataFrame to persist. |
required |
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
Exception
|
Any error propagated from the underlying dataset's saver. |
Source code in src/smallcat/catalog.py
load_arrow ¶
Load a dataset from the catalog into an Apache Arrow Table.
Resolves the catalog entry identified by key and delegates to
:meth:EntryBase.load_arrow. This is equivalent to:
`self.entries[key].build_dataset().load_arrow_table(entry.location)`
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
The catalog entry name to load. |
required |
where
|
str | None
|
Optional SQL filter predicate forwarded to the dataset. |
None
|
Returns:
| Type | Description |
|---|---|
pa.Table
|
pa.Table: The loaded Arrow table. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
Exception
|
Any error propagated from the underlying dataset's loader. |
Source code in src/smallcat/catalog.py
save_arrow ¶
save_arrow(key: str, table: pa.Table) -> None
Save an Apache Arrow Table to a dataset in the catalog.
Resolves the catalog entry identified by key and delegates to
:meth:EntryBase.save_arrow. This writes to the entry's configured
location with any format-specific save options applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
The catalog entry name to write to. |
required |
table
|
pa.Table
|
The Arrow table to persist. |
required |
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
Exception
|
Any error propagated from the underlying dataset's saver. |
Source code in src/smallcat/catalog.py
Entries¶
smallcat.catalog.CSVEntry ¶
Bases: EntryBase
Catalog entry describing a CSV dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
file_format |
Literal['csv']
|
Literal string identifying the file format: |
load_options |
CSVLoadOptions | None
|
Options controlling CSV reading (see |
save_options |
CSVSaveOptions | None
|
Options controlling CSV writing (see |
Source code in src/smallcat/catalog.py
build_dataset ¶
build_dataset() -> CSVDataset
Build a :class:CSVDataset using this entry's configuration.
Returns:
| Name | Type | Description |
|---|---|---|
CSVDataset |
CSVDataset
|
A dataset configured with the resolved connection and options. |
Source code in src/smallcat/catalog.py
load_pandas ¶
load_pandas(where: str | None = None) -> pd.DataFrame
Load this entry's dataset into a pandas DataFrame.
This method builds the concrete dataset via :meth:build_dataset and
delegates to its load_pandas method using this entry's location.
Any dataset-specific load options configured on the entry are respected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
where
|
str | None
|
Optional SQL filter predicate forwarded to the dataset. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
pd.DataFrame: The loaded tabular data. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the target path/table at |
ValueError
|
If the data cannot be parsed as tabular data. |
Exception
|
Any other error raised by the underlying dataset implementation. |
Source code in src/smallcat/catalog.py
save_pandas ¶
Save a pandas DataFrame to this entry's dataset location.
This method builds the concrete dataset via :meth:build_dataset and
delegates to its save_pandas method using this entry's location.
Any dataset-specific save options configured on the entry are respected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame
|
The DataFrame to persist. |
required |
Raises:
| Type | Description |
|---|---|
PermissionError
|
If the target cannot be written to. |
ValueError
|
If the DataFrame is incompatible with the target format/options. |
Exception
|
Any other error raised by the underlying dataset implementation. |
Source code in src/smallcat/catalog.py
smallcat.catalog.ExcelEntry ¶
Bases: EntryBase
Catalog entry describing an Excel dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
file_format |
Literal['excel']
|
Literal string identifying the file format: |
load_options |
ExcelLoadOptions | None
|
Options controlling Excel reading (see |
save_options |
ExcelSaveOptions | None
|
Options controlling Excel writing (see |
Source code in src/smallcat/catalog.py
build_dataset ¶
build_dataset() -> ExcelDataset
Build an :class:ExcelDataset using this entry's configuration.
Returns:
| Name | Type | Description |
|---|---|---|
ExcelDataset |
ExcelDataset
|
A dataset configured with the resolved connection and options. |
Source code in src/smallcat/catalog.py
load_pandas ¶
load_pandas(where: str | None = None) -> pd.DataFrame
Load this entry's dataset into a pandas DataFrame.
This method builds the concrete dataset via :meth:build_dataset and
delegates to its load_pandas method using this entry's location.
Any dataset-specific load options configured on the entry are respected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
where
|
str | None
|
Optional SQL filter predicate forwarded to the dataset. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
pd.DataFrame: The loaded tabular data. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the target path/table at |
ValueError
|
If the data cannot be parsed as tabular data. |
Exception
|
Any other error raised by the underlying dataset implementation. |
Source code in src/smallcat/catalog.py
save_pandas ¶
Save a pandas DataFrame to this entry's dataset location.
This method builds the concrete dataset via :meth:build_dataset and
delegates to its save_pandas method using this entry's location.
Any dataset-specific save options configured on the entry are respected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame
|
The DataFrame to persist. |
required |
Raises:
| Type | Description |
|---|---|
PermissionError
|
If the target cannot be written to. |
ValueError
|
If the DataFrame is incompatible with the target format/options. |
Exception
|
Any other error raised by the underlying dataset implementation. |
Source code in src/smallcat/catalog.py
smallcat.catalog.ParquetEntry ¶
Bases: EntryBase
Catalog entry describing a Parquet dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
file_format |
Literal['parquet']
|
Literal string identifying the file format: |
load_options |
ParquetLoadOptions | None
|
Optional configuration controlling Parquet reading
behavior (see :class: |
save_options |
ParquetSaveOptions | None
|
Optional configuration controlling Parquet writing
behavior (see :class: |
Source code in src/smallcat/catalog.py
build_dataset ¶
build_dataset() -> ParquetDataset
Build a :class:ParquetDataset using this entry's configuration.
Returns:
| Name | Type | Description |
|---|---|---|
ParquetDataset |
ParquetDataset
|
A dataset configured with the resolved connection |
ParquetDataset
|
and Parquet-specific options. |
Source code in src/smallcat/catalog.py
load_pandas ¶
load_pandas(where: str | None = None) -> pd.DataFrame
Load this entry's dataset into a pandas DataFrame.
This method builds the concrete dataset via :meth:build_dataset and
delegates to its load_pandas method using this entry's location.
Any dataset-specific load options configured on the entry are respected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
where
|
str | None
|
Optional SQL filter predicate forwarded to the dataset. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
pd.DataFrame: The loaded tabular data. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the target path/table at |
ValueError
|
If the data cannot be parsed as tabular data. |
Exception
|
Any other error raised by the underlying dataset implementation. |
Source code in src/smallcat/catalog.py
save_pandas ¶
Save a pandas DataFrame to this entry's dataset location.
This method builds the concrete dataset via :meth:build_dataset and
delegates to its save_pandas method using this entry's location.
Any dataset-specific save options configured on the entry are respected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame
|
The DataFrame to persist. |
required |
Raises:
| Type | Description |
|---|---|
PermissionError
|
If the target cannot be written to. |
ValueError
|
If the DataFrame is incompatible with the target format/options. |
Exception
|
Any other error raised by the underlying dataset implementation. |
Source code in src/smallcat/catalog.py
smallcat.catalog.DeltaTableEntry ¶
Bases: EntryBase
Catalog entry describing a Delta Lake table dataset.
This entry specifies configuration for reading from or writing to Delta Lake tables, typically stored on local or cloud-backed storage. It includes both connection details and Delta-specific load/save options.
Attributes:
| Name | Type | Description |
|---|---|---|
file_format |
Literal['delta_table']
|
Literal string identifying the file format: |
load_options |
DeltaTableLoadOptions | None
|
Optional configuration controlling Delta table reading
behavior (see :class: |
save_options |
DeltaTableSaveOptions | None
|
Optional configuration controlling Delta table writing
behavior (see :class: |
Source code in src/smallcat/catalog.py
file_format
class-attribute
instance-attribute
¶
file_format: Literal['delta_table'] = 'delta_table'
build_dataset ¶
build_dataset() -> DeltaTableDataset
Build a :class:DeltaTableDataset using this entry's configuration.
Returns:
| Name | Type | Description |
|---|---|---|
DeltaTableDataset |
DeltaTableDataset
|
A dataset configured with the resolved connection |
DeltaTableDataset
|
and Delta Lake options. |
Source code in src/smallcat/catalog.py
load_pandas ¶
load_pandas(where: str | None = None) -> pd.DataFrame
Load this entry's dataset into a pandas DataFrame.
This method builds the concrete dataset via :meth:build_dataset and
delegates to its load_pandas method using this entry's location.
Any dataset-specific load options configured on the entry are respected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
where
|
str | None
|
Optional SQL filter predicate forwarded to the dataset. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
pd.DataFrame: The loaded tabular data. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the target path/table at |
ValueError
|
If the data cannot be parsed as tabular data. |
Exception
|
Any other error raised by the underlying dataset implementation. |
Source code in src/smallcat/catalog.py
save_pandas ¶
Save a pandas DataFrame to this entry's dataset location.
This method builds the concrete dataset via :meth:build_dataset and
delegates to its save_pandas method using this entry's location.
Any dataset-specific save options configured on the entry are respected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame
|
The DataFrame to persist. |
required |
Raises:
| Type | Description |
|---|---|
PermissionError
|
If the target cannot be written to. |
ValueError
|
If the DataFrame is incompatible with the target format/options. |
Exception
|
Any other error raised by the underlying dataset implementation. |