Datasets¶
Dataset constructors for common table/file formats.
Re-exports the most common dataset classes so callers can import from
smallcat.datasets directly.
Available datasets
- :class:
ParquetDataset - :class:
CSVDataset - :class:
ExcelDataset - :class:
DeltaTableDataset - :class:
BaseDataset(abstract)
CSV¶
CSV dataset using DuckDB's CSV reader/writer.
This module defines :class:CSVDataset, a concrete dataset for CSV/TSV/DSV
files using DuckDB (read_csv_auto / COPY ... WITH (FORMAT CSV)).
All paths are treated as relative to the dataset's base URI.
Features
- Auto-schema inference (delimiter, header, types) with overrides.
- Large-file handling via DuckDB streaming.
- Optional Hive-style partitioning on write.
Example
ds = CSVDataset.from_conn_id("local_fs") tbl = ds.load_arrow_table("bronze/raw/users.csv") ds.save_arrow_table("silver/users/", tbl)
Typical options (suggested):
* Load: header, delimiter, columns, nullstr, types.
* Save: header, delimiter, partition_by, overwrite.
Note
An implementation typically builds SQL like:
SELECT * FROM read_csv_auto(? , ...options...) for reading and
COPY (SELECT * FROM tmp_input) TO ? WITH (FORMAT CSV, ...) for writing.
smallcat.datasets.csv_dataset.CSVDataset ¶
Bases: BaseDataset[CSVLoadOptions, CSVSaveOptions]
Dataset that reads/writes CSV using DuckDB.
- Paths are resolved relative to the dataset's connection base
(local filesystem,
gs://, etc.). - Reading uses
DuckDBPyConnection.read_csvunder the hood and returns apyarrow.Table. - Writing uses
Relation.write_csvto materialize a table to CSV.
Notes:¶
- Use :class:
CSVLoadOptionsto override auto-detection (separator, header, per-column types). - Use :class:
CSVSaveOptionsto control delimiter, header, and overwrite behavior.
Source code in src/smallcat/datasets/csv_dataset.py
load_arrow_record_batch_reader ¶
Stream CSV rows as RecordBatches with an optional filter.
Source code in src/smallcat/datasets/csv_dataset.py
save_arrow_table ¶
save_arrow_table(path: str, table: pa.Table) -> None
Write a PyArrow Table to CSV.
Parameters¶
path Destination path (file or pattern) relative to the connection base. Compression is inferred from the extension (e.g. '.gz', '.zst'). table The Arrow table to write.
Raises:¶
duckdb.IOException If the destination is not writable.
Source code in src/smallcat/datasets/csv_dataset.py
from_conn_id
classmethod
¶
from_conn_id(conn_id: str, *, load_options: L | None = None, save_options: S | None = None) -> BaseDataset[L, S]
Construct an instance by looking up an Airflow connection ID.
Uses airflow.hooks.base.BaseHook (or the SDK alternative) to fetch
the connection and then calls the class constructor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
conn_id
|
str
|
Airflow connection ID to resolve. |
required |
load_options
|
L | None
|
Optional load options model. |
None
|
save_options
|
S | None
|
Optional save options model. |
None
|
Returns:
| Type | Description |
|---|---|
BaseDataset[L, S]
|
A fully constructed |
Source code in src/smallcat/datasets/base_dataset.py
load_pandas ¶
Load data as a pandas DataFrame.
This is a convenience wrapper over load_arrow_table and pushes down
filters when provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative dataset path. |
required |
where
|
str | None
|
Optional SQL filter predicate injected into the query. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
A pandas |
Source code in src/smallcat/datasets/base_dataset.py
save_pandas ¶
save_pandas(path: str, df: pd.DataFrame) -> None
Persist a pandas DataFrame.
Converts the DataFrame to a pyarrow.Table and delegates to
save_arrow_table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative dataset path. |
required |
df
|
pd.DataFrame
|
DataFrame to persist. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/smallcat/datasets/base_dataset.py
Options¶
smallcat.datasets.csv_dataset.CSVLoadOptions ¶
Bases: BaseModel
Options that control how CSV files are read.
These mirror DuckDB's read_csv_auto parameters we expose.
All fields are optional; unset values defer to DuckDB defaults.
Attributes:¶
columns Optional mapping of column names to logical types (e.g. {"id": "INTEGER", "amount": "DOUBLE"}) used to override DuckDB's type inference when auto-detect is not good enough. sep Field separator (e.g. ",", "|", "\t"). If None, DuckDB will try to detect it. header Whether the first row contains column names. If None, DuckDB will detect. sample_size Number of rows to sample for schema detection. If None, DuckDB default applies. all_varchar If True, read all columns as VARCHAR (string). Useful when types are messy.
Source code in src/smallcat/datasets/csv_dataset.py
columns
class-attribute
instance-attribute
¶
columns: Mapping[str, str] | None = Field(None, description="Override inferred types per column, e.g. {'id': 'INTEGER'}.")
sep
class-attribute
instance-attribute
¶
sep: str | None = Field(None, description="Field separator (e.g. ',', '|', '\\t'); auto-detected if None.")
header
class-attribute
instance-attribute
¶
header: bool | None = Field(None, description='Whether the first row is a header; auto-detected if None.')
smallcat.datasets.csv_dataset.CSVSaveOptions ¶
Bases: BaseModel
Options that control how CSV files are written.
Attributes:¶
header Whether to write a header row with column names. sep Field separator to use when writing (e.g. ',', '|', '\t'). overwrite If True, allow overwriting existing files at the destination. Compression is inferred from the file extension ('.gz', '.zst', …).
Source code in src/smallcat/datasets/csv_dataset.py
Excel¶
Excel (.xlsx) dataset via DuckDB's excel extension.
This module provides :class:ExcelDataset for reading/writing .xlsx files
(legacy .xls is not supported). Paths are relative to the configured
base URI; the DuckDB excel extension is installed/loaded at runtime.
Capabilities
- Read a whole sheet or an A1 range with optional header handling.
- Coerce empty columns or all columns to VARCHAR for schema stability.
- Write Arrow tables to a specific sheet (with optional header row).
Example
ds = ExcelDataset.from_conn_id("fs_conn") tbl = ds.load_arrow_table("inputs/budget.xlsx") # first sheet by default ds.save_arrow_table("outputs/budget_out.xlsx", tbl)
smallcat.datasets.excel_dataset.ExcelDataset ¶
Bases: BaseDataset[ExcelLoadOptions, ExcelSaveOptions]
Reads and writes .xlsx files via DuckDB's excel extension.
Notes
- Legacy .xls format is not supported.
- Paths are treated as relative to this dataset's base URI (e.g.,
file://orgs://); use the connection extras to set the base.
Source code in src/smallcat/datasets/excel_dataset.py
load_arrow_record_batch_reader ¶
Stream .xlsx rows as record batches with an optional filter.
Source code in src/smallcat/datasets/excel_dataset.py
save_arrow_table ¶
save_arrow_table(path: str, table: pa.Table) -> None
Write a PyArrow table to an .xlsx file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative path of the output .xlsx file (joined under the dataset base). |
required |
table
|
pa.Table
|
The |
required |
Notes
Uses DuckDB's COPY ... TO ... WITH (FORMAT xlsx ...) from the
excel extension. Save-time options are translated into COPY options.
Source code in src/smallcat/datasets/excel_dataset.py
from_conn_id
classmethod
¶
from_conn_id(conn_id: str, *, load_options: L | None = None, save_options: S | None = None) -> BaseDataset[L, S]
Construct an instance by looking up an Airflow connection ID.
Uses airflow.hooks.base.BaseHook (or the SDK alternative) to fetch
the connection and then calls the class constructor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
conn_id
|
str
|
Airflow connection ID to resolve. |
required |
load_options
|
L | None
|
Optional load options model. |
None
|
save_options
|
S | None
|
Optional save options model. |
None
|
Returns:
| Type | Description |
|---|---|
BaseDataset[L, S]
|
A fully constructed |
Source code in src/smallcat/datasets/base_dataset.py
load_pandas ¶
Load data as a pandas DataFrame.
This is a convenience wrapper over load_arrow_table and pushes down
filters when provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative dataset path. |
required |
where
|
str | None
|
Optional SQL filter predicate injected into the query. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
A pandas |
Source code in src/smallcat/datasets/base_dataset.py
save_pandas ¶
save_pandas(path: str, df: pd.DataFrame) -> None
Persist a pandas DataFrame.
Converts the DataFrame to a pyarrow.Table and delegates to
save_arrow_table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative dataset path. |
required |
df
|
pd.DataFrame
|
DataFrame to persist. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/smallcat/datasets/base_dataset.py
Options¶
smallcat.datasets.excel_dataset.ExcelLoadOptions ¶
Bases: BaseModel
Options that control how an .xlsx file is read.
Attributes:
| Name | Type | Description |
|---|---|---|
header |
bool | None
|
If True, treat the first row as column headers. |
sheet |
str | None
|
Optional worksheet name to read. If omitted, the first sheet is used. |
range |
str | None
|
Excel A1-style range to read (e.g., "A1:D100"). If omitted, the full sheet is read. |
all_varchar |
bool | None
|
If True, coerce all columns to VARCHAR (strings). |
empty_as_varchar |
bool | None
|
If True, treat empty columns as VARCHAR instead of NULL/typed. |
Source code in src/smallcat/datasets/excel_dataset.py
smallcat.datasets.excel_dataset.ExcelSaveOptions ¶
Bases: BaseModel
Options that control how an Arrow table is written to .xlsx.
Attributes:
| Name | Type | Description |
|---|---|---|
header |
bool | None
|
If True, include column headers in the output file. |
sheet |
str | None
|
Optional worksheet name to write into (created if missing). |
Source code in src/smallcat/datasets/excel_dataset.py
Parquet¶
Parquet dataset backed by DuckDB.
This module provides :class:ParquetDataset, a concrete implementation of
:class:~smallcat.datasets.base_dataset.BaseDataset that reads/writes Parquet
via DuckDB. Paths passed to public methods are relative to the configured
base (e.g., file:// or gs://).
Features
- Read from a single file, directory, or glob pattern.
- Hive partition discovery and schema union (optional).
- Write with optional partitioning and overwrite.
Example
ds = ParquetDataset.from_conn_id("gcs_conn") tbl = ds.load_arrow_table("bronze/events/*/.parquet") ds.save_arrow_table("silver/events/", tbl)
smallcat.datasets.parquet_dataset.ParquetDataset ¶
Bases: BaseDataset
Parquet dataset backed by DuckDB's Parquet reader/writer.
Paths passed to public methods are treated as relative to the dataset's
configured base (e.g., file:// or gs://). Reads return a PyArrow
table.
Notes
- You can pass a single file, a directory (e.g.,
/path/**.parquet), or any glob DuckDB understands.
Source code in src/smallcat/datasets/parquet_dataset.py
load_arrow_record_batch_reader ¶
Stream Parquet rows as record batches with an optional filter.
Source code in src/smallcat/datasets/parquet_dataset.py
save_arrow_table ¶
save_arrow_table(path: str, table: pa.Table) -> None
Write a PyArrow table to Parquet.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative output path (file or directory) joined under the dataset base URI. |
required |
table
|
pa.Table
|
The |
required |
Notes
Uses Relation.write_parquet with parameters from
save_options_dict().
Source code in src/smallcat/datasets/parquet_dataset.py
from_conn_id
classmethod
¶
from_conn_id(conn_id: str, *, load_options: L | None = None, save_options: S | None = None) -> BaseDataset[L, S]
Construct an instance by looking up an Airflow connection ID.
Uses airflow.hooks.base.BaseHook (or the SDK alternative) to fetch
the connection and then calls the class constructor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
conn_id
|
str
|
Airflow connection ID to resolve. |
required |
load_options
|
L | None
|
Optional load options model. |
None
|
save_options
|
S | None
|
Optional save options model. |
None
|
Returns:
| Type | Description |
|---|---|
BaseDataset[L, S]
|
A fully constructed |
Source code in src/smallcat/datasets/base_dataset.py
load_pandas ¶
Load data as a pandas DataFrame.
This is a convenience wrapper over load_arrow_table and pushes down
filters when provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative dataset path. |
required |
where
|
str | None
|
Optional SQL filter predicate injected into the query. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
A pandas |
Source code in src/smallcat/datasets/base_dataset.py
save_pandas ¶
save_pandas(path: str, df: pd.DataFrame) -> None
Persist a pandas DataFrame.
Converts the DataFrame to a pyarrow.Table and delegates to
save_arrow_table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative dataset path. |
required |
df
|
pd.DataFrame
|
DataFrame to persist. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/smallcat/datasets/base_dataset.py
Options¶
smallcat.datasets.parquet_dataset.ParquetLoadOptions ¶
Bases: BaseModel
Options that control how Parquet is read via DuckDB.
Attributes:
| Name | Type | Description |
|---|---|---|
binary_as_string |
bool | None
|
If True, interpret BINARY columns as strings. |
file_row_number |
bool | None
|
If True, include a synthetic row-number column per file. |
hive_partitioning |
bool | None
|
If True, parse Hive-style directory partitions. |
union_by_name |
bool | None
|
If True, align/union schemas by column name across files. |
Source code in src/smallcat/datasets/parquet_dataset.py
smallcat.datasets.parquet_dataset.ParquetSaveOptions ¶
Bases: BaseModel
Options that control how Parquet is written via DuckDB.
Attributes:
| Name | Type | Description |
|---|---|---|
overwrite |
bool | None
|
If True, overwrite existing output. |
partition_by |
list[str] | None
|
Columns to partition by (Hive-style layout). |
write_partition_columns |
bool | None
|
If True, also materialize partition cols in files. |
Source code in src/smallcat/datasets/parquet_dataset.py
Delta Table¶
Delta Lake dataset using delta-rs (deltalake) with Smallcat.
This module implements :class:DeltaTableDataset, a Delta Lake reader/writer
powered by deltalake (delta-rs). It resolves relative paths against the
connection base (e.g., gs://bucket/prefix) and returns/accepts Arrow tables.
Storage backends
- Local filesystem (
fs) - no extra options. - Google Cloud Storage (
google_cloud_platform) - credentials derived from connection extras:keyfile_dict/keyfile/key_path. - Databricks - minimal env vars exported (workspace URL and token).
Example
ds = DeltaTableDataset.from_conn_id("gcs_delta") tbl = ds.load_arrow_table("bronze/events_delta") ds.save_arrow_table("silver/events_delta", tbl)
Notes
For Databricks, this module sets:
DATABRICKS_WORKSPACE_URL and DATABRICKS_ACCESS_TOKEN before access.
smallcat.datasets.delta_table_dataset.DeltaTableDataset ¶
Bases: BaseDataset[DeltaTableLoadOptions, DeltaTableSaveOptions]
Delta Lake dataset that reads/writes via delta-rs (DeltaTable / write_deltalake).
Paths passed to public methods are treated as relative to the dataset's
configured base (e.g., local file:// or gs://). Reads return a
PyArrow table.
Notes
- For Google Cloud Storage, credentials are derived from the connection's
extras (e.g.,
keyfile_dict,keyfile, orkey_path). - For
conn_type == "databricks", environment variables are set to support Databricks-hosted Delta.
Source code in src/smallcat/datasets/delta_table_dataset.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | |
_delta_storage_options ¶
_delta_storage_options() -> dict
Build storage_options for delta-rs reads/writes.
Returns:
| Type | Description |
|---|---|
dict
|
A mapping suitable for the |
dict
|
|
dict
|
keys include one of:
* |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the connection type is not supported. |
Source code in src/smallcat/datasets/delta_table_dataset.py
_set_databricks_acces_variables ¶
Export minimal environment variables for Databricks-hosted Delta.
Sets
DATABRICKS_WORKSPACE_URLfromself.conn.hostDATABRICKS_ACCESS_TOKENfromself.conn.password
Notes
These variables are used by delta-rs when accessing Databricks.
Source code in src/smallcat/datasets/delta_table_dataset.py
load_arrow_record_batch_reader ¶
Stream Delta Lake rows via DuckDB with an optional filter.
Source code in src/smallcat/datasets/delta_table_dataset.py
save_arrow_table ¶
save_arrow_table(path: str, table: pa.Table) -> None
Write a PyArrow table to Delta Lake using delta-rs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative path to the target Delta table (joined under the dataset's base URI). |
required |
table
|
pa.Table
|
The |
required |
Notes
- If
conn_type == "databricks", this method sets Databricks environment variables via_set_databricks_acces_variables. (The write is otherwise handled by delta-rs for non-Databricks.)
Source code in src/smallcat/datasets/delta_table_dataset.py
from_conn_id
classmethod
¶
from_conn_id(conn_id: str, *, load_options: L | None = None, save_options: S | None = None) -> BaseDataset[L, S]
Construct an instance by looking up an Airflow connection ID.
Uses airflow.hooks.base.BaseHook (or the SDK alternative) to fetch
the connection and then calls the class constructor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
conn_id
|
str
|
Airflow connection ID to resolve. |
required |
load_options
|
L | None
|
Optional load options model. |
None
|
save_options
|
S | None
|
Optional save options model. |
None
|
Returns:
| Type | Description |
|---|---|
BaseDataset[L, S]
|
A fully constructed |
Source code in src/smallcat/datasets/base_dataset.py
load_pandas ¶
Load data as a pandas DataFrame.
This is a convenience wrapper over load_arrow_table and pushes down
filters when provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative dataset path. |
required |
where
|
str | None
|
Optional SQL filter predicate injected into the query. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
A pandas |
Source code in src/smallcat/datasets/base_dataset.py
save_pandas ¶
save_pandas(path: str, df: pd.DataFrame) -> None
Persist a pandas DataFrame.
Converts the DataFrame to a pyarrow.Table and delegates to
save_arrow_table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Relative dataset path. |
required |
df
|
pd.DataFrame
|
DataFrame to persist. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/smallcat/datasets/base_dataset.py
Options¶
smallcat.datasets.delta_table_dataset.DeltaTableLoadOptions ¶
Bases: BaseModel
Options controlling how a Delta table is read.
Attributes:
| Name | Type | Description |
|---|---|---|
version |
int | None
|
Optional table version to read. |
without_files |
bool | None
|
If True, skip listing data files (metadata-only read). |
log_buffer_size |
int | None
|
Buffer size for reading Delta logs. |
Source code in src/smallcat/datasets/delta_table_dataset.py
smallcat.datasets.delta_table_dataset.DeltaTableSaveOptions ¶
Bases: BaseModel
Options controlling how a Delta table is written.
Attributes:
| Name | Type | Description |
|---|---|---|
mode |
WriteMode | None
|
Write mode to apply if the table exists. |
partition_by |
list[str] | None
|
Columns to partition by (Hive-style directory layout). |
schema_mode |
SchemaMode | None
|
Strategy to reconcile schema differences during write. |
Source code in src/smallcat/datasets/delta_table_dataset.py
mode
class-attribute
instance-attribute
¶
partition_by
class-attribute
instance-attribute
¶
partition_by: list[str] | None = Field(None, description='Columns to partition by (Hive-style directory layout).')
schema_mode
class-attribute
instance-attribute
¶
schema_mode: SchemaMode | None = Field(None, description='How to handle schema differences on write.')