Skip to content

Supported formats

Unibox routes files by extension and URI scheme. This page summarizes the current mapping.

File types by extension

Type Extensions Loader Notes
Tabular .csv CSVLoader Loads to DataFrame
Tabular .parquet ParquetLoader Loads to DataFrame
Tabular .cdc.parquet, .parquet.cdc CdcParquetLoader Parquet with content-defined chunking enabled
JSON .json JSONLoader Loads to dict/list
JSONL .jsonl JSONLLoader Loads to list of dicts
Text .txt, .md, .markdown TxtLoader Loads to string
Images common image types ImageLoader Returns PIL images or arrays
Config .yaml, .yml YAMLLoader Loads to dict
Config .toml TOMLLoader Loads to dict

Note

Image extensions are defined in src/unibox/utils/constants.py.

Tip

For parquet saves, you can enable content-defined chunking with ub.saves(df, "path.parquet", cdc=True) or use_content_defined_chunking=True.

Hugging Face URIs

  • hf://owner/repo (no file extension) is treated as a dataset.
  • hf://owner/repo/path/file.ext is treated as a file and uses the extension mapping above.

JSON-like saves to Hugging Face

When saving to a dataset URI, ub.saves also accepts JSON-like inputs: - dict - list of dicts (JSONL-style) - list of scalars

These are converted into a DataFrame and then uploaded as a dataset.

Next steps