Datasets
How datasets work in the Bivariant platform — structured data storage with partitioning, multiple storage modes, and support for CSV, JSON, and Parquet formats.
Datasets provide structured, bulk data storage on the platform. While collections are designed for transactional record management, datasets are optimized for larger volumes of data used in analytics, training, and batch processing.
When to use datasets
| Use case | Recommended store |
|---|---|
| Application records (CRM contacts, tickets) | Collections |
| Training data for models | Datasets |
| Analytics and reporting data | Datasets |
| Bulk imports and exports | Datasets |
| Logs and event archives | Datasets |
Dataset structure
A dataset consists of:
| Component | Description |
|---|---|
| Schema | Column names, types, and constraints |
| Partitions | Logical divisions of data (by date, category, or custom key) |
| Storage | The underlying storage backend (S3-compatible) |
| Format | The data serialization format |
Supported formats
Datasets support three data formats:
| Format | Best for | Characteristics |
|---|---|---|
| CSV | Human-readable data, spreadsheet interoperability | Text-based, widely compatible, larger file size |
| JSON | Nested structures, API responses | Flexible schema, human-readable |
| Parquet | Analytical workloads, large volumes | Columnar, compressed, fast query performance |
Partitioning
Partitioning divides a dataset into segments for efficient access and management:
my-dataset/
year=2024/
month=01/
data.parquet
month=02/
data.parquet
year=2025/
month=01/
data.parquetBenefits of partitioning
- Query performance — read only the partitions you need
- Data management — delete or archive partitions independently
- Parallel processing — process partitions concurrently
Partition keys
You define partition keys when creating the dataset. Common partition strategies:
| Strategy | Key | Example |
|---|---|---|
| Time-based | year, month, day | Log data, event streams |
| Category-based | region, type | Multi-region datasets |
| Source-based | source, provider | Data from multiple integrations |
Storage
Datasets are stored on S3-compatible object storage. The platform manages:
- Bucket allocation — each space has dedicated storage
- Access control — data is accessible only within the space
- Lifecycle management — automatic cleanup based on retention policies
Operations
Creating a dataset
Define the schema, format, and partition keys:
- Schema — column names and types
- Format — CSV, JSON, or Parquet
- Partition keys — the fields used for partitioning
Writing data
Data can be written to datasets through:
- Bulk upload — upload files directly
- Flow actions — write data as part of a flow execution
- API — programmatic writes from external systems
Reading data
Data can be read:
- By partition — retrieve data from specific partitions
- Full scan — read all partitions (use with caution on large datasets)
- API — programmatic reads with partition filters
Integration with flows
Datasets integrate with the flow engine through dedicated actions:
- Read Dataset — load data from a dataset into a flow step
- Write Dataset — write flow output to a dataset partition
- Transform Dataset — apply transformations across partitions
Related concepts
- Collections — transactional record storage for application data
- Assets — file and media storage
- Flows — orchestrate data processing with dataset actions