Datasets

Datasets

How datasets work in the Bivariant platform — structured data storage with partitioning, multiple storage modes, and support for CSV, JSON, and Parquet formats.

Datasets provide structured, bulk data storage on the platform. While collections are designed for transactional record management, datasets are optimized for larger volumes of data used in analytics, training, and batch processing.

When to use datasets

Use caseRecommended store
Application records (CRM contacts, tickets)Collections
Training data for modelsDatasets
Analytics and reporting dataDatasets
Bulk imports and exportsDatasets
Logs and event archivesDatasets

Dataset structure

A dataset consists of:

ComponentDescription
SchemaColumn names, types, and constraints
PartitionsLogical divisions of data (by date, category, or custom key)
StorageThe underlying storage backend (S3-compatible)
FormatThe data serialization format

Supported formats

Datasets support three data formats:

FormatBest forCharacteristics
CSVHuman-readable data, spreadsheet interoperabilityText-based, widely compatible, larger file size
JSONNested structures, API responsesFlexible schema, human-readable
ParquetAnalytical workloads, large volumesColumnar, compressed, fast query performance

Partitioning

Partitioning divides a dataset into segments for efficient access and management:

my-dataset/
  year=2024/
    month=01/
      data.parquet
    month=02/
      data.parquet
  year=2025/
    month=01/
      data.parquet

Benefits of partitioning

  • Query performance — read only the partitions you need
  • Data management — delete or archive partitions independently
  • Parallel processing — process partitions concurrently

Partition keys

You define partition keys when creating the dataset. Common partition strategies:

StrategyKeyExample
Time-basedyear, month, dayLog data, event streams
Category-basedregion, typeMulti-region datasets
Source-basedsource, providerData from multiple integrations

Storage

Datasets are stored on S3-compatible object storage. The platform manages:

  • Bucket allocation — each space has dedicated storage
  • Access control — data is accessible only within the space
  • Lifecycle management — automatic cleanup based on retention policies

Operations

Creating a dataset

Define the schema, format, and partition keys:

  • Schema — column names and types
  • Format — CSV, JSON, or Parquet
  • Partition keys — the fields used for partitioning

Writing data

Data can be written to datasets through:

  • Bulk upload — upload files directly
  • Flow actions — write data as part of a flow execution
  • API — programmatic writes from external systems

Reading data

Data can be read:

  • By partition — retrieve data from specific partitions
  • Full scan — read all partitions (use with caution on large datasets)
  • API — programmatic reads with partition filters

Integration with flows

Datasets integrate with the flow engine through dedicated actions:

  • Read Dataset — load data from a dataset into a flow step
  • Write Dataset — write flow output to a dataset partition
  • Transform Dataset — apply transformations across partitions
  • Collections — transactional record storage for application data
  • Assets — file and media storage
  • Flows — orchestrate data processing with dataset actions