
A Crowdee dataset is a named, multi-modal file collection scoped to your organization. When you create one, you give it a name and declare its modality — image, audio, video, text, document, or multimodal — and you can also set a stable identifier that you control, so you can reference the dataset consistently instead of relying on an internal ID.
Datasets are not flat file lists — they are containers for versions. Each version is a snapshot of the dataset at a point in time. You upload files to the raw version, and those files become the starting point for all downstream processing. Versions are never mutated after they are created; instead, cleaning and enrichment operations produce new derived versions that keep a link back to the version they came from. This immutability guarantee means you can always trace how a file got to its current state.
Every dataset version carries a status that moves through a defined set of states. The lifecycle for a version that goes through both cleaning and enrichment looks like this:
| State | Meaning |
|---|---|
raw | Files have been uploaded; no processing has started |
cleaning | A cleaning operation is running |
cleaned | Cleaning completed successfully |
enriching | Per-file enrichment is running |
cleaned_and_enriched | Enrichment completed on a cleaned version |
failed | The most recent operation failed |
Not every version visits every state. A version that is enriched directly without cleaning will go raw → enriching → cleaned_and_enriched (the state name reflects the end result, not the path taken to get there). While an operation is running, Crowdee keeps track of it internally so that progress can be reported back as files are processed.
You can list all versions of a dataset and pull a paginated file listing for any specific version through the Crowdee API — see the API reference for details. Checking a version's status this way is the right approach for finding out whether a cleaning or enrichment step has finished.
When a cleaning operation produces a derived version, it records, for every output file, which input file it came from. This is how the platform answers the question "what was this file before it was cleaned?"
Provenance chains can span multiple hops. If you clean a version and then enrich the cleaned version, each enriched file traces back to the cleaned version, and each cleaned file traces back to the raw version. The full ancestry of any file in the system is therefore recoverable by following that chain back, step by step.
This design also enables partial failure recovery. If a cleaning operation fails on some files but succeeds on others, the successfully processed files keep their provenance link, and the version as a whole transitions to failed rather than discarding the partial output. You can inspect which files were processed and which were not before deciding whether to retry.
Cleaning is triggered by picking a version and choosing a cleaning pipeline — for example, one that strips leading and trailing silence from audio files. Crowdee creates a new derived version linked back to the current one, marks it as cleaning, and runs the pipeline in the background. The original version is left untouched.
You can discover which cleaning pipelines are available for your dataset's modality directly in the product, or via the API reference, which lists each pipeline's expected inputs and configuration options.
Enrichment works a bit differently: instead of creating a new version, it runs against the files already in the existing version and updates their metadata in place, moving the version status through enriching to cleaned_and_enriched (or straight to cleaned_and_enriched for a version that was already cleaned). Behind the scenes, Crowdee routes each file to the right enrichment process based on its type — audio, image, video, or text — so you don't have to think about which tool handles which format.
Once a version has reached a stable state, you can export it and get back a set of temporary, secure download links — one per file in the version — that expire automatically after 24 hours. Because these links are time-limited and don't require your API key, they're safe to hand off to downstream systems. Batch downloads, pipeline hand-offs, and ML training data exports all use this export step as the hand-off point out of Crowdee.
Exporting works on any version regardless of its status, so you can export a partially processed version if needed. For production workflows, however, waiting until the version reaches cleaned_and_enriched ensures all metadata is available and all files have been processed to completion.
Share this article: