File upload and the enrichment pipeline: the missing pre-requisite for verification

Uploading files to a project

Uploading a file to a project is a straightforward step: you send the file, and the platform stores it securely and hands you back a file ID that you use to reference it in everything that follows. For the exact request and response format, see the API reference at docs.crowdee.ai.

At this point the file exists in storage but carries no derived metadata. It has a file type, a size, and a name — nothing more. The verification pipelines need much more than this to do their job. They need to know codec details, signal characteristics, image dimensions, language indicators, and extracted text before they can produce meaningful verdicts. That is the job of the enrichment pipeline.

The enrichment step

Enrichment is an asynchronous background process. When you trigger it, the platform queues a processing job for the file and immediately returns control to you — you don't wait around for it to finish. Behind the scenes, the file is routed to the right processing path based on its type: images, audio, video, and text/documents each get their own dedicated handling.

The platform scales the number of enrichment jobs it runs in parallel to match load, so heavy upload volume doesn't create a bottleneck. Once a job completes, the extracted metadata is attached to the file and becomes available through the API.

What enrichment extracts by modality

The metadata extracted depends on the file type:

Modality	Extracted metadata
Image	Dimensions (width × height), color space, EXIF tags (GPS, camera make/model, capture timestamp)
Audio	Codec, sample rate, channel count, duration, signal-level analysis
Video	Dimensions, frame rate, codec, duration, embedded audio track extraction
Text / Document	Language detection; OCR for documents that are image-based (see below)

For images, EXIF data is particularly valuable for verification pipelines. A pipeline checking whether an image was captured at a claimed location can read the GPS coordinates directly from the enrichment metadata without re-parsing the file. For audio, duration and codec help the verification pipeline decide whether a file is plausibly authentic given its claimed origin. For video, the audio track extraction metadata feeds into cross-modal verification pipelines that check consistency between visual and audio content.

Why enrichment gates verification

Most verification pipelines check whether a file has finished enrichment before they will start. If enrichment has not run yet, the pipeline will refuse to create a run and return an error indicating the file is not ready. This gating is intentional: a verification check that relies on GPS coordinates from image metadata or the codec of an audio file would fail silently or produce a misleading verdict if that information were missing.

The enrichment metadata is captured as part of the verification run at the moment it is created. This means the verification run is fully self-contained — even if you re-enrich the file later and the metadata changes, the historical run continues to reference the metadata that existed when it was created. Verification results are therefore deterministic and auditable.

For operators building integrations, the practical workflow is: upload files → trigger enrichment → wait until enrichment is complete → start verification. The platform UI enforces this order through a requirements gating checklist in the pipeline tab, and the API enforces the same order behind the scenes. See docs.crowdee.ai for the exact requests involved.

The OCR path

Document and image OCR follows a two-path strategy. For digital PDFs — files where the text layer is embedded directly in the document — text is extracted directly. This is fast, accurate, and doesn't require any heavier processing.

For scanned PDFs and standalone images where there is no embedded text layer, we fall back to a vision-based extraction step that reads the text out of the image itself. This path is slower than direct extraction, but it handles the broad class of scanned documents that are common in verification use cases.

The distinction between the two paths is determined automatically: the platform attempts direct text extraction first and falls back to the vision-based step if the extracted text is empty or too short to be meaningful. Both paths write the result to the same place in the enrichment metadata, so downstream pipeline stages do not need to know which path was used.