Input data: per-task variation in crowd jobs

What input data is

A crowd job without input data is a static survey: every worker sees the same questions with the same fixed text. That works for a handful of task types, but most meaningful crowd annotation requires each worker to judge a specific piece of content — a particular news article, a specific image, a product description from a known source. Input data is the mechanism that injects that per-worker variability.

You create an input data set from a batch of source content — say, 500 news articles — and attach it to a crowd job so that workers can be assigned one article each. Sets can live inside a single project or be shared across your whole organization, so the same content library can feed multiple jobs over time. For the exact API reference, see docs.crowdee.ai.

The data model: sets, lists, groups, variants

Conceptually, input data is organized in four nested layers, each with a specific role.

Level	Role
Set	Top-level named container for a batch of content, linked to your organization or a project
List	A named list within a set; one set can hold multiple lists
Group	A group of variants representing the same real-world item
Variant	One representation of the item (e.g. different languages or formats)

A group can appear in more than one list at once, so the same content can be reused across different groupings without duplicating it.

The variant level is where the actual content lives. A group representing one news article might have two variants: the original English text and a machine-translated German version. A worker assigned to that group gets one variant; the system keeps track of exactly which one they saw. This allows the same underlying item to be presented in different forms to different workers, or the same form to be reused across jobs without duplicating the content.

Attaching input data to a crowd job

When creating a crowd job, you link it to an input data set. Crowdee then assigns workers to groups from that set. At assignment time, each worker's task is locked to a snapshot of the exact content they were shown — not a live reference to it. That snapshot is immutable: even if the input data set is edited later, the worker's task always reflects what they actually saw.

Jobs can require that no two workers are ever assigned the same group — the right setting for annotation tasks where each piece of content should be judged independently, without workers influencing each other. Alternatively, the same group can be assigned to multiple workers on purpose, which is useful when you want several independent judgments on the same item for inter-annotator agreement analysis.

A finer-grained option lets you cap how many times a given group can be reused across workers, without requiring every group to be assigned exactly once. This is useful when your input data set is smaller than your target annotation count, or when you want weighted coverage of certain groups.

Per-task variation in practice

The full per-task variation mechanism works as follows: the survey template contains placeholder tokens for the pieces of content that should vary (for example, a token for the post text and one for the source URL). The input data variant for the assigned group contains matching values for those tokens. At task assignment time, Crowdee loads the survey template, substitutes the placeholders with the variant's values, and locks in the result as the worker's task. The worker's survey is rendered from that locked-in version, not from the live template or the live input data.

This separation is important: the locked-in version records what the worker actually saw, which is essential for reproducible quality analysis. If you later want to know why a specific worker gave an unexpected answer, you can look at the exact survey they were shown, not a reconstructed version of it.

Placeholder substitution also works for structured content. A variant might carry several related fields at once — a headline, a body, and a publication date — and the template can reference each one independently. This keeps variants easy to author and reason about, even for richer content types.

The answers-to-input-data loop

One of the more powerful features of the crowd pipeline is the ability to use accepted crowd answers as the input data for a subsequent job. Crowdee can take the accepted answers from a completed job and convert them directly into a new input data set.

This creates a feedback loop: Job A asks workers to transcribe audio clips. The accepted transcriptions are converted into an input data set. Job B asks a different set of workers to review and correct those transcriptions, using them as the varying content in the review survey. The output of Job B is a higher-quality transcription set. This pipeline pattern — generate in Job A, review in Job B — is reusable for any content type where crowd generation and crowd review are separable steps.

Accepted answers can also be converted into a dataset instead of an input data set, which is useful when the answer content should feed into a broader data pipeline rather than another crowd job. Either way, only accepted answers are used — pending and rejected answers are filtered out automatically. See docs.crowdee.ai for the full set of options.