Fundamentals

TL;DR

This page builds shared language for the rest of the docs. It explains how AI uses samples, labels, validations, how Codatta models them as atomic contributions, how they bundle into a data asset (the minimal commercial unit), and why we use blockchain to make assetification & royalties programmable.

Data basics for Artificial Intelligence (AI)

Sample (math notion: X): A raw observation the model will learn from (image, audio clip, text span, time series window, multi-sensor frame, etc.). Label (math notion: Y or y): A structured interpretation of a sample (or group of samples): class, bounding box, segmentation mask, span, rating, relation, events over time, etc. Validation: A quality judgment or evidence check on a sample or label. This can be consensus voting, rubric scoring, re-labeling, or automated checks plus human adjudication. Why quality matters to models

Signal-to-noise – mislabeled or low-information data reduces effective batch size and slows convergence.
Bias & leakage – inconsistent schema, shortcut features, or label leakage harm generalization and fairness.
Heterogeneous tasks – multi-task/chain-of-thought models depend on clear, consistent instructions and traceable provenance to debug and improve.

Take-away: Better samples + clearer labels + verifiable validations = more useful gradient steps and fewer surprises in production.

Codatta’s data model

Figure 1. Samples, labels, and validations aggregate into a Data Asset; assets are then selected into Datasets.

Why it matters: The Data Asset is the unit of ownership & royalty and the unit of licensing most buyers actually consume.

A. Atomic Contribution (AC)
One unit of work produced by a human or agent:

sample – the observation
label – the interpretation
validation – the quality/evidence decision

Every AC is assigned a Contribution Fingerprint (CF):

A tamper-evident identifier (hash + metadata + parent links) that proves who did what, when, to which payload.
CFs make contributions discoverable, deduplicable, and auditable.

B. Data Asset (DA)
A composite, minimal commercial unit created by aggregating ACs that belong together (e.g., one image + its accepted labels + validations). Ownership and licensing are enforced at the asset level because that’s what AI teams actually use. C. Dataset (View / Collection)
A curated selection of Data Assets for a particular model, vertical, or evaluation—defined by a saved query or manifest. Datasets inherit all ownership, licensing, and lineage from the assets they include.

Typical real-world scenarios

The figures below are conceptual and focus on relationships and flow; field names and formats may evolve as the protocol is finalized.

Scenarion A: One sample, multiple label sets

One sample → two assets via different label bundles

Figure 2. One sample (X0) gets labeled by task01 and task02. Bundling X0 + {Y0, Y1} forms Asset-A (Vertical AI “a”), while bundling X0 + {Y2} forms Asset-B (Vertical AI “b”). Why it matters: The same raw sample can power different products simply by bundling different labels—each with its own royalties and license terms.

Scenarion B: Cross-sample composite

Figure 3. Two samples (X0, X1) are combined into a composite asset for a new task (task03). A downstream label (Y3) annotates the composite, not a single sample. Why it matters: Many tasks (dialog pairs, multi-turn contexts, video/action segments) require relationships across samples. Codatta supports composite assets and keeps derivation links for proper attribution and payouts.

Scenarion C: Label-on-label (meta-labeling)

Figure 4. A downstream label (Y4) targets an upstream label (Y0), which itself annotated X0. Royalties propagate to the meta-labelers and origin labelers (and to the original sample as policy dictates). Why it matters: You can annotate interpretations, not just raw data: rubrics, explanations, confidence judgments, or evaluator notes—all with lineage and revenue inheritance.

Assetification & blockchain technology

Why assetify?
Traditional labeling produces files that are hard to track, share, or value. Assetification turns work into on-chain objects with provenance and programmable rights:

Provenance via Contribution Fingerprints (hash + metadata + parents) → who did what, when, to which payload.
Ownership via fractional tokens on Data Assets (not just on files), so multiple contributors/validators can participate in revenue.
Licensing & metering with policy-gated access (public vs. restricted), and usage receipts (read/train/infer) that drive royalty routing.
Derivation that carries inheritance rules (child assets point to parents; royalties propagate per policy).
Privacy-by-design: hybrid storage + secure compute (e.g., TEEs) so models can use data without exposing raw content.

What blockchain adds

Trust minimization: public, append-only record of contributions, ownership, and usage events.
Composability: assets can be queried, bundled, and re-licensed across apps while preserving lineage and payouts.
Incentives: contributors and validators earn when assets are used—aligning quality with long-term value.

Getting started

Core Systems

Protocol Token ($XNY)

Community

TL;DR

Data basics for Artificial Intelligence (AI)

Codatta’s data model

Typical real-world scenarios

Scenarion A: One sample, multiple label sets

Scenarion B: Cross-sample composite

Scenarion C: Label-on-label (meta-labeling)

Assetification & blockchain technology

Getting started

Core Systems

Protocol Token ($XNY)

Community

​TL;DR

​Data basics for Artificial Intelligence (AI)

​Codatta’s data model

​Typical real-world scenarios

​Scenarion A: One sample, multiple label sets

​Scenarion B: Cross-sample composite

​Scenarion C: Label-on-label (meta-labeling)

​Assetification & blockchain technology

TL;DR

Data basics for Artificial Intelligence (AI)

Codatta’s data model

Typical real-world scenarios

Scenarion A: One sample, multiple label sets

Scenarion B: Cross-sample composite

Scenarion C: Label-on-label (meta-labeling)

Assetification & blockchain technology