The data model
Loom's input is a Metaflow data object — a Metaflow Artifact
referenced by pathspec (e.g. IngestDataset/123) and read only through
the Metaflow Client API. There is exactly one boundary where outside data crosses
in — loom ingest — and from then on every verb threads references, never the
data itself.
Your input is a Metaflow data object
Loom never invents its own storage format. The unit of data is a Metaflow
Artifact — the same versioned, content-addressed object Metaflow produces from any run —
identified by a pathspec like IngestDataset/123. Wherever a verb takes
data, it takes that pathspec as its --dataset:
loom eda --dataset IngestDataset/123 --target is_fraud
An ingested data object is a small, well-defined set of artifacts: train and
optional test (DataFrames) plus a schema dict. Loom reads them only through
the Metaflow Client API — concretely metaflow.Run(pathspec).data.<artifact> — and
that read happens in exactly one module (loom/dataio.py), the single data door for the
whole engine.
A pathspec is the whole identity of the data — versioned and stable. The same
IngestDataset/123 resolves to the same object whether Loom runs locally on your laptop
today or in a cluster later. You pass the pathspec around; Loom resolves the bytes on demand.
The datastore is opaque
Where those artifacts physically live — a local store, or object storage
(S3 / minio) — is an implementation detail that Metaflow owns. You
configure it once in your Metaflow profile/environment (METAFLOW_PROFILE /
METAFLOW_*); after that, Loom is agnostic to it.
- Loom code never talks to that storage directly. There is no object-storage SDK, no bucket-URI literal, and no raw datastore handle anywhere in the engine — a test scans the source to keep it that way.
- Switching the local-dev minio store for a real cloud bucket is a Metaflow-profile change. No Loom verb changes, because no verb ever named the bucket.
- This is what lets a tenant bring their own perimeter: point
METAFLOW_PROFILEat your own Metaflow endpoint and your data objects live in your datastore, never Loom's.
In local dev the datastore is a minio bucket you can browse at the minio console
(http://localhost:9001) — that's where ingested data objects and run artifacts land.
See Getting started for standing it up, and
Configuration & providers for the profile knobs.
One boundary in: loom ingest
There is exactly one place where data from outside Loom crosses into the model:
loom ingest. It runs the IngestDataset flow once to turn a local directory
or CSV into Metaflow artifacts, and prints the resulting pathspec — the
dataset_ref every downstream verb will use. ingest is the
boundary tier; datasets that lists what's ingested is
read-only.
# Ingest once — a local dir/CSV becomes a Metaflow data object; prints the dataset_ref.
loom ingest --source ./your_task --name my-dataset # -> dataset_ref : IngestDataset/123
# See what's ingested (read-only, via the Client API).
loom datasets
The source shape is small and predictable. A directory source contains
train.csv (plus optional test.csv / sample_submission.csv); a
single .csv is split for you. From there, solutions read from ./input/ and
write predictions to ./working/submission.csv — the same on-disk workspace shape no
matter where the bytes physically came from.
Don't run a verb against a raw store, a file path mid-analysis, or a freshly fetched
MCP source. MCP data tools locate and fetch;
the discipline is then to loom ingest the result into a Metaflow data object and work
the pathspec. Everything past ingest is a reference, and references are what make composition
safe.
Verbs thread references, not data
Once data is a pathspec, the verbs hand off to each other by passing references
— pathspecs, card_paths, experiment-ids — never the data. A verb that
writes (like features) produces a new data object with its own
pathspec, and that pathspec becomes the --dataset for the next verb.
# Build engineered features into a NEW data object; --from drops eda-flagged leakage columns.
loom features --dataset IngestDataset/123 --target target --from EdaFlow/7 # -> FeaturesFlow/9
# The new FeaturesFlow/9 pathspec is the --dataset for every downstream verb.
loom validate --dataset FeaturesFlow/9 --target target
# Promotion asserts the upstream validate VERDICT by reference — never by re-reading data.
loom deploy --validate ValidateFlow/12 --apply
This is how the lifecycle composes: a features data object feeds
pipeline / validate; a validate VERDICT==PASS is
what deploy asserts before it will promote (a sub-threshold validate
BLOCKS deploy). The check happens on the reference and the typed summary, not on the
bytes.
| You pass | What it is | Example |
|---|---|---|
--dataset | a Metaflow data-object pathspec | IngestDataset/123, FeaturesFlow/9 |
--run / --solution | a run pathspec (a candidate / flow run) | EvalCandidate/42 |
--validate | a validate run whose VERDICT gates deploy | ValidateFlow/12 |
--from | an upstream run to compose on (e.g. drop leakage cols) | EdaFlow/7 |
--experiment | a grouping id for report/lineage | loom-abc123 |
The privacy line: bulk data never reaches the LLM
The most load-bearing consequence of the data model is a privacy line: your bulk data never reaches the LLM. Datasets and transactions live as Metaflow data objects in your datastore/perimeter; candidate code processes them there. The model only ever sees small derived context — schema, a preview, code, metrics — never raw rows, never keys.
- The CLI's discipline mirrors this. Verbs operate on datasets / runs / pathspecs; nothing dumps, moves, or pastes raw data through the agent. You thread references between verbs and summarize outcomes in prose, keeping the structured JSON for machine checks.
- Telemetry keeps the same posture. Content is redacted by default to
<REDACTED:kind>— only schema/preview/metrics enter the corpus, never raw rows, never keys. - The small derived prompts do go to a third-party LLM (that's true of using any model at all); the data itself does not.
Keep it that way with prompt hygiene: don't paste large data or log blobs into
context. If fetched material is data to model, bring it in via loom ingest — a
Metaflow data object — never by streaming it through chat.
Local vs. Metaflow input
Loom has two execution paths, and they differ only in how the input is named — both converge on
the same ./input workspace shape so a solution reads it identically.
--mlops local | --mlops metaflow (default) | |
|---|---|---|
| Input | a local --data directory | a Metaflow data object (--dataset <pathspec> from loom ingest) |
| Data | local | a Metaflow artifact; the datastore (local or S3/minio) stays in your perimeter, opaque to Loom |
| Covers | the loom run search only | the whole lifecycle (eda, features, validate, deploy, …) |
The local path is the on-ramp for a quick "does it work" with no datastore. The
lifecycle verbs need --mlops metaflow — they run versioned flows, and
each candidate becomes a real Metaflow run rather than an in-process one. The
dataset_ref pathspec is the through-line either way.
The same IngestDataset seam is reused for durable corpus scale:
loom telemetry export --to-dataset loom-ds-1 ingests the trajectory corpus through the
identical boundary, so it becomes a versioned, lossless Metaflow data object with its own pathspec
— one ingest door, used everywhere data enters.