The data model

Loom's input is a Metaflow data object — a Metaflow Artifact referenced by pathspec (e.g. IngestDataset/123) and read only through the Metaflow Client API. There is exactly one boundary where outside data crosses in — loom ingest — and from then on every verb threads references, never the data itself.

Your input is a Metaflow data object

Loom never invents its own storage format. The unit of data is a Metaflow Artifact — the same versioned, content-addressed object Metaflow produces from any run — identified by a pathspec like IngestDataset/123. Wherever a verb takes data, it takes that pathspec as its --dataset:

loom eda --dataset IngestDataset/123 --target is_fraud

An ingested data object is a small, well-defined set of artifacts: train and optional test (DataFrames) plus a schema dict. Loom reads them only through the Metaflow Client API — concretely metaflow.Run(pathspec).data.<artifact> — and that read happens in exactly one module (loom/dataio.py), the single data door for the whole engine.

A pathspec is the whole identity of the data — versioned and stable. The same IngestDataset/123 resolves to the same object whether Loom runs locally on your laptop today or in a cluster later. You pass the pathspec around; Loom resolves the bytes on demand.

The datastore is opaque

Where those artifacts physically live — a local store, or object storage (S3 / minio) — is an implementation detail that Metaflow owns. You configure it once in your Metaflow profile/environment (METAFLOW_PROFILE / METAFLOW_*); after that, Loom is agnostic to it.

Loom code never talks to that storage directly. There is no object-storage SDK, no bucket-URI literal, and no raw datastore handle anywhere in the engine — a test scans the source to keep it that way.
Switching the local-dev minio store for a real cloud bucket is a Metaflow-profile change. No Loom verb changes, because no verb ever named the bucket.
This is what lets a tenant bring their own perimeter: point METAFLOW_PROFILE at your own Metaflow endpoint and your data objects live in your datastore, never Loom's.

In local dev the datastore is a minio bucket you can browse at the minio console (http://localhost:9001) — that's where ingested data objects and run artifacts land. See Getting started for standing it up, and Configuration & providers for the profile knobs.

One boundary in: `loom ingest`

There is exactly one place where data from outside Loom crosses into the model: loom ingest. It runs the IngestDataset flow once to turn a local directory or CSV into Metaflow artifacts, and prints the resulting pathspec — the dataset_ref every downstream verb will use. ingest is the boundary tier; datasets that lists what's ingested is read-only.

# Ingest once — a local dir/CSV becomes a Metaflow data object; prints the dataset_ref.
loom ingest --source ./your_task --name my-dataset   # -> dataset_ref : IngestDataset/123

# See what's ingested (read-only, via the Client API).
loom datasets

The source shape is small and predictable. A directory source contains train.csv (plus optional test.csv / sample_submission.csv); a single .csv is split for you. From there, solutions read from ./input/ and write predictions to ./working/submission.csv — the same on-disk workspace shape no matter where the bytes physically came from.

⚠

Don't run a verb against a raw store, a file path mid-analysis, or a freshly fetched MCP source. MCP data tools locate and fetch; the discipline is then to loom ingest the result into a Metaflow data object and work the pathspec. Everything past ingest is a reference, and references are what make composition safe.

Verbs thread references, not data

Once data is a pathspec, the verbs hand off to each other by passing references — pathspecs, card_paths, experiment-ids — never the data. A verb that writes (like features) produces a new data object with its own pathspec, and that pathspec becomes the --dataset for the next verb.

# Build engineered features into a NEW data object; --from drops eda-flagged leakage columns.
loom features --dataset IngestDataset/123 --target target --from EdaFlow/7   # -> FeaturesFlow/9

# The new FeaturesFlow/9 pathspec is the --dataset for every downstream verb.
loom validate --dataset FeaturesFlow/9 --target target

# Promotion asserts the upstream validate VERDICT by reference — never by re-reading data.
loom deploy --validate ValidateFlow/12 --apply

This is how the lifecycle composes: a features data object feeds pipeline / validate; a validate VERDICT==PASS is what deploy asserts before it will promote (a sub-threshold validate BLOCKS deploy). The check happens on the reference and the typed summary, not on the bytes.

You pass	What it is	Example
`--dataset`	a Metaflow data-object pathspec	`IngestDataset/123`, `FeaturesFlow/9`
`--run` / `--solution`	a run pathspec (a candidate / flow run)	`EvalCandidate/42`
`--validate`	a validate run whose VERDICT gates deploy	`ValidateFlow/12`
`--from`	an upstream run to compose on (e.g. drop leakage cols)	`EdaFlow/7`
`--experiment`	a grouping id for report/lineage	`loom-abc123`

The privacy line: bulk data never reaches the LLM

The most load-bearing consequence of the data model is a privacy line: your bulk data never reaches the LLM. Datasets and transactions live as Metaflow data objects in your datastore/perimeter; candidate code processes them there. The model only ever sees small derived context — schema, a preview, code, metrics — never raw rows, never keys.

The CLI's discipline mirrors this. Verbs operate on datasets / runs / pathspecs; nothing dumps, moves, or pastes raw data through the agent. You thread references between verbs and summarize outcomes in prose, keeping the structured JSON for machine checks.
Telemetry keeps the same posture. Content is redacted by default to <REDACTED:kind> — only schema/preview/metrics enter the corpus, never raw rows, never keys.
The small derived prompts do go to a third-party LLM (that's true of using any model at all); the data itself does not.

⚠

Keep it that way with prompt hygiene: don't paste large data or log blobs into context. If fetched material is data to model, bring it in via loom ingest — a Metaflow data object — never by streaming it through chat.

Local vs. Metaflow input

Loom has two execution paths, and they differ only in how the input is named — both converge on the same ./input workspace shape so a solution reads it identically.

	`--mlops local`	`--mlops metaflow` (default)
Input	a local `--data` directory	a Metaflow data object (`--dataset <pathspec>` from `loom ingest`)
Data	local	a Metaflow artifact; the datastore (local or S3/minio) stays in your perimeter, opaque to Loom
Covers	the `loom run` search only	the whole lifecycle (`eda`, `features`, `validate`, `deploy`, …)

The local path is the on-ramp for a quick "does it work" with no datastore. The lifecycle verbs need --mlops metaflow — they run versioned flows, and each candidate becomes a real Metaflow run rather than an in-process one. The dataset_ref pathspec is the through-line either way.

The same IngestDataset seam is reused for durable corpus scale: loom telemetry export --to-dataset loom-ds-1 ingests the trajectory corpus through the identical boundary, so it becomes a versioned, lossless Metaflow data object with its own pathspec — one ingest door, used everywhere data enters.