Mage: data pipelines with a notebook UI and a sane execution model

Mage is an open-source data pipeline tool built around a notebook-style UI where pipelines are composed from typed blocks — loaders, transformers, exporters, sensors — each in its own file, each executable in isolation or as part of a DAG.

Why I starred it

The data engineering tool space is crowded. Airflow is the incumbent but it's painful to develop locally — you write a DAG, push it, wait, debug. Prefect and Dagster added nicer developer UX but still require a fairly heavyweight mental model. Mage's pitch is simpler: the same block you run interactively in the browser is the exact code that runs in production. No wrapper functions, no DAG decorator gymnastics.

What caught me was the executor dispatch layer. Most orchestration tools pick one execution model and build around it. Mage routes blocks to completely different backends — local Python, PySpark on EMR, ECS tasks, GCP Cloud Run, or Kubernetes — without changing the block code itself. That's a meaningful architectural choice.

How it works

The CLI is built with Typer in mage_ai/cli/main.py and exposes three main commands: init, start, and run. start launches a Tornado-based server plus the React frontend. run loads a pipeline and hands it to the executor factory.

The block system is where the interesting design lives. Every block type — data_loader, transformer, data_exporter, sensor — is declared as a decorator in mage_ai/data_preparation/decorators.py. The actual decorator bodies there are stubs:

def data_loader(function):
    return function

def transformer(function):
    return function

The real decorator injection happens at runtime inside execute_block in mage_ai/data_preparation/models/block/__init__.py. Static stubs exist solely to prevent Python compilation errors when importing block files — the actual function type resolution is dynamic. It's a slightly odd pattern but it means block files are valid Python modules you can import and test outside of Mage entirely.

The executor dispatch in mage_ai/data_preparation/executors/executor_factory.py is worth reading. ExecutorFactory.get_block_executor inspects the block's executor_type field, checks the pipeline type, and also looks at whether the block contains Spark code:

if pipeline.type == PipelineType.PYSPARK and (
    block.type != BlockType.SENSOR or is_pyspark_code(block.content)
):
    executor_type = ExecutorType.PYSPARK

It falls through cleanly to local Python by default. The DEFAULT_EXECUTOR_TYPE environment variable overrides the default for the entire instance, which is how you'd configure a production deployment to route all blocks to Kubernetes without touching pipeline definitions.

Pipeline definitions live in metadata.yaml files inside each pipeline directory under pipelines/. The Pipeline class in mage_ai/data_preparation/models/pipeline.py loads these configs and builds a DAG from block_configs entries. The cycle detection is a simple depth-first check — CYCLE_DETECTION_ERR_MESSAGE = 'A cycle was detected in this pipeline' — nothing fancy, but it's there.

Block outputs are persisted as variables under .variables/ using VariableManager. This means you can re-run a single block without re-running upstream blocks, which is what makes the interactive notebook experience actually work without being slow.

Using it

pip install mage-ai
mage init my_project
mage start my_project
# opens http://localhost:6789

Or headless via CLI:

mage run my_project my_pipeline_uuid
# with runtime variables
mage run my_project my_pipeline_uuid --runtime-vars '{"start_date": "2023-01-01"}'

Blocks are plain Python files in my_project/data_loaders/, transformers/, etc. A transformer block looks like:

from mage_ai.data_preparation.decorators import transformer
import pandas as pd

@transformer
def transform(data: pd.DataFrame, *args, **kwargs) -> pd.DataFrame:
    return data[data['amount'] > 0]

The decorator is the only Mage-specific code. The rest is regular pandas, polars, or whatever you're using.

Rough edges

The codebase is large and moves fast — the mage_ai/data_preparation/models/block/__init__.py file is enormous, doing too much in one place. The dynamic decorator injection pattern is clever but makes it harder to trace what actually runs without a debugger.

Test coverage is uneven. The mage_ai/tests/ directory exists but the coverage of the executor layer is thin — I didn't find tests for the ECS or Cloud Run executor paths, which are exactly the production codepaths where bugs hurt.

The frontend is a full React app bundled inside the Python package under mage_ai/frontend/. That's a 100+ MB install depending on your platform. For a headless deployment where you only ever use the CLI, you're carrying a lot of dead weight.

Documentation at the time of starring was improving but the gap between what the UI makes easy and what the docs explain was noticeable. The integration-specific connectors under mage_integrations/ are largely undocumented beyond the source.

Bottom line

Mage is the right choice if you want to run ETL pipelines locally with fast iteration and a visual debugger, then push the same code to Kubernetes or ECS in production. If you're already invested in Airflow or Dagster and have a working deployment, the migration cost is real.