Skip to content

Quickstart

Build and run your first Ubunye pipeline in under 5 minutes.


1. Install

pip install ubunye-engine

2. Scaffold a task

ubunye init -d pipelines -u demo -p etl -t hello_world

This creates:

pipelines/demo/etl/hello_world/
    config.yaml              ← I/O and compute config
    transformations.py       ← your Python transform
    notebooks/
        hello_world_dev.ipynb  ← interactive dev notebook

3. Edit the config

Open pipelines/demo/etl/hello_world/config.yaml:

MODEL: etl
VERSION: "1.0.0"

CONFIG:
  inputs:
    source:
      format: hive
      db_name: default
      tbl_name: sample_data

  transform:
    type: noop        # pass-through; replace with your transform type

  outputs:
    sink:
      format: delta
      path: /tmp/ubunye_demo/output
      mode: overwrite

No Spark handy?

Swap the connectors for REST API or JDBC to run without a Hive metastore. See the Connectors overview.


4. (Optional) Add a transform

Edit transformations.py:

from ubunye.core.interfaces import Task

class HelloWorldTask(Task):
    def transform(self, sources: dict) -> dict:
        df = sources["source"]
        return {"sink": df.filter("value IS NOT NULL")}

Then reference it in config.yaml:

  transform:
    type: task          # loads transformations.py automatically

5. Validate the config

ubunye validate -d pipelines -u demo -p etl -t hello_world

Expected output:

[OK] Config is valid.

6. Preview the execution plan

ubunye plan -d pipelines -u demo -p etl -t hello_world

Prints a DAG: inputs → transform → outputs. Nothing is executed.


7. Run

ubunye run -d pipelines -u demo -p etl -t hello_world --profile dev

Optionally capture lineage:

ubunye run -d pipelines -u demo -p etl -t hello_world --profile dev --lineage

View recorded runs:

ubunye lineage list


8. (Optional) Run from Python

On Databricks or in a notebook, use the Python API instead of the CLI:

import ubunye

outputs = ubunye.run_task(
    task_dir="pipelines/demo/etl/hello_world",
    mode="DEV",
)

The Python API auto-detects an active SparkSession (Databricks) and reuses it.


What's next?

Topic Link
Full YAML schema Config Reference
All built-in connectors Connectors
Python API reference API Reference
Deploying to Databricks Deployment
Training and versioning ML models Model Contract
CLI flags and sub-commands CLI Reference
Writing custom plugins Plugin Guide