Quickstart¶
Build and run your first Ubunye pipeline in under 5 minutes.
1. Install¶
2. Scaffold a task¶
This creates:
pipelines/demo/etl/hello_world/
config.yaml ← I/O and compute config
transformations.py ← your Python transform
notebooks/
hello_world_dev.ipynb ← interactive dev notebook
3. Edit the config¶
Open pipelines/demo/etl/hello_world/config.yaml:
MODEL: etl
VERSION: "1.0.0"
CONFIG:
inputs:
source:
format: hive
db_name: default
tbl_name: sample_data
transform:
type: noop # pass-through; replace with your transform type
outputs:
sink:
format: delta
path: /tmp/ubunye_demo/output
mode: overwrite
No Spark handy?
Swap the connectors for REST API or JDBC to run without a Hive metastore. See the Connectors overview.
4. (Optional) Add a transform¶
Edit transformations.py:
from ubunye.core.interfaces import Task
class HelloWorldTask(Task):
def transform(self, sources: dict) -> dict:
df = sources["source"]
return {"sink": df.filter("value IS NOT NULL")}
Then reference it in config.yaml:
5. Validate the config¶
Expected output:
6. Preview the execution plan¶
Prints a DAG: inputs → transform → outputs. Nothing is executed.
7. Run¶
Optionally capture lineage:
View recorded runs:
8. (Optional) Run from Python¶
On Databricks or in a notebook, use the Python API instead of the CLI:
The Python API auto-detects an active SparkSession (Databricks) and reuses it.
What's next?¶
| Topic | Link |
|---|---|
| Full YAML schema | Config Reference |
| All built-in connectors | Connectors |
| Python API reference | API Reference |
| Deploying to Databricks | Deployment |
| Training and versioning ML models | Model Contract |
| CLI flags and sub-commands | CLI Reference |
| Writing custom plugins | Plugin Guide |