Ubunye Engine¶
Ubunye Engine is a config-first, Spark-native framework for building ETL and ML pipelines that run everywhere — locally, on-prem, or in the cloud (Databricks, EMR, Glue).
Define your pipeline in YAML. Write a Python class. Run it.

Get started in 5 minutes → GitHub →
Why Ubunye?¶
| Concern | How Ubunye handles it |
|---|---|
| I/O boilerplate | Declarative connectors (Hive, JDBC, S3, Delta, REST API, Unity Catalog) |
| Config drift between environments | Jinja2 templating + per-profile Spark overrides |
| ML lifecycle management | Library-independent UbunyeModel contract + built-in registry |
| Observability | Pluggable lineage tracking, Prometheus, OpenTelemetry, MLflow |
| Orchestration | One-command export to Airflow, Databricks, Prefect, Dagster |
| Testing | Spark-free unit-test patterns; 288 tests in CI |
Key Features¶
- Config-first — all I/O, compute, and orchestration is YAML-driven; no hardcoded credentials or paths.
- Plugin system — extend via Python entry points (
ubunye.readers,ubunye.writers,ubunye.transforms). - Model registry — version, promote, rollback, and gate ML models without coupling to any ML library.
- Lineage tracking — automatic run provenance written to
.ubunye/lineage/. - Telemetry-ready — Prometheus, OpenTelemetry, and JSON event logs via the
monitorsprotocol. - Orchestration export — generate Airflow DAGs or Databricks job JSON from the same config.
Two entry points¶
The Python API auto-detects and reuses an active SparkSession on Databricks. When no session exists, it creates one — same as the CLI.
Quick Example¶
# pipelines/fraud/etl/claims/config.yaml
MODEL: etl
VERSION: "1.0.0"
CONFIG:
inputs:
raw_claims:
format: hive
db_name: raw
tbl_name: claims
transform:
type: noop
outputs:
clean_claims:
format: delta
path: s3://data-lake/clean/claims
mode: overwrite
Documentation Map¶
- Installation — install, verify, extras
- Quickstart — end-to-end in 5 minutes
- Config Reference — full YAML schema
- Connectors — all built-in readers and writers
- ML — Model Contract — the
UbunyeModelABC - ML — Registry — versioning, promotion, gates
- CLI Reference — all commands and flags
- API Reference — Python API (
run_task,run_pipeline) - Deployment — Databricks Asset Bundles + GitHub Actions
- Developer Guide — architecture, plugins, testing