Ubunye Engine¶

Ubunye Engine is a config-first, Spark-native framework for building ETL and ML pipelines that run everywhere — locally, on-prem, or in the cloud (Databricks, EMR, Glue).

Define your pipeline in YAML. Write a Python class. Run it.

Ubunye Engine deployment overview

Get started in 5 minutes → GitHub →

Why Ubunye?¶

Concern	How Ubunye handles it
I/O boilerplate	Declarative connectors (Hive, JDBC, S3, Delta, REST API, Unity Catalog)
Config drift between environments	Jinja2 templating + per-profile Spark overrides
ML lifecycle management	Library-independent `UbunyeModel` contract + built-in registry
Observability	Pluggable lineage tracking, Prometheus, OpenTelemetry, MLflow
Orchestration	One-command export to Airflow, Databricks, Prefect, Dagster
Testing	Spark-free unit-test patterns; 288 tests in CI

Key Features¶

Config-first — all I/O, compute, and orchestration is YAML-driven; no hardcoded credentials or paths.
Plugin system — extend via Python entry points (ubunye.readers, ubunye.writers, ubunye.transforms).
Model registry — version, promote, rollback, and gate ML models without coupling to any ML library.
Lineage tracking — automatic run provenance written to .ubunye/lineage/.
Telemetry-ready — Prometheus, OpenTelemetry, and JSON event logs via the monitors protocol.
Orchestration export — generate Airflow DAGs or Databricks job JSON from the same config.

Two entry points¶

CLIPython API (Databricks)

ubunye run -d pipelines -u fraud -p etl -t claims -m PROD --lineage

import ubunye

outputs = ubunye.run_task(
    task_dir="pipelines/fraud/etl/claims",
    mode="PROD",
    dt="2024-06-01",
)

The Python API auto-detects and reuses an active SparkSession on Databricks. When no session exists, it creates one — same as the CLI.

Quick Example¶

# pipelines/fraud/etl/claims/config.yaml
MODEL: etl
VERSION: "1.0.0"

CONFIG:
  inputs:
    raw_claims:
      format: hive
      db_name: raw
      tbl_name: claims
  transform:
    type: noop
  outputs:
    clean_claims:
      format: delta
      path: s3://data-lake/clean/claims
      mode: overwrite

ubunye run -d pipelines -u fraud -p etl -t claims -m PROD

Documentation Map¶

Installation — install, verify, extras
Quickstart — end-to-end in 5 minutes
Config Reference — full YAML schema
Connectors — all built-in readers and writers
ML — Model Contract — the UbunyeModel ABC
ML — Registry — versioning, promotion, gates
CLI Reference — all commands and flags
API Reference — Python API (run_task, run_pipeline)
Deployment — Databricks Asset Bundles + GitHub Actions
Developer Guide — architecture, plugins, testing