Changelog¶
All notable changes to Ubunye Engine will be documented here.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
[Unreleased]¶
[0.1.7] — 2026-04-21¶
Changed¶
MODELandVERSIONare now optional at the top level ofconfig.yaml.MODELdefaults toetlandVERSIONdefaults to"0.0.0-dev". The semver validator now accepts an optional pre-release suffix (e.g.1.0.0-rc1,0.0.0-dev) in addition to plainMAJOR.MINOR.PATCH. Existing configs that set these fields explicitly continue to work unchanged. Rationale: these two lines were boilerplate in every scaffolded pipeline; the defaults cover the common case. Set them explicitly when job type or version is load-bearing (lineage, model registry, orchestrator metadata).
Added¶
-
Production reference example: Titanic multi-task pipeline (local) —
examples/production/titanic_multitask_local/demonstrates sequential task chaining viaubunye run -t clean_data -t aggregate. Task 1 reads the Titanic CSV, cleans it, and writes intermediate Parquet. Task 2 reads that Parquet and computes survival rates by class and age group. Exercisesrun_pipeline(), sibling-module isolation between tasks, and cross-task lineage. CI workflow (.github/workflows/multitask_local.yml) runs Spark unit tests, validates both configs, runs the full pipeline, and verifies output. See the example'sREADME.md. -
Production reference example: Titanic multi-task pipeline (Databricks) —
examples/production/titanic_multitask_databricks/is the Databricks counterpart of the local multi-task example. Sametransformations.py(byte-identical, CI-enforced), but task chaining uses Unity Catalog Delta tables instead of Parquet files. Task 1 writestitanic_cleaned, task 2 reads it and writessurvival_summary. Runs viaubunye.run_pipeline()on serverless compute. CI workflow enforces portability diff against the local example and validates/deploys the Asset Bundle. -
Production reference example: Titanic ML end-to-end on Databricks —
examples/production/titanic_ml_databricks/demonstrates the full ML lifecycle:UbunyeModelsubclass (sklearn RandomForest), MLflow param/ metric/artifact logging, filesystem-backedModelRegistryon a UC volume, and aPromotionGateon validation AUC. Two serverless jobs share one Asset Bundle —titanic_trainregisters and auto-promotes;titanic_predictloads the current production/staging model and writes a Unity Catalog Delta predictions table. A one-rowtraining_metricsaudit row is appended per training run. CI (.github/workflows/titanic_ml_databricks.yml) runs pandas/sklearn unit tests, diffs the twomodel.pycopies for drift, and validates/deploys the bundle when Databricks secrets are configured. See the example'sREADME.md.
Changed¶
- GitHub Actions bumped to Node 24-capable majors.
actions/checkout@v4→@v6,actions/setup-python@v5→@v6across all nine workflow files. Resolves the deprecation warning surfaced on 2026-04-16 runs ahead of the Node 20 removal deadline (2026-09-16). No behavioural change.
Fixed¶
-
Undefined CLI template variables silently leaked into resolved configs.
resolve_configpre-checked{{ env.X }}references but not bare{{ var }}identifiers. Jinja2'sDebugUndefinedleft unresolved expressions verbatim, sopath: "file:///{{ dt }}"with nodtprovided would pass validation and then hand Spark a literalfile:///{{ dt }}at runtime. Fix adds a post-render residue scan that names the offending variable and suggests a CLI flag, env var, or| default()filter. Regression tests intests/unit/config/test_resolver.py. Pre-existing configs underexamples/andpipelines/all use| default()on CLI-derived vars, so no downstream config needs updating. -
Sibling modules leaked between sequential tasks in
run_pipeline._with_task_dir_on_pathadded the task dir tosys.pathbut never cleaned upsys.moduleson exit. Two tasks that each shipped their ownmodel.py(orutils.py, etc.) would silently run the first task's module when the second task imported it — Python's import cache was keying on the shared short name. Fix evicts only modules whose source file lives under the exiting task dir; stdlib and site-packages are untouched. Regression test intests/unit/test_task_runner.py. Caught by offline audit on the overnight branch, ahead of the planned multi-task DAG example (tasks/todo/task-04.md).
[0.1.6] — 2026-04-15¶
Changed¶
titanic_databricksDAB switched to serverless + Unity Catalog. Removed thenew_clusterblock and the DBFS bootstrap; the notebook now provisions a UC volume, downloads the Titanic CSV into it at runtime, and the writer emits a Unity Catalog managed Delta table (workspace.titanic.survival_by_classby default). Validated end-to-end on Databricks Free Edition. The portability contract withtitanic_local(byte-identicaltransformations.py) is unchanged — only the deployment wrapper (config + notebook + DAB) differs.jhb_weather_databricksDAB switched to serverless compute. Removed thenew_clusterblock (and theexisting_cluster_idescape hatch) fromdatabricks.yml. Notebook tasks with no cluster spec route to serverless on both Free Edition and paid workspaces; the notebook installsubunye-engineat runtime via%pip. Defaultweather_catalogchanged frommaintoworkspacebecause Free Edition auto-provisions only theworkspacecatalog. Paid-workspace users override via--var="weather_catalog=main". README documents the rationale.databricks_deploy.yml(titanic) skips deploy gracefully when secrets are absent. Mirrors the soft-skip pattern injhb_weather_databricks.ymlso PRs from forks (or repos that have not configuredDATABRICKS_HOST/DATABRICKS_TOKEN) still exercise the unit tests and portability diff instead of failing the workflow.- Production examples switched from pandas twins to PySpark tests. Every
transformations.pyunderexamples/production/now exposes a single Spark implementation. Tests use a session-scopedSparkSessionfixture (local[1], 512 MB driver, shuffle partitions=1) so the production code is the code under test. Eliminates the dual-maintenance burden and the risk of silent drift between pandas and Spark paths. Unit-test CI steps now install Java 17 +ubunye-engine[spark,dev]on the runner.
Added¶
-
Production reference example: JHB hourly weather (REST API → Unity Catalog) — end-to-end example at
examples/production/jhb_weather_databricks/demonstrating REST ingestion with therest_apireader against the free Open-Meteo API (lat/lon for Johannesburg, no auth required), a Spark transform that explodes parallel hourly arrays into a tidy one-row-per-hour DataFrame, and a Unity Catalog Delta writer partitioned byforecast_date. Ships a Databricks Asset Bundle with a scheduled daily job (06:00Africa/Johannesburg), a notebook wrapper aroundubunye.run_task(), seven pandas unit tests over a hand-built fixture response, and a CI workflow (.github/workflows/jhb_weather_databricks.yml) that runs the tests, smoke-checks the endpoint, and validates/deploys the bundle when Databricks secrets are configured. See the example'sREADME.md. -
Production reference example: Titanic (local runtime) — end-to-end example at
examples/production/titanic_local/demonstrating a CSV → Parquet pipeline with dev/prod profiles, Jinja-templated paths, pandas unit tests (no Spark), a committed golden Parquet, and a GitHub Actions workflow (.github/workflows/local_pipeline.yml) that validates config, runs the pipeline on a real SparkSession, and diffs the output against the golden. One half of the portability demo — the Databricks counterpart sharestransformations.pyverbatim. See the example'sREADME.md. -
Production reference example: Titanic (Databricks Community Edition) — the Databricks half of the portability demo at
examples/production/titanic_databricks/. Ships a Databricks Asset Bundle (databricks.yml) sized for CE's single-node / DBFS / no-UC constraints, a notebook entry (notebooks/run_titanic.py) that callsubunye.run_task()against the active SparkSession, the same pandas unit tests, and a CI workflow (.github/workflows/databricks_deploy.yml) that installs the Go-based Databricks CLI, validates and deploys the bundle, and enforces the portability contract by diffingtransformations.pyagainst the local example. Known CE limitations (no service principals, restricted Jobs API, DBFS deprecation) are documented honestly rather than worked around. -
Cross-runtime reference index —
examples/production/README.mdexplains the portability contract, provides a side-by-side config comparison of the two examples, a decision guide for choosing between the local and Databricks runtimes, and a migration table covering what changes when moving from Community Edition to a standard Databricks workspace. -
Hook abstraction for observability (
ubunye/core/hooks.py) —Hookbase class andHookChainmultiplexer. Tasks and steps are now wrapped in hook context managers so the Engine no longer imports telemetry modules directly. Built-in hooks shipped underubunye/telemetry/hooks/:EventLoggerHook,OTelHook,PrometheusHook,LegacyMonitorsHook. Third parties can register custom hooks (Slack alerts, audit logs, drift checks) without modifying the Engine. Seedocs/patterns/hooks.md. -
ubunye.hooksentry point group (pyproject.toml) — third-party packages can registerHooksubclasses as entry points and have them auto-discovered by the Engine. The three built-in telemetry hooks (events, otel, prometheus) are registered via this mechanism and gated onUBUNYE_TELEMETRY=1. -
Python API (
ubunye/api.py) —run_task()andrun_pipeline()for running Ubunye tasks from Python code (Databricks notebooks, scripts, tests) without the CLI. Auto-detects and reuses active SparkSessions. Exported fromubunye.__init__. -
DatabricksBackend (
ubunye/backends/databricks_backend.py) — backend that wraps an existing SparkSession instead of creating one.stop()is a no-op since we don't own the session. -
Dev notebook scaffolding —
ubunye initnow generatesnotebooks/<task>_dev.ipynbalongsideconfig.yamlandtransformations.py. The notebook usesDatabricksBackend,dbutils.widgets, anddisplay(). The Load step is commented out by default. -
Deployment docs —
docs/deployment.mdcovering Databricks Asset Bundles pattern, GitHub Actions CI/CD, and Python API on Databricks. DABs belong in the usecase repo, not the engine. -
Deploy workflow —
.github/workflows/deploy.ymlvalidates configs on PR and runs unit tests. Bundle deployment is handled in the usecase repo. -
ubunye test runCLI sub-command — runs tasks with a test profile and reports PASS/FAIL. -
Model Registry (
ubunye/models/) — library-independent ML lifecycle management. UbunyeModelabstract contract:train,predict,save,load,metadata,validate.ModelRegistry— filesystem-backed versioning with stages: development → staging → production → archived.PromotionGate— configurable metric thresholds (min_*,max_*,require_drift_check).load_model_class()— dynamic model file importer; mirrors the task-dir import pattern.ModelTransformplugin (type: model) — train and predict from config YAML.ubunye modelsCLI sub-commands:list,info,promote,demote,rollback,archive,compare.-
RegistryConfigandModelTransformParamsPydantic schema additions. -
Lineage tracking (
ubunye/lineage/) — automatic run provenance. RunContext,LineageRecorder,FileSystemLineageStore,hash_dataframe.ubunye lineageCLI sub-commands:show,list,compare,search,trace.-
--lineageflag onubunye run. -
REST API connector — paginated HTTP reader and writer.
- Pagination strategies: offset, cursor, next_link.
- Auth: bearer, api_key (header or query param), basic.
- Rate limiting with configurable
requests_per_second. - Exponential backoff retry on configurable status codes.
-
Optional explicit schema declaration.
-
Config validation —
ubunye validatecommand with full Pydantic v2 schema. - Format-specific field validation in
IOConfig. - Jinja2 rendering before Pydantic validation.
-
Semver validation on
VERSIONfield. -
ubunye export airflow|databricksCLI — theAirflowExporterandDatabricksExporterunderubunye/orchestration/are now reachable from the command line. The command loads the task'sconfig.yaml, pulls defaults from itsORCHESTRATIONblock, and writes the artifact to--output. Airflow emits a DAG Python file; Databricks emits a Jobs APIjob.json. Classes are now exported fromubunye.orchestration.__init__. -
Test infrastructure — 288 unit tests, all Spark-free in
tests/unit/.
Changed¶
-
databricks_expoter.pyrenamed todatabricks_exporter.py(typo fix). Not previously exported fromubunye.orchestration, so external callers are unaffected. -
Unified execution path (
ubunye/core/task_runner.py) —api.py,cli/main.py runandcli/test_cmd.py runpreviously each reimplemented the read → transform → write loop and calledload_monitors/safe_calldirectly. They now delegate toexecute_user_task(), which wraps the user'sTask.transform()as an ephemeral Transform plugin and runs it throughEngine. Single code path, single hook lifecycle,MonitorHookadapts the legacy lineage recorder to aHook.Engine.__init__gainedextra_hooks=(append to defaults) andmanage_backend=(caller-owned vs engine-owned backend lifecycle).run_task/run_pipelineaccepthooks=for notebook callers who want to swap in custom hook chains. - Engine runtime refactored —
ubunye/core/runtime.pyreduced from 374 to 255 lines.Engine.run()body shrank from ~220 lines to ~35 by delegating telemetry plumbing to hooks. The engine no longer imports fromubunye.telemetry.*— only fromubunye.core.hooks.UBUNYE_TELEMETRYandUBUNYE_PROM_PORTenv vars still honored; user monitors inCONFIG.monitorscontinue to work viaLegacyMonitorsHook.Engine.__init__gained an optionalhooks=argument. ubunye/config/schema.py— addedRegistryConfig,ModelTransformParams,FormatType.REST_API.ubunye/__init__.py— exportsrun_taskandrun_pipeline.ubunye/cli/main.py— mountedmodels_app,lineage_app,test_appTyper sub-apps; added notebook scaffolding toinit.pyproject.toml— addedmodelentry point underubunye.transforms.
Fixed¶
- Unity Catalog writer collapsed plugin dispatch with Spark source format.
UnityTableWriter.write()was readingcfg["format"]and passing it to Spark'sDataFrameWriter.format(...). Butcfg["format"]is the Ubunye plugin selector and is always"unity"by the time the writer runs, so Spark raised[DATA_SOURCE_NOT_FOUND] Failed to find the data source: unity. Switched the writer to readcfg["file_format"](defaulting todelta), matching the convention already used by the s3 writer. Caught while deploying thejhb_weather_databricksexample to a serverless workspace.
[0.1.0] — 2025-09-11¶
Added¶
- First alpha release of Ubunye Engine.
- Config-first ETL framework built on Apache Spark.
- Plugin system for Readers, Writers, and Transforms via Python entry points.
- Built-in connectors: Hive, JDBC, Delta, Unity Catalog, S3, binary.
- CLI commands:
init,run,plan,config,plugins,version. - Orchestration exporters: Airflow DAG Python file, Databricks Jobs API JSON.
- Internal ML wrappers:
SklearnModel,SparkMLModel,BatchPredictMixin,MLflowLoggingMixin. - Telemetry modules: JSON event log, Prometheus, OpenTelemetry.
- Example tasks:
fraud_detection/claims/claim_etl,rest_api/customer_sync. SparkBackendwith context manager and safe multiple-start support.