Changelog¶
All notable changes to Ubunye Engine will be documented here.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
[0.2.0] — 2026-05-20¶
Added¶
-
Interactive notebook API (
ubunye.notebook). Newubunye.notebook()factory returns aNotebookContextfor step-by-step task execution in Databricks notebooks. Data scientists can callctx.read(),ctx.transform(), andctx.write()in separate cells, inspecting DataFrames between stages. Environment variables referenced via{{ env.VAR }}in config.yaml are auto-resolved from Databricks widgets and secrets — no manualos.environsetup required. -
Public step methods on
Engine.Engine.read_inputs(),Engine.apply_transforms(), andEngine.write_outputs()expose the pipeline stages individually for interactive and advanced use cases. -
extract_env_references()utility. Scans raw YAML text for{{ env.VAR }}patterns before Jinja resolution, enabling automatic env-var discovery.
[0.1.9] — 2026-05-19¶
Added¶
-
Formal
typing.Protocolinterfaces for four pluggable seams.DeployAdapter,RegistryBackend,LineageBackend, andAuthBackendinubunye.interfacesdefine structural contracts that backends satisfy without inheriting from a base class. Cross-boundary dataclasses (DeployContext,DeployResult,ModelVersionInfo,LineageRecord,Credentials) accompany each protocol so callers never depend on a concrete backend's internal types. -
Background metadata worker (Decision 1: non-blocking writes).
ubunye._internal.MetadataWorkerdispatches lineage and registry metadata writes to a daemon thread with a bounded queue (default 1 000, configurable viaUBUNYE_METADATA_QUEUE_SIZE). Queue overflow drops the oldest pending write and logs a warning. Flush timeout configurable viaUBUNYE_METADATA_FLUSH_TIMEOUT(default 30 s). Uses a worker-thread pattern — async/await fights Spark's threading model. -
Graceful degradation with fallback manifests (Decision 2). When a metadata write fails, the record is appended to
~/.ubunye/fallback/{run_id}/{kind}.jsonl. Pipeline execution continues. Auth failures are excluded — they propagate immediately. -
ubunye syncCLI command. Replays fallback manifests against configured backends with idempotent deduplication (key:run_id + task + recorded_at). Sub-commands:ubunye sync lineage,ubunye sync registry. Processed manifests are archived to~/.ubunye/fallback/synced/. -
Entry-point discovery for backend groups. Four new entry-point groups in
pyproject.toml:ubunye.deploy_adapters,ubunye.registry_backends,ubunye.lineage_backends,ubunye.auth_backends. Third-party packages register backends by adding entry points under these groups. -
Environment-based auto-detection (
ubunye._internal.auto_detect). Registry: MLflow on Databricks, filesystem elsewhere. Lineage: Delta whenUBUNYE_LINEAGE_TABLEis set, filesystem otherwise. Auth: service principal when bothDATABRICKS_CLIENT_IDandDATABRICKS_CLIENT_SECRETare set, token whenDATABRICKS_TOKENis set, raisesAuthNotFoundErrorotherwise. -
Schema evolution (Decision 3). Every cross-boundary dataclass separates strict core fields from a flexible
metadata: Dict[str, str]. Core fields never change without a major version bump. Every record stampsengine_version. Extra metadata keys survive round-trip through the filesystem lineage store. -
Interfaces documentation page (
docs/interfaces.md) covering protocols, dataclasses, design principles, discovery, auto-detection, fallback manifests, and how to write a custom backend. -
39 conformance and failure-mode tests covering protocol isinstance checks, entry-point discovery, auto-detect logic, lineage/registry backend surfaces, non-blocking write timing, queue overflow, fallback manifest creation, sync dedup, auth propagation, and schema evolution metadata flexibility.
-
ServicePrincipalAuthBackend— OAuth M2M authentication usingDATABRICKS_CLIENT_ID+DATABRICKS_CLIENT_SECRET. Entry pointubunye.auth_backends:service_principal. Takes priority over token auth when both are available (via auto-detect). -
MLflowRegistryBackend— model registry that combines filesystem storage with MLflow experiment/run logging. Logs metrics, params, and stage transitions to MLflow when installed; falls back gracefully when MLflow is unavailable. Entry pointubunye.registry_backends:mlflow. -
DeltaLineageBackend— lineage backend targeting Delta tables via Spark SQL on Databricks. Falls back to JSONL-based local storage when no active SparkSession is available, enabling unit tests without Spark. Entry pointubunye.lineage_backends:delta. -
24 Phase 2 backend tests covering service principal auth (protocol conformance, env var resolution, error cases, entry-point discovery), MLflow registry (full CRUD lifecycle, promotion gates, metadata round-trip), and Delta lineage (record/get/search, metadata preservation, not-found errors).
-
Performance benchmark suite (
benchmarks/bench_engine.py). Nine benchmarks covering config loading (plain and Jinja), entry-point discovery (cached and cold), lineage I/O (write, read, search), metadata worker throughput, and registry registration. Reports ops/sec, p50/p99 latencies, mean, and stdev. Results saved tobenchmarks/results.jsonfor before/after comparison.
Changed¶
-
Filtered entry-point loading.
_load_group()inubunye._internal.discoverynow callsimportlib.metadata.entry_points(group=group)instead of loading all groups and filtering. Cached discovery is near-zero cost. -
O(1) lineage lookup by run ID.
FileSystemLineageStore.get_run()uses arun_id→Pathindex instead ofrglob("*.json"). The index is built lazily on first lookup and updated incrementally on writes. ~34% faster reads. -
In-memory RunContext cache for lineage search. Parsed RunContext objects are cached on first load and reused by subsequent
search()andlist_runs()calls, avoiding repeated JSON parsing and disk I/O. ~51% faster searches across 100 records.
Fixed¶
-
flood_risk_databricks tests: Added missing
pandas/pyarrowto CI workflow. Fixedtest_geocode_addresses.pymodule-resolution conflict (both tasks sharetransformations.py— geocode test now re-orderssys.pathand evicts the cached module). FixedFakeSessionURL matching to useurllib.parse.quoteso encoded commas/spaces match. Fixed_make_postto return only the items matching the batch request instead of the full payload (3 batches were returning 3x the results). -
device_mapping_etl_databricks tests: Replaced inferred column-name list with an explicit
MI_SCHEMA(StructType) so PySpark can handleNonevalues in nullable columns withoutCANNOT_DETERMINE_TYPE.
[0.1.8] — 2026-05-19¶
Added¶
-
ubunye deploy databrickscommand. End-to-end deployment from a localconfig.yamlto a running Databricks job — handles auth, file upload to/Workspace/, wrapper notebook generation, and idempotent job creation/update via the Databricks SDK. Uses a two-leveltargets.yamllookup (usecase-level defaults, task-level overrides). Supports--dry-runfor previewing the job spec without deploying. New optional dependency:pip install ubunye-engine[databricks]. -
Deploy error types.
DeployErrorbase class withAuthNotFoundError,AuthInvalidError,TargetNotFoundError,WorkspaceUploadError, andBundleDeployError— all follow the dual-inheritance pattern (DeployError(UbunyeError, RuntimeError)). -
Structured error messages across the engine. Every user-facing exception now inherits from
UbunyeError(with optionalcontextdict andhintstring) and the stdlib type it replaces (dual inheritance for backward compatibility). New exception classes:TaskNotFoundError,TaskClassMissingError,ReaderNotFoundError,WriterNotFoundError,TransformNotFoundError,TransformOutputError,MonitorNotFoundError,SourceReadError,SinkWriteError,SparkSessionError,ModelLoadError,ModelNotFittedError,VersionExistsError,VersionNotFoundError,PromotionBlockedError,RegistryNotFoundError,LineageRecordNotFoundError, andConfigProfileError. Existing config errors (ConfigFieldError,ConfigTemplateError) migrated fromubunye.config.loadertoubunye.core.errors. See the new Error Reference for the full hierarchy and examples. -
Hook failure logging.
HookChainnow logs a structured warning when a hook'stask()orstep()context manager raises, instead of silently swallowing the exception. -
CI workflows link to the Databricks job UI after
bundle run, so users can find notebook cell output thatdatabricks bundle rundoes not stream to stdout. -
ubunye init github-actionscommand. Generates a GitHub Actions workflow for any pipeline — validates config and runs tests on PRs, deploys to Databricks viaubunye deploy databrickson merge to main. Supports--no-deployfor CI-only workflows,--extrasfor pip install extras (auto-includes Java setup whensparkis present), and--targetfor the Databricks deploy target. Theinitcommand is now a sub-app:ubunye init pipeline(formerlyubunye init) andubunye init github-actions.
Fixed¶
-
Promotion gate failure now propagates.
PromotionBlockedErrorwas silently caught by the titanic training task'sexcept ValueError(via dual inheritance). Gate failures now re-raise — CI goes red when a model fails its quality gate instead of reporting green. -
typer[all]replaced with plaintyper. Typer >= 0.24 dropped the[all]extra, causing pip warnings on install. -
merged_spark_conf()/resolved_catalog()/resolved_schema()now raiseConfigProfileErrorwhen profiles are defined but the requested profile doesn't match. Previously these methods silently returned the base config, hiding typos in--mode/--profileflags.
Changed¶
-
ubunye initis now a sub-app with two subcommands:ubunye init pipeline(the previousubunye init) andubunye init github-actions. This is a breaking CLI change — update any scripts that callubunye init -d ...toubunye init pipeline -d .... -
[ml]extra no longer installstorch. Use[ml-torch]for PyTorch workloads. This saves ~1 GB of CUDA wheel downloads in CI for sklearn-only pipelines. -
CONFIG.transform.typeis now optional. When omitted, the engine defaults to loading the user'sTaskclass fromtransformations.py— notype: noopdeclaration needed. Existing configs withtype: noopcontinue to work but emit aDeprecationWarningadvising removal. All scaffolded configs, production examples, test fixtures, and docs have been updated to omit the field. -
Unknown config fields are now rejected at load time. All Pydantic models (except
IOConfig) useextra="forbid". A typo likeENGNEraisesConfigFieldErrorwith a "Did you mean 'ENGINE'?" suggestion viadifflib.get_close_matches.IOConfigretainsextra="allow"so plugin-specific keys (REST APIheaders,pagination, etc.) pass through to connectors. Breaking: configs with unknown top-level or nested fields that were previously silently ignored will now fail. -
Undefined Jinja template variables fail immediately. The resolver now uses
StrictUndefinedinstead ofDebugUndefined. A reference to{{ ds }}without passingdsas a CLI variable raisesConfigTemplateErrorlisting available variables. The| default()filter continues to work.
Added¶
-
Production reference example: device mapping ETL (Databricks, paid workspace) —
examples/production/device_mapping_etl_databricks/ports a legacy policy-device mapping script onto Ubunye. Three Unity Catalog reads, one Unity Catalog Delta sink, a monthly Databricks Jobs schedule (2nd of month, 06:00 UTC), and a dedicated GitHub Actions workflow (.github/workflows/device_mapping_etl_databricks.yml) that gatesbundle deployon OAuth service-principal secrets plusTELM_CATALOG/TELM_SCHEMAenvironment secrets. No catalog/schema identifiers are committed — values flow through DAB--varflags at deploy time, keeping confidential Unity Catalog names out of source control. Corrects a latent bug in the original script where the IMEI-first-detection-adjustedinstallation_datetime_finalwas computed but never selected into the final output. -
Production reference example: flood-risk (two-task pipeline, paid Databricks + Unity Catalog). A port of the legacy flood-detection notebook to Ubunye as
examples/production/flood_risk_databricks/, split into two chained tasks: geocode_addresses— reads(id, address)rows from a UC source table, calls TomTom Search (top-1 per id, three-step parameter fallback, 429 retry), writesaddress_geocoded.flood_risk— readsaddress_geocoded, calls JBAfloodscoresandflooddepthsin batches of 10, merges the two responses onid, renames ~60 nested keys to snake-case, writesaddress_flood_risk.
Quarterly schedule (1st of Jan/Apr/Jul/Oct at 06:00 UTC). TomTom and
JBA credentials live in a Databricks secret scope (flood-risk by
default) and are read by the notebook wrapper; Unity Catalog
identifiers and the source-table name come from GitHub environment
secrets through --var at deploy time. Also corrects the legacy
"Base " → "Basic " auth-header typo and strips the Zscaler cert
probing block that belongs on corporate networks, not serverless.
OAuth-gated GH Actions workflow
(.github/workflows/flood_risk_databricks.yml) runs Spark + fake-
HTTP unit tests, validates the bundle, and deploys to the nonprod
target when all three UC secrets are present. See the example's
README.md.
Added¶
- Service principal + OAuth auth for Databricks CI. The four Databricks
example workflows (
databricks_deploy.yml,jhb_weather_databricks.yml,multitask_databricks.yml,titanic_ml_databricks.yml) now passDATABRICKS_CLIENT_ID/DATABRICKS_CLIENT_SECRETalongside the existingDATABRICKS_TOKENenv var. The secrets gate accepts either flow: PAT (HOST+TOKEN) or OAuth (HOST+CLIENT_ID+CLIENT_SECRET). The Databricks CLI auto-selects OAuth when both client vars are set. New reference pagedocs/databricks-auth.mdwalks through service principal creation, workspace/UC grants, secret rotation, and verification.
[0.1.7] — 2026-04-21¶
Changed¶
MODELandVERSIONare now optional at the top level ofconfig.yaml.MODELdefaults toetlandVERSIONdefaults to"0.0.0-dev". The semver validator now accepts an optional pre-release suffix (e.g.1.0.0-rc1,0.0.0-dev) in addition to plainMAJOR.MINOR.PATCH. Existing configs that set these fields explicitly continue to work unchanged. Rationale: these two lines were boilerplate in every scaffolded pipeline; the defaults cover the common case. Set them explicitly when job type or version is load-bearing (lineage, model registry, orchestrator metadata).
Added¶
-
Production reference example: Titanic multi-task pipeline (local) —
examples/production/titanic_multitask_local/demonstrates sequential task chaining viaubunye run -t clean_data -t aggregate. Task 1 reads the Titanic CSV, cleans it, and writes intermediate Parquet. Task 2 reads that Parquet and computes survival rates by class and age group. Exercisesrun_pipeline(), sibling-module isolation between tasks, and cross-task lineage. CI workflow (.github/workflows/multitask_local.yml) runs Spark unit tests, validates both configs, runs the full pipeline, and verifies output. See the example'sREADME.md. -
Production reference example: Titanic multi-task pipeline (Databricks) —
examples/production/titanic_multitask_databricks/is the Databricks counterpart of the local multi-task example. Sametransformations.py(byte-identical, CI-enforced), but task chaining uses Unity Catalog Delta tables instead of Parquet files. Task 1 writestitanic_cleaned, task 2 reads it and writessurvival_summary. Runs viaubunye.run_pipeline()on serverless compute. CI workflow enforces portability diff against the local example and validates/deploys the Asset Bundle. -
Production reference example: Titanic ML end-to-end on Databricks —
examples/production/titanic_ml_databricks/demonstrates the full ML lifecycle:UbunyeModelsubclass (sklearn RandomForest), MLflow param/ metric/artifact logging, filesystem-backedModelRegistryon a UC volume, and aPromotionGateon validation AUC. Two serverless jobs share one Asset Bundle —titanic_trainregisters and auto-promotes;titanic_predictloads the current production/staging model and writes a Unity Catalog Delta predictions table. A one-rowtraining_metricsaudit row is appended per training run. CI (.github/workflows/titanic_ml_databricks.yml) runs pandas/sklearn unit tests, diffs the twomodel.pycopies for drift, and validates/deploys the bundle when Databricks secrets are configured. See the example'sREADME.md.
Changed¶
- GitHub Actions bumped to Node 24-capable majors.
actions/checkout@v4→@v6,actions/setup-python@v5→@v6across all nine workflow files. Resolves the deprecation warning surfaced on 2026-04-16 runs ahead of the Node 20 removal deadline (2026-09-16). No behavioural change.
Fixed¶
-
Undefined CLI template variables silently leaked into resolved configs.
resolve_configpre-checked{{ env.X }}references but not bare{{ var }}identifiers. Jinja2'sDebugUndefinedleft unresolved expressions verbatim, sopath: "file:///{{ dt }}"with nodtprovided would pass validation and then hand Spark a literalfile:///{{ dt }}at runtime. Fix adds a post-render residue scan that names the offending variable and suggests a CLI flag, env var, or| default()filter. Regression tests intests/unit/config/test_resolver.py. Pre-existing configs underexamples/andpipelines/all use| default()on CLI-derived vars, so no downstream config needs updating. -
Sibling modules leaked between sequential tasks in
run_pipeline._with_task_dir_on_pathadded the task dir tosys.pathbut never cleaned upsys.moduleson exit. Two tasks that each shipped their ownmodel.py(orutils.py, etc.) would silently run the first task's module when the second task imported it — Python's import cache was keying on the shared short name. Fix evicts only modules whose source file lives under the exiting task dir; stdlib and site-packages are untouched. Regression test intests/unit/test_task_runner.py. Caught by offline audit on the overnight branch, ahead of the planned multi-task DAG example (tasks/todo/task-04.md).
[0.1.6] — 2026-04-15¶
Changed¶
titanic_databricksDAB switched to serverless + Unity Catalog. Removed thenew_clusterblock and the DBFS bootstrap; the notebook now provisions a UC volume, downloads the Titanic CSV into it at runtime, and the writer emits a Unity Catalog managed Delta table (workspace.titanic.survival_by_classby default). Validated end-to-end on Databricks Free Edition. The portability contract withtitanic_local(byte-identicaltransformations.py) is unchanged — only the deployment wrapper (config + notebook + DAB) differs.jhb_weather_databricksDAB switched to serverless compute. Removed thenew_clusterblock (and theexisting_cluster_idescape hatch) fromdatabricks.yml. Notebook tasks with no cluster spec route to serverless on both Free Edition and paid workspaces; the notebook installsubunye-engineat runtime via%pip. Defaultweather_catalogchanged frommaintoworkspacebecause Free Edition auto-provisions only theworkspacecatalog. Paid-workspace users override via--var="weather_catalog=main". README documents the rationale.databricks_deploy.yml(titanic) skips deploy gracefully when secrets are absent. Mirrors the soft-skip pattern injhb_weather_databricks.ymlso PRs from forks (or repos that have not configuredDATABRICKS_HOST/DATABRICKS_TOKEN) still exercise the unit tests and portability diff instead of failing the workflow.- Production examples switched from pandas twins to PySpark tests. Every
transformations.pyunderexamples/production/now exposes a single Spark implementation. Tests use a session-scopedSparkSessionfixture (local[1], 512 MB driver, shuffle partitions=1) so the production code is the code under test. Eliminates the dual-maintenance burden and the risk of silent drift between pandas and Spark paths. Unit-test CI steps now install Java 17 +ubunye-engine[spark,dev]on the runner.
Added¶
-
Production reference example: JHB hourly weather (REST API → Unity Catalog) — end-to-end example at
examples/production/jhb_weather_databricks/demonstrating REST ingestion with therest_apireader against the free Open-Meteo API (lat/lon for Johannesburg, no auth required), a Spark transform that explodes parallel hourly arrays into a tidy one-row-per-hour DataFrame, and a Unity Catalog Delta writer partitioned byforecast_date. Ships a Databricks Asset Bundle with a scheduled daily job (06:00Africa/Johannesburg), a notebook wrapper aroundubunye.run_task(), seven pandas unit tests over a hand-built fixture response, and a CI workflow (.github/workflows/jhb_weather_databricks.yml) that runs the tests, smoke-checks the endpoint, and validates/deploys the bundle when Databricks secrets are configured. See the example'sREADME.md. -
Production reference example: Titanic (local runtime) — end-to-end example at
examples/production/titanic_local/demonstrating a CSV → Parquet pipeline with dev/prod profiles, Jinja-templated paths, pandas unit tests (no Spark), a committed golden Parquet, and a GitHub Actions workflow (.github/workflows/local_pipeline.yml) that validates config, runs the pipeline on a real SparkSession, and diffs the output against the golden. One half of the portability demo — the Databricks counterpart sharestransformations.pyverbatim. See the example'sREADME.md. -
Production reference example: Titanic (Databricks Community Edition) — the Databricks half of the portability demo at
examples/production/titanic_databricks/. Ships a Databricks Asset Bundle (databricks.yml) sized for CE's single-node / DBFS / no-UC constraints, a notebook entry (notebooks/run_titanic.py) that callsubunye.run_task()against the active SparkSession, the same pandas unit tests, and a CI workflow (.github/workflows/databricks_deploy.yml) that installs the Go-based Databricks CLI, validates and deploys the bundle, and enforces the portability contract by diffingtransformations.pyagainst the local example. Known CE limitations (no service principals, restricted Jobs API, DBFS deprecation) are documented honestly rather than worked around. -
Cross-runtime reference index —
examples/production/README.mdexplains the portability contract, provides a side-by-side config comparison of the two examples, a decision guide for choosing between the local and Databricks runtimes, and a migration table covering what changes when moving from Community Edition to a standard Databricks workspace. -
Hook abstraction for observability (
ubunye/core/hooks.py) —Hookbase class andHookChainmultiplexer. Tasks and steps are now wrapped in hook context managers so the Engine no longer imports telemetry modules directly. Built-in hooks shipped underubunye/telemetry/hooks/:EventLoggerHook,OTelHook,PrometheusHook,LegacyMonitorsHook. Third parties can register custom hooks (Slack alerts, audit logs, drift checks) without modifying the Engine. Seedocs/patterns/hooks.md. -
ubunye.hooksentry point group (pyproject.toml) — third-party packages can registerHooksubclasses as entry points and have them auto-discovered by the Engine. The three built-in telemetry hooks (events, otel, prometheus) are registered via this mechanism and gated onUBUNYE_TELEMETRY=1. -
Python API (
ubunye/api.py) —run_task()andrun_pipeline()for running Ubunye tasks from Python code (Databricks notebooks, scripts, tests) without the CLI. Auto-detects and reuses active SparkSessions. Exported fromubunye.__init__. -
DatabricksBackend (
ubunye/backends/databricks_backend.py) — backend that wraps an existing SparkSession instead of creating one.stop()is a no-op since we don't own the session. -
Dev notebook scaffolding —
ubunye initnow generatesnotebooks/<task>_dev.ipynbalongsideconfig.yamlandtransformations.py. The notebook usesDatabricksBackend,dbutils.widgets, anddisplay(). The Load step is commented out by default. -
Deployment docs —
docs/deployment.mdcovering Databricks Asset Bundles pattern, GitHub Actions CI/CD, and Python API on Databricks. DABs belong in the usecase repo, not the engine. -
Deploy workflow —
.github/workflows/deploy.ymlvalidates configs on PR and runs unit tests. Bundle deployment is handled in the usecase repo. -
ubunye test runCLI sub-command — runs tasks with a test profile and reports PASS/FAIL. -
Model Registry (
ubunye/models/) — library-independent ML lifecycle management. UbunyeModelabstract contract:train,predict,save,load,metadata,validate.ModelRegistry— filesystem-backed versioning with stages: development → staging → production → archived.PromotionGate— configurable metric thresholds (min_*,max_*,require_drift_check).load_model_class()— dynamic model file importer; mirrors the task-dir import pattern.ModelTransformplugin (type: model) — train and predict from config YAML.ubunye modelsCLI sub-commands:list,info,promote,demote,rollback,archive,compare.-
RegistryConfigandModelTransformParamsPydantic schema additions. -
Lineage tracking (
ubunye/lineage/) — automatic run provenance. RunContext,LineageRecorder,FileSystemLineageStore,hash_dataframe.ubunye lineageCLI sub-commands:show,list,compare,search,trace.-
--lineageflag onubunye run. -
REST API connector — paginated HTTP reader and writer.
- Pagination strategies: offset, cursor, next_link.
- Auth: bearer, api_key (header or query param), basic.
- Rate limiting with configurable
requests_per_second. - Exponential backoff retry on configurable status codes.
-
Optional explicit schema declaration.
-
Config validation —
ubunye validatecommand with full Pydantic v2 schema. - Format-specific field validation in
IOConfig. - Jinja2 rendering before Pydantic validation.
-
Semver validation on
VERSIONfield. -
ubunye export airflow|databricksCLI — theAirflowExporterandDatabricksExporterunderubunye/orchestration/are now reachable from the command line. The command loads the task'sconfig.yaml, pulls defaults from itsORCHESTRATIONblock, and writes the artifact to--output. Airflow emits a DAG Python file; Databricks emits a Jobs APIjob.json. Classes are now exported fromubunye.orchestration.__init__. -
Test infrastructure — 288 unit tests, all Spark-free in
tests/unit/.
Changed¶
-
databricks_expoter.pyrenamed todatabricks_exporter.py(typo fix). Not previously exported fromubunye.orchestration, so external callers are unaffected. -
Unified execution path (
ubunye/core/task_runner.py) —api.py,cli/main.py runandcli/test_cmd.py runpreviously each reimplemented the read → transform → write loop and calledload_monitors/safe_calldirectly. They now delegate toexecute_user_task(), which wraps the user'sTask.transform()as an ephemeral Transform plugin and runs it throughEngine. Single code path, single hook lifecycle,MonitorHookadapts the legacy lineage recorder to aHook.Engine.__init__gainedextra_hooks=(append to defaults) andmanage_backend=(caller-owned vs engine-owned backend lifecycle).run_task/run_pipelineaccepthooks=for notebook callers who want to swap in custom hook chains. - Engine runtime refactored —
ubunye/core/runtime.pyreduced from 374 to 255 lines.Engine.run()body shrank from ~220 lines to ~35 by delegating telemetry plumbing to hooks. The engine no longer imports fromubunye.telemetry.*— only fromubunye.core.hooks.UBUNYE_TELEMETRYandUBUNYE_PROM_PORTenv vars still honored; user monitors inCONFIG.monitorscontinue to work viaLegacyMonitorsHook.Engine.__init__gained an optionalhooks=argument. ubunye/config/schema.py— addedRegistryConfig,ModelTransformParams,FormatType.REST_API.ubunye/__init__.py— exportsrun_taskandrun_pipeline.ubunye/cli/main.py— mountedmodels_app,lineage_app,test_appTyper sub-apps; added notebook scaffolding toinit.pyproject.toml— addedmodelentry point underubunye.transforms.
Fixed¶
- Unity Catalog writer collapsed plugin dispatch with Spark source format.
UnityTableWriter.write()was readingcfg["format"]and passing it to Spark'sDataFrameWriter.format(...). Butcfg["format"]is the Ubunye plugin selector and is always"unity"by the time the writer runs, so Spark raised[DATA_SOURCE_NOT_FOUND] Failed to find the data source: unity. Switched the writer to readcfg["file_format"](defaulting todelta), matching the convention already used by the s3 writer. Caught while deploying thejhb_weather_databricksexample to a serverless workspace.
[0.1.0] — 2025-09-11¶
Added¶
- First alpha release of Ubunye Engine.
- Config-first ETL framework built on Apache Spark.
- Plugin system for Readers, Writers, and Transforms via Python entry points.
- Built-in connectors: Hive, JDBC, Delta, Unity Catalog, S3, binary.
- CLI commands:
init,run,plan,config,plugins,version. - Orchestration exporters: Airflow DAG Python file, Databricks Jobs API JSON.
- Internal ML wrappers:
SklearnModel,SparkMLModel,BatchPredictMixin,MLflowLoggingMixin. - Telemetry modules: JSON event log, Prometheus, OpenTelemetry.
- Example tasks:
fraud_detection/claims/claim_etl,rest_api/customer_sync. SparkBackendwith context manager and safe multiple-start support.