Orchestration¶

The ORCHESTRATION section provides metadata for exporting your task to an orchestration platform. It does not affect how ubunye run executes the task — it is used only by ubunye export.

Structure¶

ORCHESTRATION:
  type: airflow           # required — airflow | databricks | prefect | dagster
  schedule: "0 2 * * *"  # cron expression
  retries: 3
  owner: data-engineering
  tags:
    - fraud
    - etl
  databricks:             # Databricks-specific cluster settings
    cluster_id: "0123-456789-abcde"
    node_type_id: "Standard_DS3_v2"
    num_workers: 4

Fields¶

Field	Type	Default	Description
`type`	`airflow` \| `databricks` \| `prefect` \| `dagster`	required	Target orchestration platform
`schedule`	string	`null`	Cron expression for the DAG/workflow schedule
`retries`	int	`2`	Number of automatic retries on failure
`owner`	string	`null`	Team or person responsible (shown in Airflow UI)
`tags`	list of strings	`[]`	Labels for filtering in the orchestration UI
`databricks`	dict	`null`	Databricks-specific job cluster settings

Extra fields are allowed and passed through to the relevant exporter.

Airflow export¶

ubunye export airflow \
    -c pipelines/fraud/etl/claims/config.yaml \
    -o dags/claims_etl.py \
    --profile prod

The generated DAG contains a single BashOperator that runs:

ubunye run -d pipelines -u fraud -p etl -t claims --profile prod

Schedule, retries, owner, and tags are read from ORCHESTRATION.

Databricks export¶

ubunye export databricks \
    -c pipelines/fraud/etl/claims/config.yaml \
    -o jobs/claims_etl.json \
    --profile prod

The generated JSON can be submitted with the Databricks CLI:

databricks jobs create --json-file jobs/claims_etl.json
databricks jobs run-now --job-id <ID>

Cluster settings from ORCHESTRATION.databricks are embedded in the job JSON.

Example — full Airflow config¶

MODEL: etl
VERSION: "1.0.0"

CONFIG:
  inputs:
    raw:
      format: hive
      db_name: raw
      tbl_name: claims
  transform:
    type: noop
  outputs:
    clean:
      format: delta
      path: s3://datalake/clean/claims
      mode: overwrite

ORCHESTRATION:
  type: airflow
  schedule: "30 1 * * *"    # 01:30 UTC daily
  retries: 2
  owner: fraud-team
  tags:
    - fraud
    - daily
    - etl