Deploy to Databricks¶
Deploy a pipeline task to Databricks as a scheduled job — from config.yaml
to a running job in one command.
Prerequisites¶
Install the engine with the Databricks extra:
Set your Databricks credentials:
export DATABRICKS_HOST="https://adb-123456789.azuredatabricks.net"
export DATABRICKS_TOKEN="dapi..."
Create targets.yaml¶
Define your deploy targets in a targets.yaml file at the usecase level
(shared across all pipelines and tasks):
pipelines/
fraud_detection/
targets.yaml ← usecase-level targets
ingestion/
claim_etl/
config.yaml
transformations.py
# fraud_detection/targets.yaml
targets:
dev:
host: "https://adb-dev-123.azuredatabricks.net"
token_env: DATABRICKS_TOKEN # env var holding the PAT
workspace_path: /Workspace/ubunye # where files are uploaded
spark_version: "13.3.x-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 0 # 0 = single-node cluster
prod:
host: "https://adb-prod-456.azuredatabricks.net"
token_env: DATABRICKS_TOKEN_PROD
workspace_path: /Workspace/ubunye
node_type_id: "i3.2xlarge"
num_workers: 4
Task-level overrides¶
A task can override specific fields by adding its own targets.yaml:
# fraud_detection/ingestion/claim_etl/targets.yaml
targets:
dev:
num_workers: 2 # this task needs more compute
spark_conf:
spark.sql.shuffle.partitions: "200"
The task-level file is deep-merged on top of the usecase-level one. Fields you don't override are inherited.
Preview with --dry-run¶
ubunye deploy databricks \
-d pipelines -u fraud_detection -p ingestion -t claim_etl \
--target dev --dry-run
Prints the job spec JSON without touching Databricks.
Deploy¶
This:
- Validates the task
config.yaml - Authenticates to the Databricks workspace
- Uploads
config.yaml,transformations.py, and helpers to/Workspace/ubunye/fraud_detection/ingestion/claim_etl/ - Generates a wrapper notebook that calls
ubunye.run_task() - Creates (or updates) a Databricks job named
ubunye-fraud_detection-ingestion-claim_etl-dev
If the job already exists (same name), it is updated in place. Every Ubunye-managed
job is tagged with ubunye_managed: "true" so it can be identified later.
Ad-hoc deploy (no targets.yaml)¶
For quick one-off deploys, pass the host directly:
ubunye deploy databricks \
-d pipelines -u fraud_detection -p ingestion -t claim_etl \
--host "https://adb-123.azuredatabricks.net" \
--dry-run
Configuration reference¶
DatabricksTargetConfig fields¶
| Field | Type | Default | Description |
|---|---|---|---|
host |
str | required | Databricks workspace URL |
token_env |
str | DATABRICKS_TOKEN |
Environment variable holding the PAT |
workspace_path |
str | /Workspace/ubunye |
Root path for uploaded files |
spark_version |
str | 13.3.x-scala2.12 |
Databricks Runtime version |
node_type_id |
str | i3.xlarge |
Instance type for the job cluster |
num_workers |
int | 0 |
Number of workers (0 = single-node) |
spark_conf |
dict | {} |
Spark configuration overrides |
aws_attributes |
dict | {} |
AWS-specific cluster attributes |
CLI flags¶
| Flag | Short | Default | Description |
|---|---|---|---|
--usecase-dir |
-d |
required | Root directory of pipelines |
--usecase |
-u |
required | Usecase name |
--package |
-p |
required | Pipeline/package name |
--task |
-t |
required | Task name |
--target |
dev |
Deploy target name | |
--mode |
-m |
PROD |
Config profile for the job |
--data-timestamp |
-dt |
Data timestamp passed to the task | |
--dry-run |
false |
Print job spec without deploying | |
--host |
Ad-hoc workspace URL (skips targets.yaml) | ||
--token |
Env var name for the token (ad-hoc) |