7/18/2025

Databricks CI/CD with Asset Bundles: Developer Hands-On Guide

 


Here's a simple developer hands-on guide based on the provided Databricks CI/CD pipeline code, which leverages Databricks Asset Bundles:

This guide will walk you through setting up, developing, testing, and deploying your Databricks project using Databricks Asset Bundles and integrating it with an Azure DevOps CI/CD pipeline.


1. Local Development Setup

To begin local development, you'll need to set up your environment:


  1. Install Databricks CLI:
    Download and install the Databricks CLI following the official documentation.


  1. Authenticate to Databricks Workspace:
    Open your terminal and run the following command to authenticate with your Databricks workspace:

Bash

$  databricks configure


You’ll then be prompted for:

Databricks Host (should begin with https://): https://adb-1234567890123456.7.azuredatabricks.net

Token: dapi1234567890abcdef1234567890abcdef

Where to get the Host and Token?


  1. Host:
    Go to your Databricks workspace and copy the base URL (it always starts with https://).

  2. Token:
    Generate a personal access token in your Databricks user settings.

  3. Also can login with interactive mode

$ databricks auth login

Databricks Host: https://adb-1234567890123456.7.azuredatabricks.net

The browser window opened for login...

Login successful.

Profile [default] created in ~/.databricks.cfg


  1. Now you can list available clusters in the workspace

$ databricks clusters list


  1. Install Development Dependencies:
    Install the required Python packages for local development from requirements-dev.txt:

Bash

$ pip install -r requirements-dev.txt

     8. Initialize a new Databricks Asset Bundle project using the default-python template

Bash

databricks bundle init default-python 


     9. Deploying an asset bundle

Bash

databricks bundle deploy 

2. Project Overview

The databricks-cicd-demo is a sample Python project generated using the default-python template for Databricks. It demonstrates how to package Python code into a wheel file, define Databricks jobs, and deploy them using Databricks Asset Bundles. It also includes an example of a Jupyter notebook and unit tests.


Project Structure

Here's the project structure in a tree-like format, similar to your example:

├── README.md

├── azure-pipelines-release.yml  #Azure devops deploy/CD pipeline

├── databricks.yml  #asset bundle config

├── pytest.ini #Pytest configuration file, specifying test paths and Python paths.

├── requirements-dev.txt  #dependencies for local development.

├── setup.py  # Configuration script for building and packaging the Python project into a wheel file.


├── build/         #python app build artifacts

│   └── lib/

│       └── my_db_python_project/

│           ├── __init__.py

│           └── main.py

├── resources/

│   └── my_db_python_project.job.yml  #databricks job, notebook and cluster config

├── scratch/ #Reserved for personal, exploratory notebooks and is not committed to Git.


│   ├── README.md

│   └── exploration.ipynb

├── src/      #source folder for py-spark apps and databricks note books

│   ├── my_db_python_project/  #sample py-spark app

│   │   ├── __init__.py

│   │   └── main.py

│   ├── my_db_python_project.egg-info/ #project info folder

│   │   ├── PKG-INFO

│   │   ├── SOURCES.txt

│   │   ├── dependency_links.txt

│   │   ├── entry_points.txt

│   │   ├── requires.txt

│   │   └── top_level.txt

│   └── notebook_first.ipynb # Python databricks notebook

└── tests/

    └── main_test.py  # Contains unit tests for the main.py functions.



3. Deploying with Databricks Asset Bundles Using CLI

Databricks Asset Bundles allow you to manage and deploy your Databricks resources (jobs, notebooks, ML models) as code.

Deployment Targets:

The databricks.yml file defines two deployment targets: dev and prod.

  • dev (Development Mode):

    • Resources are prefixed with [dev your_user_name].

    • Job schedules and triggers are paused by default.

    • Root path: ~/.bundle/${bundle.name}/${bundle.target}.


  • prod (Production Mode):

    • No prefix on deployed resources.

    • Job schedules run as defined.

    • Root path: /Workspace/Shared/exploration/.bundle/${bundle.name}/${bundle.target}.

    • Permissions are set for the users group to CAN_MANAGE.

Deployment Steps:

  1. Deploy to Development:
    To deploy a development copy of your project, run:

Bash

databricks bundle deploy --target dev

  1. (The --target dev is optional as dev is the default target). This will deploy a job named [dev yourname] my_db_python_project_job to your workspace under Workflows.


  1. Deploy to Production:
    To deploy a production copy, run:

Bash

databricks bundle deploy --target prod


  1. Run a Job:
    To manually run a deployed job or pipeline, use:

Bash

databricks bundle run


5. Understanding the Databricks Job Definition


The job is defined in resources/my_db_python_project.job.yml and consists of two tasks:

  1. notebook_first_task:

    • Executes the ../src/notebook_first.ipynb notebook.

    • Uses an existing_cluster_id: 1166-092678-1n0456.

  2. main_task:

    • Depends on notebook_first_task.

    • Executes a Python wheel task: package_name: my_db_python_project, entry_point: main.

    • Uses an existing_cluster_id: 1112-092926-1166.

    • Includes the generated .whl file as a library dependency.

6. Testing

The project is configured for unit testing using pytest.

  • Test File: tests/main_test.py contains a test function test_main that verifies get_taxis returns more than 5 rows.

  • Pytest Configuration: pytest.ini specifies tests as the testpaths and src as the pythonpath.

To run tests locally, navigate to the project root and execute:

Bash

pytest



7. CI/CD with Azure DevOps

The azure-pipelines-release.yml file defines an Azure DevOps release pipeline for continuous deployment.

Pipeline Stages:

The pipeline has a DeployToDatabricks stage with a single Deploy job, which runs on an ubuntu-latest VM image.

Pipeline Steps:

  1. Use Python Version:
    Ensures Python 3.x is used.

  2. Install Databricks CLI:
    Installs the Databricks CLI, wheel, and setuptools.

  3. Run Databricks configure CLI Commands:
    Configures the Databricks CLI using environment variables DATABRICKS_HOST and DATABRICKS_TOKEN. These variables should be set in an Azure DevOps variable group named UTILITY_DATABRICKS_VARIABLES.

  4. Deploy bundle:
    Deploys the Databricks Asset Bundle to the prod target.

  5. Upload Notebooks to Databricks Workspace:
    Imports the entire current directory (./) to /Workspace/Shared/exploration in Databricks Workspace. This is useful for making notebooks available directly in the workspace UI.

Setting up Azure DevOps Variables:

You need to create a variable group named UTILITY_DATABRICKS_VARIABLES in your Azure DevOps project and add the following variables:

  • DATABRICKS_MY_ANALYTICS_DEV_HOST: Your Databricks workspace URL (e.g., https://adb-xxxxxxxxxxxxxxxx.xx.azuredatabricks.net).

  • DATABRICKS_MY_ANALYTICS_DEV_TOKEN: A Databricks personal access token with sufficient permissions to deploy resources.

By following this guide, you can effectively manage and automate the deployment of your Databricks projects using Databricks Asset Bundles and Azure DevOps.


No comments:

Post a Comment