Posted on 24 February 2021, updated on 9 August 2021.
In this article, you will find an example of a CI/CD pipeline for Databricks and the several issues I encountered creating it. Some useful configurations of your Databricks workspace will be detailed too. For example, how to secure the code or secure access!
When working on setting up this pipeline, one of the main issues I encountered was the lack of information. To help future Ops and share my experience, I wrote this article! I hope it will answer several of your questions and give you a good starting point.
Basic knowledge of Databricks resources is required to understand this tutorial.
- A Ressource Group with a Databricks instance
- An Azure DevOps Repo
- Configure your repo following this tutorial
- Create a Databricks Access Token
The goal of the CI pipeline is to ensure the validity of the code. To do so, we start with testing the code: Pytest, Black …
Due to the specificity of our project, we had to run a “CI Integration Test” job in Databricks to validate the code.
This Job tries to train an IA model and output a score if the score is over a threshold it succeeds, otherwise, it failed.
To execute this Job you need to do several steps:
- Deploy notebooks in a temporary folder in your Databricks workspace
- Deploy the “CI” Job linked to a notebook in the temporary folder
- Run the “CI” Job and wait for its results
When we started the project the feature to link a Git Repo and a Databricks workspace was still in Preview. So, we chose to add all our Notebooks to our Git Repository.
Now that we have our Notebooks in our Repository we need to synchronize them with our Databricks workspace.
To deploy all our notebooks in a temporary folder we use the databricks CLI:
You might notice the strange specified profile, it is due to Azure DevOps tasks!
When you install Databricks Cli using the task provide by Azure DevOps it will not configure the default profile but a profile called AZDO in the “~/.databrickscfg” file.
Deploy and Run a CI job
To deploy the job we will use dbx CLI. It’s a CLI that helps you deploy jobs with your library attached to it. If you’ve followed the prerequisite you should have configured your repo with dbx in it.
Here is a short description of the folders in your repo:
- src: Contain the library which will be packed and attached to your jobs
- conf: definition of all jobs
- tools: Contain the dbx installer
- notebooks: Contain all our Notebooks, required for CI previous stage
- tests: Contains all the unit test
First, you need to define the job you want to deploy using dbx. By default, the configuration file is ‘conf/deployment.json’. Several examples are given in the dbx tutorial. If you want to use a job you’ve already created, the dbx definition job is nearly identical to the settings section of a job description when you use the following command:
Once you’ve configured a job, You need to configure dbx:
Then you need to deploy the selected job
Now your job is deployed and you can see it in the Job interface in Databricks. The last step is to run it:
If you use the “--trace” option the azure task will induce an error in case the job run fails.
Deploy in a new workspace
As you can see, Our CI pipeline is quite complete now. In fact, to deploy your environment you only need to reuse some steps.
Configure databricks CLI and dbx
Re-start jobs (Need a little scripting to retrieve all jobs name)
Configuring our Databricks Workspace took a lot more time than expected. We’ve encountered the following issues.
How to remove credentials from the code
The best solution I found is to link a Databricks Secret Scope with an Azure Key Vault.
Here is a link to the official documentation on how to do it.
Once you are done with configuring the Secret Scop you can use dbutils to access all linked key vault secrets, here is an example:
If you have multiple Databricks workspace to separate different environments. You can create a Secret Scope with the same name in each workspace link to a Key Vault corresponding to each environment!
How Azure pipeline can access Databricks
Using a predefine Databricks Token is not the best solution in terms of security and durability. To fix this issue we needed a Service Account to log onto Databricks. I found this documentation that will help you grant access to databricks for a service principal.
Now that you have a service principal who can access Databricks, you need to generate a Databricks Token. To do so you can follow this documentation.
Take care this token isn’t valid for a long time, so you will need to add a step in your pipeline to generate this token!
All-purpose clusters library management
This issue may be the most time-consuming one I had. When the user started to deploy jobs regularly in our environment, clusters began to fail to install the library needed by jobs.
To fix this issue, we added several steps to our pipeline:
- Stop all running jobs: Avoid any issue with any streaming job or job writing into a file (corrupt file happened once)
- Remove all dbx artifacts: Stop the installation of an old artifact by jobs restarting due to scheduling.
- Uninstall library and restart clusters
To avoid this issue, I strongly recommend using Job Compute, and it’s way cheaper!
Deploying this CI/CD pipeline was quite challenging but in the end feedbacks from the Data Scientist are great and deployment to a new env is fully automated.
This tutorial helps you create a CI/CD pipeline on an already existing infrastructure. The next step will be to transform this existing infrastructure into IAC (Infrastructure As Code). To do so a databricks provider exists in Terraform. If you wish to implement it I advise you to read this article on Terraform before!