Datalore
Collaborative data science platform for teams
How to Work With Git in Datalore
Git is a tool that is commonly used by data science teams. In this tutorial, we’ll describe the ways you can work with Git in Datalore, our collaborative data science platform.
Read on to learn how to install Git repositories, edit the content of these repositories, and version your work with Datalore.
How to install a Git repository in your Datalore notebook environment
If you or your team develop a collection of Python scripts or a pip-compatible package stored in Git, you can conveniently access this repository from Jupyter notebooks in Datalore.
There are 3 ways to do this. We recommend choosing the method that best suits your repository access level and type.
Using Environment | Repositories | Using Tools | Terminal or IPython magic commands | Using a team’s base environment(Enterprise-only feature) | |
Repository access level | From a chosen notebook only | From a chosen notebook or from each notebook in a chosen workspace | Any notebook of any team member in any workspace |
Repository type | Public Git repositories and private Git repositories via SSH | Any private or public Git or non-Git repositories (Artifactory, Space Packages, privately hosted PyPI repositories) | Any private or public Git or non-Git repositories (Artifactory, Space Packages, privately hosted PyPI repositories) |
Installation specifics | Installed on demand, can be refreshed at any time from the UI | Installed on demand using Git CLI, certain options can be automated with init.sh and installed on notebook computation start | Installed as part of the custom docker image |
Refresh type | Refresh button and restart kernel | Using Git CLI via Terminal | Rebuild docker image |
Available actions | Clone, pull | Clone, pull, push | Clone on image creation |
The main purpose of cloning Git repositories to Datalore is to gain access to custom Python modules, scripts, or functions, and edit them collaboratively in Datalore. However, it is currently not possible to edit Jupyter notebooks that were cloned as part of the Git repository.
Using Environment | Repositories
Using Environment | Repositories is the easiest way to install a publicly available Git repository from a user interface into a single Datalore notebook. You can choose the repository’s branch and refresh the connection from the user interface.
If you want to access a private Git repository, you can do so by providing SSH keys in Environment | Repositories | Keys.
If you want to access a private Git repository with a personal token or via a username and password, use an init.sh script or Terminal.
Using Terminal and init.sh scripts
To clone a Git repository from Terminal, open a notebook, go to Tools | Terminal, and use Git CLI commands to clone a repository. If you want to use a repository in one notebook only, clone it to Notebook files. If you want to use a repository in all of the workspace notebooks, clone it to Workspace files.
To access the repository contents from the notebook, import the necessary functions. Datalore provides code completion and documentation popups for imported Python modules.
If you want to automate running a set of Terminal commands on each notebook start, you can use an init.sh shell script.
For example, you can configure access to your privately owned repositories, configure usage of your personal tokens, install non-python dependencies, and mount file directories. You can do this automatically before the pip or conda environment manager executes the base environment setup.
If you need to specify a username or email to access or push files to a repository, add the following configurations to your init.sh script:
git config --global user.email "email@example.com" git config --global user.name "your name"
To make the init.sh script available for each notebook in the workspace, make sure Workspace files are attached and move the init.sh file from Notebook files to Workspace files.
Using a team’s base environment
If you want to provide centralized access to a certain repository for your team, you can make this repository part of a custom base environment.
Base environments are custom Docker images that can be easily used as pre-build configurations when creating a new notebook in Datalore.
Custom base environments are available for Enterprise users only. To configure a custom base environment for Datalore Enterprise, please use this guide.
How to edit Git repository contents in Datalore
If you want to edit Python scripts or files available in your Git repository, you can clone the repository to Attached data using:
- Tools | Terminal: This opens a terminal session and allows you to execute Git CLI commands.
- Python magic commands inside a notebook’s code cells.
If you want to clone the repository and edit it from one notebook, make sure to clone it to Notebook files. If you want to edit the repository from any notebook in the workspace, clone it to Workspace files. For Home workspace files, you might need to attach Workspace files to a notebook explicitly.
After cloning the repository to Attached data, you can edit the file’s contents collaboratively.
For Python files, you also get code completion and syntax highlighting. To use the updated functions in your notebook, make sure to restart the kernel or use an autoreload extension:
%load_ext autoreload %autoreload 2
⚠️ Currently it is not possible to edit Jupyter notebooks that are part of your cloned Git repository. To view Jupyter notebooks from the repository, you can double click on them and Datalore will open the notebook in a new tab. If you are particularly interested in this workflow, please see the last paragraph of this blog post.
How to version your data science work with Git and Datalore
Jupyter notebooks are first-class citizens in Datalore. To keep track of changes in the notebook, we recommend using Datalore’s History tool.
Go to Tools | History which allows you to:
- Revert to previously saved states.
- See the difference between the current version of the notebook and the checkpoints.
- Press Ctrl/Cmd+S to create new custom checkpoints.
- See any edits made by your collaborators.
Additionally, Datalore automatically creates checkpoints to rectify potentially dangerous actions, such as the deletion of a cell from the notebook.
To version the Python files you have developed within Datalore, you can use Terminal to commit or push specific files or folders to Git.
How to import a Jupyter notebook from Git in Datalore
You can import a single Jupyter notebook from Git from the Workspace file system in Datalore. Click the down arrow next to the New notebook button and paste the notebook URL.
Roadmap for future Git support improvements in Datalore
We are working on a deeper integration with Github which will be rolled out later in 2023, and we’d like to learn more about the particular workflows and use cases that are of interest to you.
If you are part of a data science team and some of the workflows you would like are missing in Datalore, please talk to us! We offer eligible candidates a $30 Amazon gift card in exchange for a 30-minute interview.
There are many ways you can work with Git repositories in Datalore. We believe that Datalore’s Internal History tool and live collaboration features help you focus more on data science tasks, rather than on working with Git. However, if you need to share your scripts, access, or changes to internal repositories, you can always do so with Datalore’s Terminal, init.sh scripts, and Environment manager.
Kind regards
The Datalore team