Data Science News

How to Work With Git in Datalore

Read this post in other languages:

Git is a tool that is commonly used by data science teams. In this tutorial, we’ll describe the ways you can work with Git in Datalore, our collaborative data science platform. 

Read on to learn how to install Git repositories, edit the content of these repositories, and version your work with Datalore. 

How to install a Git repository in your Datalore notebook environment

If you or your team develop a collection of Python scripts or a pip-compatible package stored in Git, you can conveniently access this repository from Jupyter notebooks in Datalore. 

There are 3 ways to do this. We recommend choosing the method that best suits your repository access level and type.

Using Environment | RepositoriesUsing Tools | Terminal or IPython magic commandsUsing a team’s base environment(Enterprise-only feature)
Repository access levelFrom a chosen notebook onlyFrom a chosen notebook or from each notebook in a chosen workspaceAny notebook of any team member in any workspace
Repository typePublic Git repositories and private Git repositories via SSHAny private or public Git or non-Git repositories (Artifactory, Space Packages, privately hosted PyPI repositories)Any private or public Git or non-Git repositories (Artifactory, Space Packages, privately hosted PyPI repositories)
Installation specificsInstalled on demand, can be refreshed at any time from the UIInstalled on demand using Git CLI, certain options can be automated with init.sh and installed on notebook computation startInstalled as part of the custom docker image 
Refresh typeRefresh button and restart kernelUsing Git CLI via TerminalRebuild docker image
Available actionsClone, pullClone, pull, pushClone on image creation

The main purpose of cloning Git repositories to Datalore is to gain access to custom Python modules, scripts, or functions, and edit them collaboratively in Datalore. However, it is currently not possible to edit Jupyter notebooks that were cloned as part of the Git repository. 

Using Environment | Repositories

Using Environment | Repositories is the easiest way to install a publicly available Git repository from a user interface into a single Datalore notebook. You can choose the repository’s branch and refresh the connection from the user interface. 

If you want to access a private Git repository, you can do so by providing SSH keys in Environment | Repositories | Keys

If you want to access a private Git repository with a personal token or via a username and password, use an init.sh script or Terminal

Install a publicly available Git repository in Datalore

Using Terminal and init.sh scripts

To clone a Git repository from Terminal, open a notebook, go to Tools | Terminal, and use Git  CLI commands to clone a repository. If you want to use a repository in one notebook only, clone it to Notebook files. If you want to use a repository in all of the workspace notebooks, clone it to Workspace files

To access the repository contents from the notebook, import the necessary functions. Datalore provides code completion and documentation popups for imported Python modules. 

If you want to automate running a set of Terminal commands on each notebook start, you can use an init.sh shell script. 

For example, you can configure access to your privately owned repositories, configure usage of your personal tokens, install non-python dependencies, and mount file directories. You can do this automatically before the pip or conda environment manager executes the base environment setup. 

Use Git  CLI commands to clone a repository in Datalore

If you need to specify a username or email to access or push files to a repository, add the following configurations to your init.sh script: 

git config --global user.email "email@example.com"
git config --global user.name "your name"
Automate running a set of Terminal commands with init.sh script

To make the init.sh script available for each notebook in the workspace, make sure Workspace files are attached and move the init.sh file from Notebook files to Workspace files.

Using a team’s base environment

If you want to provide centralized access to a certain repository for your team, you can make this repository part of a custom base environment. 

Base environments are custom Docker images that can be easily used as pre-build configurations when creating a new notebook in Datalore.

Provide centralized access to a Git repository through a custom base environment

Custom base environments are available for Enterprise users only. To configure a custom base environment for Datalore Enterprise, please use this guide

How to edit Git repository contents in Datalore

If you want to edit Python scripts or files available in your Git repository, you can clone the repository to Attached data using:

  • Tools | Terminal: This opens a terminal session and allows you to execute Git CLI commands. 
  • Python magic commands inside a notebook’s code cells.
Edit Python scripts or files available in your Git repository in Datalore

If you want to clone the repository and edit it from one notebook, make sure to clone it to Notebook files. If you want to edit the repository from any notebook in the workspace, clone it to Workspace files. For Home workspace files, you might need to attach Workspace files to a notebook explicitly.

After cloning the repository to Attached data, you can edit the file’s contents collaboratively. 

For Python files, you also get code completion and syntax highlighting. To use the updated functions in your notebook, make sure to restart the kernel or use an autoreload extension:

%load_ext autoreload
%autoreload 2
Collaborate on Python files editing

⚠️ Currently it is not possible to edit Jupyter notebooks that are part of your cloned Git repository. To view Jupyter notebooks from the repository, you can double click on them and Datalore will open the notebook in a new tab. If you are particularly interested in this workflow, please see the last paragraph of this blog post. 

How to version your data science work with Git and Datalore

Jupyter notebooks are first-class citizens in Datalore. To keep track of changes in the notebook, we recommend using Datalore’s History tool. 

Go to Tools | History which allows you to: 

  • Revert to previously saved states.
  • See the difference between the current version of the notebook and the checkpoints. 
  • Press Ctrl/Cmd+S to create new custom checkpoints.
  • See any edits made by your collaborators.

Additionally, Datalore automatically creates checkpoints to rectify potentially dangerous actions, such as the deletion of a cell from the notebook.

Version your notebooks in Datalore

To version the Python files you have developed within Datalore, you can use Terminal to commit or push specific files or folders to Git.

Version your Git repositories in Datalore

How to import a Jupyter notebook from Git in Datalore

You can import a single Jupyter notebook from Git from the Workspace file system in Datalore. Click the down arrow next to the New notebook button and paste the notebook URL.

import a Jupyter notebook from Git in Datalore

Roadmap for future Git support improvements in Datalore 

We are working on a deeper integration with Github which will be rolled out later in 2023, and we’d like to learn more about the particular workflows and use cases that are of interest to you.

If you are part of a data science team and some of the workflows you would like are missing in Datalore, please talk to us! We offer eligible candidates a $30 Amazon gift card in exchange for a 30-minute interview.

Talk to us

There are many ways you can work with Git repositories in Datalore. We believe that Datalore’s Internal History tool and live collaboration features help you focus more on data science tasks, rather than on working with Git. However, if you need to share your scripts, access, or changes to internal repositories, you can always do so with Datalore’s Terminal, init.sh scripts, and Environment manager. 

Kind regards

The Datalore team

image description