Distributed datasets with DataLad
You can find this page at https://folio.vastcloud.org/datalad
What is DataLad?
DataLad is a version control system for your data. It is built on top of Git and git-annex, and is available both as a command-line tool and as a Python API.
Git
Git a version control system designed to keep track of software projects and their history, to merge edits from multiple authors, to work with branches (distinct project copies) and merge them into the main projects. Since Git was designed for version control of text files, it can also be applied to writing projects, such as manuscripts, theses, website repositories, etc.
I assume that most attendees are familiar with Git, but we can certainly do a quick command-line Git demo.
Git can also keep track of binary (non-text) files and/or of large data files, but putting binary and/or large files under version control and especially modifying them will inflate the size of the repositories.
git-annex
Git-annex was built on top of Git and was designed to share and synchronize large file in a distributed fashion. The file content is managed separately from the dataset’s structure / metadata – the latter is kept under Git version control, while files are stored in separate directories. If you look inside a git-annex repository, you will see that files are replaced with symbolic links, and in fact you don’t have to have the actual data stored locally, e.g. if you want to reduce the disk space usage.
DataLad
DalaLad builds on top of Git and git-annex, retaining all their features, but adds few other functions:
- Datasets can be nested, and most DalaLad commands have a
--recursiveoption that will traverse subdatasets and do “the right thing”. - DalaLad can run commands on data, and if a dataset is not present locally, DalaLad will automatically get the required input files from a remote repository.
- DataLad can keep track of data provenance, e.g.
datalad download-urlwill download files, add them to the repository, and keep a record of data origin. - Few other features.
As you will see in this workshop, most DataLad workflows involve running all three – git, git annex, and datalad – commands, so we’ll be using the functionality of all three layers.
Installation
On a Mac with Homebrew installed:
brew upgrade
brew install git-annex
brew install dataladWith pip (Python’s package manager) use one of these two:
pip install datalad # if you don't run into permission problems
pip install --user datalad # to force installation into user spaceWith conda:
conda install -c conda-forge datalad
conda update -c conda-forge dataladDataLad also needs Git and git-annex, if these are not installed. For mote information, visit the official installation guide.
On a cluster you can install DataLad into your $HOME directory:
module load git-annex # need this each time you use DalaLad
module load python
virtualenv --no-download ~/datalad-env
source ~/datalad-env/bin/activate
pip install --no-index --upgrade pip
pip install datalad
deactivate
alias datalad=$HOME/datalad-env/bin/datalad # best to add this line to your ~/.bashrc fileAlternatively, you can install DalaLad into your group’s /project directory:
module load git-annex # need this each time you use DalaLad
module load python
cd ~/projects/def-sponsor00/shared
virtualenv --no-download datalad-env
source datalad-env/bin/activate
pip install --no-index --upgrade pip
pip install datalad
deactivate
chmod -R og+rX datalad-envThen everyone in the group can activate DalaLad with:
module load git-annex # need this each time you use DalaLad
alias datalad=/project/def-sponsor00/shared/datalad-env/bin/datalad # best to add this line to your ~/.bashrc fileInitial configuration
All these settings go into ~/.gitconfig:
git config --global --add user.name "First Last" # your name
git config --global --add user.email name@domain.ca # your email address
git config --global init.defaultBranch mainBasics
Create a new dataset
Some files in your dataset will be stored as plain files, some files will be put in the annex, i.e. they will be replaced with their symbolic links and might not be even stored locally. Annexed files cannot be modified directly (more on this later). The command datalad run-procedure --discover shows you a list of available configurations. On my computer they are:
- text2git: do not put anything that is a text file in the annex, i.e. process them with regular Git
- yoda: configure a dataset according to the yoda principles
- noannex: put everything under regular Git control
cd ~/tmp
datalad create --description "our first dataset" -c text2git test # use `text2git` configuration
cd test
ls
git logAdd some data
Let’s use some file examples from the official DalaLad handbook:
mkdir books
wget -q https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf -O books/theLinuxCommandLine.pdf
wget -q https://homepages.uc.edu/~becktl/byte_of_python.pdf -O books/aByteOfPython.pdf
ls books
datalad status
datalad save -m "added a couple of books on Linux and Python"
ls books
git log -n 1 # check last commit
git log -n 1 -p # check last commit in details
git config --global alias.one "log --graph --date-order --date=short --pretty=format:'%C(cyan)%h %C(yellow)%ar %C(auto)%s%+b %C(green)%ae'"
git one # custom alias
git log --oneline # a short alternativeLet’s add another couple of books using a built-in downloading command:
datalad download-url https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf \
--dataset . -m "added a reference book about git" -O books/proGit.pdf
datalad download-url http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf \
--dataset . -m "added bash guide for beginners" -O books/bashGuideForBeginners.pdf
ls books
tree
datalad status # nothing to be saved
git log # `datalad download-url` took care of that
git annex whereis books/proGit.pdf # show the available copies (including the URL source)
git annex whereis books # show the same for all booksCreate and commit a short text file:
cat << EOT > notes.txt
We have downloaded 4 books.
EOT
datalad save -m "added notes.txt"
git log -n 1 # see the last commit
git log -n 1 -p # and its file changesNotice that the text file was not annexed: there is no symbolic link. This means that we can modify it easily:
echo Text files are not in the annex.>> notes.txt
datalad save -m "edited notes.txt"Subdatasets
Let’s clone a remote dataset and store it locally as a subdataset:
datalad clone --dataset . https://github.com/datalad-datasets/machinelearning-books # get its structure
tree
du -s machinelearning-books # not much data there (large files were not downloaded)
cd machinelearning-books
datalad status --annex # if all files were present: 9 annex'd files (74.4 MB recorded total size)
datalad status --annex all # check how much data we have locally: 0.0 B/74.4 MB present/total size
datalad status --annex all A.Shashua-Introduction_to_Machine_Learning.pdf # 683.7 KBOk, this file is not too large, so we can download it easily:
datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
datalad status --annex all # now we have 683.7 KB/74.4 MB present/total size
open A.Shashua-Introduction_to_Machine_Learning.pdf # it should open
datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf # delete the local copy
git log # this particular dataset's history (none of our commands show here: we did not modify it)
cd ..Running scripts
cd ../machinelearning-books
git annex find --not --in=here # show remote files
mkdir code
cat << EOT > code/titles.sh
for file in \$(git annex find --not --in=here); do
echo \$file | sed 's/^.*-//' | sed 's/.pdf//' | sed 's/_/ /g'
done
EOT
cat code/titles.sh
datalad save -m "added a short script to write a list of book titles"
datalad run -m "create a list of books" "bash code/titles.sh > list.txt"
cat list.txt
git log # the command run record went into the logNow we will modify and rerun this script:
datalad unlock code/titles.sh # move the script out of the annex to allow edits
cat << EOT > code/titles.sh
for file in \$(git annex find --not --in=here); do
title=\$(echo \$file | sed 's/^.*-//' | sed 's/.pdf//' | sed 's/_/ /g')
echo \"\$title\"
done
EOT
datalad save -m "correction: enclose titles into quotes" code/titles.sh
git log -n 5 # note the hash of the last commit
datalad rerun ba90706
more list.txt
datalad diff --from ba90706 --to f88e2ce # show the filenames onlyFinally, let’s extract the title page from one of the books, A.Shashua-Introduction_to_Machine_Learning.pdf. First, let’s open the book itself:
open A.Shashua-Introduction_to_Machine_Learning.pdf # this book is not here!The book is not here … That’s not a problem for DalaLad, as it can process a file that is stored remotely (as long as it is part of the dataset) 🡲 it will automatically get the required input file.
datalad run -m "extract the title page" \
--input "A.Shashua-Introduction_to_Machine_Learning.pdf" \
--output "title.pdf" \
"convert -density 300 {inputs}[0] -quality 90 {outputs}"
git log
git annex find --in=here # show local files: it downloaded the book, extracted the first page
open title.pdfFive workflows
- two users on a shared cluster filesystem working with the same dataset,
- one user, one dataset spread over multiple drives, with data redundancy,
- publish a dataset on GitHub with annexed files in a special private remote,
- publish a dataset on GitHub with publicly-accessible annexed files on Nextcloud, and
- if we have time: manage multiple Git repos under one dataset
(2) one user, one dataset spread over multiple drives, with data redundancy
Initially I created this scenario with two external USB drives. In the interest of time, I simplified it to a single external drive, but it can easily be extended to any number of drives.
First, let’s create an always-present dataset on the computer that will also keep track of all data stored in its clone on a removable USB drive:
cd ~/tmp
datalad create --description "Central location" -c text2git distributed
cd distributed
git config receive.denyCurrentBranch updateInstead # allow clones to update this dataset
mkdir books
wget -q https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf -O books/theLinuxCommandLine.pdf
wget -q https://homepages.uc.edu/~becktl/byte_of_python.pdf -O books/aByteOfPython.pdf
datalad save -m "added a couple of books"
ls books
du -s . # 4.9M stored hereCreate a clone on a portable USB drive:
cd /Volumes/t7
datalad clone --description "t7" ~/tmp/distributed distributed
cd distributed
du -s . # no actual data was copied, just the links
git remote rename origin central
cd books
wget -q https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf -O proGit.pdf
wget -q http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf -O bashGuideForBeginners.pdf
datalad save -m "added two more books"
git log # we have history from both drives (all 4 books)
git annex find --in=here # but only 2 books are stored here
git annex find --not --in=here # and 2 books are stored not here
for book in $(git annex find --not --in=here); do
git annex whereis $book # show their location: they are in central
done
datalad push --to central --data nothing # push metadata to centralOperations from the central dataset:
cd ~/tmp/distributed
git annex find --in=here # show local files: 2 books
git annex find --not --in=here # show remote files: 2 books
datalad status --annex all # check local data usage: 4.6 MB/17.6 MB present/total size
git annex find --lackingcopies 0 # show files that are stored only in one place
git annex whereis books/* # show locationLet’s mount t7 and get one of its books:
datalad get books/bashGuideForBeginners.pdf # try getting this book from a remote => error
... get(error): books/bashGuideForBeginners.pdf (file) [not available]
git remote # nothing: central does not know where the remotes are stored
datalad siblings add -d . --name t7 --url /Volumes/t7/distributed
git remote # now it knows where to find the remotes
datalad get books/bashGuideForBeginners.pdf # successful!Now unmount t7.
git annex whereis books/bashGuideForBeginners.pdf # 2 copies (here and t7)
open books/bashGuideForBeginners.pdfLet’s remove this the local copy of this book:
datalad drop books/bashGuideForBeginners.pdf # error: tried to t7 for remaining physical copies
datalad drop --reckless availability books/bashGuideForBeginners.pdf # do not check remotes (potentially dangerous)
git annex whereis books/bashGuideForBeginners.pdf # only 1 copy left on t7Letting remotes know about central changes
Let’s add a DalaLad book to central:
cd ~/tmp/distributed
datalad download-url http://handbook.datalad.org/_/downloads/en/stable/pdf/ \
--dataset . -m "added the DataLad Handbook" -O books/datalad.pdfThe remote knows nothing about this new book. Let’s push this update out! Make sure to mount t7 and then run the following:
cd /Volumes/t7/distributed
git config receive.denyCurrentBranch updateInstead # allow clones to update this dataset
cd ~/tmp/distributed
datalad push --to t7 --data nothing # push metadata, but not the dataAlternatively, we could update from the USB drive:
cd /Volumes/t7/distributed
datalad update -s central --how=mergeNow let’s check things from t7’s perspective:
cd /Volumes/t7/distributed
ls books/ # datalad.pdf is there
git annex whereis books/datalad.pdf # it is in central only (plus on the web)Data redundancy
Now imagine that we want to backup all files that are stored in a single location, and always have a second copy on the other drive.
cd /Volumes/t7/distributed
for file in $(git annex find --lackingcopies 0); do
datalad get $file
done
datalad push --to central --data nothing # update the central
git annex find --lackingcopies 0 # still two files have only 1 copy
git annex find --in=here # but they are both here already ==> makes senseLet’s go to central and do the same:
cd ~/tmp/distributed
for file in $(git annex find --lackingcopies 0); do
datalad get $file
done
git annex find --lackingcopies 0 # none: now all files have at least two copies
git annex whereis # here where everything isThe file books/datalad.pdf is in two locations, although one of them is the web. You can correct that manually: go to t7 and run get there.
Try dropping a local file:
datalad drop books/theLinuxCommandLine.pdf # successful, since t7 is also mounted
datalad get books/theLinuxCommandLine.pdf # get it backSet the minimum number of copies and try dropping again:
git annex numcopies 2
datalad drop books/theLinuxCommandLine.pdf # can't: need minimum 2 copies!(3) publish a dataset on GitHub with annexed files in a special private remote
At some stage, you might want to publish a dataset on GitHub that contains some annexed data. The problem is that annexed data could be large, and you can quickly run into problems with GitHub’s storage/bandwidth limitations. Moreover, free accounts on GitHub do not support working with annexed data.
With DalaLad, however, you can host large/annexed files elsewhere and still have the dataset published on GitHub. This is done with so-called special remotes. The published dataset on GitHub stores the information about where to obtain the annexed file contents when you run datalad get.
Special remotes can point to Amazon S3, Dropbox, Google Drive, WebDAV, sftp servers, etc.
Let’s create a small dataset with an annexed file:
cd ~/tmp
chmod -R u+wX publish && /bin/rm -r publish
datalad create --description "published dataset" -c text2git publish
cd publish
dd if=/dev/urandom of=test1 bs=1024 count=$(( RANDOM + 1024 ))
datalad save -m "added test1"Next, we can set up a special remote on the Alliance’s Nextcloud service. DataLad talks to special remotes via rclone protocol, so we need to install it (along with git-annex-remote-rclone utility) and then configure an rclone remote of type WebDAV:
brew install rclone
brew install git-annex-remote-rclone
rclone config
new remote
Name: nextcloud
Type of storage: 46 / WebDAV
URL: https://nextcloud.computecanada.ca/remote.php/webdav/
Vendor: 1 / Nextcloud
User name: razoumov
Password: type and confirm your password
no bearer_token
no advanced config
keep this remote
quitInside our dataset we set a nextcloud remote on which we’ll write into the directory annexedData:
git annex initremote nextcloud type=external externaltype=rclone encryption=none target=nextcloud prefix=annexedData
git remote -v
datalad siblings
datalad push --to nextcloud --data anythingIf you want to share your annexedData folder with another CCDB user, log in to https://nextcloud.computecanada.ca with your CC credentials, click “share” on annexedData, then optionally type in the name/username of the user to share with.
Next, we publish on dataset on GitHub. The following command creates an empty repository called testPublish on GitHub and sets a publication dependency: all new annexed content will automatically go to Nextcloud when we push to GitHub.
datalad create-sibling-github -d . testPublish --publish-depends nextcloud
datalad siblings # +/- indicates the presence/absence of a remote data annex at this remote
datalad push --to githubdd if=/dev/urandom of=test2 bs=1024 count=$(( RANDOM + 1024 ))
datalad save -m "added test2"
datalad push --to github # automatically pushes test2 to nextcloud!Imagine we are another user trying to download the dataset. In this demo I will use the same credentials, but in principle this could be another researcher (at least for reading only):
user001
module load git-annex # need this each time you use DalaLad
alias datalad=/project/def-sponsor00/shared/datalad-env/bin/datalad # best to add this line to your ~/.bashrc file
datalad clone https://github.com/razoumov/testPublish.git publish # note that access to nextcloud is not enabled yet
cd publish
du -s . # the annexed file is not here
git annex whereis --in=here # no annexed file stored locally
git annex whereis test* # two copies: "published dataset" and nextcloud
datalad update --how merge # if you need to update the local copy (analogue of `git pull`)
rclone config # set up exactly the same configuration as before
datalad siblings -d . enable --name nextcloud # enable access to this special remote
datalad siblings # should now see nextcloud
datalad get test1
git annex whereis --in=here # now we have a local copy
dd if=/dev/urandom of=test3 bs=1024 count=$(( RANDOM + 1024 ))
datalad save -m "added test3"
datalad push --to origin # push non-annexed files to GitHub
datalad push --to nextcloud # push annexed files
datalad push --to origin # update GitHub of thisBack in the original “published dataset” on my laptop:
datalad update --how merge
ls # now can see test3
datalad get test3
git annex whereis test3 # it is here(4) publish a dataset on GitHub with publicly-accessible annexed files on Nextcloud
Starting from scratch, let’s push some files to
cd ~/tmp
chmod -R u+wX publish && /bin/rm -r publish
dd if=/dev/urandom of=test1 bs=1024 count=$(( RANDOM + 1024 ))
rclone copy test1 nextcloud: # works since we've already set up the `nextcloud` remote in rcloneLog in to https://nextcloud.computecanada.ca with your CC credentials, on test1 click “share” followed by “share link” and “copy link”. Add /download to the copied link to form something like https://nextcloud.computecanada.ca/index.php/s/YeyNrjJfpQQ7WTq/download.
datalad create --description "published dataset" -c text2git publish
cd publish
cat << EOF > list.csv
file,link
test1,https://nextcloud.computecanada.ca/index.php/s/YeyNrjJfpQQ7WTq/download
EOF
datalad addurls --fast list.csv '{link}' '{file}' # --fast means do not download, just add URL
git annex whereis test1 # one copy (web)Later, when needed, we can download this file with datalad get test1.
datalad create-sibling-github -d . testPublish2 # create am empty repo on GitHub
datalad siblings # +/- indicates the presence/absence of a remote data annex at this remote
datalad push --to github
user001
module load git-annex # need this each time you use DalaLad
alias datalad=/project/def-sponsor00/shared/datalad-env/bin/datalad # best to add this line to your ~/.bashrc file
chmod -R u+wX publish && /bin/rm -r publish
datalad clone https://github.com/razoumov/testPublish2.git publish # "remote origin not usable by git-annex"
cd publish
git annex whereis test1 # one copy (web)
datalad get test1
git annex whereis test1 # now we have a local copy(5) if we have time: manage multiple Git repos under one dataset
Create a new dataset and inside clone a couple of subdatasets:
cd ~/tmp
datalad create -c text2git envelope
cd envelope
# let's clone few regular Git (not DataLad!) repos
datalad clone --dataset . https://github.com/razoumov/radiativeTransfer.git projects/radiativeTransfer
datalad clone --dataset . https://github.com/razoumov/sharedSnippets projects/sharedSnippets
git log # can see those two new subdatasetsGo into one of these subdatasets, modify a file, and commit it to GitHub:
cd projects/sharedSnippets
>>> add an empty line to mpiContainer.md
git status
git add mpiContainer.md
git commit -m "added another line to mpiContainer.md"
git pushThis directory is still a pure GitHub repository, i.e. there no DalaLad files.
Let’s clone out entire dataset to another location:
cd ~/tmp
datalad install --description "copy of envelope" -r -s envelope copy # `clone` has no recursive option
cd copy
cd projects/sharedSnippets
git log # cloned as of the moment of that dataset's creation; no recent update there yetRecursively update all child Git repositories:
git remote -v # remote is origin = GitHub
cd ../.. # to ~/tmp/copy
git remote -v # remote is origin = ../envelope
# pull recent changes from "proper origin" for each subdataset
datalad update -s origin --how=merge --recursive
cd projects/sharedSnippets
git log