Version control

Stat 550

Daniel J. McDonald

Last modified – 03 April 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\mid} \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \renewcommand{\hat}{\widehat} \]

Why version control?

Much of this lecture is based on material from Colin Rundel and Karl Broman

Why version control?

  • Simple formal system for tracking all changes to a project
  • Time machine for your projects
    • Track blame and/or praise
    • Remove the fear of breaking things
  • Learning curve is steep, but when you need it you REALLY need it

Words of wisdom

Your closest collaborator is you six months ago, but you don’t reply to emails.

Paul Wilson

Why Git

  • You could use something like Box or Dropbox
  • These are poor-man’s version control
  • Git is much more appropriate
  • It works with large groups
  • It’s very fast
  • It’s much better at fixing mistakes
  • Tech companies use it (so it’s in your interest to have some experience)

This will hurt, but what doesn’t kill you, makes you stronger.

Overview

  • git is a command line program that lives on your machine
  • If you want to track changes in a directory, you type git init
  • This creates a (hidden) directory called .git
  • The .git directory contains a history of all changes made to “versioned” files
  • This top directory is referred to as a “repository” or “repo”
  • http://github.com is a service that hosts a repo remotely and has other features: issues, project boards, pull requests, renders .ipynb & .md
  • Some IDEs (pycharm, RStudio, VScode) have built in git
  • git/GitHub is broad and complicated. Here, just what you need

Aside on “Built-in” & “Command line”

Tip

First things first, RStudio and the Terminal

  • Command line is the “old” type of computing. You type commands at a prompt and the computer “does stuff”.

  • You may not have seen where this is. RStudio has one built in called “Terminal”

  • The Mac System version is also called “Terminal”. If you have a Linux machine, this should all be familiar.

  • Windows is not great at this.

  • To get the most out of Git, you have to use the command line.

Typical workflow

  1. Download a repo from Github
git clone https://github.com/stat550-2021/lecture-slides.git
  1. Create a branch
git branch <branchname>
  1. Make changes to your files.
  2. Add your changes to be tracked (“stage” them)
git add <name/of/tracked/file>
  1. Commit your changes
git commit -m "Some explanatory message"

Repeat 3–5 as needed. Once you’re satisfied

  • Push to GitHub
git push
git push -u origin <branchname>

What should be tracked?


Definitely
code, markdown documentation, tex files, bash scripts/makefiles, …


Possibly
logs, jupyter notebooks, images (that won’t change), …


Questionable
processed data, static pdfs, …


Definitely not
full data, continually updated pdfs, other things compiled from source code, …

What things to track

  • You decide what is “versioned”.

  • A file called .gitignore tells git files or types to never track

# History files
.Rhistory
.Rapp.history

# Session Data files
.RData

# Compiled junk
*.o
*.so
*.DS_Store
  • Shortcut to track everything (use carefully):
git add .

What’s a PR?

  • This exists on GitHub (not git)
  • Demonstration

Some things to be aware of

  • master vs main
  • If you think you did something wrong, stop and ask for help
  • The hardest part is the initial setup. Then, this should all be rinse-and-repeat.
  • This book is great: Happy Git with R
    1. See Chapter 6 if you have install problems.
    2. See Chapter 9 for credential caching (avoid typing a password all the time)
    3. See Chapter 13 if RStudio can’t find git

The main/develop/branch workflow

  • When working on your own
    1. Don’t NEED branches (but you should use them, really)
    2. I make a branch if I want to try a modification without breaking what I have.
  • When working on a large team with production grade software
    1. main is protected, released version of software (maybe renamed to release)
    2. develop contains things not yet on main, but thoroughly tested
    3. On a schedule (once a week, once a month) develop gets merged to main
    4. You work on a feature branch off develop to build your new feature
    5. You do a PR against develop. Supervisors review your contributions

I and many DS/CS/Stat faculty use this workflow with my lab.

Protection

  • Typical for your PR to trigger tests to make sure you don’t break things

  • Typical for team members or supervisors to review your PR for compliance

Tip

I suggest (require?) you adopt the “production” version for your HW 2

Operations in Rstudio

  1. Stage
  2. Commit
  3. Push
  4. Pull
  5. Create a branch

Covers:

  • Everything to do your HW / Project if you’re careful
  • Plus most other things you “want to do”

Command line versions (of the same)

git add <name/of/file>

git commit -m "some useful message"

git push

git pull

git checkout -b <name/of/branch>

Other useful stuff (but command line only)

Initializing

git config user.name --global "Daniel J. McDonald"
git config user.email --global "daniel@stat.ubc.ca"
git config core.editor --global nano 
# or emacs or ... (default is vim)

Staging

git add name/of/file # stage 1 file
git add . # stage all

Committing

# stage/commit simultaneously
git commit -am "message" 

# open editor to write long commit message
git commit 

Pushing

# If branchname doesn't exist
# on remote, create it and push
git push -u origin branchname

Branching

# switch to branchname, error if uncommitted changes
git checkout branchname 
# switch to a previous commit
git checkout aec356

# create a new branch
git branch newbranchname
# create a new branch and check it out
git checkout -b newbranchname

# merge changes in branch2 onto branch1
git checkout branch1
git merge branch2

# grab a file from branch2 and put it on current
git checkout branch2 -- name/of/file

git branch -v # list all branches

Check the status

git status
git remote -v # list remotes
git log # show recent commits, msgs

Commit messages

  1. Write meaningful messages. Not fixed stuff or oops? maybe done?
  2. These appear in the log and help you determine what you’ve done.
  3. Think imperative mood: “add cross validation to simulation”
  4. Best to have each commit “do one thing”

Conventions: (see here for details)

  • feat: – a new feature is introduced with the changes
  • fix: – a bug fix has occurred
  • chore: – changes that do not relate to a fix or feature (e.g., updating dependencies)
  • refactor: – refactored code that neither fixes a bug nor adds a feature
  • docs: – updates to documentation such as a the README or other markdown files
  • style: – changes that do not affect the function of the code
  • test – including new or correcting previous tests
  • perf – performance improvements
  • ci – continuous integration related
git commit -m "feat: add cross validation to simulation, closes #251"

Conflicts

  • Sometimes you merge things and “conflicts” happen.

  • Meaning that changes on one branch would overwrite changes on a different branch.

They look like this:

Here are lines that are either unchanged
from the common ancestor, or cleanly
resolved because only one side changed.

But below we have some troubles
<<<<<<< yours:sample.txt
Conflict resolution is hard;
let's go shopping.
=======
Git makes conflict resolution easy.
>>>>>>> theirs:sample.txt

And here is another line that is cleanly 
resolved or unmodified.

You decide what to keep

  1. Your changes (above ======)
  2. Their changes (below ======)
  3. Both.
  4. Neither.

Always delete the <<<<<, ======, and >>>>> lines.

Once you’re satisfied, commit to resolve the conflict.

Some other pointers

  • Commits have long names: 32b252c854c45d2f8dfda1076078eae8d5d7c81f
    • If you want to use it, you need “enough to be unique”: 32b25
  • Online help uses directed graphs in ways different from statistics:
    • In stats, arrows point from cause to effect, forward in time
    • In git docs, it’s reversed, they point to the thing on which they depend

Cheat sheet

https://training.github.com/downloads/github-git-cheat-sheet.pdf

How to undo in 3 scenarios

  • Suppose we’re concerned about a file named README.md
  • Often, git status will give some of these as suggestions

1. Saved but not staged

In RStudio, select the file and click then select Revert…

# grab the old committed version
git checkout -- README.md 

2. Staged but not committed

In RStudio, uncheck the box by the file, then use the method above.

# first unstage, then same as 1
git reset HEAD README.md
git checkout -- README.md

3. Committed

Not easy to do in RStudio…

# check the log to find the chg 
git log
# go one step before that 
# (e.g., to commit 32b252)
# and grab that earlier version
git checkout 32b252 -- README.md


# alternatively, if it happens
# to also be on another branch
git checkout otherbranch -- README.md

Recovering from things

  1. Accidentally did work on main,
# make a new branch with everything, but stay on main
git branch newbranch
# find out where to go to
git log
# undo everything after ace2193
git reset --hard ace2193
git checkout newbranch
  1. Made a branch, did lots of work, realized it’s trash, and you want to burn it
git checkout main
git branch -d badbranch
  1. Anything more complicated, either post to Slack or LMGTFY

Rules for HW 2

  • Each team has their own repo
  • Make a PR against main to submit
  • Tag me and all the assigned reviewers
  • Peer evaluations are done via PR review (also send to Estella)
  • YOU must make at least 5 commits (fewer will lead to deductions)
  • I review your work and merge the PR

Important

☠️☠️ Read all the instructions in the repo! ☠️☠️

Practice time…

dajmcdon/sugary-beverages