+ - 0:00:00
Notes for current slide
Notes for next slide

DANL 310 Lecture 15


Collaboration via Git and GitHub

Byeong-Hak Choe

April 7, 2022

1 / 36

Git and GitHub


2 / 36

Git(Hub) solves this problem

Git

  • Git is a distributed version control system. (Wait, what?)

  • Okay, try this: Imagine if Dropbox and the "Track changes" feature in MS Word had a baby. Git would be that baby.

  • In fact, it's even better than that because Git is optimized for the things that economists and data scientists spend a lot of time working on (e.g. code).

  • There is a learning curve, but I promise you it's worth it.

GitHub

  • It's important to realize that Git and GitHub are distinct things.

  • GitHub is an online hosting platform that provides an array of services built on top of the Git system. (Similar platforms include Bitbucket and GitLab.)

  • Just like we don't need Rstudio to run R code, we don't need GitHub to use Git... But it will make our lives so much easier.

3 / 36

Git(Hub) for scientific projects

From software development...

Git and GitHub's role in global software development is not in question.

  • There's a high probability that your favourite app, program or package is built using Git-based tools. (RStudio is a case in point.)

... to scientific works

Data scientists and academic researchers are relying on too.

  • Benefits of version control and collaboration tools aside, Git(Hub) helps to operate the ideals of open science and reproducibility.

  • Data science teams have increasingly strict requirements regarding reproducibility and data access. GitHub makes this easy.

4 / 36

Git(Hub) + RStudio


5 / 36

Seamless integration

One of the (many) great features of RStudio is how well it integrates version control into your everyday workflow.

  • Even though Git is a completely separate program to R, they feel like part of the same "thing" in RStudio.

  • This next section is about learning the basic Git(Hub) commands and the recipe for successful project integration with RStudio.

6 / 36

Link a GitHub repo to an RStudio Project

The starting point for our workflow is to link a GitHub repository (i.e. "repo") to an RStudio Project. Here are the steps we're going to follow:

  1. Create the repo on GitHub and initialize with a README.
  2. Copy the HTTPS/SSH link (the green "Clone or Download" button).1
  3. Open up RStudio.
  4. Navigate to File -> New Project -> Version Control -> Git.
  5. Paste your copied link into the "Repository URL:" box.
  6. Choose the project path ("Create project as subdirectory of:") and click Create Project.

1 It's easiest to start with HTTPS, but SSH is advised for more advanced users.

7 / 36

Link a GitHub repo to an RStudio Project

The starting point for our workflow is to link a GitHub repository (i.e. "repo") to an RStudio Project. Here are the steps we're going to follow:

  1. Create the repo on GitHub and initialize with a README.
  2. Copy the HTTPS/SSH link (the green "Clone or Download" button).1
  3. Open up RStudio.
  4. Navigate to File -> New Project -> Version Control -> Git.
  5. Paste your copied link into the "Repository URL:" box.
  6. Choose the project path ("Create project as subdirectory of:") and click Create Project.

1 It's easiest to start with HTTPS, but SSH is advised for more advanced users.


Now, I want you to practice by these steps by creating your own repo on GitHub — call it "test" — and cloning it via an RStudio Project.

7 / 36

Main Git operations

Now that you've cloned your first repo and made some local changes, it's time to learn the four main Git operations.

  1. Stage (or "add")
    • Tell Git that you want to add changes to the repo history (file edits, additions, deletions, etc.)
  2. Commit
    • Tell Git that, yes, you are sure these changes should be part of the repo history.
  3. Pull
    • Get any new changes made on the GitHub repo (i.e. the upstream remote), either by your collaborators or you on another machine.
  4. Push
    • Push any (committed) local changes to the GitHub repo
8 / 36

Main Git operations

Now that you've cloned your first repo and made some local changes, it's time to learn the four main Git operations.

  1. Stage (or "add")
    • Tell Git that you want to add changes to the repo history (file edits, additions, deletions, etc.)
  2. Commit
    • Tell Git that, yes, you are sure these changes should be part of the repo history.
  3. Pull
    • Get any new changes made on the GitHub repo (i.e. the upstream remote), either by your collaborators or you on another machine.
  4. Push
    • Push any (committed) local changes to the GitHub repo

For the moment, it will be useful to group the first two operations and last two operations together. (They are often combined in practice too, although you'll soon get a sense of when and why they should be split up.)

8 / 36

Why this workflow?

Creating the repo on GitHub first means that it will always be "upstream" of your (and any other) local copies.

  • In effect, this allows GitHub to act as the central node in the network of version control.
  • Especially valuable when you are collaborating on a project with others — more on this later — but also has advantages when you are working alone.

RStudio Projects are great.

  • Again, they interact seamlessly with Git(Hub), as we've just seen.
  • They also solve absolute vs. relative path problems, since the .Rproj file acts as an anchor point for all other files in the repo.
9 / 36

Git from the shell


10 / 36

Why bother with the shell?

The GitHub + RStudio Project combo is ideal for new users.

  • RStudio's Git integration and built-in GUI cover all the major operations.
  • RStudio Projects FTW.

However, I want to go over Git shell commands so that you can internalise the basics.

  • The shell is more powerful and flexible. Does some things that the RStudio Git GUI can't.
  • Potentially more appropriate for projects that aren't primarily based in R. (Although, no real harm in using RStudio Projects to clone a non-R repo.)
11 / 36

Main Git shell commands

Clone a repo.

$ git clone REPOSITORY-URL

See the commit history (hit spacebar to scroll down or q to exit).

$ git log

What has changed?

$ git status
12 / 36

Main Git shell commands (cont.)

Stage ("add") a file or group of files.

$ git add NAME-OF-FILE-OR-FOLDER

You can use wildcard characters to stage a group of files (e.g. sharing a common prefix). There are a bunch of useful flag options too:

  • Stage all files.
    $ git add -A
  • Stage updated files only (modified or deleted, but not new).
    $ git add -u
  • Stage new and updated files.
    $ git add .
13 / 36

Main Git shell commands (cont.)

Commit your changes.

$ git commit -m "Helpful message"

Pull from the upstream repository (i.e. GitHub).

$ git pull

Push any local changes that you've committed to the upstream repo (i.e. GitHub).

$ git push
14 / 36

Merge conflicts


15 / 36

Collaboration time

Turn to the person next to you. You are now partners. (Congratulations.)

  • P1: Invite P2 to join you as a collaborator on the "test" GitHub repo that you created earlier. (See the Settings tab of your repo.)

  • P2: Clone P1's repo to your local machine.1 Make some edits to the README (e.g. delete lines of text and add your own). Stage, commit and push these changes.

  • P1: Make your own changes to the README on your local machine. Stage, commit and then try to push them (after pulling from the GitHub repo first).

1 Change into a new directory first or give it a different name to avoid conflicts with your own "test" repo. Don't worry, Git tracking will still work if you change the repo name locally.

16 / 36

Collaboration time

Turn to the person next to you. You are now partners. (Congratulations.)

  • P1: Invite P2 to join you as a collaborator on the "test" GitHub repo that you created earlier. (See the Settings tab of your repo.)

  • P2: Clone P1's repo to your local machine.1 Make some edits to the README (e.g. delete lines of text and add your own). Stage, commit and push these changes.

  • P1: Make your own changes to the README on your local machine. Stage, commit and then try to push them (after pulling from the GitHub repo first).

1 Change into a new directory first or give it a different name to avoid conflicts with your own "test" repo. Don't worry, Git tracking will still work if you change the repo name locally.

Did P1 encounter a merge conflict error?

  • Good, that's what we were trying to trigger.
  • Now, let's learn how to fix them.
16 / 36

Merge conflicts

Let's confirm what's going on.

$ git status

As part of the response, you should see something like:

Unmerged paths:
(use "git add <file>..." to mark resolution)
* both modified: README.md

Git is protecting P1 by refusing the merge. It wants to make sure that you don't accidentally overwrite all of your changes by pulling P2's version of the README.

  • In this case, the source of the problem was obvious. Once we start working on bigger projects, however, git status can provide a helpful summary to see which files are in conflict.
17 / 36

Merge conflicts (cont.)

Okay, let's see what's happening here by opening up the README file in RStudio.

You should see something like:

# README
Some text here.
<<<<<<< HEAD
Text added by Partner 2.
=======
Text added by Partner 1.
>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.
More text here.
18 / 36

Merge conflicts (cont.)

What do these symbols mean?

# README
Some text here.
<<<<<<< HEAD
Text added by Partner 2.
=======
Text added by Partner 1.
>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.
More text here.
19 / 36

Merge conflicts (cont.)

What do these symbols mean?

# README
Some text here.
<<<<<<< HEAD
Text added by Partner 2.
=======
Text added by Partner 1.
>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.
More text here.
  • <<<<<<< HEAD Indicates the start of the merge conflict.
19 / 36

Merge conflicts (cont.)

What do these symbols mean?

# README
Some text here.
<<<<<<< HEAD
Text added by Partner 2.
=======
Text added by Partner 1.
>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.
More text here.
  • <<<<<<< HEAD Indicates the start of the merge conflict.
  • ======= Indicates the break point used for comparison.
19 / 36

Merge conflicts (cont.)

What do these symbols mean?

# README
Some text here.
<<<<<<< HEAD
Text added by Partner 2.
=======
Text added by Partner 1.
>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.
More text here.
  • <<<<<<< HEAD Indicates the start of the merge conflict.
  • ======= Indicates the break point used for comparison.
  • >>>>>>> <long string> Indicates the end of the lines that had a merge conflict.
19 / 36

Merge conflicts (cont.)

Fixing these conflicts is a simple matter of (manually) editing the README file.

  • Delete the lines of the text that you don't want.
  • Then, delete the special Git merge conflict symbols.

Once that's done, you should be able to stage, commit, pull and finally push your changes to the GitHub repo without any errors.

20 / 36

Merge conflicts (cont.)

Fixing these conflicts is a simple matter of (manually) editing the README file.

  • Delete the lines of the text that you don't want.
  • Then, delete the special Git merge conflict symbols.

Once that's done, you should be able to stage, commit, pull and finally push your changes to the GitHub repo without any errors.

Caveats

  • P1 gets to decide what to keep because they fixed the merge conflict.
  • On the other hand, the full commit history is preserved, so P2 can always recover their changes if desired.
  • A more elegant and democratic solution to merge conflicts (and repo changes in general) is provided by Git branches. We'll get there next.
20 / 36

Aside: Line endings and different OSs

Problem

During your collaboration, you may have encountered a situation where Git is highlighting differences on seemingly unchanged sentences.

  • If that is the case, check whether your partner is using a different OS to you.

The "culprit" is the fact that Git evaluates an invisible character at the end of every line. This is how Git tracks changes. (More info here and here.)

  • For Linux and MacOS, that ending is "LF"
  • For Windows, that ending is "CRLF" (of course it is...)

Solution

Open up the shell and enter

$ git config --global core.autocrlf input

(Windows users: Change input to true).

21 / 36

Branches and forking


22 / 36

What are branches and why use them?

Branches are one of Git's coolest features.

  • Allow you to take a snapshot of your existing repo and try out a whole new idea without affecting your main (i.e. "master") branch.
  • Only once you (and your collaborators) are 100% satisfied, would you merge it back into the master branch.1
    • This is how most new features in modern software and apps are developed.
    • It is also how bugs are caught and fixed.
    • But researchers can easily — and should! — use it to try out new ideas and analysis (e.g. robustness checks, revisions, etc.)
  • If you aren't happy, then you can just delete the experimental branch and continue as if nothing happened.

1 You can actually have branches of branches (of branches). But let's not get ahead of ourselves.

23 / 36

Branch shell commands

Create a new branch on your local machine and switch to it:

$ git checkout -b NAME-OF-YOUR-NEW-BRANCH

Push the new branch to GitHub:

$ git push origin NAME-OF-YOUR-NEW-BRANCH

List all branches on your local machine:

$ git branch

Switch back to (e.g.) the master branch:

$ git checkout master

Delete a branch

$ git branch -d NAME-OF-YOUR-FAILED-BRANCH
$ git push origin :NAME-OF-YOUR-FAILED-BRANCH
24 / 36

Merging branches + Pull requests

You have two options:

1. Locally

  • Commit your final changes to the new branch (say we call it "new-idea").
  • Switch back to the master branch: $ git checkout master
  • Merge in the new-idea branch changes: $ git merge new-idea
  • Delete the new-idea branch (optional): $ git branch -d new-idea

2. Remotely (i.e. pull requests on GitHub)

  • PRs are a way to notify collaborators — or yourself! — that you have completed a feature.
  • You write a summary of all the changes contained in the branch.
  • You then assign suggested reviewers of your code — including yourself potentially — who are then able to approve these changes ("Merge pull request") on GitHub.
  • Let's practice this now in class...
25 / 36

Your first pull request

You know that "new-idea" branch we just created a few slides back? Switch over to it if you haven't already.

  • Remember: $ git checkout new-idea (or just click on the branches tab in RStudio)

Make some local changes and then commit + push them to GitHub.

  • The changes themselves don't really matter. Add text to the README, add some new files, whatever.

After pushing these changes, head over to your repo on GitHub.

  • You should see a new green button with "Compare & pull request". Click it.
  • Add a meta description of what this PR accomplishes. You can also change the title if you want.
  • Click "Create pull request".
  • (Here's where you or your collaborators would review all the changes.)
  • Once satisfied, click "Merge pull request" and then confirm.
26 / 36

See instructions here.

Forks

Git forks lie somewhere between cloning a repo and branching from it.

  • In fact, if you fork a repo then you are really creating a copy of it.

Forking a repo on GitHub is very simple; just click the "Fork" button in the top-right corner of said repo.

  • This will create an independent copy of the repo under your GitHub account.
  • Try this now. Use one of my repos if you can't think of anyone else's.

Once you fork a repo, you are free to do anything you want to it. (It's yours.) However, forking — in combination with pull requests — is actually how much of the world's software is developed. For example:

  • Outside user B forks A's repo. She adds a new feature (or fixes a bug she's identified) and then issues an upstream pull request.
  • A is notified and can then decide whether to merge B's contribution with the main project.
27 / 36

Forks (cont.)

Creating forks is super easy as we've just seen. However, maintaining them involves some more leg work if you want to stay up to date with the original repo.

  • GitHub: "Syncing a fork"
  • On the other hand, this isn't going to be an issue for completed projects. E.g. Forking the repo that contains the code and data of a published paper.
28 / 36

Other tips


29 / 36

README

README files are special in GitHub because they act as repo landing pages.

  • This is where you should be explicit about the goal of the data science project.

README files can also be added to the sub-directories of a repo, where they will act as a landing pages too.

30 / 36

.gitignore

A .gitignore file tells Git what to — wait for it — ignore.

This is especially useful if you want to exclude whole folders or a class of files (e.g. based on size or type).

  • Proprietary data files should be ignored from the beginning if you intend to make a repo public at some point.
  • Very large individual files (>100 MB) exceed GitHub's maximum allowable size and should be ignored regardless. See here and here.

  • Reduces redundant version control history, where the main thing is the code that produces the compiled dataset, not the end CSV in of itself. ("Source is real.")

31 / 36

.gitignore (cont.)

You can create a .gitignore file in multiple ways.

  • A .gitignore file was automatically generated if you cloned your repo with an RStudio Project.
  • You could also have the option of adding one when you first create a repo on GitHub.
  • Or, you can create one with your preferred text editor. (Must be saved as ".gitignore".)

Once the .gitignore file is created, simply add in lines of text corresponding to the files that should be ignored.

  • To ignore a single a file: FILE-I-WANT-TO-IGNORE.csv
  • To ignore a whole folder (and all of its contents, subfolders, etc.): FOLDER-NAME/**
  • The standard shell commands and special characters apply.
    • E.g. Ignore all CSV files in the repo: *.csv
    • E.g. Ignore all files beginning with "test": test*
    • E.g. Don't ignore a particular file: !somefile.txt
32 / 36

GitHub Issues

GitHub Issues are another great way to interact with your collaborators and/or package maintainers.

33 / 36

Summary


34 / 36

Recipe

  1. Create a repo on GitHub and initialize with a README.

  2. Clone the repo to your local machine. Preferably using an RStudio Project, but as you wish. (E.g. Shell command: $ git clone REPOSITORY-URL)

  3. Stage any changes you make: $ git add -A or $ git add .

  4. Commit your changes: $ git commit -m "Helpful message"

  5. Pull from GitHub: $ git pull --rebase or $ git pull

  6. (Fix any merge conflicts.)

  7. Push your changes to GitHub: $ git push -u origin BRANCH-NAME

35 / 36

Recipe

  1. Create a repo on GitHub and initialize with a README.

  2. Clone the repo to your local machine. Preferably using an RStudio Project, but as you wish. (E.g. Shell command: $ git clone REPOSITORY-URL)

  3. Stage any changes you make: $ git add -A or $ git add .

  4. Commit your changes: $ git commit -m "Helpful message"

  5. Pull from GitHub: $ git pull --rebase or $ git pull

  6. (Fix any merge conflicts.)

  7. Push your changes to GitHub: $ git push -u origin BRANCH-NAME

Repeat steps 3—7 (but especially steps 3 and 4) often.

Check out Git Guides.

35 / 36

Reference

36 / 36

Git and GitHub


2 / 36
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow