Git is a distributed version control system. (Wait, what?)
Okay, try this: Imagine if Dropbox and the "Track changes" feature in MS Word had a baby. Git would be that baby.
In fact, it's even better than that because Git is optimized for the things that economists and data scientists spend a lot of time working on (e.g. code).
There is a learning curve, but I promise you it's worth it.
It's important to realize that Git and GitHub are distinct things.
GitHub is an online hosting platform that provides an array of services built on top of the Git system. (Similar platforms include Bitbucket and GitLab.)
Just like we don't need Rstudio to run R code, we don't need GitHub to use Git... But it will make our lives so much easier.
Git and GitHub's role in global software development is not in question.
Data scientists and academic researchers are relying on too.
Benefits of version control and collaboration tools aside, Git(Hub) helps to operate the ideals of open science and reproducibility.
Data science teams have increasingly strict requirements regarding reproducibility and data access. GitHub makes this easy.
One of the (many) great features of RStudio is how well it integrates version control into your everyday workflow.
Even though Git is a completely separate program to R, they feel like part of the same "thing" in RStudio.
This next section is about learning the basic Git(Hub) commands and the recipe for successful project integration with RStudio.
The starting point for our workflow is to link a GitHub repository (i.e. "repo") to an RStudio Project. Here are the steps we're going to follow:
1 It's easiest to start with HTTPS, but SSH is advised for more advanced users.
The starting point for our workflow is to link a GitHub repository (i.e. "repo") to an RStudio Project. Here are the steps we're going to follow:
1 It's easiest to start with HTTPS, but SSH is advised for more advanced users.
Now, I want you to practice by these steps by creating your own repo on GitHub — call it "test" — and cloning it via an RStudio Project.
Now that you've cloned your first repo and made some local changes, it's time to learn the four main Git operations.
Now that you've cloned your first repo and made some local changes, it's time to learn the four main Git operations.
For the moment, it will be useful to group the first two operations and last two operations together. (They are often combined in practice too, although you'll soon get a sense of when and why they should be split up.)
Creating the repo on GitHub first means that it will always be "upstream" of your (and any other) local copies.
RStudio Projects are great.
The GitHub + RStudio Project combo is ideal for new users.
However, I want to go over Git shell commands so that you can internalise the basics.
Clone a repo.
$ git clone REPOSITORY-URL
See the commit history (hit spacebar to scroll down or q to exit).
$ git log
What has changed?
$ git status
Stage ("add") a file or group of files.
$ git add NAME-OF-FILE-OR-FOLDER
You can use wildcard characters to stage a group of files (e.g. sharing a common prefix). There are a bunch of useful flag options too:
$ git add -A
$ git add -u
$ git add .
Commit your changes.
$ git commit -m "Helpful message"
Pull from the upstream repository (i.e. GitHub).
$ git pull
Push any local changes that you've committed to the upstream repo (i.e. GitHub).
$ git push
Turn to the person next to you. You are now partners. (Congratulations.)
P1: Invite P2 to join you as a collaborator on the "test" GitHub repo that you created earlier. (See the Settings tab of your repo.)
P2: Clone P1's repo to your local machine.1 Make some edits to the README (e.g. delete lines of text and add your own). Stage, commit and push these changes.
P1: Make your own changes to the README on your local machine. Stage, commit and then try to push them (after pulling from the GitHub repo first).
1 Change into a new directory first or give it a different name to avoid conflicts with your own "test" repo. Don't worry, Git tracking will still work if you change the repo name locally.
Turn to the person next to you. You are now partners. (Congratulations.)
P1: Invite P2 to join you as a collaborator on the "test" GitHub repo that you created earlier. (See the Settings tab of your repo.)
P2: Clone P1's repo to your local machine.1 Make some edits to the README (e.g. delete lines of text and add your own). Stage, commit and push these changes.
P1: Make your own changes to the README on your local machine. Stage, commit and then try to push them (after pulling from the GitHub repo first).
1 Change into a new directory first or give it a different name to avoid conflicts with your own "test" repo. Don't worry, Git tracking will still work if you change the repo name locally.
Did P1 encounter a merge conflict
error?
Let's confirm what's going on.
$ git status
As part of the response, you should see something like:
Unmerged paths: (use "git add <file>..." to mark resolution) * both modified: README.md
Git is protecting P1 by refusing the merge. It wants to make sure that you don't accidentally overwrite all of your changes by pulling P2's version of the README.
git status
can provide a helpful summary to see which files are in conflict.Okay, let's see what's happening here by opening up the README file in RStudio.
You should see something like:
# READMESome text here.<<<<<<< HEADText added by Partner 2.=======Text added by Partner 1.>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.More text here.
What do these symbols mean?
# READMESome text here.<<<<<<< HEADText added by Partner 2.=======Text added by Partner 1.>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.More text here.
What do these symbols mean?
# READMESome text here.<<<<<<< HEADText added by Partner 2.=======Text added by Partner 1.>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.More text here.
<<<<<<< HEAD
Indicates the start of the merge conflict.What do these symbols mean?
# READMESome text here.<<<<<<< HEADText added by Partner 2.=======Text added by Partner 1.>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.More text here.
<<<<<<< HEAD
Indicates the start of the merge conflict.=======
Indicates the break point used for comparison.What do these symbols mean?
# READMESome text here.<<<<<<< HEADText added by Partner 2.=======Text added by Partner 1.>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.More text here.
<<<<<<< HEAD
Indicates the start of the merge conflict.=======
Indicates the break point used for comparison.>>>>>>> <long string>
Indicates the end of the lines that had a merge conflict.Fixing these conflicts is a simple matter of (manually) editing the README file.
Once that's done, you should be able to stage, commit, pull and finally push your changes to the GitHub repo without any errors.
Fixing these conflicts is a simple matter of (manually) editing the README file.
Once that's done, you should be able to stage, commit, pull and finally push your changes to the GitHub repo without any errors.
Caveats
During your collaboration, you may have encountered a situation where Git is highlighting differences on seemingly unchanged sentences.
The "culprit" is the fact that Git evaluates an invisible character at the end of every line. This is how Git tracks changes. (More info here and here.)
Open up the shell and enter
$ git config --global core.autocrlf input
(Windows users: Change input
to true
).
Branches are one of Git's coolest features.
1 You can actually have branches of branches (of branches). But let's not get ahead of ourselves.
Create a new branch on your local machine and switch to it:
$ git checkout -b NAME-OF-YOUR-NEW-BRANCH
Push the new branch to GitHub:
$ git push origin NAME-OF-YOUR-NEW-BRANCH
List all branches on your local machine:
$ git branch
Switch back to (e.g.) the master branch:
$ git checkout master
Delete a branch
$ git branch -d NAME-OF-YOUR-FAILED-BRANCH$ git push origin :NAME-OF-YOUR-FAILED-BRANCH
You have two options:
$ git checkout master
$ git merge new-idea
$ git branch -d new-idea
You know that "new-idea" branch we just created a few slides back? Switch over to it if you haven't already.
$ git checkout new-idea
(or just click on the branches tab in RStudio)Make some local changes and then commit + push them to GitHub.
After pushing these changes, head over to your repo on GitHub.
See instructions here.
Git forks lie somewhere between cloning a repo and branching from it.
Forking a repo on GitHub is very simple; just click the "Fork" button in the top-right corner of said repo.
Once you fork a repo, you are free to do anything you want to it. (It's yours.) However, forking — in combination with pull requests — is actually how much of the world's software is developed. For example:
Creating forks is super easy as we've just seen. However, maintaining them involves some more leg work if you want to stay up to date with the original repo.
README files are special in GitHub because they act as repo landing pages.
README files can also be added to the sub-directories of a repo, where they will act as a landing pages too.
A .gitignore file tells Git what to — wait for it — ignore.
This is especially useful if you want to exclude whole folders or a class of files (e.g. based on size or type).
Very large individual files (>100 MB) exceed GitHub's maximum allowable size and should be ignored regardless. See here and here.
Reduces redundant version control history, where the main thing is the code that produces the compiled dataset, not the end CSV in of itself. ("Source is real.")
You can create a .gitignore file in multiple ways.
Once the .gitignore file is created, simply add in lines of text corresponding to the files that should be ignored.
FILE-I-WANT-TO-IGNORE.csv
FOLDER-NAME/**
*.csv
test*
!somefile.txt
GitHub Issues are another great way to interact with your collaborators and/or package maintainers.
Create a repo on GitHub and initialize with a README.
Clone the repo to your local machine. Preferably using an RStudio Project, but as you wish. (E.g. Shell command: $ git clone REPOSITORY-URL
)
Stage any changes you make: $ git add -A
or $ git add .
Commit your changes: $ git commit -m "Helpful message"
Pull from GitHub: $ git pull --rebase
or $ git pull
(Fix any merge conflicts.)
Push your changes to GitHub: $ git push -u origin BRANCH-NAME
Create a repo on GitHub and initialize with a README.
Clone the repo to your local machine. Preferably using an RStudio Project, but as you wish. (E.g. Shell command: $ git clone REPOSITORY-URL
)
Stage any changes you make: $ git add -A
or $ git add .
Commit your changes: $ git commit -m "Helpful message"
Pull from GitHub: $ git pull --rebase
or $ git pull
(Fix any merge conflicts.)
Push your changes to GitHub: $ git push -u origin BRANCH-NAME
Repeat steps 3—7 (but especially steps 3 and 4) often.
Check out Git Guides.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |