class: center, middle, inverse, title-slide # Intro to R ## CodeOn! Bootcamp Session #2 ### Kevin Stachelek ### 2019-08-06 --- # Learning Objectives 1. Understand the components of the [RStudio IDE](#rstudio_ide) 2. Type commands into the [console](#console) 3. Understand [function syntax](#function_syx) 4. Install a [package](#install-package) 5. [Organise a project](#projects) 6. Appropriately [structure an R script or RMarkdown file](#structure) 7. Create and compile an [Rmarkdown document](#rmarkdown) --- # Resources * [Chapter 1: Introduction](http://r4ds.had.co.nz/introduction.html) in *R for Data Science* * [RStudio IDE Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf) * [Introduction to R Markdown](https://rmarkdown.rstudio.com/lesson-1.html) * [R Markdown Cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf) * [R Markdown Reference](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf) --- # What is R? <img src="images/01/new_R_logo.png" width="10%" /> R is a programming environment for data processing and statistical analysis. It is well suited to **reproducible research**. -- R allows you to write scripts that combine data files, clean data, and run analyses. There are many other ways to do this, but R has certain important advantages: -- 1. R is free 2. It has powerful plotting and graphics capabilities 3. Is well suited to interactive, trial and error data analysis 5. Has an active and vibrant community especially in the life sciences 6. Is welcoming to newcomers and underrepresented groups commonly marginalized in coding generally. ??? This refers to being able to document and reproduce all of the steps between raw data and results. --- # Why Use R <img src="images/memes/why_use_R.png" width="50%" style="display: block; margin: auto;" /> --- # The Base R Console If you open up the application called R, you will see an "R Console" window that looks something like this. <div class="figure"> <img src="images/01/r_console.png" alt="The R Console window." width="1968" /> <p class="caption">The R Console window.</p> </div> You can close R and never open it again. We'll be working entirely in RStudio in this class. <div class="warning"> <p>ALWAYS REMEMBER: Launch R though the RStudio IDE</p> <p>Launch <img src="images/01/rstudio_icon.png" style="height: 2em; vertical-align: middle;" alt="RStudio.app"> (RStudio.app), not <img src="images/01/new_R_logo.png" style="height: 2em; vertical-align: middle;" alt="R.app"> (R.app).</p> </div> --- # RStudio [RStudio](http://www.rstudio.com) is an Integrated Development Environment (IDE). This is a program that serves as a text editor, file manager, and provides many functions to help you read and write R code. <div class="figure"> <img src="images/01/rstudio.png" alt="The RStudio IDE" width="70%" /> <p class="caption">The RStudio IDE</p> </div> --- # RStudio RStudio is arranged with four window <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/p#panes'>panes</a>. By default, the upper left pane is the **source pane**, where you view and edit source code from files. The bottom left pane is usually the **console pane**, where you can type in commands and view output messages. The right panes have several different tabs that show you information about your code. You can change the location of panes and what tabs are shown under **`Preferences > Pane Layout`**. --- # Configure RStudio You should learn how to develop **reproducible scripts**. This means scripts that completely and transparently perform some analysis from start to finish in a way that yields the same result for different people using the same software on different computers. -- <img src="images/memes/forgetting.jpg" width="50%" style="display: block; margin: auto;" /> --- # Reproducibility When you do things reproducibly, others can understand and check your work. + The most important person who will benefit from a reproducible script is your __future self__. -- Two Rstudio tweaks to maximize reproducibility: 1. Go to the preferences/settings menu, and uncheck the box that says **`Restore .RData into workspace at startup`** 1. Set Save workspace on exit to **`Never`** <div class="figure"> <img src="images/01/repro.png" alt="Alter these settings for increased reproducibility." width="50%" /> <p class="caption">Alter these settings for increased reproducibility.</p> </div> ??? If you keep things around in your workspace, things will get messy, and unexpected things will happen. You should always start with a clear workspace. This also means that you never want to save your workspace when you exit, so set this to **`Never`**. The only thing you want to save are your scripts. --- # Getting Started ### Console commands You can consider the console a kind of **sandbox** where you can try out lines of code and adapt them. -- You can type into the script editor window (either into an R script or an R Markdown file) Then send the commands to the console by placing the cursor on the line and holding down the Ctrl key while you press Enter. -- The Ctrl+Enter key sequence sends the command in the script to the console. -- <img src="images/memes/typos.jpg" width="40%" style="display: block; margin: auto;" /> --- One simple way to learn about the R console is to use it as a calculator. Enter the lines of code below and see if your results match. Be prepared to make lots of typos (at first). ```r 1 + 1 ``` ``` ## [1] 2 ``` --- The R console remembers a history of the commands you typed in the past. Use the up and down arrow keys on your keyboard to scroll backwards and forwards through your history. It's a lot faster than re-typing. ```r 1 + 1 + 3 ``` ``` ## [1] 5 ``` You can break up math expressions over multiple lines; R waits for a complete expression before processing it. ```r # here comes a long expression # let's break it over multiple lines 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 ``` ``` ## [1] 55 ``` --- Text inside quotes is called a <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/s#string'>string</a>. ```r "Good afternoon" ``` ``` ## [1] "Good afternoon" ``` --- You can break up text over multiple lines; R waits for a close quote before processing it. If you want to include a double quote inside this quoted string, <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/e#escape'>escape</a> it with a backslash. ```r africa <- "I hear the drums echoing tonight But she hears only whispers of some quiet conversation She's coming in, 12:30 flight The moonlit wings reflect the stars that guide me towards salvation I stopped an old man along the way Hoping to find some old forgotten words or ancient melodies He turned to me as if to say, \"Hurry boy, it's waiting there for you\" - Toto" cat(africa) # cat() prints the string ``` ``` ## I hear the drums echoing tonight ## But she hears only whispers of some quiet conversation ## She's coming in, 12:30 flight ## The moonlit wings reflect the stars that guide me towards salvation ## I stopped an old man along the way ## Hoping to find some old forgotten words or ancient melodies ## He turned to me as if to say, "Hurry boy, it's waiting there for you" ## ## - Toto ``` --- # Variables Often you want to store the result of some computation for later use. You can store it in a <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/v#variable'>variable</a>. A variable in R: * contains only letters, numbers, periods, and underscores * starts with a letter or a full stop and a letter * distinguishes uppercase and lowercase letters (`rickastley` is not the same as `RickAstley`) -- .pull-left[The following are valid and different variables: * songdata * SongData * song_data * song.data * .song.data * never_gonna_give_you_up_never_gonna_let_you_down ] .pull-right[The following are not valid variables: * _song_data * 1song * .1song * song data * song-data] --- Use the assignment operator `<-` to assign the value on the right to the variable named on the left. ```r # use the assignment operator '<-' # R stores the number in the variable x <- 5 ``` -- Now that we have set `x` to a value, we can do something with it: ```r x * 2 ``` ``` ## [1] 10 ``` ```r # R evaluates the expression and stores the result in the variable boring_calculation <- 2 + 2 ``` --- Note that it doesn't print the result back at you when it's stored. To view the result, just type the variable name on a blank line. ```r boring_calculation ``` ``` ## [1] 4 ``` -- Once a variable is assigned a value, its value doesn't change unless you reassign the variable, even if the variables you used to calculate it change. Predict what the code below does and test yourself: ```r this_year <- 2019 my_birth_year <- 1976 my_age <- this_year - my_birth_year this_year <- 2020 ``` ??? After all the code above is run: * `this_year` = <select class='solveme' data-answer='["2020"]'> <option></option> <option>43</option> <option>44</option> <option>1976</option> <option>2019</option> <option>2020</option></select> * `my_birth_year` = <select class='solveme' data-answer='["1976"]'> <option></option> <option>43</option> <option>44</option> <option>1976</option> <option>2019</option> <option>2020</option></select> * `my_age` = <select class='solveme' data-answer='["43"]'> <option></option> <option>43</option> <option>44</option> <option>1976</option> <option>2019</option> <option>2020</option></select> --- # The environment Anytime you assign something to a new variable, R creates a new object in the <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/g#global-environment'>global environment</a>. Objects in the global environment exist until you end your session; then they disappear forever (unless you save them). -- The **Environment** tab in the upper right pane lists all of the variables you have created. Click the broom icon to clear all of the variables and start fresh. You can also use the following functions in the console to view all variables, remove one variable, or remove all variables. ```r ls() # print the variables in the global environment rm("x") # remove the variable named x from the global environment rm(list = ls()) # clear out the global environment ``` --- # Whitespace When you see `>` at the beginning of a line, that means R is waiting for you to start a new command. -- However, if you see a `+` instead of `>` at the start of the line, that means R is waiting for you to finish a command you started on a previous line. -- If you want to cancel whatever command you started, just press the Esc key in the console window and you'll get back to the `>` command prompt. -- ```r # R waits until next line for evaluation (3 + 2) * 5 ``` ``` ## [1] 25 ``` --- It is often useful to break up long functions onto several lines. ```r cat("row, row, row your boat", "gently down the stream", "merrily, merrily, merrily, merrily", "life is but a dream.", sep = " \n") ``` ``` ## row, row, row your boat ## gently down the stream ## merrily, merrily, merrily, merrily ## life is but a dream. ``` --- # Function syntax A lot of what you do in R involves calling a <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/f#function'>function</a> and storing the results. A function is a named section of code that can be reused. -- For example, `sd` is a function that returns the standard deviation of the <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/v#vector'>vector</a> of numbers that you provide as the input <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/a#argument'>argument</a>. Functions are set up like this: `function_name(argument1, argument2 = "value")`. --- The arguments in parentheses can be named like, `argument1 = 10` or you can skip the names if you put them in the exact same order that they're defined in the function. -- You can check this by typing `?sd` (or whatever function name you're looking up) into the console and the Help pane will show you the default order under **Usage**. You can also skip arguments that have a default value specified. --- # Functions Most functions return a value, but may also produce side effects like printing to the console. -- To illustrate, the function `rnorm()` generates random numbers from the standard normal distribution. The help page for `rnorm()` (accessed by typing `?rnorm` in the console) shows that it has the syntax `rnorm(n, mean = 0, sd = 1)` ??? where `n` is the number of randomly generated numbers you want, `mean` is the mean of the distribution, and `sd` is the standard deviation. The default mean is 0, and the default standard deviation is 1. There is no default for `n`, which means you'll get an error if you don't specify it: --- # Functions ```r rnorm() ``` ``` ## Error in rnorm(): argument "n" is missing, with no default ``` If you want 10 random numbers from a distribution with mean of 0 and standard deviation, you can just use the defaults. ```r rnorm(10) ``` ``` ## [1] -0.44279207 0.18092826 0.08268665 1.46985007 0.78244027 ## [6] 1.66898589 1.35925566 -0.47803497 -0.31298952 0.10098632 ``` --- # Functions If you want 10 numbers from a distribution with a mean of 100: ```r rnorm(10, 100) ``` ``` ## [1] 99.82251 100.24522 101.57654 100.14618 99.35093 101.14487 99.96880 ## [8] 99.77554 98.94785 100.77235 ``` This would be an equivalent but less efficient way of calling the function: ```r rnorm(n = 10, mean = 100) ``` ``` ## [1] 99.01359 99.51432 100.41640 99.48055 101.07178 99.89537 101.33241 ## [8] 98.76678 100.15448 100.10581 ``` --- We don't need to name the arguments because R will recognize that we intended to fill in the first and second arguments by their position in the function call. However, if we want to change the default for an argument coming later in the list, then we need to name it. For instance, if we wanted to keep the default `mean = 0` but change the standard deviation to 100 we would do it this way: ```r rnorm(10, sd = 100) ``` ``` ## [1] -3.541378 146.267044 3.638628 3.624617 87.180718 37.116243 ## [7] 18.868506 -86.068371 -65.151184 -28.849075 ``` --- # Functions Some functions give a list of options after an argument; this means the default value is the first option. The usage entry for the `power.t.test()` function looks like this: ```r power.t.test(n = NULL, delta = NULL, sd = 1, sig.level = 0.05, power = NULL, type = c("two.sample", "one.sample", "paired"), alternative = c("two.sided", "one.sided"), strict = FALSE, tol = .Machine$double.eps^0.25) ``` ??? * What is the default value for `sd`? <select class='solveme' data-answer='["1"]'> <option></option> <option>NULL</option> <option>1</option> <option>0.05</option> <option>two.sample</option></select> * What is the default value for `type`? <select class='solveme' data-answer='["two.sample"]'> <option></option> <option>NULL</option> <option>two.sample</option> <option>one.sample</option> <option>paired</option></select> * Which is equivalent to `power.t.test(100, 0.5)`? <select class='solveme' data-answer='["power.t.test(delta = 0.5, n = 100)"]'> <option></option> <option>power.t.test(100, 0.5, sig.level = 1, sd = 0.05)</option> <option>power.t.test()</option> <option>power.t.test(n = 100)</option> <option>power.t.test(delta = 0.5, n = 100)</option></select> --- # Getting help Start up help in a browser using the function `help.start()`. -- If a function is in base R or a loaded package, you can use the `help("function_name")` function or the `?function_name` shortcut to access the help file. -- If the package isn't loaded, specify the package name as the second argument to the help function. ```r # these methods are all equivalent ways of getting help help("rnorm") ?rnorm help("rnorm", package="stats") ``` -- When the package isn't loaded or you aren't sure what package the function is in, use the shortcut `??function_name`. ??? * What is the first argument to the `mean` function? <select class='solveme' data-answer='["x"]'> <option></option> <option>trim</option> <option>na.rm</option> <option>mean</option> <option>x</option></select> * What package is `read_excel` in? <select class='solveme' data-answer='["readxl"]'> <option></option> <option>readr</option> <option>readxl</option> <option>base</option> <option>stats</option></select> --- # Add-on packages One of the great things about R is that it is **user extensible**: anyone can create a new add-on software package that extends its functionality. There are currently thousands of add-on packages that R users have created to solve many different kinds of problems or to have fun. ??? There are packages for data visualisation, machine learning, neuroimaging, eyetracking, web scraping, and playing games such as Sudoku. --- # Packages Add-on packages are not distributed with base R, but have to be downloaded and installed from an archive, in the same way that you would, for instance, download and install an app on your smartphone. -- The main repository where packages reside is called CRAN, the Comprehensive R Archive Network. A package has to pass strict tests devised by the R core team to be allowed to be part of the CRAN archive. -- You can install from the CRAN archive through R using the `install.packages()` function. -- There is an important distinction between **installing** a package and **loading** a package. --- # Installing a package <img src="images/memes/pokemon.gif" width="50%" /> This is done using `install.packages()`. -- This is like installing an app on your phone: you only have to do it once and the app will remain installed until you remove it. -- When you install a package, the package will be available (but not *loaded*) every time you open up R. ??? For instance, if you want to use PokemonGo on your phone, you install it once from the App Store or Play Store, and you don't have to re-install it each time you want to use it. Once you launch the app, it will run in the background until you close it or restart your phone. <div class="warning"> <p>You may only be able to permanently install packages if you are using R on your own system; you may not be able to do this on public workstations if you lack the appropriate privileges.</p> </div> --- Install the `fortunes` package on your system: ```r install.packages("fortunes") ``` If you don't get an error message, the installation was successful. --- # Loading a package This is done using `library(packagename)`. -- This is like **launching** an app on your phone: the functionality is only there where the app is launched and remains there until you close the app or restart. When you run `library(packagename)` within a session, the package referred to by `packagename` will be available for your R session. -- The next time you start R, you will need to run the `library()` function again if you want to use it. --- You can load the functions in `fortune` for your current R session as follows: ```r library(fortunes) ``` -- Once you have typed this, you can run the function `fortune()`, which spouts random wisdom from one of the R help lists: ```r fortune() ``` ``` ## ## Hans Ole Orka: I try to reproduce the SAS proc reg stepwise model ## selection procedure in R. ## Brian D. Ripley: But why? If you want 1950s statistical methods why not ## use a 1960s package? There are enough problems with stepwise selection ## (see e.g. the book by Frank Harrell and many postings here) even with a ## well-defined criterion like AIC, but that is better than an ad hoc ## algorithm, especially one based on forwards selection. ## -- Hans Ole Orka and Brian D. Ripley ## R-help (September 2007) ``` -- The convention `package::function()` is used to indicate in which add-on package a function resides. -- For instance, if you see `readr::read_csv()`, that refers to the function `read_csv()` in the `readr` add-on package. --- # Install from GitHub Many R packages are not yet on CRAN or Bioconductor because they are still in development. -- Increasingly, datasets and code for papers are available as packages you can download from github. -- You'll need to install the devtools package to be able to install packages from github. -- Check if you have a package installed by trying to load it (e.g., if you don't have devtools installed, `library("devtools")` will display an error message) -- or by searching for it in the packages tab in the lower right pane. All listed packages are installed; all checked packages are currently loaded. <div class="figure"> <img src="images/01/packages.png" alt="Check installed and loaded packages in the packages tab in the lower right pane." width="100%" /> <p class="caption">Check installed and loaded packages in the packages tab in the lower right pane.</p> </div> --- # Install a package from Github ```r install.packages("devtools") devtools::install_github("adam-gruer/goodshirt") ``` After you install the goodshirt package, load it using the `library()` function and display some quotes using the functions below. ```r library(goodshirt) # quotes from The Good Place chidi() ``` ``` ## ## I missed my mom's back surgery because I had already promised my landlord's nephew that I would help him figure out his new phone. ## ## ~ Chidi ``` ```r eleanor() ``` ``` ## ## I'll miss you too, you sexy skyscraper. ## ## ~ Eleanor ``` ??? <div class="try"> <p>How many different ways can you find to discover what functions are available in the goodshirt package?</p> </div> --- # Organising a project Projects in RStudio are a way to group all of the files you need for one project. Most projects include scripts, data files, and output files like the PDF version of the script or images. <div class="try"> <p>Make a new directory where you will keep all of your materials for this class. If you’re using a lab computer, make sure you make this directory in your network drive so you can access it from other computers.</p> <p>Choose <strong><code>New Project...</code></strong> under the <strong><code>File</code></strong> menu to create a new project called <code>01-intro</code> in this directory.</p> </div> --- # An Example Script Here is what an R script looks like. Don't worry about the details for now. ```r # load add-on packages library(tidyverse) # set variables ---- n <- 100 # simulate data ---- data <- data.frame( id = 1:n, dv = c(rnorm(n/2, 0), rnorm(n/2, 1)), condition = rep(c("A", "B"), each = n/2) ) # plot data ---- ggplot(data, aes(condition, dv)) + geom_violin(trim = FALSE) + geom_boxplot(width = 0.25, aes(fill = condition), show.legend = FALSE) # save plot ---- ggsave("sim_data.png", width = 8, height = 6) ``` --- # Script Structure It's best if you follow the following structure when developing your own scripts: 1. load in any add-on packages you need to use 1. define any custom functions 1. load or simulate the data you will be working with 1. work with the data 1. save anything you need to save ??? Often when you are working on a script, you will realize that you need to load another add-on package. Don't bury the call to `library(package_I_need)` way down in the script. Put it in the top, so the user has an overview of what packages are needed. --- # Add Comments to your Script (The Why) You can add comments to an R script with the hash symbol (`#`). The R interpreter will ignore characters from the hash to the end of the line. ```r # comments: any text from '#' on is ignored until end of line 22 / 7 # approximation to pi ``` ``` ## [1] 3.142857 ``` --- # Reproducible reports with R Markdown We will make reproducible reports following the principles of [literate programming](https://en.wikipedia.org/wiki/Literate_programming). -- We have the text of the report together with the code needed to perform all analyses and generate the tables. -- The report is "compiled" from the original format into some other, more portable format, such as HTML or PDF. ??? This is different from traditional cutting and pasting approaches where, for instance, you create a graph in Microsoft Excel or a statistics program like SPSS and then paste it into Microsoft Word. --- # Reproducible reports with R Markdown We will use [R Markdown](http://rmarkdown.rstudio.com/lesson-1.html) to create reproducible reports, which enables mixing of text and code. -- A reproducible script will contain sections of code in code blocks. -- A code block starts and ends with backtick symbols in a row, with some infomation about the code between curly brackets, such as `{r chunk-name, echo=FALSE}` (this runs the code, but does not show the text of the code block in the compiled document). -- The text outside of code blocks is written in <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/m#markdown'>markdown</a>, which is a way to specify formatting, such as headers, paragraphs, lists, bolding, and links. <div class="figure"> <img src="images/01/reproducibleScript.png" alt="A reproducible script." width="2235" /> <p class="caption">A reproducible script.</p> </div> --- # Reproducible reports with R Markdown If you open up a new RMarkdown file from a template, you will see an example document with several code blocks in it. -- To create an HTML or PDF report from an R Markdown (Rmd) document, you compile it. -- Compiling a document is called <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/k#knitting'>knit</a> in RStudio. -- There is a button that looks like a ball of yarn with needles through it that you click on to compile your file into a report. --- # Working Directory Where should I put all my files? -- When developing an analysis, you usually want to have all of your scripts and data files in one subtree of your computer's directory structure. -- Usually there is a single **working directory** where your data and scripts are stored. -- Your script should only reference files in three locations, using the appropriate format. | Where | Example | |--------------------------|---------| | on the web | "https://psyteachr.github.io/msc-data-skills/data/disgust_scores.csv" | | in the working directory | "disgust_scores.csv" | | in a subdirectory | "data/disgust_scores.csv" | --- # Working Directory If you are working with an R Markdown file, it will automatically use the same directory the .Rmd file is in as the working directory. -- If you are working with R scripts, store your main script file in the top-level directory and manually set your working directory to that location. -- If your script needs a file in a subdirectory of `new_analysis`, say, `data/questionnaire.csv`, load it in using a <a class='glossary' target='_blank' title='' href='https://psyteachr.github.io/glossary/r#relative-path'>relative path</a>: ```r dat <- read_csv("data/questionnaire.csv") # right way ``` ??? Do not load it in using an absolute path: ```r dat <- read_csv("C:/Carla's_files/thesis22/my_thesis/new_analysis/data/questionnaire.csv") # wrong ``` <div class="info"> <p>Also note the convention of using forward slashes, unlike the Windows-specific convention of using backward slashes. This is to make references to files platform independent.</p> </div> --- # Closing Thoughts .pull-left[ <img src="images/memes/googling.jpg" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/memes/changing-stuff.jpg" width="70%" style="display: block; margin: auto;" /> ]