Chapter 2 Introduction

FIXME: general introduction.

2.1 Who are these lessons for?

2.1.1 Exton Excel

  1. Exton taught business at a community college for several years, and now does community management for an event management company. He still teaches Marketing 101 every year to help people with backgrounds like his.
  2. Exton uses Excel to keep track of who is registered for webinars, workshops, and training sessions. He doesn’t think of himself as a programmer, but spends hours creating complicated lookup tables to figure out how many webinar attendees turn into community contributors, who answers forum posts most frequently, and so on.
  3. Exton knows there are better ways to do what he’s doing, but feels overwhelmed by the blog posts, tweets, and “helpful” recommendations from the company’s engineering team.
  4. Exton is a single parent; the one evening a week he spends teaching is the only out-of-work time he can take away from family responsibilities.

2.1.2 Nina Newbie

  1. Nina is 18 years old and in the first year of an undergraduate degree in urban planning. She has read lots of gushing articles about data science, and was excited by the prospect of learning how to do it, but dropped her CS 101 course after six weeks because nothing made sense. She is doing better in her intro to statistics (which uses a little bit of R), but still spends most of her time copying, pasting, and swearing.
  2. Nina did well in her high school math classes, and built herself a home page with HTML and CSS in a weekend workshop in grade 11. She has accounts on nine different social media site, and attends all of her morning classes online.
  3. Nina wants self-paced tutorials with practice exercises, plus forums where she can ask for help.
  4. After a few bruising conversations with CS majors, Nina is reluctant to reveal how little she knows about programming: she would rather get a low grade and blame it on partying than let her classmates see that she is floundering.

2.2 What will these lessons teach you?

During this course, you will learn how to:

  • Write programs in R that read data, clean it up, perform simple statistical analyses on it.
  • Build visualizations to help you understand your data and communicate your findings.
  • Find and install R packages to help you do these things.
  • Create software that other people can understand and re-run.
  • Track your work in version control using Git and GitHub.
  • Publish your results on a web site using R Markdown and GitHub Pages.
  • Make your data, software, and reports citable using ORCIDs and DOIs, and cite the work of others.
  • Select open source and Creative Commons licenses that allow you and others to share data, software, and reports.
  • Be an active participant in open, inclusive projects.

2.2.1 What do you need to have and know to start?

This course assumes that you:

  • Have a laptop that you can install software on, or access to a web-based programming system like [][rstudio-cloud].
  • Know what mean and variance are.
  • Are willing to invest about 30 hours in reading or lectures and another 100 hours doing practice exercises.

We have tried to make these lessons accessible to people with visual or motor challenges, but recognize that some parts (particularly data visualization) may still be difficult. We welcome suggestions for improvements.

2.3 What examples will we use?

FIXME: introduce running examples

2.4 What’s the big picture? (#intro-bigpicture}

We now swim in a sea of data and generate more each day. That data can help us understand the world, but it can also be used to manipulate us and invade our privacy. Learning how to analyze data will help you do the former and guard against the latter.

This course is therefore about people, programs, and data. Data can live in three places (FIXME: diagram)

  1. In the computer’s memory. It has to be here for the computer to use it, but when a program stops running or the computer is shut down, the contents of memory evaporate.

  2. On the computer’s hard drive. This is much larger than memory—terabytes instead of gigabytes—and its contents are organized into files and directories (also called folders). What’s on the hard drive stays there even when programs aren’t running or the computer is switched off. A program must read data from files into memory to work with it, and write data to files to save it permanently.

  3. On some other computer on the network. “The cloud” and “the web” are just other people’s computers; if we want to use data that’s on the other side of a URL, we need to download it (i.e., copy it to our hard drive) or read it directly into memory (which is called [streaming][streaming-data]).

Programs also live in three places (FIXME: enhanced diagram)

  1. In the computer’s memory. A program has to be in memory for the computer to run it.

  2. On the computer’s hard drive. A program is a file like any other. Instead of containing text, pixels, or CO2 measurements, it contains instructions. In order for the computer to run it, those instructions have to be copied into memory. (This is part of what your computer is doing when it launches an application.) Your program typically uses other pieces of software called [packages][package] or [libraries][library] that provide common operations like searching in text, changing the colors of pixels, or calculating averages. When your program is loaded into memory, your computer also loads those packages.

  3. On some other computer on the network. R and Python both have catalogs of packages that people have written and shared. In order to use one of these, you must [install][install] it on your computer by copying its files from the catalog site to your hard drive. There’s usually more to this than simply copying one file, so R and Python both come with tools to help you find, install, and manage packages.

Finally, we come to people (FIXME: enhanced diagram)

  1. This course starts by teaching you how to write programs that run on your computer, analyze data that is on your computer or on the web, and use packages written by other people.

  2. It will also teach you how to write programs that your collaborators can understand and run. (One of those collaborators is your future self, who will be grateful three months from now that you did the right thing today.) These skills will make you a productive member of a small team, and this course will also explain how to make such teams open and inclusive.

  3. The third group of people includes collaborators or reviewers who want to reproduce your work. This course will show you how to publish your results on the web and how to get credit for your work and give it to others.