Chapter 1 Overview

NOTE: This book is very much in-development so many parts are incomplete, as a draft, or not even started. Check back later!

It’s still magic even if you know how it’s done.

— Terry Pratchett

FIXME: general introduction

1.1 Why isn’t all of this normal already?

Nobody argues that research should be irreproducible or unsustainable, but “not against it” and actively supporting it are very different things. Academia doesn’t yet know how to reward people for writing useful software, so while you may be thanked, the extra effort you put in may not translate into job security or decent pay.

And some people still argue against openness, Being open is a big step toward a (non-academic) career path, which is where approximately 80% of PhDs go, and for those staying in academia, open work is cited more often than closed (FIXME: citation). However, some people still worry that if they make their data and code generally available, someone else will use it and publish a result they have come up with themselves. This is almost unheard of in practice, but that doesn’t stop people using it as a boogeyman.

Other people are afraid of looking foolish or incompetent by sharing code that might contain bugs. This isn’t just impostor syndrome: members of marginalized groups are frequently judged more harshly than others (FIXME: citation).

1.2 Contributing

If you want to contribute, please check out the CONTRIBUTING guidelines.

1.3 Acknowledgments

This book owes its existence to the hundreds of researchers we met through the Carpentries. We are also grateful to Insight Data Science for sponsoring the early stages of this work, to everyone who has contributed, and to:

1.4 Goals of this course

This outline describes the questions that the novice courses on R and Python will answer. The advanced course can then assume that learners have hands-on experience with these topics but nothing more.

1.4.1 Personas

1.4.1.1 Anya

Anya is a professor of neuropsychology who is responsible for teaching her department’s introduction to statistics to 1100 first-year students every year. (Students complain that the Stats department’s introductory course is too theoretical and requires more programming knowledge than they have.) When she finds time for it, her research focuses on color perception in infants.

Over the past nine years, Anya has designed and run a dozen experiments on 50-100 infant subjects each and analyzed the results using SPSS and more recently R (which she taught herself during a sabbatical). She has never taken a programming course, and suffers from impostor syndrome when talking to colleagues who are using things like GitHub and R Markdown.

Anya would like to figure out how to use R to teach her intro stats course, which currently uses a mixture of Excel and SPSS. She would like to learn more about time series analysis to support her research, and about tools like Git and R Markdown.

This guide has modular lessons and exercises that she can adapt to use in her course, and suggestions for how to make learning interactive with a large class size. She also finds helpful instructions for applying time series analysis to data using R.

1.4.1.2 Exton

Exton taught business at a community college before joining a friend’s startup, and now does community management for a company that builds healthcare software. He still teaches Marketing 101 every year to help people with backgrounds like his.

Exton uses Excel to keep track of who is registered for webinars, workshops, and training sessions. Some of these spreadsheets are created from CSV files produced by a web-scraping script a summer intern wrote for him a couple of years ago. Exton doesn’t think of himself as a programmer, but spends hours creating complicated lookup tables in multi-sheet spreadsheets to help him figure out how many webinar attendees turn into community contributors, who answers forum posts most frequently, and so on.

Exton knows there are better ways to do what he’s doing, but feels overwhelmed by the flood of blog posts, tweets, and “helpful” recommendations he receives from members of the company’s engineering team. He wants someone to tell him where he should start and how long it will take whatever he learns to pay off.

Exton finds ‘Merely Useful’ after some Googling, and sees an example of data analysis with spreadsheet data that looks really similar to what he’s trying to do. He carefully works through that particular example, then goes back and works through some of the earlier material in the book. He can tell that it won’t take long to get this to work with his data.

1.4.1.3 Irwin

Irwin, 18, is five months into an undergraduate degree in urban planning. He’s read lots of gushing articles in Wired about data science, and was excited by the prospect of learning how to do it, but dropped his CS 101 course after six weeks because nothing made sense. (His university’s computer science department uses Haskell as an introductory programming language…) He is doing better in Anya’s course (which he is taking as an elective) but still spends most of his time copying, pasting, and swearing.

Irwin did well in his high school math classes, and built himself a home page with HTML and CSS in a weekend workshop in grade 11. He knows how to do simple calculations in Excel, has accounts on nine different social media sites, and attends all of his morning classes online.

Anya mentions this guide in one of her classes, and Irwin downloads the PDF to read on the bus. He loves the examples that use urban data, and right away he has tons of ideas about where to get more cool data to analyze. His urban data science blog is already taking shape in his head.

1.4.1.4 Camilla

Camilla recently started a job as an assistant professor. Her department (Medieval Studies) is trying to develop a digital humanities data-science-heavy undergraduate program, and the undergraduate chair thinks that Camilla has the most programming experience in the department and has asked her to develop an introduction to programming course for humanities students.

Camilla has dabbled in natural language processing and has learned Python over the course of her previous work, but she has no experience teaching progamming and she’s not sure what the best way is to teach beginners. She doesn’t want to start from scratch to create a course out of nothing. She also isn’t sure which programming language the new program should focus on.

She finds ‘Merely Useful’ and feels relieved: she can pretty much use the book as-is for her course. She looks up the examples of text and image analysis and compares how both R and Python approach those kinds of data to help her make a decision about which language to teach.

1.4.1.5 Jordan

Jordan is a third-year undergraduate student in ecology. Two months ago she started working part-time for a professor in her department, and she’s beginning to collect and analyze data from her own experiments with fruit flies. Her professor has asked her to learn R to do her analysis and suggested that she sign up for the introduction to quantitative data analysis in R course that the ecology department offers. The course is just starting, and it uses ‘Merely Useful’ as the textbook.

Jordan can’t wait to apply her new programming knowledge to her data, so she starts reading ahead and trying to use her own data in some of the book’s examples. As she works through examples, she realizes that she’ll need to change a few things about how she records her data in spreadsheets so that it will be easier to analyze in R.

1.4.2 Outline

1.4.2.1 Getting started

  • What are the different ways I can interact with software?
    • console
    • scripts
  • How can I find and view help?
    • In the IDE
    • Stack Overflow
  • How can I inspect data while I’m working on it?
    • table viewers

1.4.2.2 Data manipulation

  • How can I read tabular data into a program?
    • what CSV is, where it comes from, and why people use it
    • reading files
  • How can I select subsets of my data?
    • select
    • filter
    • arrange
    • Boolean conditions
  • How can I calculate new values?
    • mutate
    • ifelse
  • How can I tell what’s gone wrong in my programs?
    • reading error messages
    • the difference between syntax and runtime errors
  • How can I operate on subsets of my data?
    • group
    • summarize
    • split-apply-combine
  • How can I work with two or more datasets?
    • join
  • How can I save my results?
    • writing files
  • What isn’t included?
    • anything other than reasonably tidy tabular data
    • map
    • loops and conditionals

1.4.2.3 Plotting

  • Why plot?
    • summary statistics can mislead
    • [Anscombe’s Quartet and the DataSaurus dozen][anscombe-datasaurus]
  • What are the core elements of every plot?
    • data
    • geometric objects
    • aesthetic mapping
  • How can I create different kinds of plots?
    • scatter plot
    • line plot
    • histogram
    • bar plot
    • which to use when
  • How can I plot multiple datasets at once?
    • grouping
    • faceting
  • How can I make misleading plots?
    • showing a single central tendency data point instead of the individual observations
    • saturated plots instead of for example violins or 2D histograms
    • picking unreasonable axes limits to intentionally misrepresent the underlying data
    • not using perceptually uniform colormaps to indicate quantities
    • not thinking about color blindness
  • What isn’t included?
    • outliers
    • interactive plots
    • maps
    • 3D visualization

1.4.2.4 Development

  • How can I make my own functions?
    • declaring functions
    • declaring parameters
    • default values
    • common conventions
  • How can I make my programs tell me that something has gone wrong?
    • validation (did we build the right thing) vs. verification (did we build the thing correctly)
    • assertions for sanity checks
  • How can I ask for help?
    • creating a reproducible example (reprex)
  • How do I install software?
    • what is a package?
    • package manager
  • What isn’t included?
    • code browsers, multiple cursors, and other fancy IDE tricks
    • virtual environments
    • debuggers

1.4.2.5 Data analysis

  • How can I represent and manage missing values?

    • NA Python Learning Objectives
    • Identify missing values (e.g., NaN) in your datasets
    • Examine the impact of missing valus on your datasets
    • Demonstrate and use commands that process missing values
    • Demonstrate how to replace missing values

    Python Lecture Notes (general info)

    • Missing Completely at Random (MCAR)
    • Missing at Random (MAR)
    • Missing Not at Random (MNAR)
    • Missing data at scale (what’s a little and what’s alot?)
    • Imputation and Multiple Imputation
    • When summing the data, missing values will be treated as zero
    • If all values are missing, the sum will be equal to NaN
    • cumsum() and cumprod() methods ignore missing values but preserve them in the resulting arrays
    • .groupby() method see more in descriptive stats section
      • Missing values in .groupby() method are excluded (just like in R)
      • Another .groupby() fun fact: .groupby() with observed=True includes NaNs for categoricals.
    • Many descriptive statistics methods have skipna option to control if missing data should be excluded . This value is set to True by default (unlike R)

    Python Lecture Notes (useful commands)

    • Table: df.method() / description
    • dropna() / drop missing observations
    • dropna(how=‘all’) / drop observations where all cells are NA
    • dropna(axis=1,how=‘all’) / drop coloumn if all values are missing
    • dropna(thresh=5) / drop rows that contain less than 5 non-missing values
    • fillna(0) / replace missing values with zeros
    • isna() returns True if the value is missing
    • notna() / Returns True for non-missing values

    T/F Exercises (#Review4Everybody)

    • Imputation is really just making up data to artificially inflate results. It’s better to just drop cases with missing data than to impute.
    • (need descriptive stats lesson first) I can just impute the mean for any missing data. It won’t affect results, and improves power.
    • Missing data isn’t really a problem if I’m just doing simple statistics.
    • When imputing, it’s important that the imputations be plausible data points.

    Short Answer Exercises (#Review4Everybody)

    • What is imputation of missing data?
    • How does missing values affect results?
    • How do you computationally mark missing values in your dataset?
    • How could you handle missing data?
  • How can I get a feel for my data?

    • summary/descrptive statistics

    Python Learning Objectives

    • Recognize the difference between statistics and descriptive statistics
    • Define four variable types: nomial, ordinal, interval and ratio
    • Describe the types of descriptive statistics: organize data and summarize data
    • Demonstrate and use commands that help you describe data

    Python Lecture Notes (general info)

    • Discuss basic probability & statistics overview
    • Descriptive statistics quantitatively describes or summarizes features of a dataset.
    • Descriptive statistics are used by researchers/analysts to report on populations and samples.
    • Descriptive statistics speed up and simplify comprehension of a group’s characteristics
    • Variable types
      • Nominal: Unordered categorical variables
      • Ordinal: There is an ordering but no implication of equal distance between the different points of the scale.
      • Interval: There are equal differences between successive points on the scale but the position of zero is arbitrary.
      • Ratio: The relative magnitudes of scores and the differences between them matter. The position of zero is fixed.
    • Types of descriptive statistics
      • Organize data: tables and graphs
      • Summarize data: central tendency and variation
    • Tables
      • frequency distributions, relative frequency distributions
    • Graphs
      • bar chart, histogram, boxplot
    • Central tendency
      • mean, median, mode
    • Variation
      • range, interquartile range, variance, standard deviation

    Python Lecture Notes (useful commands)

    • Import python libraries:
      • import pandas as pd
      • import seaborn as sns
    • Read data into a dataframe (likely review from another section)
      • df = pd.read_csv(“your-filename.csv”) #read a csv file, it better be tidy first!
      • Note: you can read data from other file formats like excel, stata and sas
    • df.method() / description
      • head([n]), tail([n]) / first/last n rows
      • describe() / generate descriptive statistics (for numeric columns only)
      • shape / return a tuple representing the dimensionality of the dataframe
      • info() / prints a dataframe’s summary information, e.g., index dtype and column dtypes, non-null values and memory usage
      • max(), min() / return max/min values for all numeric columns
      • mean(), median() / return mean/median values for all numeric columns
      • std() / standard deviation
      • sample([n]) / returns a random sample of the data frame
    • .groupby() method
      • Splits data into groups based on some criteria you define
      • Then you can calculate statistics (or apply a function) to each group
      • A few performance notes:
        • It’s customary to enact the grouping or splitting of data until it’s needed.
        • Creating the groupby object only verifies that you have passed a valid mapping
        • By default, the group keys are sorted during the groupby operation. You may want to pass sort=False for potential speedup
    • Graphs (Intro to seaborn)
      • Seaborn package is built on matplotlib but provides high level interface for drawing attractive statistical graphics
      • It specifically targets statistical data visualization
      • Reference: https://seaborn.pydata.org/introduction.html
      • import seaborn as sns
      • sns.set(): applies the default default seaborn theme, scaling, and color palette
      • load dataset: e.g., tips = sns.load_dataset(“tips”) where tips is a tidy df
      • relplot() method: designed to visualize many different statistical relationships
        • Example: fmri = sns.load_dataset(“fmri”) where fmri is a tidy df
        • Example: sns.relplot(x=“timepoint,” y=“signal,” col=“region,” hue=“event,” style=“event,” kind=“line,” data=fmri)
        • explain each parameter in kind
      • catplot() method: helps visualize the relationship between one numeric variable and one (or more) categorical variables
        • Example: sns.catplot(x=“day,” y=“total_bill,” hue=“smoker,” kind=“swarm,” data=tips)
        • kind parameter can be changed to: violin, bar, etc.
      • pairplot() method: shows all pairwise relationships and the marginal distributions, optionally conditioned on a categorical variable
        • Example: sns.pairplot(data=iris, hue=“species”)
        • my personal fav when paired w/ .describe() for some initial EDA

    Live-coding Exercises

    • Try to read the first 10, 20, 50 records of your dataframe?
    • Can you guess how to view the last few records of your dataframe?
    • Find how many records this data frame has
    • How many elements are there?
    • What are the column names?
    • What types of columns we have in this data frame?
    • Give the summary for the numeric columns in the dataset
    • Calculate standard deviation for all numeric columns
    • What are the mean values of the first 50 records in the dataset?
    • Calculate the basic statistics for column XXX (XXX refers to a TBD attribute selected lecture/presentation)
    • Find how many values in column XXX (hint: use the count method)
    • Calculate the average of column XXX
  • How can I create a simple model of my data? Python Learning Objectives

    • Describe clustering and regression
    • Define k-means clustering and linear regression analyses
    • Demonstrate appropriate applications or uses of k-means clustering analysis
    • Demonstrate appropriate applications or uses of linear regression analysis
    • State and specify limitations of both analyses

    Lecture Notes (K-means Cluster Analysis)

    • Resource: Information to Information Retrieval (https://nlp.stanford.edu/IR-book/information-retrieval-book.html), Chapter 16.
    • Clustering: the process of grouping a set of elemens into classes of similar elements
    • Clustering is also called unsupervised learning, e.g., “learning” from raw data
    • K-means clustering refines goups iteratively where each element belongs to only one group (a nod to flat algorithm approaches w/ hard clustering practices, if we want to expand in later books)
    • Algorithm setup
      • Assumes elementss are real-valued vectors
      • Clusters based on centroids (or mean) of points in a cluster
      • Reassignment of elements to clusters is based on distance to the current cluster centroids.
    • Algorithm Pseudocode
      • Step 0: Select K random elements {s1, s2,…,sK} as seeds. These are your initial K centroids.
      • Step 1: Assign points to closest centroid
      • Step 2: Recompute cluster centroids
      • Step 3: Goto Step 1
    • A Few FYIs
      • You (designer/coder/developer) must select K
      • K-means always maintains exactly K clusters
      • Clusters may be sensitive to initial seed selection
      • Execution of K-means algorithm tends to converge quickly

    Live Coding Experience for K-Means (instructor w/ class) - run through https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html - Note: Expectation Maximization terminology is not explicitly mentioned, should we discuss it? I don’t want to be too term-heavy for novices. Thoughts welcomed.

    Live Coding Exercises for K-Means (learner on “their own”) - TBD based on at least one of the datasets selected by the group

    Lecture Notes (Linear Regression)

    • In science, we frequently measure two or more variables on the same individual (case, object, etc). We do this to explore the nature of the relationship among these variables. There are two basic types of relationships.
    • Cause-and-effect relationships and functional relationships. Let’s consider functional relationships.
    • Function: a mathematical relationship enabling us to predict what values of one variable (Y) correspond to given values of another variable (X)
    • Y: is referred to as the dependent variable, the response variable or the predicted variable.
    • X: is referred to as the independent variable, the explanatory variable or the predictor variable.
    • In each case, the statement can be read as; Y is a function of X.
    • Two kinds of explanatory variables:
      • Those we can control
      • Those over which we have little or no control.
    • Example: The time needed to fill a soft drink vending machine
    • Motivating questions
      • What is the association between Y and X?
      • How can changes in Y be explained by changes in X?
      • What are the functional relationships between Y and X?
    • The main formula (ta-da..)
      • Y = b_0 + b_1*X where b_0 = intercept and b_1 = slope
      • b_0 = tells you the impact of X being set to 0
      • b_1 = tells you the magnitude change between X and X+1
    • Steps in regression analysis
      • Examine the scatterplot of the data. Does the relationship look linear? Are there points in locations they shouldn’t be? Do we need a transformation?
      • Assuming a linear function looks appropriate, estimate the regression parameters. How do we do this? (Method of Least Squares)
      • Test whether there really is a statistically significant linear relationship. Just because we assumed a linear function it does not follow that the data support this assumption. How do we test this? (F-test for Variances)
      • If there is a significant linear relationship, estimate the response, Y, for the given values of X, and compute the residuals.
      • Examine the residuals for systematic inadequacies in the linear model as fit to the data. Is there evidence that a more complicated relationship (say a polynomial) should be considered; are there problems with the regression assumptions? (Residual analysis). Are there specific data points which do not seem to follow the proposed relationship? (Examined using influence measures).

    Live Coding Experience for Linear Regression (instructor w/ class) - Seaborn version: https://seaborn.pydata.org/tutorial/regression.html
    - [Optional] Matplotlib version: https://www.geeksforgeeks.org/linear-regression-python-implementation/ (for those who want more theory)

    Live Coding Exercises for Linear Regression (learners on “their own”) - TBD based on at least one of the datasets selected by the group

    Misc. Notes (remove?)

    • formulas
    • frame these as exploratory tools for revealing structure in the data, rather than modelling or inferential tools
    • adding a best fit straight line on a scatterplot (linear regression)
    • understanding what the error bands on a “best fit” straight line mean (linear regression)
  • How can I put people at risk?

    • algorithmic bias
      • Watch this video and discuss w/ class (https://www.youtube.com/watch?v=UG_X_7g63rY) How I’m fighting bias in algorithms by Joy Buolamwini TedTalk Abstract: MIT grad student Joy Buolamwini was working with facial analysis software when she noticed a problem: the software didn’t detect her face – because the people who coded the algorithm hadn’t taught it to identify a broad range of skin tones and facial structures. Now she’s on a mission to fight bias in machine learning, a phenomenon she calls the “coded gaze.” It’s an eye-opening talk about the need for accountability in coding … as algorithms take over more and more aspects of our lives.
    • de-anonymization
  • What isn’t included?

    • statistical tests
    • multiple linear regression
    • anything with “machine learning” in its name

1.4.2.6 Version control

  • What is a version control system?
    • a smarter kind of backup
  • What goes where and why?
    • local vs. remote storage (physically)
    • local vs. remote storage (ethical/privacy issues)
  • How do I track my work locally?
    • diff
    • add
    • commit
    • log
  • How do I view or recover an old version of a file?
    • diff
    • checkout
  • How do I save work remotely?
    • push and pull
  • How do I manage conflicts?
    • merge
  • What isn’t included?
    • forking
    • branching
    • pull requests
    • git reflow –substantive –single-afferent-cycle –ia-ia-rebase-fhtagn …

1.4.2.7 Publishing

  • How do static websites work?
    • URLs
    • servers
    • request/response cycle
    • pages
  • How do I create a simple HTML page?
    • head/body
    • basic elements
    • images
    • links
    • relative vs. absolute paths
  • How can I create a simple website?
    • GitHub pages
  • How can I give pages a standard appearance?
    • layouts
  • How can I avoid writing all those tags?
    • Markdown
  • How can I share values between pages?
    • flat per-site and per-page configuration
    • variable expansion
  • What isn’t included?
    • templating
    • filters
    • inclusions

1.4.2.8 Reproducibility

  • How can I make programs easy to read?
    • coding style
    • linters
    • documentation
  • How can I make programs easy to re-use?
    • Taschuk’s Rules
  • How can I combine explanations, code, and results?
    • notebooks
  • Where does stuff actually live on my computer?
    • directory structure on Windows and Unix
    • absolute vs. relative paths
    • significance of the working directory
    • data on disk vs. data in memory
  • How should I organize my projects?
    • Noble’s Rules
    • RStudio projects
  • How should I keep track of my data?
    • simple manifests
  • What isn’t included?
    • build tools (Make and its kin)
    • continuous integration
    • documentation generators

1.4.2.9 Collaboration

  • What kinds of licenses are there?
    • open vs. closed
    • copyright
  • Who gets to decide what license to use?
    • it depends…
  • What license should I use for my publications?
    • CC-something
  • What license should I use for my software?
    • MIT/BSD vs. GPL
  • What license should I use for my data?
    • CC0
  • How should I identify myself and my work?
    • DOIs
    • ORCIDs
  • How do I credit someone else’s code?
    • citing packages, citing something from GitHub, giving credit for someone’s answer on StackOverflow…
  • What’s the difference between open and welcoming?
    • evidence for systematic exclusion
    • mechanics of exclusion
  • How can I help create a level playing field?
    • what’s wrong with deficit models
    • allyship, advocacy, and sponsorship
    • Code of Conduct (remove negatives)
    • [curb cuts][curb-cuts] (adding positives for some people helps everyone else too)
  • What isn’t included?
    • how to run a meeting
    • community management
    • mental health
    • assessment of this course