C Key Points

This appendix lists the key points for each chapter.

C.1 Getting Started

  • Make tidiness a habit, rather than cleaning up your project files later.
  • Include a few standard files in all your projects, such as README, LICENSE, CONTRIBUTING, CONDUCT and CITATION.
  • Put runnable code in a bin/ directory.
  • Put raw/original data in a data/ directory and never modify it.
  • Put results in a results/ directory. This includes cleaned-up data and figures (i.e., everything created using what’s in bin and data).
  • Put documentation and manuscripts in a docs/ directory.
  • Refer to The Carpentries software installation guide if you’re having trouble.

C.2 The Basics of the Unix Shell

  • A shell is a program that reads commands and runs other programs.
  • The filesystem manages information stored on disk.
  • Information is stored in files, which are located in directories (folders).
  • Directories can also store other directories, which forms a directory tree.
  • pwd prints the user’s current working directory.
  • / on its own is the root directory of the whole filesystem.
  • ls prints a list of files and directories.
  • An absolute path specifies a location from the root of the filesystem.
  • A relative path specifies a location in the filesystem starting from the current directory.
  • cd changes the current working directory.
  • .. means the parent directory.
  • . on its own means the current directory.
  • mkdir creates a new directory.
  • cp copies a file.
  • rm removes (deletes) a file.
  • mv moves (renames) a file or directory.
  • * matches zero or more characters in a filename.
  • ? matches any single character in a filename.
  • wc counts lines, words, and characters in its inputs.
  • man displays the manual page for a given command; some commands also have a --help option.

C.3 Building Tools with the Unix Shell

  • cat displays the contents of its inputs.
  • head displays the first few lines of its input.
  • tail displays the last few lines of its input.
  • sort sorts its inputs.
  • Use the up-arrow key to scroll up through previous commands to edit and repeat them.
  • Use history to display recent commands and !number to repeat a command by number.
  • Every process in Unix has an input channel called standard input and an output channel called standard output.
  • > redirects a command’s output to a file, overwriting any existing content.
  • >> appends a command’s output to a file.
  • < operator redirects input to a command.
  • A pipe | sends the output of the command on the left to the input of the command on the right.
  • A for loop repeats commands once for every thing in a list.
  • Every for loop must have a variable to refer to the thing it is currently operating on and a body containing commands to execute.
  • Use $name or ${name} to get the value of a variable.

C.4 Going Further with the Unix Shell

  • Save commands in files (usually called shell scripts) for re-use.
  • bash filename runs the commands saved in a file.
  • $@ refers to all of a shell script’s command-line arguments.
  • $1, $2, etc., refer to the first command-line argument, the second command-line argument, etc.
  • Place variables in quotes if the values might have spaces or other special characters in them.
  • find prints a list of files with specific properties or whose names match patterns.
  • $(command) inserts a command’s output in place.
  • grep selects lines in files that match patterns.
  • Use the .bashrc file in your home directory to set shell variables each time the shell runs.
  • Use alias to create shortcuts for things you type frequently.

C.5 Building Command-Line Programs in Python

  • Write command-line Python programs that can be run in the Unix shell like other command-line tools.
  • If the user does not specify any input files, read from standard input.
  • If the user does not specify any output files, write to standard output.
  • Place all import statements at the start of a module.
  • Use the value of __name__ to determine if a file is being run directly or being loaded as a module.
  • Use argparse to handle command-line arguments in standard ways.
  • Use short options for common controls and long options for less common or more complicated ones.
  • Use docstrings to document functions and scripts.
  • Place functions that are used across multiple scripts in a separate file that those scripts can import.

C.6 Using Git at the Command Line

  • Use git config with the --global option to configure your username, email address, and other preferences once per machine.
  • git init initializes a repository.
  • Git stores all repository management data in the .git subdirectory of the repository’s root directory.
  • git status shows the status of a repository.
  • git add puts files in the repository’s staging area.
  • git commit saves the staged content as a new commit in the local repository.
  • git log lists previous commits.
  • git diff shows the difference between two versions of the repository.
  • Synchronize your local repository with a remote repository on a forge such as GitHub.
  • git remote manages bookmarks pointing at remote repositories.
  • git push copies changes from a local repository to a remote repository.
  • git pull copies changes from a remote repository to a local repository.
  • git restore and git checkout recover old versions of files.
  • The .gitignore file tells Git what files to ignore.

C.7 Going Further with Git

  • Use a branch-per-feature workflow to develop new features while leaving the master branch in working order.
  • git branch creates a new branch.
  • git checkout switches between branches.
  • git merge merges changes from another branch into the current branch.
  • Conflicts occur when files or parts of files are changed in different ways on different branches.
  • Version control systems do not allow people to overwrite changes silently; instead, they highlight conflicts that need to be resolved.
  • Forking a repository makes a copy of it on a server.
  • Cloning a repository with git clone creates a local copy of a remote repository.
  • Create a remote called upstream to point to the repository a fork was derived from.
  • Create pull requests to submit changes from your fork to the upstream repository.

C.8 Working in Teams

  • Welcome and nurture community members proactively.
  • Create an explicit Code of Conduct for your project modeled on the Contributor Covenant.
  • Include a license in your project so that it’s clear who can do what with the material.
  • Create issues for bugs, enhancement requests, and discussions.
  • Label issues to identify their purpose.
  • Triage issues regularly and group them into milestones to track progress.
  • Include contribution guidelines in your project that specify its workflow and its expectations of participants.
  • Make rules about governance explicit.
  • Use common-sense rules to make project meetings fair and productive.
  • Manage conflict between participants rather than hoping it will take care of itself.

C.9 Automating Analyses with Make

  • Make is a widely used build manager.
  • A build manager re-runs commands to update files that are out of date.
  • A build rule has targets, prerequisites, and a recipe.
  • A target can be a file or a phony target that simply triggers an action.
  • When a target is out of date with respect to its prerequisites, Make executes the recipe associated with its rule.
  • Make executes as many rules as it needs to when updating files, but always respects prerequisite order.
  • Make defines automatic variables such as $@ (target), $^ (all prerequisites), and $< (first prerequisite).
  • Pattern rules can use % as a placeholder for parts of filenames.
  • Makefiles can define variables using NAME=value.
  • Make also has functions such as $(wildcard...) and $(patsubst...).
  • Use specially formatted comments to create self-documenting Makefiles.

C.10 Configuring Programs

  • Overlay configuration specifies settings for a program in layers, each of which overrides previous layers.
  • Use a system-wide configuration file for general settings.
  • Use a user-specific configuration file for personal preferences.
  • Use a job-specific configuration file with settings for a particular run.
  • Use command-line options to change things that commonly change.
  • Use YAML or some other standard syntax to write configuration files.
  • Save configuration information to make your research reproducible.

C.11 Testing Software

  • Test software to convince people (including yourself) that software is correct enough and to make tolerances on “enough” explicit.
  • Add assertions to code so that it checks itself as it runs.
  • Write unit tests to check individual pieces of code.
  • Write integration tests to check that those pieces work together correctly.
  • Write regression tests to check if things that used to work no longer do.
  • A test framework finds and runs tests written in a prescribed fashion and reports their results.
  • Test coverage is the fraction of lines of code that are executed by a set of tests.
  • Continuous integration re-builds and/or re-tests software every time something changes.

C.12 Handling Errors

  • Signal errors by raising exceptions.
  • Use try/except blocks to catch and handle exceptions.
  • Python organizes its standard exceptions in a hierarchy so that programs can catch and handle them selectively.
  • “Throw low, catch high,” i.e., raise exceptions immediately but handle them at a higher level.
  • Write error messages that help users figure out what to do to fix the problem.
  • Store error messages in a lookup table to ensure consistency.
  • Use a logging framework instead of print statements to report program activity.
  • Separate logging messages into DEBUG, INFO, WARNING, ERROR, and CRITICAL levels.
  • Use logging.basicConfig to define basic logging parameters.

C.13 Tracking Provenance

  • Publish data and code as well as papers.
  • Use DOIs to identify reports, datasets, and software releases.
  • Use an ORCID to identify yourself as an author of a report, dataset, or software release.
  • Data should be FAIR: findable, accessible, interoperable, and reusable.
  • Put small datasets in version control repositories; store large ones on data sharing sites.
  • Describe your software environment, analysis scripts, and data processing steps in reproducible ways.
  • Make your analyses inspectable as well as reproducible.

C.14 Creating Packages with Python

  • Use setuptools to build and distribute Python packages.
  • Create a directory named mypackage containing a setup.py script with a subdirectory also called mypackage containing the package’s source files.
  • Use semantic versioning for software releases.
  • Use a virtual environment to test how your package installs without disrupting your main Python installation.
  • Use pip to install Python packages.
  • The default repository for Python packages is PyPI.
  • Use TestPyPI to test the distribution of your package.
  • Use a README file for package-level documentation.
  • Use Sphinx to generate documentation for a package.
  • Use Read the Docs to host package documentation online.
  • Create a DOI for your package using GitHub’s Zenodo integration.
  • Publish details of your package in a software journal so others can cite it.