C Key Points
This appendix lists the key points for each chapter.
C.1 Getting Started
- Make tidiness a habit, rather than cleaning up your project files later.
- Include a few standard files in all your projects, such as README, LICENSE, CONTRIBUTING, CONDUCT and CITATION.
- Put runnable code in a
bin/
directory. - Put raw/original data in a
data/
directory and never modify it. - Put results in a
results/
directory. This includes cleaned-up data and figures (i.e., everything created using what’s inbin
anddata
). - Put documentation and manuscripts in a
docs/
directory. - Refer to The Carpentries software installation guide if you’re having trouble.
C.2 The Basics of the Unix Shell
- A shell is a program that reads commands and runs other programs.
- The filesystem manages information stored on disk.
- Information is stored in files, which are located in directories (folders).
- Directories can also store other directories, which forms a directory tree.
pwd
prints the user’s current working directory./
on its own is the root directory of the whole filesystem.ls
prints a list of files and directories.- An absolute path specifies a location from the root of the filesystem.
- A relative path specifies a location in the filesystem starting from the current directory.
cd
changes the current working directory...
means the parent directory..
on its own means the current directory.mkdir
creates a new directory.cp
copies a file.rm
removes (deletes) a file.mv
moves (renames) a file or directory.*
matches zero or more characters in a filename.?
matches any single character in a filename.wc
counts lines, words, and characters in its inputs.man
displays the manual page for a given command; some commands also have a--help
option.
C.3 Building Tools with the Unix Shell
cat
displays the contents of its inputs.head
displays the first few lines of its input.tail
displays the last few lines of its input.sort
sorts its inputs.- Use the up-arrow key to scroll up through previous commands to edit and repeat them.
- Use
history
to display recent commands and!number
to repeat a command by number. - Every process in Unix has an input channel called standard input and an output channel called standard output.
>
redirects a command’s output to a file, overwriting any existing content.>>
appends a command’s output to a file.<
operator redirects input to a command.- A pipe
|
sends the output of the command on the left to the input of the command on the right. - A
for
loop repeats commands once for every thing in a list. - Every
for
loop must have a variable to refer to the thing it is currently operating on and a body containing commands to execute. - Use
$name
or${name}
to get the value of a variable.
C.4 Going Further with the Unix Shell
- Save commands in files (usually called shell scripts) for re-use.
bash filename
runs the commands saved in a file.$@
refers to all of a shell script’s command-line arguments.$1
,$2
, etc., refer to the first command-line argument, the second command-line argument, etc.- Place variables in quotes if the values might have spaces or other special characters in them.
find
prints a list of files with specific properties or whose names match patterns.$(command)
inserts a command’s output in place.grep
selects lines in files that match patterns.- Use the
.bashrc
file in your home directory to set shell variables each time the shell runs. - Use
alias
to create shortcuts for things you type frequently.
C.5 Building Command-Line Programs in Python
- Write command-line Python programs that can be run in the Unix shell like other command-line tools.
- If the user does not specify any input files, read from standard input.
- If the user does not specify any output files, write to standard output.
- Place all
import
statements at the start of a module. - Use the value of
__name__
to determine if a file is being run directly or being loaded as a module. - Use
argparse
to handle command-line arguments in standard ways. - Use short options for common controls and long options for less common or more complicated ones.
- Use docstrings to document functions and scripts.
- Place functions that are used across multiple scripts in a separate file that those scripts can import.
C.6 Using Git at the Command Line
- Use
git config
with the--global
option to configure your username, email address, and other preferences once per machine. git init
initializes a repository.- Git stores all repository management data in the
.git
subdirectory of the repository’s root directory. git status
shows the status of a repository.git add
puts files in the repository’s staging area.git commit
saves the staged content as a new commit in the local repository.git log
lists previous commits.git diff
shows the difference between two versions of the repository.- Synchronize your local repository with a remote repository on a forge such as GitHub.
git remote
manages bookmarks pointing at remote repositories.git push
copies changes from a local repository to a remote repository.git pull
copies changes from a remote repository to a local repository.git restore
andgit checkout
recover old versions of files.- The
.gitignore
file tells Git what files to ignore.
C.7 Going Further with Git
- Use a branch-per-feature workflow to develop new features while leaving the master branch in working order.
git branch
creates a new branch.git checkout
switches between branches.git merge
merges changes from another branch into the current branch.- Conflicts occur when files or parts of files are changed in different ways on different branches.
- Version control systems do not allow people to overwrite changes silently; instead, they highlight conflicts that need to be resolved.
- Forking a repository makes a copy of it on a server.
- Cloning a repository with
git clone
creates a local copy of a remote repository. - Create a remote called
upstream
to point to the repository a fork was derived from. - Create pull requests to submit changes from your fork to the upstream repository.
C.8 Working in Teams
- Welcome and nurture community members proactively.
- Create an explicit Code of Conduct for your project modeled on the Contributor Covenant.
- Include a license in your project so that it’s clear who can do what with the material.
- Create issues for bugs, enhancement requests, and discussions.
- Label issues to identify their purpose.
- Triage issues regularly and group them into milestones to track progress.
- Include contribution guidelines in your project that specify its workflow and its expectations of participants.
- Make rules about governance explicit.
- Use common-sense rules to make project meetings fair and productive.
- Manage conflict between participants rather than hoping it will take care of itself.
C.9 Automating Analyses with Make
- Make is a widely used build manager.
- A build manager re-runs commands to update files that are out of date.
- A build rule has targets, prerequisites, and a recipe.
- A target can be a file or a phony target that simply triggers an action.
- When a target is out of date with respect to its prerequisites, Make executes the recipe associated with its rule.
- Make executes as many rules as it needs to when updating files, but always respects prerequisite order.
- Make defines automatic variables such as
$@
(target),$^
(all prerequisites), and$<
(first prerequisite). - Pattern rules can use
%
as a placeholder for parts of filenames. - Makefiles can define variables using
NAME=value
. - Make also has functions such as
$(wildcard...)
and$(patsubst...)
. - Use specially formatted comments to create self-documenting Makefiles.
C.10 Configuring Programs
- Overlay configuration specifies settings for a program in layers, each of which overrides previous layers.
- Use a system-wide configuration file for general settings.
- Use a user-specific configuration file for personal preferences.
- Use a job-specific configuration file with settings for a particular run.
- Use command-line options to change things that commonly change.
- Use YAML or some other standard syntax to write configuration files.
- Save configuration information to make your research reproducible.
C.11 Testing Software
- Test software to convince people (including yourself) that software is correct enough and to make tolerances on “enough” explicit.
- Add assertions to code so that it checks itself as it runs.
- Write unit tests to check individual pieces of code.
- Write integration tests to check that those pieces work together correctly.
- Write regression tests to check if things that used to work no longer do.
- A test framework finds and runs tests written in a prescribed fashion and reports their results.
- Test coverage is the fraction of lines of code that are executed by a set of tests.
- Continuous integration re-builds and/or re-tests software every time something changes.
C.12 Handling Errors
- Signal errors by raising exceptions.
- Use
try
/except
blocks to catch and handle exceptions. - Python organizes its standard exceptions in a hierarchy so that programs can catch and handle them selectively.
- “Throw low, catch high,” i.e., raise exceptions immediately but handle them at a higher level.
- Write error messages that help users figure out what to do to fix the problem.
- Store error messages in a lookup table to ensure consistency.
- Use a logging framework instead of
print
statements to report program activity. - Separate logging messages into
DEBUG
,INFO
,WARNING
,ERROR
, andCRITICAL
levels. - Use
logging.basicConfig
to define basic logging parameters.
C.13 Tracking Provenance
- Publish data and code as well as papers.
- Use DOIs to identify reports, datasets, and software releases.
- Use an ORCID to identify yourself as an author of a report, dataset, or software release.
- Data should be FAIR: findable, accessible, interoperable, and reusable.
- Put small datasets in version control repositories; store large ones on data sharing sites.
- Describe your software environment, analysis scripts, and data processing steps in reproducible ways.
- Make your analyses inspectable as well as reproducible.
C.14 Creating Packages with Python
- Use
setuptools
to build and distribute Python packages. - Create a directory named
mypackage
containing asetup.py
script with a subdirectory also calledmypackage
containing the package’s source files. - Use semantic versioning for software releases.
- Use a virtual environment to test how your package installs without disrupting your main Python installation.
- Use
pip
to install Python packages. - The default repository for Python packages is PyPI.
- Use TestPyPI to test the distribution of your package.
- Use a README file for package-level documentation.
- Use Sphinx to generate documentation for a package.
- Use Read the Docs to host package documentation online.
- Create a DOI for your package using GitHub’s Zenodo integration.
- Publish details of your package in a software journal so others can cite it.