Chapter 9 Automating Analyses with Make
The three rules of the Librarians of Time and Space are: 1) Silence; 2) Books must be returned no later than the last date shown; and 3) Do not interfere with the nature of causality.
— Terry Pratchett
It’s easy to run one program to process a single data file, but what happens when our analysis depends on many files, or when we need to re-do the analysis every time new data arrives? What should we do if the analysis has several steps that we have to do in a particular order?
If we try to keep track of this ourselves, we will inevitably forget some crucial steps and it will be hard for other people to pick up our work. Instead, we should use a build manager to keep track of what depends on what and run our analysis programs automatically. These tools were invented to help programmers compile complex software, but can be used to automate any workflow.
Our Zipf’s Law project currently includes these files:
zipf/ ├── .gitignore ├── CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE.md ├── README.md ├── bin │ ├── book_summary.sh │ ├── collate.py │ ├── countwords.py │ ├── plotcounts.py │ ├── script_template.py │ └── utilities.py ├── data │ ├── README.md │ ├── dracula.txt │ ├── frankenstein.txt │ └── ... └── results ├── dracula.csv ├── dracula.png └── ...
Now that the project’s main building blocks are in place, we’re ready to automate our analysis using a build manager. We will use a program called Make to do this so that every time we add a new book to our data, we can create a new plot of the word count distribution with a single command. Make works as follows:
Every time the operating system creates, reads, or changes a file, it updates a timestamp on the file to show when the operation took place. Make can compare these timestamps to figure out whether files are newer or older than one another.
A user can describe which files depend on each other by writing rules in a Makefile. For example, one rule could say that
data/moby_dick.txt, while another could say that the plot
results/comparison.pngdepends on all of the CSV files in the
Each rule also tells Make how to update an out-of-date file. For example, the rule for Moby Dick could tell Make to run
bin/countwords.pyif the result file is older than either the raw data file or the program.
When the user runs Make, the program checks all of the rules in the Makefile and runs the commands needed to update any that are out of date. If there are transitive dependencies—i.e., if A depends on B and B depends on C—then Make will trace them through and run all of the commands it needs to in the right order.
Keep Tracking with Version Control
9.1 Updating a Single File
To start automating our analysis,
let’s create a file called
Makefile in the root of our project
and add the following:
As in the shell and many other programming languages,
# indicates that the first line is a comment.
The second and third lines form a build rule:
the target of the rule is
its single prerequisite is the file
and the two are separated by a single colon
There is no limit on the length of statement lines in Makefiles,
but to aid readability we have used a backslash (
\) character to split
the lengthy third line in this example.
The target and prerequisite tell Make what depends on what.
The line below them describes the recipe
that will update the target if it is out of date.
The recipe consists of one or more shell commands,
each of which must be prefixed by a single tab character.
Spaces cannot be used instead of tabs here,
which can be confusing as they are interchangeable in most other programming languages.
In the rule above,
the recipe is “run
bin/countwords.py on the raw data file
and put the output in a CSV file in the
To test our rule, run this command in the shell:
Make automatically looks for a file called
follows the rules it contains,
and prints the commands that were executed.
In this case it displays:
python bin/countwords.py \ data/moby_dick.txt > results/moby_dick.csv
Makefileindents a rule with spaces rather than tabs, Make produces an error message like this:
Makefile:3: *** missing separator. Stop.
When Make follows the rules in our Makefile, one of three things will happen:
results/moby_dick.csvdoesn’t exist, Make runs the recipe to create it.
data/moby_dick.txtis newer than
results/moby_dick.csv, Make runs the recipe to update the results.
results/moby_dick.csvis newer than its prerequisite, Make does nothing.
In the first two cases, Make prints the commands it runs, along with anything those commands print to the screen via standard output or standard error. There is no screen output in this case, so we only see the command.
No matter what happened the first time we ran
if we run it again right away it does nothing
because our rule’s target is now up to date.
It tells us this by displaying the message:
make: `results/moby_dick.csv' is up to date.
We can check that it is telling the truth by listing the files with their timestamps, ordered by how recently they have been updated:
-rw-r--r-- 1 amira staff 274967 Nov 29 12:58 results/moby_dick.csv -rw-r--r-- 1 amira staff 1253891 Nov 27 20:56 data/moby_dick.txt
As a further test:
makeagain. This is case #1, so Make runs the recipe.
touch data/moby_dick.txtto update the timestamp on the data file, then run
make. This is case #2, so again, Make runs the recipe.
We don’t have to call our file
Makefile: if we prefer something like
workflows.mk, we can tell Make to read recipes from that file using
make -f workflows.mk.
9.2 Managing Multiple Files
Our Makefile documents exactly how to reproduce one specific result. Let’s add another rule to reproduce another result:
When we run
make it tells us:
make: `results/moby_dick.csv' is up to date.
By default, Make only attempts to update the first target it finds in the Makefile,
which is called the default target.
In this case,
the first target is
which is already up to date.
To update something else,
we need to tell Make specifically what we want:
python bin/countwords.py \ data/jane_eyre.txt > results/jane_eyre.csv
If we have to run
make once for each result,
we’re right back where we started.
we can add a rule to our Makefile to update all of our results at once.
We do this by creating a phony target
that doesn’t correspond to an actual file.
Let’s add this line to the top of our Makefile:
There is no file called
and this rule doesn’t have any recipes of its own,
but when we run
Make finds everything that
all depends on,
then brings each of those prerequisites up to date (Figure 9.1).
The order in which rules appear in the Makefile does not necessarily determine the order in which recipes are run. Make is free to run commands in any order so long as nothing is updated before its prerequisites are up to date.
We can use phony targets to automate and document other steps in our workflow.
let’s add another target to our Makefile to delete all of the result files we have generated
so that we can start afresh.
By convention this target is called
and we’ll place it below the two existing targets:
-f flag to
rm means “force removal”:
if it is present,
rm won’t complain if the files we have told it to remove are already gone.
If we now run:
Make will delete any results files we have.
This is a lot safer than typing
rm -f results/*.csv at the command line each time,
because if we mistakenly put a space before the
we would delete all of the CSV files in the project’s root directory.
Phony targets are very useful, but there is a catch. Try doing this:
make: `clean' is up to date.
Since there is a directory called
Make thinks the target
clean in the Makefile refers to this directory.
Since the rule has no prerequisites,
it can’t be out of date,
so no recipes are executed.
We can unconfuse Make by putting this line at the top of Makefile to explicitly state which targets are phony:
9.3 Updating Files When Programs Change
Our current Makefile says that each result file depends on the corresponding data file. That’s not entirely true: each result also depends on the program used to generate it. If we change our program, we should regenerate our results. To get Make to do that, we can change our prerequisites to include the program:
# Regenerate results for "Moby Dick" results/moby_dick.csv : data/moby_dick.txt bin/countwords.py python bin/countwords.py \ data/moby_dick.txt > results/moby_dick.csv # Regenerate results for "Jane Eyre" results/jane_eyre.csv : data/jane_eyre.txt bin/countwords.py python bin/countwords.py \ data/jane_eyre.txt > results/jane_eyre.csv
To run both of these rules,
we can type
all is the first target in our Makefile,
Make will use it if we just type
make on its own:
python bin/countwords.py \ data/moby_dick.txt > results/moby_dick.csv python bin/countwords.py \ data/jane_eyre.txt > results/jane_eyre.csv
The exercises will explore how we can write a rule to tell us whether our results will be different after a change to a program without actually updating them. Rules like this can help us test our programs: if we don’t think an addition or modification ought to affect the results, but it would, we may have some debugging to do.
9.4 Reducing Repetition in a Makefile
Our Makefile now mentions
bin/countwords.py four times.
If we ever change the name of the program or move it to a different location,
we will have to find and replace each of those occurrences.
this redundancy makes our Makefile harder to understand,
just as scattering magic numbers through programs
makes them harder to understand.
The solution is the same one we use in programs: define and use variables. Let’s modify the results regeneration code by creating targets for the word-counting script and the command used to run it. The entire file should now read:
.PHONY : all clean COUNT=bin/countwords.py RUN_COUNT=python $(COUNT) # Regenerate all results. all : results/moby_dick.csv results/jane_eyre.csv # Regenerate results for "Moby Dick" results/moby_dick.csv : data/moby_dick.txt $(COUNT) $(RUN_COUNT) data/moby_dick.txt > results/moby_dick.csv # Regenerate results for "Jane Eyre" results/jane_eyre.csv : data/jane_eyre.txt $(COUNT) $(RUN_COUNT) data/jane_eyre.txt > results/jane_eyre.csv # Remove all generated files. clean : rm -f results/*.csv
Each definition takes the form
Variables are written in upper case by convention
so that they’ll stand out from filenames
(which are usually in lower case),
but Make doesn’t require this.
What is required is using parentheses to refer to the variable,
$(NAME) and not
Why the Parentheses?
For historical reasons, Make interprets
$NAMEto be a variable called
Nfollowed by the three characters “AME”. If no variable called
AME, which is almost certainly not what we want.
As in programs, variables don’t just cut down on typing. They also tell readers that several things are always and exactly the same, which reduces cognitive load.
9.5 Automatic Variables
We could add a third rule to analyze a third novel and a fourth to analyze a fourth, but that won’t scale to hundreds or thousands of novels. Instead, we can write a generic rule that does what we want for every one of our data files.
To do this,
we need to understand Make’s
The first step is to use the very cryptic expression
$@ in the rule’s recipe
to mean “the target of the rule.”
It lets us turn this:
Make defines a value of
$@ separately for each rule,
so it always refers to that rule’s target.
$@ is an unfortunate name:
$TARGET would have been easier to understand,
but we’re stuck with it now.
The next step is to replace the explicit list of prerequisites in the recipe
with the automatic variable
which means “all the prerequisites in the rule”:
this doesn’t work.
The rule’s prerequisites are the novel and the word-counting program.
When Make expands the recipe,
the resulting command tries to process the program
as if it was a data file:
Make solves this problem with another automatic variable
which means “only the first prerequisite”.
Using it lets us rewrite our rule as:
If you use this approach, the rule for Jane Eyre should be updated as well. The next section, however, includes instructions for generalizing rules.
9.6 Generic Rules
$< > $@ is even harder to read than
$@ on its own,
but we can now replace all the rules for generating results files
with one pattern rule
using the wildcard
which matches zero or more characters in a filename.
% in the target also matches in the prerequisites,
so the rule:
will handle Jane Eyre, Moby Dick, The Time Machine, and every other novel in the
% cannot be used in recipes,
which is why
$@ are needed.
With this rule in place, our entire Makefile is reduced to:
.PHONY: all clean COUNT=bin/countwords.py RUN_COUNT=python $(COUNT) # Regenerate all results. all : results/moby_dick.csv results/jane_eyre.csv \ results/time_machine.csv # Regenerate result for any book. results/%.csv : data/%.txt $(COUNT) $(RUN_COUNT) $< > $@ # Remove all generated files. clean : rm -f results/*.csv
We now have fewer lines of text, but we’ve also included a third book. To test our shortened Makefile, let’s delete all of the results files:
rm -f results/*.csv
and then re-create them:
python bin/countwords.py data/moby_dick.txt > results/moby_dick.csv python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv python bin/countwords.py data/time_machine.txt > results/time_machine.csv
We can still rebuild individual files if we want, since Make will take the target filename we give on the command line and see if a pattern rule matches it:
python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv
9.7 Defining Sets of Files
Our analysis is still not fully automated:
if we add another book to
we have to remember to add its name to the
all target in the Makefile as well.
Once again we will fix this in steps.
imagine that all the results files already exist
and we just want to update them.
We can define a variable called
to be a list of all the results files
using the same wildcards we would use in the shell:
We can then rewrite
all to depend on that list:
this only works if the results files already exist.
If one doesn’t,
its name won’t be included in
and Make won’t realize that we want to generate it.
What we really want is to generate the list of results files
based on the list of books in the
We can create that list using Make’s
This calls the function
wildcard with the argument
The result is a list of all the text files in the
just as we would get with
data/*.txt in the shell.
The syntax is odd because functions were added to Make long after it was first written,
but at least they have readable names.
To check that this line does the right thing,
we can add another phony target called
that uses the shell command
echo to print the names and values of our variables:
Let’s run this:
echo COUNT: bin/countwords.py COUNT: bin/countwords.py echo DATA: data/dracula.txt data/frankenstein.txt data/jane_eyre.txt data/moby_dick.txt data/sense_and_sensibility.txt data/sherlock_holmes.txt data/time_machine.txt DATA: data/dracula.txt data/frankenstein.txt data/jane_eyre.txt data/moby_dick.txt data/sense_and_sensibility.txt data/sherlock_holmes.txt data/time_machine.txt
The output appears twice
because Make shows us the command it’s going to run before running it.
@ before the command in the recipe prevents this,
which makes the output easier to read:
COUNT: bin/countwords.py DATA: data/dracula.txt data/frankenstein.txt data/jane_eyre.txt data/moby_dick.txt data/sense_and_sensibility.txt data/sherlock_holmes.txt data/time_machine.txt
We now have the names of our input files.
To create a list of corresponding output files,
we use Make’s
(short for pattern substitution):
The first argument to
patsubst is the pattern to look for,
which in this case is a text file in the
% to match the stem of the file’s name,
which is the part we want to keep.
The second argument is the replacement we want.
As in a pattern rule,
% in this argument with whatever matched
% in the pattern,
which creates the name of the result file we want.
the third argument is what to do the substitution in,
which is our list of books’ names.
Let’s check the
RESULTS variable by adding another command to the
COUNT: bin/countwords.py DATA: data/dracula.txt data/frankenstein.txt data/jane_eyre.txt data/moby_dick.txt data/sense_and_sensibility.txt data/sherlock_holmes.txt data/time_machine.txt RESULTS: results/dracula.csv results/frankenstein.csv results/jane_eyre.csv results/moby_dick.csv results/sense_and_sensibility.csv results/sherlock_holmes.csv results/time_machine.csv
DATA has the names of the files we want to process
RESULTS automatically has the names of the corresponding result files.
Why haven’t we included
RUN_COUNT when assessing our variables’ values?
This is another place we can streamline our script,
RUN_COUNT from the list of variables
and changing our regeneration rule:
Since the phony target
all depends on
(i.e., all the files whose names appear in the variable
we can regenerate all the results in one step:
rm -f results/*.csv
python bin/countwords.py data/dracula.txt > results/dracula.csv python bin/countwords.py data/frankenstein.txt > results/frankenstein.csv python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv python bin/countwords.py data/moby_dick.txt > results/moby_dick.csv python bin/countwords.py data/sense_and_sensibility.txt > results/sense_and_sensibility.csv python bin/countwords.py data/sherlock_holmes.txt > results/sherlock_holmes.csv python bin/countwords.py data/time_machine.txt > results/time_machine.csv
Our workflow is now just two steps: add a data file and run Make. This is a big improvement over running things manually, particularly as we start to add more steps like merging data files and generating plots.
9.8 Documenting a Makefile
Every well-behaved program should tell people how to use it (Taschuk and Wilson 2017).
If we run
we get a (very) long list of options that Make understands,
but nothing about our specific workflow.
We could create another phony target called
help that prints a list of available commands:
but sooner or later we will add a target or rule and forget to update this list.
A better approach is to format some comments in a special way
and then extract and display those comments when asked to.
## (a double comment marker) to indicate the lines we want displayed
grep (Section 4.5) to pull these lines out of the file:
.PHONY: all clean help settings COUNT=bin/countwords.py DATA=$(wildcard data/*.txt) RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA)) ## all : regenerate all results. all : $(RESULTS) ## results/%.csv : regenerate result for any book. results/%.csv : data/%.txt $(COUNT) python $(COUNT) $< > $@ ## clean : remove all generated files. clean : rm -f results/*.csv ## settings : show variables' values. settings : @echo COUNT: $(COUNT) @echo DATA: $(DATA) @echo RESULTS: $(RESULTS) ## help : show this message. help : @grep '^##' ./Makefile
## all : regenerate all results. ## results/%.csv : regenerate result for any book. ## clean : remove all generated files. ## settings : show variables' values. ## help : show this message.
The exercises will explore how to format this more readably.
9.9 Automating Entire Analyses
To finish our discussion of Make,
let’s automatically generate a collated list of word frequencies.
The target is a file called
that depends on the results generated by
To create it,
we add or change these lines in our Makefile:
# ...phony targets and previous variable definitions... COLLATE=bin/collate.py ## all : regenerate all results. all : results/collated.csv ## results/collated.csv : collate all results. results/collated.csv : $(RESULTS) $(COLLATE) mkdir -p results python $(COLLATE) $(RESULTS) > $@ # ...other rules... ## settings : show variables' values. settings : @echo COUNT: $(COUNT) @echo DATA: $(DATA) @echo RESULTS: $(RESULTS) @echo COLLATE: $(COLLATE) # ...help rule...
The first two lines tell Make about the collation program,
while the change to
all tells it what the final target of our pipeline is.
Since this target depends on the results files for single novels,
make all will regenerate all of those automatically.
The rule to regenerate
results/collated.csv should look familiar by now:
it tells Make that all of the individual results have to be up-to-date
and that the final result should be regenerated if the program used to create it has changed.
One difference between the recipe in this rule and the recipes we’ve seen before
is that this recipe uses
$(RESULTS) directly instead of an automatic variable.
We have written the rule this way because
there isn’t an automatic variable that means “all but the last prerequisite,”
so there’s no way to use automatic variables that wouldn’t result in us trying to process our program.
we can also add the
plotcounts.py script to this workflow
and update the
settings rules accordingly.
Note that there is no
> needed before the
because the default action of
plotcounts.py is to write to a file
rather than to standard output.
# ...phony targets and previous variable definitions... PLOT=bin/plotcounts.py ## all : regenerate all results. all : results/collated.png ## results/collated.png: plot the collated results. results/collated.png : results/collated.csv python $(PLOT) $< --outfile $@ # ...other rules... ## settings : show variables' values. settings : @echo COUNT: $(COUNT) @echo DATA: $(DATA) @echo RESULTS: $(RESULTS) @echo COLLATE: $(COLLATE) @echo PLOT: $(PLOT) # ...help...
make all should now generate the new
collated.png plot (Figure 9.2):
python bin/collate.py results/time_machine.csv results/moby_dick.csv results/jane_eyre.csv results/dracula.csv results/sense_and_sensibility.csv results/sherlock_holmes.csv results/frankenstein.csv > results/collated.csv python bin/plotcounts.py results/collated.csv --outfile results/collated.png alpha: 1.1712445413685917
we can update the
to only remove files created by the Makefile.
It is a good habit to do this rather than using the asterisk wildcard to remove all files,
since you might manually place files in the results directory
and forget that these will be cleaned up when you run
Make’s reliance on shell commands instead of direct calls to functions in Python sometimes makes it clumsy to use. However, that also makes it very flexible: a single Makefile can run shell commands and programs written in a variety of languages, which makes it a great way to assemble pipelines out of whatever is lying around.
Programmers have created many replacements for Make in the 45 years since it was first created—so many, in fact, that none have attracted enough users to displace it. If you would like to explore them, check out Snakemake (for Python). If you want to go deeper, Smith (2011) describes the design and implementation of several build managers.
Makefile currently reads as follows:
.PHONY: all clean help settings COUNT=bin/countwords.py COLLATE=bin/collate.py PLOT=bin/plotcounts.py DATA=$(wildcard data/*.txt) RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA)) ## all : regenerate all results. all : results/collated.png ## results/collated.png: plot the collated results. results/collated.png : results/collated.csv python $(PLOT) $< --outfile $@ ## results/collated.csv : collate all results. results/collated.csv : $(RESULTS) $(COLLATE) @mkdir -p results python $(COLLATE) $(RESULTS) > $@ ## results/%.csv : regenerate result for any book. results/%.csv : data/%.txt $(COUNT) python $(COUNT) $< > $@ ## clean : remove all generated files. clean : rm $(RESULTS) results/collated.csv results/collated.png ## settings : show variables' values. settings : @echo COUNT: $(COUNT) @echo DATA: $(DATA) @echo RESULTS: $(RESULTS) @echo COLLATE: $(COLLATE) @echo PLOT: $(PLOT) ## help : show this message. help : @grep '^##' ./Makefile
A number of the exercises below ask you to make further edits to
9.11.1 Report results that would change
How can you get
make to show the commands it would run
without actually running them?
(Hint: look at the manual page.)
9.11.2 Useful options
- What does Make’s
-Boption do and when is it useful?
- What about the
- What about the
9.11.3 Make sure the output directory exists
One of our build recipes includes
What does this do and why is it useful?
9.11.4 Print the title and author
The build rule for regenerating the result for any book is currently:
Add an extra line to the recipe that uses the
to print the title and author of the book to the screen.
@bash so that the command itself isn’t printed to the screen
and don’t forget to update the settings build rule to include the
If you’ve successfully made those changes, you should get the following output for Dracula:
Title: Dracula Author: Bram Stoker python bin/countwords.py data/dracula.txt > results/dracula.csv
9.11.5 Create all results
The default target of our final
Add a target to
make results creates or updates any result files that are missing or out of date,
but does not regenerate
9.11.6 The perils of shell wildcards
What is wrong with writing the rule for
results/collated.csv like this:
(The fact that the result no longer depends on the program used to create it isn’t the biggest problem.)
9.11.7 Making documentation more readable
We can format the documentation in our Makefile more readably using this command:
man and online search,
explain what every part of this recipe does.
A next step in automating this analysis might include
moving the definitions of the
into a separate file called
and using the
include command to access those definitions in the existing
Under what circumstances would this strategy be useful?
9.12 Key Points
- Make is a widely used build manager.
- A build manager re-runs commands to update files that are out of date.
- A build rule has targets, prerequisites, and a recipe.
- A target can be a file or a phony target that simply triggers an action.
- When a target is out of date with respect to its prerequisites, Make executes the recipe associated with its rule.
- Make executes as many rules as it needs to when updating files, but always respects prerequisite order.
- Make defines automatic variables such as
$^(all prerequisites), and
- Pattern rules can use
%as a placeholder for parts of filenames.
- Makefiles can define variables using
- Make also has functions such as
- Use specially formatted comments to create self-documenting Makefiles.