A Solutions

The exercises included in this book represent a wide variety of problems, from multiple choice questions to larger coding tasks. It’s relatively straightforward to indicate a correct answer for the former, though there may be unanticipated cases in which the specific software version you’re using leads to alternative answers being preferable. It’s even more difficult to identify the “right” answer for the latter, since there are often many ways to accomplish the same task with code. Here we present possible solutions that the authors generally agree represent “good” code, but we encourage you to explore additional approaches.

Commits noted in a solution reference Amira’s zipf repository on GitHub, which allow you to see the specific lines of a file modified to arrive at the answer.

Chapter 2

Exercise 2.10.1

The -l option makes ls use a long listing format, showing not only the file/directory names but also additional information such as the file size and the time of its last modification. If you use both the -h option and the -l option, this makes the file size “human readable”, i.e., displaying something like 5.3K instead of 5369.

Exercise 2.10.2

The command ls -R -t results in the contents of each directory sorted by time of last change.

Exercise 2.10.3

  1. No: . stands for the current directory.
  2. No: / stands for the root directory.
  3. No: Amira’s home directory is /Users/Amira.
  4. No: This goes up two levels, i.e., ends in /Users.
  5. Yes: ~ stands for the user’s home directory, in this case /Users/amira.
  6. No: This would navigate into a directory home in the current directory if it exists.
  7. Yes: Starting from the home directory ~, this command goes into data then back (using ..) to the home directory.
  8. Yes: Shortcut to go back to the user’s home directory.
  9. Yes: Goes up one level.
  10. Yes: Same as the previous answer, but with an unnecessary . (indicating the current directory).

Exercise 2.10.4

  1. No: There is a directory backup in /Users.
  2. No: This is the content of Users/sami/backup, but with .. we asked for one level further up.
  3. No: Same as previous explanation, but results shown as directories (which is what the -F option specifies).
  4. Yes: ../backup/ refers to /Users/backup/.

Exercise 2.10.5

  1. No: pwd is not the name of a directory.
  2. Yes: ls without directory argument lists files and directories in the current directory.
  3. Yes: Uses the absolute path explicitly.

Exercise 2.10.6

The touch command updates a file’s timestamp. If no file exists with the given name, touch will create one. Assuming you don’t already have my_file.txt in your working directory, touch my_file.txt will create the file. When you inspect the file with ls -l, note that the size of my_file.txt is 0 bytes. In other words, it contains no data. If you open my_file.txt using your text editor, it is blank.

Some programs do not generate output files themselves, but instead require that empty files have already been generated. When the program is run, it searches for an existing file to populate with its output. The touch command allows you to efficiently generate a blank text file to be used by such programs.

Exercise 2.10.7

$ remove my_file.txt? y

The -i option will prompt before (every) removal (use y to confirm deletion or n to keep the file). The Unix shell doesn’t have a trash bin, so all the files removed will disappear forever. By using the -i option, we have the chance to check that we are deleting only the files that we want to remove.

Exercise 2.10.8

$ mv ../data/chapter1.txt ../data/chapter2.txt .

Recall that .. refers to the parent directory (i.e., one above the current directory) and that . refers to the current directory.

Exercise 2.10.9

  1. No: While this would create a file with the correct name, the incorrectly named file still exists in the directory and would need to be deleted.
  2. Yes: This would work to rename the file.
  3. No: The period (.) indicates where to move the file, but does not provide a new filename; identical filenames cannot be created.
  4. No: The period (.) indicates where to copy the file, but does not provide a new filename; identical filenames cannot be created.

Exercise 2.10.10

We start in the /Users/amira/data directory, containing a single file, books.dat. We create a new folder called doc and move (mv) the file books.dat to that new folder. Then we make a copy (cp) of the file we just moved named books-saved.dat.

The tricky part here is the location of the copied file. Recall that .. means “go up a level,” so the copied file is now in /Users/amira. Notice that .. is interpreted with respect to the current working directory, not with respect to the location of the file being copied. So, the only thing that will show using ls (in /Users/amira/data) is the doc folder.

  1. No: books-saved.dat is located at /Users/amira
  2. Yes.
  3. No: books.dat is located at /Users/amira/data/doc
  4. No: books-saved.dat is located at /Users/amira

Exercise 2.10.11

If given more than one filename followed by a directory name (i.e., the destination directory must be the last argument), cp copies the files to the named directory.

If given three filenames, cp throws an error because it is expecting a directory name as the last argument.

Exercise 2.10.12

  1. Yes: Shows all files whose names contain two different characters (?) followed by the letter n, then zero or more characters (*) followed by txt.

  2. No: Shows all files whose names start with zero or more characters (*) followed by e_, zero or more characters (*), then txt. The output includes the two desired books, but also time_machine.txt.

  3. No: Shows all files whose names start with zero or more characters (*) followed by n, zero or more characters (*), then txt. The output includes the two desired books, but also frankenstein.txt and time_machine.txt.

  4. No: Shows all files whose names start with zero or more characters (*) followed by n, a single character ?, e, zero or more characters (*), then txt. The output shows frankenstein.txt and sense_and_sensibility.txt.

Exercise 2.10.13

$ mv *.txt data

Amira needs to move her files books.txt and titles.txt to the data directory. The shell will expand *.txt to match all .txt files in the current directory. The mv command then moves the list of .txt files to the data directory.

Exercise 2.10.14

  1. Yes: This accurately re-creates the directory structure.

  2. Yes: This accurately re-creates the directory structure.

  3. No: The first line of this code set gives an error:

    mkdir: 2016-05-20/data: No such file or directory

    mkdir won’t create a subdirectory for a directory that doesn’t yet exist (unless you use an option like -p that explicitly creates parent directories).

  4. No: This creates raw and processed directories at the same level as data:

    2016-05-20/
        ├── data
        ├── processed
        └── raw

Exercise 2.10.15

  1. A solution using two wildcard expressions:

    $ ls s*.txt
    $ ls t*.txt
  2. When there are no files beginning with s and ending in .txt, or when there are no files beginning with t and ending in .txt.

Exercise 2.10.16

  1. No: This would remove only .csv files with one-character names.
  2. Yes: This removes only files ending in .csv.
  3. No: The shell would expand * to match everything in the current directory, so the command would try to remove all matched files and an additional file called .csv.
  4. No: The shell would expand *.* to match all files with any extension, so this command would delete all files in the current directory.

Exercise 2.10.17

novel-????-[ab]*.{txt,pdf} matches:

  • Files whose names start with novel-,
  • which is then followed by exactly four characters (since each ? matches one character),
  • followed by another literal -,
  • followed by either the letter a or the letter b,
  • followed by zero or more other characters (the *),
  • followed by .txt or .pdf.

Chapter 3

Exercise 3.8.1

echo hello > testfile01.txt writes the string “hello” to testfile01.txt, but the file gets overwritten each time we run the command.

echo hello >> testfile02.txt writes “hello” to testfile02.txt, but appends the string to the file if it already exists (i.e., when we run it for the second time).

Exercise 3.8.2

  1. No: This results from only running the first line of code (head).
  2. No: This results from only running the second line of code (tail).
  3. Yes: The first line writes the first three lines of dracula.txt, the second line appends the last two lines of dracula.txt to the same file.
  4. No: We would need to pipe the commands to obtain this answer (head -n 3 dracula.txt | tail -n 2 > extracted.txt).

Exercise 3.8.3

Try running each line of code in the data directory.

  1. No: This incorrectly uses redirect (>), and will result in an error.
  2. No: The number of lines desired for head is reported incorrectly; this will result in an error.
  3. No: This will extract the first three files from the wc results, which have not yet been sorted into length of lines.
  4. Yes: This output correctly orders and connects each of the commands.

Exercise 3.8.4

To obtain a list of unique results from these data, we need to run:

$ sort genres.txt | uniq

It makes sense that uniq is almost always run after using sort, because that allows a computer to compare only adjacent lines. If uniq did not compare only adjacent lines, it would require comparing each line to all other lines. For a small set of comparisons, this doesn’t matter much, but this isn’t always possible for large files.

Exercise 3.8.5

When used on a single file, cat prints the contents of that file to the screen. In this case, the contents of titles.txt are sent as input to head -n 5, so the first five lines of titles.txt is output. These five lines are used as the input for tail -n 3, which results in lines 3–5 as output. This is used as input to the final command, which sorts them in reverse order. These results are written to the file final.txt, the contents of which are:

Sense and Sensibility,1811
Moby Dick,1851
Jane Eyre,1847

Exercise 3.8.6

cut selects substrings from a line by:

  • breaking the string into pieces wherever it finds a separator (-d ,), which in this case is a comma, and
  • keeping one or more of the resulting fields/columns (-f 2).

In this case, the output is only the dates from titles.txt, since this is in the second column.

$ cut -d , -f 2 titles.txt
1897
1818
1847
1851
1811
1892
1897
1895
1847

Exercise 3.8.7

  1. No: This sorts by the book title.
  2. No: This results in an error because sort is being used incorrectly.
  3. No: There are duplicate dates in the output because they have not been sorted first.
  4. Yes: This results in the output shown below.
  5. No: This extracts the desired data (below), but then counts the number of lines, resulting in the incorrect answer.
   1 1811
   1 1818
   2 1847
   1 1851
   1 1892
   1 1895
   2 1897

If you have difficulty understanding the answers above, try running the commands or sub-sections of the pipelines (e.g., the code between pipes).

Exercise 3.8.8

The difference between the versions is whether the code after echo is inside quotation marks.

The first version redirects the output from echo analyze $file to a file (analyzed-$file). This doesn’t allow us to preview the commands, but instead creates files (analyzed-$file) containing the text analyze $file.

The second version will allow us to preview the commands. This prints to screen everything enclosed in the quotation marks, expanding the loop variable name (prefixed with $).

Try both versions for yourself to see the output. Be sure to open the analyzed-* files to view their contents.

Exercise 3.8.9

The first version gives the same output on each iteration through the loop. Bash expands the wildcard *.txt to match all files ending in .txt and then lists them using ls. The expanded loop would look like this (we’ll only show the first two data files):

$ for datafile in dracula.txt  frankenstein.txt ...
> do
>   ls dracula.txt  frankenstein.txt ...
dracula.txt  frankenstein.txt ...
dracula.txt  frankenstein.txt ...
...

The second version lists a different file on each loop iteration. The value of the datafile variable is evaluated using $datafile, and then listed using ls.

dracula.txt
frankenstein.txt
jane_eyre.txt
moby_dick.txt
sense_and_sensibility.txt
sherlock_holmes.txt
time_machine.txt

Exercise 3.8.10

The first version results in only dracula.txt output, because it is the only file beginning in “d”.

The second version results in the following, because these files all contain a “d” with zero or more characters before and after:

README.md
dracula.txt
moby_dick.txt
sense_and_sensibility.txt

Exercise 3.8.11

Both versions write the first 16 lines (head -n 16) of each book to a file (headers.txt).

The first version results in the text from each file being overwritten in each iteration because of use of > as a redirect.

The second version uses >>, which appends the lines to the existing file. This is preferable because the final headers.txt includes the first 16 lines from all files.

Exercise 3.8.12

If a command causes something to crash or hang, it might be useful to know what that command was, in order to investigate the problem. Were the command only be recorded after running it, we would not have a record of the last command run in the event of a crash.

Chapter 4

Exercise 4.8.1

$ cd ~/zipf

Change into the zipf directory, which is located in the home directory (designated by ~).

$ for file in $(find . -name "*.bak")
> do
>   rm $file
> done

Find all the files ending in .bak and remove them one by one.

$ rm bin/summarize_all_books.sh

Remove the summarize_all_books.sh script.

$ rm -r results

Recursively remove each file in the results directory and then remove the directory itself. (It is necessary to remove all the files first because you cannot remove a non-empty directory.)

Exercise 4.8.2

Running this script with the given parameters will print the first and last line from each file in the directory ending in .txt.

  1. No: This answer misinterprets the lines printed.
  2. Yes.
  3. No: This answer includes the wrong files.
  4. No: Leaving off the quotation marks would result in an error.

Exercise 4.8.3

One possible script (longest.sh) to accomplish this task:

# Shell script which takes two arguments:
#    1. a directory name
#    2. a file extension
# and prints the name of the file in that directory
# with the most lines which matches the file extension.
#
# Usage: bash longest.sh directory/ txt

wc -l $1/*.$2 | sort -n | tail -n 2 | head -n 1

Exercise 4.8.4

  1. script1.sh will print the names of all files in the directory on a single line, e.g., README.md dracula.txt frankenstein.txt jane_eyre.txt moby_dick.txt script1.sh sense_and_sensibility.txt sherlock_holmes.txt time_machine.txt. Although *.txt is included when running the script, the commands run by the script do not reference $1.
  2. script2.sh will print the contents of the first three files ending in .txt; the three variables ($1, $2, $3) refer to the first, second, and third argument entered after the script, respectively.
  3. script3.sh will print the name of each file ending in .txt, since $@ refers to all the arguments (e.g., filenames) given to a shell script. The list of files would be followed by .txt: dracula.txt frankenstein.txt jane_eyre.txt moby_dick.txt sense_and_sensibility.txt sherlock_holmes.txt time_machine.txt.txt.

Exercise 4.8.5

  1. No: This command extracts any line containing “he”, either as a word or within a word.
  2. No: This results in the same output as the answer for #1. -E allows the search term to represent an extended regular expression, but the search term is simple enough that it doesn’t make a difference in the result.
  3. Yes: -w means to return only matches for the word “he”.
  4. No: -i means to invert the search result; this would return all lines except the one we desire.

Exercise 4.8.6

# Obtain unique years from multiple comma-delimited
# lists of titles and publication years
#
# Usage: bash year.sh file1.txt file2.txt ...

for filename in $*
do
  cut -d , -f 2 $filename | sort -n | uniq
done

Exercise 4.8.7

One possible solution:

for sister in Elinor Marianne
do
    echo $sister:
    grep -o -w $sister sense_and_sensibility.txt | wc -l
done

The -o option prints only the matching part of a line.

An alternative (but possibly less accurate) solution is:

for sister in Elinor Marianne
do
    echo $sister:
    grep -o -c -w $sister sense_and_sensibility.txt
done

This solution is potentially less accurate because grep -c only reports the number of lines matched. The total number of matches reported by this method will be lower if there is more than one match per line.

Exercise 4.8.8

  1. Yes: This returns data/jane_eyre.txt.
  2. Maybe: This option may work on your computer, but may not behave consistently across all shells because expansion of the wildcard (*e.txt) may prevent piping from working correctly. We recommend enclosing *e.txt in quotation marks, as in answer 1.
  3. No: This searches the contents of files for lines matching “machine”, rather than the filenames.
  4. See above.

Exercise 4.8.9

  1. Find all files with a .dat extension recursively from the current directory.
  2. Count the number of lines each of these files contains.
  3. Sort the output from step 2 numerically.

Exercise 4.8.10

The following command works if your working directory is Desktop/ and you replace “username” with that of your current computer. -mtime needs to be negative because it is referencing a day prior to the current date.

$ find . -type f -mtime -1 -user username

Chapter 5

Exercise 5.11.1

Running a Python statement directly from the command line is useful as a basic calculator and for simple string operations, since these commands occur in one line of code. More complicated commands will require multiple statements; when run using python -c, statements must be separated by semi-colons:

$ python -c "import math; print(math.log(123))"

Multiple statements, therefore, quickly become more troublesome to run in this manner.

Exercise 5.11.2

The my_ls.py script could read as follows:

"""List the files in a given directory with a given suffix."""

import argparse
import glob


def main(args):
    """Run the program."""
    dir = args.dir if args.dir[-1] == '/' else args.dir + '/'
    glob_input = dir + '*.' + args.suffix
    glob_output = sorted(glob.glob(glob_input))
    for item in glob_output:
        print(item)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('dir', type=str, help='Directory')
    parser.add_argument('suffix', type=str,
                        help='File suffix (e.g. py, sh)')
    args = parser.parse_args()
    main(args)

Exercise 5.11.3

The sentence_endings.py script could read as follows:

"""Count the occurrence of different sentence endings."""

import argparse


def main(args):
    """Run the command line program."""
    text = args.infile.read()
    for ending in ['.', '?', '!']:
        count = text.count(ending)
        print(f'Number of {ending} is {count}')


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'),
                        nargs='?', default='-',
                        help='Input file name')
    args = parser.parse_args()
    main(args)

Exercise 5.11.4

While there may be other ways for plotcounts.py to meet the requirements of the exercise, we’ll be using this script in subsequent chapters so we recommend that the script reads as follows:

"""Plot word counts."""

import argparse

import pandas as pd


def main(args):
    """Run the command line program."""
    df = pd.read_csv(args.infile, header=None,
                     names=('word', 'word_frequency'))
    df['rank'] = df['word_frequency'].rank(ascending=False,
                                           method='max')
    df['inverse_rank'] = 1 / df['rank']
    ax = df.plot.scatter(x='word_frequency',
                         y='inverse_rank',
                         figsize=[12, 6],
                         grid=True,
                         xlim=args.xlim)
    ax.figure.savefig(args.outfile)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'),
                        nargs='?', default='-',
                        help='Word count csv file name')
    parser.add_argument('--outfile', type=str,
                        default='plotcounts.png',
                        help='Output image file name')
    parser.add_argument('--xlim', type=float, nargs=2,
                        metavar=('XMIN', 'XMAX'),
                        default=None, help='X-axis limits')
    args = parser.parse_args()
    main(args)

Chapter 6

Exercise 6.11.1

Amira does not need to make the heaps-law subdirectory a Git repository because the zipf repository will track everything inside it regardless of how deeply nested.

Amira shouldn’t run git init in heaps-law because nested Git repositories can interfere with each other. If someone commits something in the inner repository, Git will not know whether to record the changes in that repository, the outer one, or both.

Exercise 6.11.2

git status now shows:

On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
    example.txt

nothing added to commit but untracked files present
(use "git add" to track)

Nothing has happened to the file; it still exists but Git no longer has it in the staging area. git rm --cached is equivalent to git restore --staged. With newer versions of Git, older commands will still work, and you may encounter references to them when reading help documentation. If you created this file in your zipf project, we recommend removing it before proceeding.

Exercise 6.11.3

If we make a few changes to .gitignore such that it now reads:

__pycache__ this is a change

this is another change

then git diff would show:

diff --git a/.gitignore b/.gitignore
index bee8a64..5c83419 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,3 @@
-__pycache__
+__pycache__ this is a change
+
+this is another change

Whereas git diff --word-diff shows:

diff --git a/.gitignore b/.gitignore
index bee8a64..5c83419 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,3 @@
__pycache__ {+this is a change+}

{+this is another change+}

Depending on the nature of the changes you are viewing, the latter may be easier to interpret since it shows exactly what has been changed.

Exercise 6.11.4

  1. Maybe: would only create a commit if the file has already been staged.
  2. No: would try to create a new repository, which results in an error if a repository already exists.
  3. Yes: first adds the file to the staging area, then commits.
  4. No: would result in an error, as it would try to commit a file “my recent changes” with the message “myfile.txt.”

Exercise 6.11.5

  1. Go into your home directory with cd ~.
  2. Create a new folder called bio with mkdir bio.
  3. Make the repository your working directory with cd bio.
  4. Turn it into a repository with git init.
  5. Create your biography using nano or another text editor.
  6. Add it and commit it in a single step with git commit -a -m "Some message".
  7. Modify the file.
  8. Use git diff to see the differences.

Exercise 6.11.6

  1. Create employment.txt using an editor like Nano.
  2. Add both me.txt and employment.txt to the staging area with git add *.txt.
  3. Check that both files are there with git status.
  4. Commit both files at once with git commit.

Exercise 6.11.7

GitHub displays timestamps in a human-readable relative format (i.e., “22 hours ago” or “three weeks ago”), since this makes it easy for anyone in any time zone to know what changes have been made recently. However, if we hover over the timestamp we can see the exact time at which the last change to the file occurred.

Exercise 6.11.8

The answer is 1.

The command git add motivation.txt adds the current version of motivation.txt to the staging area. The changes to the file from the second echo command are only applied to the working copy, not the version in the staging area.

As a result, when git commit -m "Motivate project" is executed, the version of motivation.txt committed to the repository is the content from the first echo.

However, the working copy still has the output from the second echo; git status would show that the file is modified. git restore HEAD motivation.txt therefore replaces the working copy with the most recently committed version of motivation.txt (the content of the first echo), so cat motivation.txt prints:

Zipf's Law describes the relationship between the frequency and
rarity of words.

Exercise 6.11.9

Add this line to .gitignore:

results/plots/

Exercise 6.11.10

Add the following two lines to .gitignore:

*.dat           # ignore all data files
!final.dat      # except final.data

The exclamation point ! includes a previously excluded entry.

Note also that if we have previously committed .dat files in this repository, they will not be ignored once these rules are added to .gitignore. Only future .dat files will be ignored.

Exercise 6.11.11

The left button (with the picture of a clipboard) copies the full identifier of the commit to the clipboard. In the shell, git log shows the full commit identifier for each commit.

The middle button (with seven letters and numbers) shows all of the changes that were made in that particular commit; green shaded lines indicate additions and red lines indicate removals. We can show the same thing in the shell using git diff or git diff FROM..TO (where FROM and TO are commit identifiers).

The right button lets us view all of the files in the repository at the time of that commit. To do this in the shell, we would need to check out the repository as it was at that commit using git checkout ID, where ID is the tag, branch name, or commit identifier. If we do this, we need to remember to put the repository back to the right state afterward.

Exercise 6.11.12

Committing updates our local repository. Pushing sends any commits we have made locally that aren’t yet in the remote repository to the remote repository.

Exercise 6.11.13

When GitHub creates a README.md file while setting up a new repository, it actually creates the repository and then commits the README.md file. When we try to pull from the remote repository to our local repository, Git detects that their histories do not share a common origin and refuses to merge them.

$ git pull origin master
warning: no common commits
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/frances/eniac
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
fatal: refusing to merge unrelated histories

We can force Git to merge the two repositories with the option --allow-unrelated-histories. Please check the contents of the local and remote repositories carefully before doing this.

Exercise 6.11.14

The checkout command restores files from the repository, overwriting the files in our working directory. HEAD indicates the latest version.

  1. No: this can be dangerous; without a filename, git checkout will restore all files in the current directory (and all directories below it) to their state at the commit specified. This command will restore data_cruncher.sh to the latest commit version, but will also reset any other files we have changed to that version, which will erase any unsaved changes you may have made to those files.
  2. Yes: this restores the latest version of only the desired file.
  3. No: this gets the version of data_cruncher.sh from the commit before HEAD, which is not what we want.
  4. Yes: the unique ID (identifier) of the last commit is what HEAD means.
  5. Yes: this is equivalent to the answer to 2.
  6. No: git restore assumes HEAD, so Git will assume you’re trying to restore a file called HEAD, resulting in an error.

Exercise 6.11.15

  1. Compares what has changed between the current bin/plotcounts.py and the same file nine commits ago.
  2. It returns an error: fatal: ambiguous argument 'HEAD~9': unknown revision or path not in the working tree. We don’t have enough commits in history for the command to properly execute.
  3. It compares changes (either staged or unstaged) to the most recent commit.

Exercise 6.11.16

No, using git checkout on a staged file does not unstage it. The changes are in the staging area and checkout would affect the working directory.

Exercise 6.11.17

Each line of output corresponds to a line in the file, and includes the commit identifier, who last modified the line, when that change was made, and what is included on that line. Note that the edit you just committed is not present here; git blame only shows the current lines in the file, and doesn’t report on lines that have been removed.

Chapter 7

Exercise 7.12.1

  1. --oneline shows each commit on a single line with the short identifier at the start and the title of the commit beside it. -n NUMBER limits the number of commits to show.

  2. --since and --after can be used to show commits in a range of dates or times; --author can be used to show commits by a particular person; and -w tells Git to ignore whitespace when comparing commits.

Exercise 7.12.2

An online search for “show Git branch in Bash prompt” turns up several approaches, one of the simplest of which is to add this line to our ~/.bashrc file:

export PS1="\\w + \$(git branch 2>/dev/null | grep '^*' |
colrm 1 2) \$ "

Breaking it down:

  1. Setting the PS1 variable defines the primary shell prompt.

  2. \\w in a shell prompt string means “the current directory.”

  3. The + is a literal + sign between the current directory and the Git branch name.

  4. The command that gets the name of the current Git branch is in $(...). (We need to escape the $ as \$ so Bash doesn’t just run it once when defining the string.)

  5. The git branch command shows all the branches, so we pipe that to grep and select the one marked with a *.

  6. Finally, we remove the first column (i.e., the one containing the *) to leave just the branch name.

So what’s 2>/dev/null about? That redirects any error messages to /dev/null, a special “file” that consumes input without saving it. We need that because sometimes we will be in a directory that isn’t inside a Git repository, and we don’t want error messages showing up in our shell prompt.

None of this is obvious, and we didn’t figure it out ourselves. Instead, we did a search and pasted various answers into explainshell.com until we had something we understood and trusted.

Exercise 7.12.3

https://github.com/github/gitignore/blob/master/Python.gitignore ignores 76 files or patterns. Of those, we recognized less than half. Searching online for some of these, like "*.pot file", turns up useful explanations. Searching for others like var/ does not; in that case, we have to look at the category (in this case, “Python distribution”) and set aside time to do more reading.

Exercise 7.12.4

  1. git diff master..same does not print anything because there are no differences between the two branches.

  2. git merge same master prints merging because Git combines histories even when the files themselves do not differ. After running this command, git history shows a commit for the merge.

Exercise 7.12.5

  1. Git refuses to delete a branch with unmerged commits because it doesn’t want to destroy our work.

  2. Using the -D (capital-D) option to git branch will delete the branch anyway. This is dangerous because any content that exists only in that branch will be lost.

  3. Even with -D, git branch will not delete the branch we are currently on.

Exercise 7.12.6

  1. Chartreuse has repositories on GitHub and their desktop containing identical copies of README.md and nothing else.
  2. Fuchsia has repositories on GitHub and their desktop with exactly the same content as Chartreuse’s repositories.
  3. fuchsia.txt is in both of Fuchsia’s repositories but not in Chartreuse’s repositories.
  4. fuchsia.txt is still in both of Fuchsia’s repositories but still not in Chartreuse’s repositories.
  5. chartreuse.txt is in both of Chartreuse’s repositories but not yet in either of Fuchsia’s repositories.
  6. chartreuse.txt is in Fuchsia’s desktop repository but not yet in their GitHub repository.
  7. chartreuse.txt is in both of Fuchsia’s repositories.
  8. fuchsia.txt is in Chartreuse’s GitHub repository but not in their desktop repository.
  9. All four repositories contain both fuchsia.txt and chartreuse.txt.

Chapter 8

Exercise 8.14.2

The CONDUCT.md file should have contents that mimic those given in Section 8.3.

Exercise 8.14.3

The newly created LICENSE.md should look something like the example MIT License shown in Section 8.4.1.

Exercise 8.14.4

The text in the README.md might look something like:

## Contributing

Interested in contributing?
Check out the [CONTRIBUTING.md](CONTRIBUTING.md)
file for guidelines on how to contribute.
Please note that this project is released with a
[Contributor Code of Conduct](CONDUCT.md).
By contributing to this project,
you agree to abide by its terms.

Your CONTRIBUTING.md file might look something like the following:

# Contributing

Thank you for your interest
in contributing to the Zipf's Law package!

If you are new to the package and/or
collaborative code development on GitHub,
feel free to discuss any suggested changes via issue or email.
We can then walk you through the pull request process if need be.
As the project grows,
we intend to develop more detailed guidelines for submitting
bug reports and feature requests.

We also have a code of conduct
(see [`CONDUCT.md`](CONDUCT.md)).
Please follow it in all your interactions with the project.

Exercise 8.14.5

Be sure to tag the new issue as a feature request to help triage.

Exercise 8.14.6

We often delete the duplicate label: when we mark an issue that way, we (almost) always add a comment saying which issue it’s a duplicate of, in which case it’s just as sensible to label the issue wontfix.

Exercise 8.14.7

Some solutions could be:

  • Give the team member their own office space so they don’t distract others.
  • Buy noise-cancelling headphones for the employees that find it distracting.
  • Re-arrange the work spaces so that there is a “quiet office” and a regular office space and have the team member with the attention disorder work in the regular office.

Exercise 8.14.8

Possible solutions:

  • Change the rule so that anyone who contributes to the project, in any way, gets included as a co-author.
  • Update the rule to include a contributor list on all projects with descriptions of duties, roles, and tasks the contributor provided for the project.

Exercise 8.14.9

We obviously can’t say which description fits you best, but:

  • Use three sticky notes and interruption bingo to stop Anna from cutting people off.

  • Tell Bao that the devil doesn’t need more advocates, and that he’s only allowed one “but what about” at a time.

  • Hediyeh’s lack of self-confidence will take a long time to remedy. Keeping a list of the times she’s been right and reminding her of them frequently is a start, but the real fix is to create and maintain a supportive environment.

  • Unmasking Kenny’s hitchhiking will feel like nit-picking, but so does the accounting required to pin down other forms of fraud. The most important thing is to have the discussion in the open so that everyone realizes he’s taking credit for everyone else’s work as well as theirs.

  • Melissa needs a running partner—someone to work beside her so that she starts when she should and doesn’t get distracted. If that doesn’t work, the project may need to assign everything mission-critical to someone else (which will probably lead to her leaving).

  • Petra can be managed with a one-for-one rule: each time she builds or fixes something that someone else needs, she can then work on something she thinks is cool. However, she’s only allowed to add whatever it is to the project if someone else will publicly commit to maintaining it.

  • Get Frank and Raj off your project as quickly as you can.

Chapter 9

Exercise 9.11.1

make -n target will show commands without running them.

Exercise 9.11.2

  1. The -B option rebuilds everything, even files that aren’t out of date.

  2. The -C option tells Make to change directories before executing, so that make -C ~/myproject runs Make in ~/myproject regardless of the directory it is invoked from.

  3. By default, Make looks for (and runs) a file called Makefile or makefile. If you use another name for your Makefile (which is necessary if you have multiple Makefiles in the same directory), then you need to specify the name of that Makefile using the -f option.

Exercise 9.11.3

mkdir -p some/path makes one or more nested directories if they don’t exist, and does nothing (without complaining) if they already exist. It is useful for creating the output directories for build rules.

Exercise 9.11.4

The build rule for generating the result for any book should now be:

## results/%.csv : regenerate result for any book.
results/%.csv : data/%.txt $(COUNT)
    @bash $(SUMMARY) $< Title
    @bash $(SUMMARY) $< Author
    python $(COUNT) $< > $@

where SUMMARY is defined earlier in the Makefile as

SUMMARY=bin/book_summary.sh

and the settings build rule now includes:

@echo SUMMARY: $(SUMMARY)

Exercise 9.11.5

Since we already have a variable RESULTS that contains all of the results files, all we need is a phony target that depends on them:

.PHONY: results # and all the other phony targets

## results : regenerate result for all books.
results : ${RESULTS}

Exercise 9.11.6

If we use a shell wildcard in a rule like this:

results/collated.csv : results/*.csv
    python $(COLLATE) $^ > $@

then if results/collated.csv already exists, the rule tells Make that the file depends on itself.

Exercise 9.11.7

Our rule is:

help :
    @grep -h -E '^##' ${MAKEFILE_LIST} | sed -e 's/## //g' \
    | column -t -s ':'
  • The -h option to grep tells it not to print filenames, while the -E option tells it to interpret ^## as a pattern.

  • MAKEFILE_LIST is an automatically defined variable with the names of all the Makefiles in play. (There might be more than one because Makefiles can include other Makefiles.)

  • sed can be used to do string substitution.

  • column formats text nicely in columns.

Exercise 9.11.8

This strategy would be advantageous if in the future we intended to write a number of different Makefiles that all use the countwords.py, collate.py and plotcounts.py scripts.

We discuss configuration strategies in more detail in Chapter 10.

Chapter 10

Exercise 10.8.1

The build rule involving plotcounts.py should now read:

## results/collated.png: plot the collated results.
results/collated.png : results/collated.csv $(PARAMS)
    python $(PLOT) $< --outfile $@ --plotparams $(word 2,$^)

where PARAMS is defined earlier in the Makefile along with all the other variables and also included later in the settings build rule:

COUNT=bin/countwords.py
COLLATE=bin/collate.py
PARAMS=bin/plotparams.yml
PLOT=bin/plotcounts.py
SUMMARY=bin/book_summary.sh
DATA=$(wildcard data/*.txt)
RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA))
## settings : show variables' values.
settings :
    @echo COUNT: $(COUNT)
    @echo DATA: $(DATA)
    @echo RESULTS: $(RESULTS)
    @echo COLLATE: $(COLLATE)
    @echo PARAMS: $(PARAMS)
    @echo PLOT: $(PLOT)
    @echo SUMMARY: $(SUMMARY)

Exercise 10.8.2

  1. Make the following additions to plotcounts.py:
import matplotlib.pyplot as plt
parser.add_argument('--style', type=str,
                    choices=plt.style.available,
                    default=None, help='matplotlib style')
def main(args):
    """Run the command line program."""
    if args.style:
        plt.style.use(args.style)
  1. Add nargs='*' to the definition of the --style option:
parser.add_argument('--style', type=str, nargs='*',
                    choices=plt.style.available,
                    default=None, help='matplotlib style')

Exercise 10.8.3

The first step is to add a new command-line argument to tell plotcount.py what we want to do:

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    # ...other options as before...
    parser.add_argument('--saveconfig', type=str, default=None,
                        help='Save configuration to file')
    args = parser.parse_args()
    main(args)

Next, we add three lines to main to act on this option after all of the plotting parameters have been set. For now we use return to exit from main as soon as the parameters have been saved; this lets us test our change without overwriting any of our actual plots.

def save_configuration(fname, params):
    """Save configuration to a file."""
    with open(fname, 'w') as reader:
        yaml.dump(params, reader)


def main(args):
    """Run the command line program."""
    if args.style:
        plt.style.use(args.style)
    set_plot_params(args.plotparams)
    if args.saveconfig:
        save_configuration(args.saveconfig, mpl.rcParams)
        return
    df = pd.read_csv(args.infile, header=None,
                     names=('word', 'word_frequency'))
    # ...carry on producing plot...

Finally, we add a target to Makefile to try out our change. We do the test this way so that we can be sure that we’re testing with the same options we use with the real program; if we were to type in the whole command ourselves, we might use something different. We also save the configuration to /tmp rather than to our project directory to keep it out of version control’s way:

## test-saveconfig : save plot configuration.
test-saveconfig :
    python $(PLOT) --saveconfig /tmp/test-saveconfig.yml \
      --plotparams $(PARAMS)

The output is over 400 lines long, and includes settings for everything from the animation bit rate to the size of y-axis ticks:

!!python/object/new:matplotlib.RcParams
dictitems:
  _internal.classic_mode: false
  agg.path.chunksize: 0
  animation.avconv_args: []
  animation.avconv_path: avconv
  animation.bitrate: -1
  ...
  ytick.minor.size: 2.0
  ytick.minor.visible: false
  ytick.minor.width: 0.6
  ytick.right: false

The beautiful thing about this file is that the entries are automatically sorted alphabetically, which makes it easy for both human beings and the diff command to spot differences. This helps reproducibility because any one of these settings might change in a new release of matplotlib, and any of those changes might affect our plots. Saving the settings allows us to compare what we had when we did our work to what we have when we’re trying to re-create it, which in turn gives us a starting point for debugging if we need to.

Exercise 10.8.4

import configparser


def set_plot_params(param_file):
    """Set the matplotlib parameters."""
    if param_file:
        config = configparser.ConfigParser()
        config.read(param_file)
        for section in config.sections():
            for param in config[section]:
                value = config[section][param]
                mpl.rcParams[param] = value
  1. Most people seem to find Windows INI files easier to write and read, since it’s easier to see what’s a heading and what’s a value.

  2. However, Windows INI files only provide one level of sectioning, so complex configurations are harder to express. Thus, while YAML may be a bit more difficult to get started with, it will take us further.

Exercise 10.8.5

The answer depends on whether we are able to make changes to Program A and Program B. If we can, we can modify them to use overlay configuration and put the shared parameters in a single file that both programs load. If we can’t do that, the next best thing is to create a small helper program that reads their configuration files and checks that common parameters have consistent values. The first solution prevents the problem; the second detects it, which is a lot better than nothing.

Chapter 11

Exercise 11.11.1

  • The first assertion checks that the input sequence values is not empty. An empty sequence such as [] will make it fail.

  • The second assertion checks that each value in the list can be turned into an integer. Input such as [1, 2,'c', 3] will make it fail.

  • The third assertion checks that the total of the list is greater than 0. Input such as [-10, 2, 3] will make it fail.

Exercise 11.11.2

  1. Remove the comments about inserting preconditions and add the following:
assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
x0, y0, x1, y1 = rect
assert x0 < x1, 'Invalid X coordinates'
assert y0 < y1, 'Invalid Y coordinates'
  1. Remove the comment about inserting postconditions and add the following
assert 0 < upper_x <= 1.0, \
  'Calculated upper X coordinate invalid'
assert 0 < upper_y <= 1.0, \
  'Calculated upper Y coordinate invalid'
  1. The problem is that the following section of normalize_rectangle should read float(dy) / dx, not float(dx) / dy.
if dx > dy:
    scaled = float(dx) / dy
  1. test_geometry.py should read as follows:
import geometry


def test_tall_skinny():
    """Test normalization of a tall, skinny rectangle."""
    rect = [20, 15, 30, 20]
    expected_result = (0, 0, 1.0, 0.5)
    actual_result = geometry.normalize_rectangle(rect)
    assert actual_result == expected_result
  1. Other tests might include (but are not limited to):
def test_short_wide():
    """Test normalization of a short, wide rectangle."""
    rect = [2, 5, 3, 10]
    expected_result = (0, 0, 0.2, 1.0)
    actual_result = geometry.normalize_rectangle(rect)
    assert actual_result == expected_result


def test_negative_coordinates():
    """Test rectangle normalization with negative coords."""
    rect = [-2, 5, -1, 10]
    expected_result = (0, 0, 0.2, 1.0)
    actual_result = geometry.normalize_rectangle(rect)
    assert actual_result == expected_result

Exercise 11.11.3

There are three approaches to testing when pseudo-random numbers are involved:

  1. Run the function once with a known seed, check and record its output, and then compare the output of subsequent runs to that saved output. (Basically, if the function does the same thing it did the first time, we trust it.)

  2. Replace the pseudo-random number generator with a function of our own that generates a predictable series of values. For example, if we are randomly partitioning a list into two equal halves, we could instead use a function that puts odd-numbered values in one partition and even-numbered values in another (which is a legal but unlikely outcome of truly random partitioning).

  3. Instead of checking for an exact result, check that the result lies within certain bounds, just as we would with the result of a physical experiment.

Exercise 11.11.4

This result seems counter-intuitive to many people because relative error is a measure of a single value, but in this case we are looking at a distribution of values: each result is off by 0.1 compared to a range of 0–2, which doesn’t “feel” infinite. In this case, a better measure might be the largest absolute error divided by the standard deviation of the data.

Chapter 12

Exercise 12.6.1

Add a new command-line argument to collate.py:

parser.add_argument('-v', '--verbose',
                    action="store_true", default=False,
                    help="Set logging level to DEBUG")

and two new lines to the beginning of the main function:

log_level = logging.DEBUG if args.verbose else logging.WARNING
logging.basicConfig(level=log_level)

such that the full collate.py script now reads as follows:

"""
Combine multiple word count CSV-files
into a single cumulative count.
"""

import csv
import argparse
from collections import Counter
import logging

import utilities as util


ERRORS = {
    'not_csv_suffix' : '{fname}: File must end in .csv',
    }


def update_counts(reader, word_counts):
    """Update word counts with data from another reader/file."""
    for word, count in csv.reader(reader):
        word_counts[word] += int(count)


def main(args):
    """Run the command line program."""
    log_lev = logging.DEBUG if args.verbose else logging.WARNING
    logging.basicConfig(level=log_lev)
    word_counts = Counter()
    logging.info('Processing files...')
    for fname in args.infiles:
        logging.debug(f'Reading in {fname}...')
        if fname[-4:] != '.csv':
            msg = ERRORS['not_csv_suffix'].format(fname=fname)
            raise OSError(msg)
        with open(fname, 'r') as reader:
            logging.debug('Computing word counts...')
            update_counts(reader, word_counts)
    util.collection_to_csv(word_counts, num=args.num)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infiles', type=str, nargs='*',
                        help='Input file names')
    parser.add_argument('-n', '--num',
                        type=int, default=None,
                        help='Output n most frequent words')
    parser.add_argument('-v', '--verbose',
                        action="store_true", default=False,
                        help="Set logging level to DEBUG")
    args = parser.parse_args()
    main(args)

Exercise 12.6.2

Add a new command-line argument to collate.py:

parser.add_argument('-l', '--logfile',
                    type=str, default='collate.log',
                    help='Name of the log file')

and pass the name of the log file to logging.basicConfig using the filename argument:

logging.basicConfig(level=log_lev, filename=args.logfile)

such that the collate.py script now reads as follows:

"""
Combine multiple word count CSV-files
into a single cumulative count.
"""

import csv
import argparse
from collections import Counter
import logging

import utilities as util


ERRORS = {
    'not_csv_suffix' : '{fname}: File must end in .csv',
    }


def update_counts(reader, word_counts):
    """Update word counts with data from another reader/file."""
    for word, count in csv.reader(reader):
        word_counts[word] += int(count)


def main(args):
    """Run the command line program."""
    log_lev = logging.DEBUG if args.verbose else logging.WARNING
    logging.basicConfig(level=log_lev, filename=args.logfile)
    word_counts = Counter()
    logging.info('Processing files...')
    for fname in args.infiles:
        logging.debug(f'Reading in {fname}...')
        if fname[-4:] != '.csv':
            msg = ERRORS['not_csv_suffix'].format(fname=fname)
            raise OSError(msg)
        with open(fname, 'r') as reader:
            logging.debug('Computing word counts...')
            update_counts(reader, word_counts)
    util.collection_to_csv(word_counts, num=args.num)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infiles', type=str, nargs='*',
                        help='Input file names')
    parser.add_argument('-n', '--num',
                        type=int, default=None,
                        help='Output n most frequent words')
    parser.add_argument('-v', '--verbose',
                        action="store_true", default=False,
                        help="Set logging level to DEBUG")
    parser.add_argument('-l', '--logfile',
                        type=str, default='collate.log',
                        help='Name of the log file')
    args = parser.parse_args()
    main(args)

Exercise 12.6.3

  1. The loop in collate.py that reads/processes each input file should now read as follows:
for fname in args.infiles:
    try:
        logging.debug(f'Reading in {fname}...')
        if fname[-4:] != '.csv':
            msg = ERRORS['not_csv_suffix'].format(fname=fname)
            raise OSError(msg)
        with open(fname, 'r') as reader:
            logging.debug('Computing word counts...')
            update_counts(reader, word_counts)
    except Exception as error:
        logging.warning(f'{fname} not processed: {error}')
  1. The loop in collate.py that reads/processes each input file should now read as follows:
for fname in args.infiles:
    try:
        logging.debug(f'Reading in {fname}...')
        if fname[-4:] != '.csv':
            msg = ERRORS['not_csv_suffix'].format(
                fname=fname)
            raise OSError(msg)
        with open(fname, 'r') as reader:
            logging.debug('Computing word counts...')
            update_counts(reader, word_counts)
    except FileNotFoundError:
        msg = f'{fname} not processed: File does not exist'
        logging.warning(msg)
    except PermissionError:
        msg = f'{fname} not processed: No read permission'
        logging.warning(msg)
    except Exception as error:
        msg = f'{fname} not processed: {error}'
        logging.warning(msg)

Exercise 12.6.4

  1. The try/except block in collate.py should begin as follows:

    try:
        process_file(fname, word_counts)
    except FileNotFoundError:
    # ... the other exceptions
  2. The following additions need to be made to test_zipfs.py.

    import collate
    def test_not_csv_error():
        """Error handling test for csv check"""
        fname = 'data/dracula.txt'
        word_counts = Counter()
        with pytest.raises(OSError):
            collate.process_file(fname, word_counts)
  3. The following unit test needs to be added to test_zipfs.py.

    def test_missing_file_error():
        """Error handling test for missing file"""
        fname = 'fake_file.csv'
        word_counts = Counter()
        with pytest.raises(FileNotFoundError):
            collate.process_file(fname, word_counts)
  4. The following sequence of commands is required to test the code coverage.

    $ coverage run -m pytest
    $ coverage html

    Open htmlcov/index.html and click on bin/collate.py to view a coverage summary. The lines of process_files that include the raise OSError and open(fname, 'r') commands should appear in green after clicking the green “run” box in the top left-hand corner of the page.

Exercise 12.6.5

  1. The convention is to use ALL_CAPS_WITH_UNDERSCORES when defining global variables.

  2. Python’s f-strings interpolate variables that are in scope: there is no easy way to interpolate values from a lookup table. In contrast, str.format can be given any number of named keyword arguments (Appendix F), so we can look up a string and then interpolate whatever values we want.

  3. Once ERRORS has been moved to the utilities module, all references to it in collate.py must be updated to util.ERRORS.

Exercise 12.6.6

A traceback is an object that records where an exception was raised), what stack frames were on the call stack when the error occurred, and other details that are helpful for debugging. Python’s traceback library can be used to get and print information from these objects.

Chapter 13

Exercise 13.4.1

You can get an ORCID by registering here. Please add this 16-digit identifier to all of your published works and to your online profiles.

Exercise 13.4.2

If possible, compare your answers with those of a colleague who works with the same data. Where did you agree and disagree, and why?

Exercise 13.4.3

  1. 51 solicitors were interviewed as the participants.

  2. Interview data, and data from a database on court decisions.

  3. This information is not available within the documentation. Information on their jobs and opinions are there, but the participant demographics are only described within the associated article. The difficulty is that the article is not linked within the documentation or the metadata.

  4. We can search the dataset name and author name trying to find this. A search for the grant information with “National Science Foundation (1228602)” finds the grant page. Two articles are linked there, but both the DOI links are broken. We can search with the citation for each paper to find them. The Forced Migration article uses a different subset of interviews and does not mention demographics, nor links to the deposited dataset. The Boston College Law Review article has the same two problems of different data and no dataset citation.

    Searching more broadly through Meili’s work, we can find Meili (2015). This lists the dataset as a footnote and reports the 51 interviews with demographic data on reported gender of the interviewees. This paper lists data collection as 2010–2014, while the other two say 2010–2013. We might come to a conclusion that this extra year is where the extra 9 interviews come in, but that difference is not explained anywhere.

Exercise 13.4.4

For borstlab/reversephi_paper:

  1. The software requirements are documented in README.md. In addition to the tools used in the zipf/ project (Python, Make and Git), the project also requires ImageMagick. No information on installing ImageMagick or a required version of ImageMagick is provided.

    To re-create the conda environment, you would need the file my_environment.yml. Instructions for creating and using the environment are provided in README.md.

  2. Like zipf the data processing and analysis steps are documented in a Makefile. The README includes instructions for re-creating the results using make all.

  3. There doesn’t seem to be a DOI for the archived code and data, but the GitHub repo does have a release v1.0 with the description “Published manuscript (1.0)” beside it. A zip file of this release could be downloaded from GitHub.

For the figshare page that accompanies the paper Irving, Wijffels, and Church (2019):

  1. The figshare page includes a “Software environment” section. To re-create the conda environment, you would need the file environment.yml.

  2. figure*_log.txt are log files for each figure in the paper. These files show the computational steps performed in generating the figure, in the form of a list of commands executed at the command line.

    code.zip is a version controlled (using git) file directory containing the code written to perform the analysis (i.e., it contains the scripts referred to in the log files). This code can also be found on GitHub.

  3. The figshare page itself is the archive, and includes a version history for the contents.

For the GitHub repo blab/h3n2-reassortment:

  1. README.md includes an “Install requirements” section that describes setting up the conda environment using the file h3n2_reassortment.yaml.

    The analysis also depends on components from Nextstrain. Instructions for cloning them from GitHub are provided.

  2. The code seems to be spread across the directories jupyter_notebooks, hyphy, flu_epidemiology, and src, but it isn’t clear what order the code should be run in, or how the components depend on each other.

  3. The data itself is not archived, but links are provided in the “Install requirements” section of README.md to documents that describe how to obtain the data. Some intermediate data is also provided in the data/ directory.

    The GitHub repo has a release with files “that are up-to-date with the version of the manuscript that was submitted to Virus Evolution on 31 January 2019.”

Exercise 13.4.6

You’ll know you’ve completed this exercise when you have a URL that points to zip archive for a specific release of your repository on GitHub, e.g:

https://github.com/amira-khan/zipf/archive/KhanVirtanen2020.zip

Exercise 13.4.7

Some steps to publishing your project’s code would be:

  1. Upload the code on GitHub.
  2. Use a standard folder and file structure as taught in this book.
  3. Include README, CONTRIBUTING, CONDUCT, and LICENSE files.
  4. Make sure these files explain how to install and configure the required software and tells people how to run the code in the project.
  5. Include a requirements.txt file for Python package dependencies.

Chapter 14

Exercise 14.9.1

A description and long_description argument need to be provided when the setup function is called in setup.py. On the TestPyPI webpage, the user interface displays description in the grey banner and long_description in the section named “Project Description.”

Other metadata that might be added includes the author email address, software license details and a link to the documentation at Read the Docs.

Exercise 14.9.2

The new requirements_dev.txt file will have this inside it:

pytest

Exercise 14.9.3

The answers to the relevant questions from the checklist are shown below.

  • Repository: Is the source code for this software available at the repository url?
    • Yes. The source code is available at PyPI.
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
    • Yes. Our GitHub repository contains LICENSE.md (Section 8.4.1).
  • Installation: Does installation proceed as outlined in the documentation?
    • Yes. Our README says the package can be installed via pip.
  • Functionality: Have the functional claims of the software been confirmed?
    • Yes. The command-line programs countwords, collate, and plotcounts perform as described in the README.
  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
    • Yes. The “Motivation” section of the README explains this.
  • Installation instructions: Is there a clearly stated list of dependencies? Ideally these should be handled with an automated package management solution.
    • Yes. In our setup.py file the install_requires argument lists dependencies.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
    • Yes. There are examples in the README.
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
    • Yes. This information is available on Read the Docs.
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
    • We have unit tests written and available (test_zipfs.py), but our documentation needs to be updated to tell people to run pytest in order to manually run those tests.
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support?
    • Yes. Our CONTRIBUTING file explains this (Section 8.11).

Exercise 14.9.4

The directory tree for the pratchett package is:

pratchett
├── pratchett
│   └── quotes.py
├── README.md
└── setup.py

README.md should contain a basic description of the package and how to install/use it, while setup.py should contain:

from setuptools import setup


setup(
    name='pratchett',
    version='0.1',
    author='Amira Khan',
    packages=['pratchett'],
)

The following sequence of commands will create the development environment, activate it, and then install the package:

$ conda create -n pratchett python
$ conda activate pratchett
(pratchett)$ cd pratchett
(pratchett)$ pip install -e .