A Solutions
The exercises included in this book represent a wide variety of problems, from multiple choice questions to larger coding tasks. It’s relatively straightforward to indicate a correct answer for the former, though there may be unanticipated cases in which the specific software version you’re using leads to alternative answers being preferable. It’s even more difficult to identify the “right” answer for the latter, since there are often many ways to accomplish the same task with code. Here we present possible solutions that the authors generally agree represent “good” code, but we encourage you to explore additional approaches.
Commits noted in a solution reference Amira’s zipf
repository on GitHub,
which allow you to see the specific lines of a file modified to arrive at the answer.
Chapter 2
Exercise 2.10.1
The -l
option makes ls
use a long listing format, showing not only
the file/directory names but also additional information such as the file size
and the time of its last modification. If you use both the -h
option and the -l
option,
this makes the file size “human readable”, i.e., displaying something like 5.3K
instead of 5369
.
Exercise 2.10.2
The command ls -R -t
results in the contents of
each directory sorted by time of last change.
Exercise 2.10.3
- No:
.
stands for the current directory. - No:
/
stands for the root directory. - No: Amira’s home directory is
/Users/Amira
. - No: This goes up two levels, i.e., ends in
/Users
. - Yes:
~
stands for the user’s home directory, in this case/Users/amira
. - No: This would navigate into a directory
home
in the current directory if it exists. - Yes: Starting from the home directory
~
, this command goes intodata
then back (using..
) to the home directory. - Yes: Shortcut to go back to the user’s home directory.
- Yes: Goes up one level.
- Yes: Same as the previous answer, but with an unnecessary
.
(indicating the current directory).
Exercise 2.10.4
- No: There is a directory
backup
in/Users
. - No: This is the content of
Users/sami/backup
, but with..
we asked for one level further up. - No: Same as previous explanation, but results shown as directories
(which is what the
-F
option specifies). - Yes:
../backup/
refers to/Users/backup/
.
Exercise 2.10.5
- No:
pwd
is not the name of a directory. - Yes:
ls
without directory argument lists files and directories in the current directory. - Yes: Uses the absolute path explicitly.
Exercise 2.10.6
The touch
command updates a file’s timestamp.
If no file exists with the given name, touch
will create one.
Assuming you don’t already have my_file.txt
in your working directory,
touch my_file.txt
will create the file.
When you inspect the file with ls -l
, note that the size of
my_file.txt
is 0 bytes. In other words, it contains no data.
If you open my_file.txt
using your text editor, it is blank.
Some programs do not generate output files themselves, but instead require that empty files have already been generated. When the program is run, it searches for an existing file to populate with its output. The touch command allows you to efficiently generate a blank text file to be used by such programs.
Exercise 2.10.7
The -i
option will prompt before (every) removal
(use y to confirm deletion or n to keep the file).
The Unix shell doesn’t have a trash bin, so all the files removed will disappear forever.
By using the -i
option, we have the chance to check that we are deleting
only the files that we want to remove.
Exercise 2.10.8
Recall that ..
refers to the parent directory (i.e., one above the current directory)
and that .
refers to the current directory.
Exercise 2.10.9
- No: While this would create a file with the correct name, the incorrectly named file still exists in the directory and would need to be deleted.
- Yes: This would work to rename the file.
- No: The period (
.
) indicates where to move the file, but does not provide a new filename; identical filenames cannot be created. - No: The period (
.
) indicates where to copy the file, but does not provide a new filename; identical filenames cannot be created.
Exercise 2.10.10
We start in the /Users/amira/data
directory,
containing a single file, books.dat
.
We create a new folder called doc
and move (mv
) the file books.dat
to that new folder.
Then we make a copy (cp
) of the file we just moved named books-saved.dat
.
The tricky part here is the location of the copied file.
Recall that ..
means “go up a level,” so the copied file is now in /Users/amira
.
Notice that ..
is interpreted with respect to the current working
directory, not with respect to the location of the file being copied.
So, the only thing that will show using ls (in /Users/amira/data
) is the doc
folder.
- No:
books-saved.dat
is located at/Users/amira
- Yes.
- No:
books.dat
is located at/Users/amira/data/doc
- No:
books-saved.dat
is located at/Users/amira
Exercise 2.10.11
If given more than one filename followed by a directory name (i.e., the destination directory must
be the last argument), cp
copies the files to the named directory.
If given three filenames,
cp
throws an error because it is expecting a directory name as the last argument.
Exercise 2.10.12
Yes: Shows all files whose names contain two different characters (
?
) followed by the lettern
, then zero or more characters (*
) followed bytxt
.No: Shows all files whose names start with zero or more characters (
*
) followed bye_
, zero or more characters (*
), thentxt
. The output includes the two desired books, but alsotime_machine.txt
.No: Shows all files whose names start with zero or more characters (
*
) followed byn
, zero or more characters (*
), thentxt
. The output includes the two desired books, but alsofrankenstein.txt
andtime_machine.txt
.No: Shows all files whose names start with zero or more characters (
*
) followed byn
, a single character?
,e
, zero or more characters (*
), thentxt
. The output showsfrankenstein.txt
andsense_and_sensibility.txt
.
Exercise 2.10.13
Amira needs to move her files books.txt
and titles.txt
to the data
directory.
The shell will expand *.txt
to match all .txt
files in the current directory.
The mv
command then moves the list of .txt
files to the data
directory.
Exercise 2.10.14
Yes: This accurately re-creates the directory structure.
Yes: This accurately re-creates the directory structure.
No: The first line of this code set gives an error:
mkdir: 2016-05-20/data: No such file or directory
mkdir
won’t create a subdirectory for a directory that doesn’t yet exist (unless you use an option like-p
that explicitly creates parent directories).No: This creates
raw
andprocessed
directories at the same level asdata
:2016-05-20/ ├── data ├── processed └── raw
Exercise 2.10.15
A solution using two wildcard expressions:
When there are no files beginning with
s
and ending in.txt
, or when there are no files beginning witht
and ending in.txt
.
Exercise 2.10.16
- No: This would remove only
.csv
files with one-character names. - Yes: This removes only files ending in
.csv
. - No: The shell would expand
*
to match everything in the current directory, so the command would try to remove all matched files and an additional file called.csv
. - No: The shell would expand
*.*
to match all files with any extension, so this command would delete all files in the current directory.
Exercise 2.10.17
novel-????-[ab]*.{txt,pdf}
matches:
- Files whose names start with
novel-
, - which is then followed by exactly four characters
(since each
?
matches one character), - followed by another literal
-
, - followed by either the letter
a
or the letterb
, - followed by zero or more other characters (the
*
), - followed by
.txt
or.pdf
.
Chapter 3
Exercise 3.8.1
echo hello > testfile01.txt
writes the string “hello” to testfile01.txt
,
but the file gets overwritten each time we run the command.
echo hello >> testfile02.txt
writes “hello” to testfile02.txt
,
but appends the string to the file if it already exists (i.e., when we run it for the second time).
Exercise 3.8.2
- No: This results from only running the first line of code (
head
). - No: This results from only running the second line of code (
tail
). - Yes: The first line writes the first three lines of
dracula.txt
, the second line appends the last two lines ofdracula.txt
to the same file. - No: We would need to pipe the commands to obtain this answer (
head -n 3 dracula.txt | tail -n 2 > extracted.txt
).
Exercise 3.8.3
Try running each line of code in the data
directory.
- No: This incorrectly uses redirect (
>
), and will result in an error. - No: The number of lines desired for
head
is reported incorrectly; this will result in an error. - No: This will extract the first three files from the
wc
results, which have not yet been sorted into length of lines. - Yes: This output correctly orders and connects each of the commands.
Exercise 3.8.4
To obtain a list of unique results from these data, we need to run:
It makes sense that uniq
is almost always run after using sort
,
because that allows a computer to compare only adjacent lines.
If uniq
did not compare only adjacent lines,
it would require comparing each line to all other lines.
For a small set of comparisons,
this doesn’t matter much,
but this isn’t always possible for large files.
Exercise 3.8.5
When used on a single file,
cat
prints the contents of that file to the screen.
In this case,
the contents of titles.txt
are sent as input to head -n 5
,
so the first five lines of titles.txt
is output.
These five lines are used as the input for tail -n 3
,
which results in lines 3–5 as output.
This is used as input to the final command,
which sorts them in reverse order.
These results are written to the file final.txt
,
the contents of which are:
Sense and Sensibility,1811
Moby Dick,1851
Jane Eyre,1847
Exercise 3.8.6
cut
selects substrings from a line by:
- breaking the string into pieces wherever it finds a separator (
-d ,
), which in this case is a comma, and - keeping one or more of the resulting fields/columns (
-f 2
).
In this case,
the output is only the dates from titles.txt
,
since this is in the second column.
1897
1818
1847
1851
1811
1892
1897
1895
1847
Exercise 3.8.7
- No: This sorts by the book title.
- No: This results in an error because
sort
is being used incorrectly. - No: There are duplicate dates in the output because they have not been sorted first.
- Yes: This results in the output shown below.
- No: This extracts the desired data (below), but then counts the number of lines, resulting in the incorrect answer.
1 1811
1 1818
2 1847
1 1851
1 1892
1 1895
2 1897
If you have difficulty understanding the answers above, try running the commands or sub-sections of the pipelines (e.g., the code between pipes).
Exercise 3.8.8
The difference between the versions is whether the code after echo
is inside quotation marks.
The first version redirects the output from echo analyze $file
to a file (analyzed-$file
).
This doesn’t allow us to preview the commands,
but instead creates files (analyzed-$file
)
containing the text analyze $file
.
The second version will allow us to preview the commands.
This prints to screen everything enclosed in the quotation marks,
expanding the loop variable name (prefixed with $
).
Try both versions for yourself to see the output. Be sure to open the
analyzed-*
files to view their contents.
Exercise 3.8.9
The first version gives the same output on each iteration through
the loop.
Bash expands the wildcard *.txt
to match all files ending in .txt
and then lists them using ls
.
The expanded loop would look like this
(we’ll only show the first two data files):
dracula.txt frankenstein.txt ...
dracula.txt frankenstein.txt ...
...
The second version lists a different file on each loop iteration.
The value of the datafile
variable is evaluated using $datafile
,
and then listed using ls
.
dracula.txt
frankenstein.txt
jane_eyre.txt
moby_dick.txt
sense_and_sensibility.txt
sherlock_holmes.txt
time_machine.txt
Exercise 3.8.10
The first version results in only dracula.txt
output,
because it is the only file beginning in “d”.
The second version results in the following, because these files all contain a “d” with zero or more characters before and after:
README.md
dracula.txt
moby_dick.txt
sense_and_sensibility.txt
Exercise 3.8.11
Both versions write the first 16 lines (head -n 16
)
of each book to a file (headers.txt
).
The first version results in the text from each file being overwritten in each iteration
because of use of >
as a redirect.
The second version uses >>
,
which appends the lines to the existing file.
This is preferable because the final headers.txt
includes the first 16 lines from all files.
Exercise 3.8.12
If a command causes something to crash or hang, it might be useful to know what that command was, in order to investigate the problem. Were the command only be recorded after running it, we would not have a record of the last command run in the event of a crash.
Chapter 4
Exercise 4.8.1
Change into the zipf
directory,
which is located in the home directory (designated by ~
).
Find all the files ending in .bak
and remove them one by one.
Remove the summarize_all_books.sh
script.
Recursively remove each file in the results
directory
and then remove the directory itself.
(It is necessary to remove all the files first because you
cannot remove a non-empty directory.)
Exercise 4.8.2
Running this script with the given parameters will print
the first and last line from each file in the directory ending in .txt
.
- No: This answer misinterprets the lines printed.
- Yes.
- No: This answer includes the wrong files.
- No: Leaving off the quotation marks would result in an error.
Exercise 4.8.3
One possible script (longest.sh
) to accomplish this task:
Exercise 4.8.4
script1.sh
will print the names of all files in the directory on a single line, e.g.,README.md dracula.txt frankenstein.txt jane_eyre.txt moby_dick.txt script1.sh sense_and_sensibility.txt sherlock_holmes.txt time_machine.txt
. Although*.txt
is included when running the script, the commands run by the script do not reference$1
.script2.sh
will print the contents of the first three files ending in.txt
; the three variables ($1
,$2
,$3
) refer to the first, second, and third argument entered after the script, respectively.script3.sh
will print the name of each file ending in.txt
, since$@
refers to all the arguments (e.g., filenames) given to a shell script. The list of files would be followed by.txt
:dracula.txt frankenstein.txt jane_eyre.txt moby_dick.txt sense_and_sensibility.txt sherlock_holmes.txt time_machine.txt.txt
.
Exercise 4.8.5
- No: This command extracts any line containing “he”, either as a word or within a word.
- No: This results in the same output as the answer for #1.
-E
allows the search term to represent an extended regular expression, but the search term is simple enough that it doesn’t make a difference in the result. - Yes:
-w
means to return only matches for the word “he”. - No:
-i
means to invert the search result; this would return all lines except the one we desire.
Exercise 4.8.6
Exercise 4.8.7
One possible solution:
for sister in Elinor Marianne
do
echo $sister:
grep -o -w $sister sense_and_sensibility.txt | wc -l
done
The -o
option prints only the matching part of a line.
An alternative (but possibly less accurate) solution is:
This solution is potentially less accurate
because grep -c
only reports the number of lines matched.
The total number of matches reported by this method
will be lower if there is more than one match per line.
Exercise 4.8.8
- Yes: This returns
data/jane_eyre.txt
. - Maybe: This option may work on your computer,
but may not behave consistently across all shells
because expansion of the wildcard (
*e.txt
) may prevent piping from working correctly. We recommend enclosing*e.txt
in quotation marks, as in answer 1. - No: This searches the contents of files for lines matching “machine”, rather than the filenames.
- See above.
Exercise 4.8.9
- Find all files with a
.dat
extension recursively from the current directory. - Count the number of lines each of these files contains.
- Sort the output from step 2 numerically.
Exercise 4.8.10
The following command works if your working directory is Desktop/
and you replace “username” with that of your current computer.
-mtime
needs to be negative because it is referencing a day prior to the current date.
Chapter 5
Exercise 5.11.1
Running a Python statement directly from the command line is useful as a basic calculator
and for simple string operations,
since these commands occur in one line of code.
More complicated commands will require multiple statements;
when run using python -c
,
statements must be separated by semi-colons:
Multiple statements, therefore, quickly become more troublesome to run in this manner.
Exercise 5.11.2
The my_ls.py
script could read as follows:
"""List the files in a given directory with a given suffix."""
import argparse
import glob
def main(args):
"""Run the program."""
dir = args.dir if args.dir[-1] == '/' else args.dir + '/'
glob_input = dir + '*.' + args.suffix
glob_output = sorted(glob.glob(glob_input))
for item in glob_output:
print(item)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('dir', type=str, help='Directory')
parser.add_argument('suffix', type=str,
help='File suffix (e.g. py, sh)')
args = parser.parse_args()
main(args)
Exercise 5.11.3
The sentence_endings.py
script could read as follows:
"""Count the occurrence of different sentence endings."""
import argparse
def main(args):
"""Run the command line program."""
text = args.infile.read()
for ending in ['.', '?', '!']:
count = text.count(ending)
print(f'Number of {ending} is {count}')
if __name__ == '__main__':
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('infile', type=argparse.FileType('r'),
nargs='?', default='-',
help='Input file name')
args = parser.parse_args()
main(args)
Exercise 5.11.4
While there may be other ways for plotcounts.py
to meet the requirements of the exercise,
we’ll be using this script in subsequent chapters so
we recommend that the script reads as follows:
"""Plot word counts."""
import argparse
import pandas as pd
def main(args):
"""Run the command line program."""
df = pd.read_csv(args.infile, header=None,
names=('word', 'word_frequency'))
df['rank'] = df['word_frequency'].rank(ascending=False,
method='max')
df['inverse_rank'] = 1 / df['rank']
ax = df.plot.scatter(x='word_frequency',
y='inverse_rank',
figsize=[12, 6],
grid=True,
xlim=args.xlim)
ax.figure.savefig(args.outfile)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('infile', type=argparse.FileType('r'),
nargs='?', default='-',
help='Word count csv file name')
parser.add_argument('--outfile', type=str,
default='plotcounts.png',
help='Output image file name')
parser.add_argument('--xlim', type=float, nargs=2,
metavar=('XMIN', 'XMAX'),
default=None, help='X-axis limits')
args = parser.parse_args()
main(args)
Chapter 6
Exercise 6.11.1
Amira does not need to make the heaps-law
subdirectory a Git repository
because the zipf
repository will track everything inside it regardless of how deeply nested.
Amira shouldn’t run git init
in heaps-law
because nested Git repositories can interfere with each other.
If someone commits something in the inner repository,
Git will not know whether to record the changes in that repository,
the outer one,
or both.
Exercise 6.11.2
git status
now shows:
On branch master
Untracked files:
(use "git add <file>..." to include in what will be committed)
example.txt
nothing added to commit but untracked files present
(use "git add" to track)
Nothing has happened to the file;
it still exists but Git no longer has it in the staging area.
git rm --cached
is equivalent to git restore --staged
.
With newer versions of Git,
older commands will still work,
and you may encounter references to them when reading help documentation.
If you created this file in your zipf
project,
we recommend removing it before proceeding.
Exercise 6.11.3
If we make a few changes to .gitignore
such that it now reads:
__pycache__ this is a change
this is another change
then git diff
would show:
diff --git a/.gitignore b/.gitignore
index bee8a64..5c83419 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,3 @@
-__pycache__
+__pycache__ this is a change
+
+this is another change
Whereas git diff --word-diff
shows:
diff --git a/.gitignore b/.gitignore
index bee8a64..5c83419 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,3 @@
__pycache__ {+this is a change+}
{+this is another change+}
Depending on the nature of the changes you are viewing, the latter may be easier to interpret since it shows exactly what has been changed.
Exercise 6.11.4
- Maybe: would only create a commit if the file has already been staged.
- No: would try to create a new repository, which results in an error if a repository already exists.
- Yes: first adds the file to the staging area, then commits.
- No: would result in an error, as it would try to commit a file “my recent changes” with the message “myfile.txt.”
Exercise 6.11.5
- Go into your home directory with
cd ~
. - Create a new folder called
bio
withmkdir bio
. - Make the repository your working directory with
cd bio
. - Turn it into a repository with
git init
. - Create your biography using
nano
or another text editor. - Add it and commit it in a single step with
git commit -a -m "Some message"
. - Modify the file.
- Use
git diff
to see the differences.
Exercise 6.11.6
- Create
employment.txt
using an editor like Nano. - Add both
me.txt
andemployment.txt
to the staging area withgit add *.txt
. - Check that both files are there with
git status
. - Commit both files at once with
git commit
.
Exercise 6.11.7
GitHub displays timestamps in a human-readable relative format (i.e., “22 hours ago” or “three weeks ago”), since this makes it easy for anyone in any time zone to know what changes have been made recently. However, if we hover over the timestamp we can see the exact time at which the last change to the file occurred.
Exercise 6.11.8
The answer is 1.
The command git add motivation.txt
adds the current version of motivation.txt
to the staging area.
The changes to the file from the second echo
command are only applied to the working copy,
not the version in the staging area.
As a result,
when git commit -m "Motivate project"
is executed,
the version of motivation.txt
committed to the repository is the content from the first echo
.
However,
the working copy still has the output from the second echo
;
git status
would show that the file is modified.
git restore HEAD motivation.txt
therefore replaces the working copy with
the most recently committed version of motivation.txt
(the content of the first echo
),
so cat motivation.txt
prints:
Zipf's Law describes the relationship between the frequency and
rarity of words.
Exercise 6.11.10
Add the following two lines to .gitignore
:
*.dat # ignore all data files
!final.dat # except final.data
The exclamation point !
includes a previously excluded entry.
Note also that if we have previously committed .dat
files in this repository,
they will not be ignored once these rules are added to .gitignore
.
Only future .dat
files will be ignored.
Exercise 6.11.11
The left button (with the picture of a clipboard)
copies the full identifier of the commit to the clipboard.
In the shell,
git log
shows the full commit identifier for each commit.
The middle button (with seven letters and numbers)
shows all of the changes that were made in that particular commit;
green shaded lines indicate additions and red lines indicate removals.
We can show the same thing in the shell using git diff
or git diff FROM..TO
(where FROM
and TO
are commit identifiers).
The right button lets us view all of the files in the repository at the time of that commit.
To do this in the shell,
we would need to check out the repository as it was at that commit
using git checkout ID
, where ID
is the tag, branch name, or commit identifier.
If we do this,
we need to remember to put the repository back to the right state afterward.
Exercise 6.11.12
Committing updates our local repository. Pushing sends any commits we have made locally that aren’t yet in the remote repository to the remote repository.
Exercise 6.11.13
When GitHub creates a README.md
file while setting up a new repository,
it actually creates the repository and then commits the README.md
file.
When we try to pull from the remote repository to our local repository,
Git detects that their histories do not share a common origin and refuses to merge them.
warning: no common commits
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/frances/eniac
* branch master -> FETCH_HEAD
* [new branch] master -> origin/master
fatal: refusing to merge unrelated histories
We can force Git to merge the two repositories with the option --allow-unrelated-histories
.
Please check the contents of the local and remote repositories carefully before doing this.
Exercise 6.11.14
The checkout
command restores files from the repository,
overwriting the files in our working directory.
HEAD
indicates the latest version.
- No: this can be dangerous;
without a filename,
git checkout
will restore all files in the current directory (and all directories below it) to their state at the commit specified. This command will restoredata_cruncher.sh
to the latest commit version, but will also reset any other files we have changed to that version, which will erase any unsaved changes you may have made to those files. - Yes: this restores the latest version of only the desired file.
- No: this gets the version of
data_cruncher.sh
from the commit beforeHEAD
, which is not what we want. - Yes: the unique ID (identifier) of the last commit is what
HEAD
means. - Yes: this is equivalent to the answer to 2.
- No:
git restore
assumesHEAD
, so Git will assume you’re trying to restore a file calledHEAD
, resulting in an error.
Exercise 6.11.15
- Compares what has changed between the
current
bin/plotcounts.py
and the same file nine commits ago. - It returns an error:
fatal: ambiguous argument 'HEAD~9': unknown revision or path not in the working tree.
We don’t have enough commits in history for the command to properly execute. - It compares changes (either staged or unstaged) to the most recent commit.
Exercise 6.11.16
No, using git checkout
on a staged file does not unstage it.
The changes are in the staging area and checkout would affect
the working directory.
Exercise 6.11.17
Each line of output corresponds to a line in the file,
and includes the commit identifier,
who last modified the line,
when that change was made,
and what is included on that line.
Note that the edit you just committed is not present here;
git blame
only shows the current lines in the file,
and doesn’t report on lines that have been removed.
Chapter 7
Exercise 7.12.1
--oneline
shows each commit on a single line with the short identifier at the start and the title of the commit beside it.-n NUMBER
limits the number of commits to show.--since
and--after
can be used to show commits in a range of dates or times;--author
can be used to show commits by a particular person; and-w
tells Git to ignore whitespace when comparing commits.
Exercise 7.12.2
An online search for “show Git branch in Bash prompt” turns up several approaches,
one of the simplest of which is to add this line to our ~/.bashrc
file:
export PS1="\\w + \$(git branch 2>/dev/null | grep '^*' |
colrm 1 2) \$ "
Breaking it down:
Setting the
PS1
variable defines the primary shell prompt.\\w
in a shell prompt string means “the current directory.”The
+
is a literal+
sign between the current directory and the Git branch name.The command that gets the name of the current Git branch is in
$(...)
. (We need to escape the$
as\$
so Bash doesn’t just run it once when defining the string.)The
git branch
command shows all the branches, so we pipe that togrep
and select the one marked with a*
.Finally, we remove the first column (i.e., the one containing the
*
) to leave just the branch name.
So what’s 2>/dev/null
about?
That redirects any error messages to /dev/null
,
a special “file” that consumes input without saving it.
We need that because sometimes we will be in a directory
that isn’t inside a Git repository,
and we don’t want error messages showing up in our shell prompt.
None of this is obvious, and we didn’t figure it out ourselves. Instead, we did a search and pasted various answers into explainshell.com until we had something we understood and trusted.
Exercise 7.12.3
https://github.com/github/gitignore/blob/master/Python.gitignore
ignores 76 files or patterns.
Of those,
we recognized less than half.
Searching online for some of these,
like "*.pot file"
,
turns up useful explanations.
Searching for others like var/
does not;
in that case,
we have to look at the category (in this case, “Python distribution”)
and set aside time to do more reading.
Exercise 7.12.4
git diff master..same
does not print anything because there are no differences between the two branches.git merge same master
printsmerging
because Git combines histories even when the files themselves do not differ. After running this command,git history
shows a commit for the merge.
Exercise 7.12.5
Git refuses to delete a branch with unmerged commits because it doesn’t want to destroy our work.
Using the
-D
(capital-D) option togit branch
will delete the branch anyway. This is dangerous because any content that exists only in that branch will be lost.Even with
-D
,git branch
will not delete the branch we are currently on.
Exercise 7.12.6
- Chartreuse has repositories on GitHub and their desktop
containing identical copies of
README.md
and nothing else. - Fuchsia has repositories on GitHub and their desktop with exactly the same content as Chartreuse’s repositories.
fuchsia.txt
is in both of Fuchsia’s repositories but not in Chartreuse’s repositories.fuchsia.txt
is still in both of Fuchsia’s repositories but still not in Chartreuse’s repositories.chartreuse.txt
is in both of Chartreuse’s repositories but not yet in either of Fuchsia’s repositories.chartreuse.txt
is in Fuchsia’s desktop repository but not yet in their GitHub repository.chartreuse.txt
is in both of Fuchsia’s repositories.fuchsia.txt
is in Chartreuse’s GitHub repository but not in their desktop repository.- All four repositories contain both
fuchsia.txt
andchartreuse.txt
.
Chapter 8
Exercise 8.14.1
Our license is at https://github.com/merely-useful/py-rse/blob/book/LICENSE.md.
Our contribution guidelines are at https://github.com/merely-useful/py-rse/blob/book/CONTRIBUTING.md.
Exercise 8.14.3
The newly created LICENSE.md
should look something like the example
MIT License shown in Section 8.4.1.
Exercise 8.14.4
The text in the README.md
might look something like:
## Contributing
Interested in contributing?
Check out the [CONTRIBUTING.md](CONTRIBUTING.md)
file for guidelines on how to contribute.
Please note that this project is released with a
[Contributor Code of Conduct](CONDUCT.md).
By contributing to this project,
you agree to abide by its terms.
Your CONTRIBUTING.md
file might look something like the following:
# Contributing
Thank you for your interest
in contributing to the Zipf's Law package!
If you are new to the package and/or
collaborative code development on GitHub,
feel free to discuss any suggested changes via issue or email.
We can then walk you through the pull request process if need be.
As the project grows,
we intend to develop more detailed guidelines for submitting
bug reports and feature requests.
We also have a code of conduct
(see [`CONDUCT.md`](CONDUCT.md)).
Please follow it in all your interactions with the project.
Exercise 8.14.6
We often delete the duplicate
label:
when we mark an issue that way,
we (almost) always add a comment saying which issue it’s a duplicate of,
in which case it’s just as sensible to label the issue wontfix
.
Exercise 8.14.7
Some solutions could be:
- Give the team member their own office space so they don’t distract others.
- Buy noise-cancelling headphones for the employees that find it distracting.
- Re-arrange the work spaces so that there is a “quiet office” and a regular office space and have the team member with the attention disorder work in the regular office.
Exercise 8.14.8
Possible solutions:
- Change the rule so that anyone who contributes to the project, in any way, gets included as a co-author.
- Update the rule to include a contributor list on all projects with descriptions of duties, roles, and tasks the contributor provided for the project.
Exercise 8.14.9
We obviously can’t say which description fits you best, but:
Use three sticky notes and interruption bingo to stop Anna from cutting people off.
Tell Bao that the devil doesn’t need more advocates, and that he’s only allowed one “but what about” at a time.
Hediyeh’s lack of self-confidence will take a long time to remedy. Keeping a list of the times she’s been right and reminding her of them frequently is a start, but the real fix is to create and maintain a supportive environment.
Unmasking Kenny’s hitchhiking will feel like nit-picking, but so does the accounting required to pin down other forms of fraud. The most important thing is to have the discussion in the open so that everyone realizes he’s taking credit for everyone else’s work as well as theirs.
Melissa needs a running partner—someone to work beside her so that she starts when she should and doesn’t get distracted. If that doesn’t work, the project may need to assign everything mission-critical to someone else (which will probably lead to her leaving).
Petra can be managed with a one-for-one rule: each time she builds or fixes something that someone else needs, she can then work on something she thinks is cool. However, she’s only allowed to add whatever it is to the project if someone else will publicly commit to maintaining it.
Get Frank and Raj off your project as quickly as you can.
Chapter 9
Exercise 9.11.1
make -n target
will show commands without running them.
Exercise 9.11.2
The
-B
option rebuilds everything, even files that aren’t out of date.The
-C
option tells Make to change directories before executing, so thatmake -C ~/myproject
runs Make in~/myproject
regardless of the directory it is invoked from.By default, Make looks for (and runs) a file called
Makefile
ormakefile
. If you use another name for your Makefile (which is necessary if you have multiple Makefiles in the same directory), then you need to specify the name of that Makefile using the-f
option.
Exercise 9.11.3
mkdir -p some/path
makes one or more nested directories if they don’t exist,
and does nothing (without complaining) if they already exist.
It is useful for creating the output directories for build rules.
Exercise 9.11.4
The build rule for generating the result for any book should now be:
## results/%.csv : regenerate result for any book.
results/%.csv : data/%.txt $(COUNT)
@bash $(SUMMARY) $< Title
@bash $(SUMMARY) $< Author
python $(COUNT) $< > $@
where SUMMARY
is defined earlier in the Makefile
as
and the settings build rule now includes:
Exercise 9.11.5
Since we already have a variable RESULTS
that contains all of the results files,
all we need is a phony target that depends on them:
Exercise 9.11.6
If we use a shell wildcard in a rule like this:
then if results/collated.csv
already exists,
the rule tells Make that the file depends on itself.
Exercise 9.11.7
Our rule is:
The
-h
option togrep
tells it not to print filenames, while the-E
option tells it to interpret^##
as a pattern.MAKEFILE_LIST
is an automatically defined variable with the names of all the Makefiles in play. (There might be more than one because Makefiles can include other Makefiles.)sed
can be used to do string substitution.column
formats text nicely in columns.
Chapter 10
Exercise 10.8.1
The build rule involving plotcounts.py
should now read:
## results/collated.png: plot the collated results.
results/collated.png : results/collated.csv $(PARAMS)
python $(PLOT) $< --outfile $@ --plotparams $(word 2,$^)
where PARAMS
is defined earlier in the Makefile
along with all the other variables and
also included later in the settings build rule:
Exercise 10.8.2
- Make the following additions to
plotcounts.py
:
parser.add_argument('--style', type=str,
choices=plt.style.available,
default=None, help='matplotlib style')
- Add
nargs='*'
to the definition of the--style
option:
Exercise 10.8.3
The first step is to add a new command-line argument to tell plotcount.py
what we want to do:
if __name__ == '__main__':
parser = argparse.ArgumentParser(description=__doc__)
# ...other options as before...
parser.add_argument('--saveconfig', type=str, default=None,
help='Save configuration to file')
args = parser.parse_args()
main(args)
Next, we add three lines to main
to act on this option after all of the plotting parameters have been set.
For now we use return
to exit from main
as soon as the parameters have been saved;
this lets us test our change without overwriting any of our actual plots.
def save_configuration(fname, params):
"""Save configuration to a file."""
with open(fname, 'w') as reader:
yaml.dump(params, reader)
def main(args):
"""Run the command line program."""
if args.style:
plt.style.use(args.style)
set_plot_params(args.plotparams)
if args.saveconfig:
save_configuration(args.saveconfig, mpl.rcParams)
return
df = pd.read_csv(args.infile, header=None,
names=('word', 'word_frequency'))
# ...carry on producing plot...
Finally, we add a target to Makefile
to try out our change.
We do the test this way so that we can be sure that
we’re testing with the same options we use with the real program;
if we were to type in the whole command ourselves,
we might use something different.
We also save the configuration to /tmp
rather than to our project directory
to keep it out of version control’s way:
## test-saveconfig : save plot configuration.
test-saveconfig :
python $(PLOT) --saveconfig /tmp/test-saveconfig.yml \
--plotparams $(PARAMS)
The output is over 400 lines long, and includes settings for everything from the animation bit rate to the size of y-axis ticks:
!!python/object/new:matplotlib.RcParams
dictitems:
_internal.classic_mode: false
agg.path.chunksize: 0
animation.avconv_args: []
animation.avconv_path: avconv
animation.bitrate: -1
...
ytick.minor.size: 2.0
ytick.minor.visible: false
ytick.minor.width: 0.6
ytick.right: false
The beautiful thing about this file is that the entries are automatically sorted alphabetically,
which makes it easy for both human beings and the diff
command to spot differences.
This helps reproducibility because any one of these settings might change
in a new release of matplotlib
,
and any of those changes might affect our plots.
Saving the settings allows us to compare what we had when we did our work
to what we have when we’re trying to re-create it,
which in turn gives us a starting point for debugging if we need to.
Exercise 10.8.4
import configparser
def set_plot_params(param_file):
"""Set the matplotlib parameters."""
if param_file:
config = configparser.ConfigParser()
config.read(param_file)
for section in config.sections():
for param in config[section]:
value = config[section][param]
mpl.rcParams[param] = value
Most people seem to find Windows INI files easier to write and read, since it’s easier to see what’s a heading and what’s a value.
However, Windows INI files only provide one level of sectioning, so complex configurations are harder to express. Thus, while YAML may be a bit more difficult to get started with, it will take us further.
Exercise 10.8.5
The answer depends on whether we are able to make changes to Program A and Program B. If we can, we can modify them to use overlay configuration and put the shared parameters in a single file that both programs load. If we can’t do that, the next best thing is to create a small helper program that reads their configuration files and checks that common parameters have consistent values. The first solution prevents the problem; the second detects it, which is a lot better than nothing.
Chapter 11
Exercise 11.11.1
The first assertion checks that the input sequence
values
is not empty. An empty sequence such as[]
will make it fail.The second assertion checks that each value in the list can be turned into an integer. Input such as
[1, 2,'c', 3]
will make it fail.The third assertion checks that the total of the list is greater than 0. Input such as
[-10, 2, 3]
will make it fail.
Exercise 11.11.2
- Remove the comments about inserting preconditions and add the following:
assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
x0, y0, x1, y1 = rect
assert x0 < x1, 'Invalid X coordinates'
assert y0 < y1, 'Invalid Y coordinates'
- Remove the comment about inserting postconditions and add the following
assert 0 < upper_x <= 1.0, \
'Calculated upper X coordinate invalid'
assert 0 < upper_y <= 1.0, \
'Calculated upper Y coordinate invalid'
- The problem is that the following section of
normalize_rectangle
should readfloat(dy) / dx
, notfloat(dx) / dy
.
test_geometry.py
should read as follows:
import geometry
def test_tall_skinny():
"""Test normalization of a tall, skinny rectangle."""
rect = [20, 15, 30, 20]
expected_result = (0, 0, 1.0, 0.5)
actual_result = geometry.normalize_rectangle(rect)
assert actual_result == expected_result
- Other tests might include (but are not limited to):
def test_short_wide():
"""Test normalization of a short, wide rectangle."""
rect = [2, 5, 3, 10]
expected_result = (0, 0, 0.2, 1.0)
actual_result = geometry.normalize_rectangle(rect)
assert actual_result == expected_result
def test_negative_coordinates():
"""Test rectangle normalization with negative coords."""
rect = [-2, 5, -1, 10]
expected_result = (0, 0, 0.2, 1.0)
actual_result = geometry.normalize_rectangle(rect)
assert actual_result == expected_result
Exercise 11.11.3
There are three approaches to testing when pseudo-random numbers are involved:
Run the function once with a known seed, check and record its output, and then compare the output of subsequent runs to that saved output. (Basically, if the function does the same thing it did the first time, we trust it.)
Replace the pseudo-random number generator with a function of our own that generates a predictable series of values. For example, if we are randomly partitioning a list into two equal halves, we could instead use a function that puts odd-numbered values in one partition and even-numbered values in another (which is a legal but unlikely outcome of truly random partitioning).
Instead of checking for an exact result, check that the result lies within certain bounds, just as we would with the result of a physical experiment.
Exercise 11.11.4
This result seems counter-intuitive to many people because relative error is a measure of a single value, but in this case we are looking at a distribution of values: each result is off by 0.1 compared to a range of 0–2, which doesn’t “feel” infinite. In this case, a better measure might be the largest absolute error divided by the standard deviation of the data.
Chapter 12
Exercise 12.6.1
Add a new command-line argument to collate.py
:
parser.add_argument('-v', '--verbose',
action="store_true", default=False,
help="Set logging level to DEBUG")
and two new lines to the beginning of the main
function:
such that the full collate.py
script now reads as follows:
"""
Combine multiple word count CSV-files
into a single cumulative count.
"""
import csv
import argparse
from collections import Counter
import logging
import utilities as util
ERRORS = {
'not_csv_suffix' : '{fname}: File must end in .csv',
}
def update_counts(reader, word_counts):
"""Update word counts with data from another reader/file."""
for word, count in csv.reader(reader):
word_counts[word] += int(count)
def main(args):
"""Run the command line program."""
log_lev = logging.DEBUG if args.verbose else logging.WARNING
logging.basicConfig(level=log_lev)
word_counts = Counter()
logging.info('Processing files...')
for fname in args.infiles:
logging.debug(f'Reading in {fname}...')
if fname[-4:] != '.csv':
msg = ERRORS['not_csv_suffix'].format(fname=fname)
raise OSError(msg)
with open(fname, 'r') as reader:
logging.debug('Computing word counts...')
update_counts(reader, word_counts)
util.collection_to_csv(word_counts, num=args.num)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('infiles', type=str, nargs='*',
help='Input file names')
parser.add_argument('-n', '--num',
type=int, default=None,
help='Output n most frequent words')
parser.add_argument('-v', '--verbose',
action="store_true", default=False,
help="Set logging level to DEBUG")
args = parser.parse_args()
main(args)
Exercise 12.6.2
Add a new command-line argument to collate.py
:
parser.add_argument('-l', '--logfile',
type=str, default='collate.log',
help='Name of the log file')
and pass the name of the log file to logging.basicConfig
using the filename
argument:
such that the collate.py
script now reads as follows:
"""
Combine multiple word count CSV-files
into a single cumulative count.
"""
import csv
import argparse
from collections import Counter
import logging
import utilities as util
ERRORS = {
'not_csv_suffix' : '{fname}: File must end in .csv',
}
def update_counts(reader, word_counts):
"""Update word counts with data from another reader/file."""
for word, count in csv.reader(reader):
word_counts[word] += int(count)
def main(args):
"""Run the command line program."""
log_lev = logging.DEBUG if args.verbose else logging.WARNING
logging.basicConfig(level=log_lev, filename=args.logfile)
word_counts = Counter()
logging.info('Processing files...')
for fname in args.infiles:
logging.debug(f'Reading in {fname}...')
if fname[-4:] != '.csv':
msg = ERRORS['not_csv_suffix'].format(fname=fname)
raise OSError(msg)
with open(fname, 'r') as reader:
logging.debug('Computing word counts...')
update_counts(reader, word_counts)
util.collection_to_csv(word_counts, num=args.num)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('infiles', type=str, nargs='*',
help='Input file names')
parser.add_argument('-n', '--num',
type=int, default=None,
help='Output n most frequent words')
parser.add_argument('-v', '--verbose',
action="store_true", default=False,
help="Set logging level to DEBUG")
parser.add_argument('-l', '--logfile',
type=str, default='collate.log',
help='Name of the log file')
args = parser.parse_args()
main(args)
Exercise 12.6.3
- The loop in
collate.py
that reads/processes each input file should now read as follows:
for fname in args.infiles:
try:
logging.debug(f'Reading in {fname}...')
if fname[-4:] != '.csv':
msg = ERRORS['not_csv_suffix'].format(fname=fname)
raise OSError(msg)
with open(fname, 'r') as reader:
logging.debug('Computing word counts...')
update_counts(reader, word_counts)
except Exception as error:
logging.warning(f'{fname} not processed: {error}')
- The loop in
collate.py
that reads/processes each input file should now read as follows:
for fname in args.infiles:
try:
logging.debug(f'Reading in {fname}...')
if fname[-4:] != '.csv':
msg = ERRORS['not_csv_suffix'].format(
fname=fname)
raise OSError(msg)
with open(fname, 'r') as reader:
logging.debug('Computing word counts...')
update_counts(reader, word_counts)
except FileNotFoundError:
msg = f'{fname} not processed: File does not exist'
logging.warning(msg)
except PermissionError:
msg = f'{fname} not processed: No read permission'
logging.warning(msg)
except Exception as error:
msg = f'{fname} not processed: {error}'
logging.warning(msg)
Exercise 12.6.4
The
try/except
block incollate.py
should begin as follows:The following additions need to be made to
test_zipfs.py
.The following unit test needs to be added to
test_zipfs.py
.The following sequence of commands is required to test the code coverage.
Open
htmlcov/index.html
and click onbin/collate.py
to view a coverage summary. The lines ofprocess_files
that include theraise OSError
andopen(fname, 'r')
commands should appear in green after clicking the green “run” box in the top left-hand corner of the page.
Exercise 12.6.5
The convention is to use
ALL_CAPS_WITH_UNDERSCORES
when defining global variables.Python’s f-strings interpolate variables that are in scope: there is no easy way to interpolate values from a lookup table. In contrast,
str.format
can be given any number of named keyword arguments (Appendix F), so we can look up a string and then interpolate whatever values we want.Once
ERRORS
has been moved to theutilities
module, all references to it incollate.py
must be updated toutil.ERRORS
.
Exercise 12.6.6
A traceback is an object that records where an exception was raised), what stack frames were on the call stack when the error occurred, and other details that are helpful for debugging. Python’s traceback library can be used to get and print information from these objects.
Chapter 13
Exercise 13.4.1
You can get an ORCID by registering here. Please add this 16-digit identifier to all of your published works and to your online profiles.
Exercise 13.4.2
If possible, compare your answers with those of a colleague who works with the same data. Where did you agree and disagree, and why?
Exercise 13.4.3
51 solicitors were interviewed as the participants.
Interview data, and data from a database on court decisions.
This information is not available within the documentation. Information on their jobs and opinions are there, but the participant demographics are only described within the associated article. The difficulty is that the article is not linked within the documentation or the metadata.
We can search the dataset name and author name trying to find this. A search for the grant information with “National Science Foundation (1228602)” finds the grant page. Two articles are linked there, but both the DOI links are broken. We can search with the citation for each paper to find them. The Forced Migration article uses a different subset of interviews and does not mention demographics, nor links to the deposited dataset. The Boston College Law Review article has the same two problems of different data and no dataset citation.
Searching more broadly through Meili’s work, we can find Meili (2015). This lists the dataset as a footnote and reports the 51 interviews with demographic data on reported gender of the interviewees. This paper lists data collection as 2010–2014, while the other two say 2010–2013. We might come to a conclusion that this extra year is where the extra 9 interviews come in, but that difference is not explained anywhere.
Exercise 13.4.4
For borstlab/reversephi_paper
:
The software requirements are documented in
README.md
. In addition to the tools used in thezipf/
project (Python, Make and Git), the project also requires ImageMagick. No information on installing ImageMagick or a required version of ImageMagick is provided.To re-create the
conda
environment, you would need the filemy_environment.yml
. Instructions for creating and using the environment are provided inREADME.md
.Like
zipf
the data processing and analysis steps are documented in aMakefile
. TheREADME
includes instructions for re-creating the results usingmake all
.There doesn’t seem to be a DOI for the archived code and data, but the GitHub repo does have a release
v1.0
with the description “Published manuscript (1.0)” beside it. A zip file of this release could be downloaded from GitHub.
For the figshare page that accompanies the paper Irving, Wijffels, and Church (2019):
The figshare page includes a “Software environment” section. To re-create the
conda
environment, you would need the fileenvironment.yml
.figure*_log.txt
are log files for each figure in the paper. These files show the computational steps performed in generating the figure, in the form of a list of commands executed at the command line.code.zip
is a version controlled (using git) file directory containing the code written to perform the analysis (i.e., it contains the scripts referred to in the log files). This code can also be found on GitHub.The figshare page itself is the archive, and includes a version history for the contents.
For the GitHub repo blab/h3n2-reassortment
:
README.md
includes an “Install requirements” section that describes setting up theconda
environment using the fileh3n2_reassortment.yaml
.The analysis also depends on components from Nextstrain. Instructions for cloning them from GitHub are provided.
The code seems to be spread across the directories
jupyter_notebooks
,hyphy
,flu_epidemiology
, andsrc
, but it isn’t clear what order the code should be run in, or how the components depend on each other.The data itself is not archived, but links are provided in the “Install requirements” section of
README.md
to documents that describe how to obtain the data. Some intermediate data is also provided in thedata/
directory.The GitHub repo has a release with files “that are up-to-date with the version of the manuscript that was submitted to Virus Evolution on 31 January 2019.”
Exercise 13.4.5
Exercise 13.4.6
You’ll know you’ve completed this exercise when you have a URL that points to zip archive for a specific release of your repository on GitHub, e.g:
https://github.com/amira-khan/zipf/archive/KhanVirtanen2020.zip
Exercise 13.4.7
Some steps to publishing your project’s code would be:
- Upload the code on GitHub.
- Use a standard folder and file structure as taught in this book.
- Include
README
,CONTRIBUTING
,CONDUCT
, andLICENSE
files. - Make sure these files explain how to install and configure the required software and tells people how to run the code in the project.
- Include a
requirements.txt
file for Python package dependencies.
Chapter 14
Exercise 14.9.1
A description
and long_description
argument need to be provided
when the setup
function is called in setup.py
.
On the TestPyPI webpage,
the user interface displays description in the grey banner
and long_description in the section named “Project Description.”
Other metadata that might be added includes the author email address, software license details and a link to the documentation at Read the Docs.
Exercise 14.9.3
The answers to the relevant questions from the checklist are shown below.
- Repository: Is the source code for this software available at the repository url?
- Yes. The source code is available at PyPI.
- License: Does the repository contain a plain-text LICENSE file
with the contents of an OSI approved software license?
- Yes. Our GitHub repository contains LICENSE.md (Section 8.4.1).
- Installation: Does installation proceed as outlined in the documentation?
- Yes. Our README says the package can be installed via pip.
- Functionality: Have the functional claims of the software been confirmed?
- Yes. The command-line programs
countwords
,collate
, andplotcounts
perform as described in the README.
- Yes. The command-line programs
- A statement of need: Do the authors clearly state what problems the software
is designed to solve and who the target audience is?
- Yes. The “Motivation” section of the README explains this.
- Installation instructions: Is there a clearly stated list of dependencies?
Ideally these should be handled with an automated package management solution.
- Yes. In our
setup.py
file theinstall_requires
argument lists dependencies.
- Yes. In our
- Example usage: Do the authors include examples of how to use the software
(ideally to solve real-world analysis problems).
- Yes. There are examples in the README.
- Functionality documentation: Is the core functionality of the software documented
to a satisfactory level (e.g., API method documentation)?
- Yes. This information is available on Read the Docs.
- Automated tests: Are there automated tests or manual steps described
so that the functionality of the software can be verified?
- We have unit tests written and available (
test_zipfs.py
), but our documentation needs to be updated to tell people to runpytest
in order to manually run those tests.
- We have unit tests written and available (
- Community guidelines: Are there clear guidelines for third parties wishing
to 1) Contribute to the software 2) Report issues or problems
with the software 3) Seek support?
- Yes. Our CONTRIBUTING file explains this (Section 8.11).
Exercise 14.9.4
The directory tree for the pratchett
package is:
pratchett
├── pratchett
│ └── quotes.py
├── README.md
└── setup.py
README.md
should contain a basic description of the package and how to install/use it,
while setup.py
should contain:
from setuptools import setup
setup(
name='pratchett',
version='0.1',
author='Amira Khan',
packages=['pratchett'],
)
The following sequence of commands will create the development environment, activate it, and then install the package: