Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Lecture 8: Tools and best practices for mathematical programming
References:
Summary:
In this lecture, we start with discussing the best formats to store code and good coding styles. Then we learn about some tools to debug your code (i.e. find mistakes) and how raise and interpret exceptions. We show how to write good documentation for your code and how you can use it to test whether it does what you want. Finally, we see how you can use a profiler to find the slow portions of a program, and some basic strategies for speeding things up.
In the appendix, you find a short crash course on how to use the console (which is needed e.g. for the tests mentioned above).
Introduction
Until now, the emphasis of the lecture has been on performing concrete computations. In the course of this, we have already seen that it can be useful to write functions and classes that encapsulate algorithms and sub-computations that we want to call repeatedly.
The focus for the rest of the course will shift to this activity of mathematical programming, i.e. how to develop and share good code, which can then be used in more complicated computations.
Formats of storing and sharing code
In the lectures so far we have worked with Jupyter notebooks, and these can be a nice way to share short code snippets and computations. For instance, imagine you thought of the following revolutionary algorithm for computing square roots:
Then you could create a new notebook sqroot.ipynb
, have a cell as above, and even include a cell where you show how to apply the function in practice:
This is fine in principle, but can have a few downsides:
Some people like to work in a console, and they would have to copy-paste your code by hand.
Others might want to integrate your code into their own programs, again having to copy-paste.
This problem only gets worse if you have many lines of code.
A first way of solving this problem is to save the code into a Sage file. Let's try this out (if you haven't done so, see the Appendix on using the console below):
Exercise
Create a new file
sqroot.sage
in the same folder as the current notebook08 Tools and best practices for mathematical programming.ipynb
.Open it using a text editor, paste the code of the function
square_root
above into the file and save it.In the current notebook, restart the kernel via
Kernel -> Restart
and convince yourself that afterwards, the SageMath session no longer knows the functionsquare_root
.Load the content of the file
sqroot.sage
via the following cell, and check that afterwards the functionsquare_root
is known by computing .
Such a Sage file is perfectly adequate for working on and sharing code of moderate length. In the next lecture, we will discuss a slightly more professional way of doing this, via Python files and packages. We'll talk about the additional features that this brings, but for now, we will stick to Sage files.
Addendum
Answering a question from the audience: it is also possible to "load" a Jupyter notebook using the magic command %run
. E.g. if the folder containing the current notebook also contains e.g. the file 04 Analysis and power series.ipynb
that we worked with before, you can execute all code cells in that notebook using the following command:
Then you have access to the variables and functions defined there, e.g. the parabola
-function we saw at the beginning of lecture 4:
Good coding style
To make code easier to read and understand, there are certain conventions which should be followed. You can have a look at the summary of conventions for coding in SageMath or the more detailed PEP 8 style guide for Python code.
The most important points:
Use 4 spaces for indentation (and no Tab-characters).
Use a single space before and after lowest priority operators (like
=, +, *, ...
) in an expression, and after commas:
Use
snake_case
for function names andCamelCase
for class names:
Exercise
Repair the following code to comply with standard conventions:
Solution (click to expand)
There are some automated checkers like flake8 to test if your code satisfies standard conventions, and autopep8 to automatically clean up many issues. For now, don't worry too much about it, but if you plan to get serious about writing code (e.g. by contributing to the SageMath development), you should get familiar with some of these conventions.
Tools for debugging
Now we can write beautiful code, but there might still be an issue: maybe the code does not work! The process of finding errors in code is called debugging and there are some good tools helping us do this.
Syntax errors
When trying to execute the code, you get an error message, pointing out where the syntax is incorrect.
Exercise
What goes wrong here?
Solution (click to expand)
Missing :
after else
:
Understanding value and runtime errors using %debug
Sometimes a program is interrupted because at some point in the program an exception is raised, e.g. a ValueError
to indicate that the value of a certain variable does not make sense in a given context. You then see the chain of function calls leading to this exception.
In this case, we can upen a debugging tool, allowing us to explore the state of the program when it failed as follows:
Click here to see an example session
Useful commands:
Exercise
Several things go wrong here. Can you fix them?
Solution (click to expand)
The argument of
quadruple_of
is calledm
, but the functiondouble_of
is called forn
. This defaults to the global functionn
(which you use to computen(pi)
as3.1415...
).The function
double_of
prints a nice statement, but does not return the double of its input (and thus gives backNone
, the default value when a function ends without having called areturn
).
Here is the corrected code:
You can also use %debug
before something breaks. By typing %debug code_line
, you can start executing a certain line of code in the debugging environment above. Using the commands s
(to execute the next step of the program) and n
(to let the program run until the next line is reached) you can then trace the computation step by step and check the values of variables and order of execution, etc.
Click here to see an example session
We can see that when we call double_of
for the first time, its argument is the function n
computing numerical approximations, instead of the expected number 5
.
Raising exceptions yourself
Note that sometimes it can be good to raise exceptions yourself, since they allow you and others to find errors more easily. Take our good old square root function above, which we now build on to define a fourth root:
The following gives a very confusing error message:
What goes wrong is that even the square root function for input 15
returns None
.
It's better to immediately raise an exception when our program does not find a solution:
It also allows the programmer of fourth_root
to try alternative solutions if our square_root
function fails:
Below is a small list of the most common types of exceptions in Python that you would typically include in your own code:
| Exception | Usage | | --- | --- | | ValueError
| a function receives an argument that has the right type but an inappropriate value | | NotImplementedError
| a function has not been implemented yet (maybe for a particular type of input) | | TypeError
| a function is applied to an object of inappropriate type (e.g. a string
when an int
is expected) | | AssertionError
| raised by assert condition
if condition
is False
; used for debugging | | RuntimeError
| an error is detected that does not fall into any of the other categories |
Here you find the official Python 3 documentation about exceptions.
Documentation and doctests
We have already seen and used many times that we can look at the documentation of a SageMath function by typing something like primes?
.
In this documentation, we get
a description what the function does,
a list of its arguments, their types, default values and meanings,
a block of example computations, showing how to use the function in practice and what the output looks like.
It is an excellent idea to add documentation to your own functions. This is done by specifying a string (enclosed by r""" ... """
) at the beginning of a function definition:
Some remarks:
In very simple cases (as above) the
INPUT
section can be omitted.The
EXAMPLES
section is indented and should look like copy-pasting the text of a SageMath session in a console (hence thesage:
prompts). In particular, running such a session is a good way to generate this part! Also, it is important to be creative with possible inputs, exploring some edge cases (like0
or negative numbers above), and also demonstrating values for optional arguments.Many more good practices, further possible sections, etc can be found in this official guide.
Exercise
Open again the file
sqroot.sage
from the beginning, and add a documentation to it. (Hint: As in many situations, it's a good idea to just copy-paste and modify an existing docstring, like from the functionmultiply
above).Load the file and look at your docstring using
square_root?
.
Solution (click to expand)
Here is a possible docstring:
After saving the changes to sqroot.sage
, we load the file again and look at the documentation as follows:
One cool feature of having a documentation with EXAMPLE
section as above, is that you can use it to test your code and make sure that any changes you make do not break some previously working function. This can be done via SageMath's doctests. The idea is simple: a program goes through all documentations of functions that you wrote, looks for lines starting with sage:
and runs those lines in SageMath. If the output is not identical to the one specified in the documentation (meaning that the string printed in the terminal is the same), it raises a warning. Let's try it out:
Exercise
Open a console and navigate to the folder containing the file sqroot.sage
, which should contain the function square_root
whose documentation has some EXAMPLE
block as above. Call the command
and look at the output.
Remark: This assumes that typing sage
in your console starts a SageMath session. Depending on your installation, you should change sage
to Sage-9.4
or /home/users/donald_duck/sage/sage
(i.e. the file in your local installation).
Bonus: Ideally, the above should produce an ouput saying that all tests passed. You can try changing the function square_root
, e.g. by inserting a line return 42
at the beginning, and repeat the procedure to check that now the doctests produce errors.
Profiling and optimization
In practial applications, it can sometimes happen that you have some code which produces the correct answer, but which is too slow to be useful. In this section, we'll discuss some tools that can help you find the slow parts of the code, and some possibilities for making things faster.
We'll begin with an innocent-looking example:
This is a perfectly correct implementation of the Fibonacci numbers. Using the magic command %time
we can check how it performs relative to the pre-defined function fibonacci
:
While they give the same answer, the fibonacci
function is much faster, and our handmade version even starts struggling at around n=30
to produce an answer in any reasonable time.
One way to explore what happens is using the magic %prun
, which does a profiler run of the command that follows. This means that this command is executed and a record is created, remembering which functions are called how many times, and how long that took. Let's see it in action:
We see that to compute the th Fibonacci number, our function my_fibonacci
is called a total number of times! The reason is that my_fibonacci
does not remember when it has been called previously, so in the course of computing my_fibonacci(n)
, the value of my_fibonacci(n-2)
is computed twice, my_fibonacci(n-3)
is called three times, my_fibonacci(n-4)
is called five times and so on (do you see the pattern?).
See here for a nice explanation and picture of this tree of function calls.
To avoid this, we can use memoization: we make the function my_fibonacci
create a list of all previous inputs and outputs. When it is called, it first looks up in this list if it has already computed the accociated value. If yes, the stored value is returned, if not the function code is executed, the answer is stored and then returned by the function. From a practical point of view, we simply have to add a line @cached_function
before our function definition:
For optimizing the function my_fibonacci
above, the main important output of %prun
was to show us that the function is called a large number of times without memoization. In more complex examples, it is also important to find out how much computation time is spent in different functions.
Take the following example case: somewhere in our computation, a large square matrix of numbers in is created, and we want to compute its characteristic polynomial. However, this seems very slow in practice, so we can start to investigate:
Here, we sorted the output of %prun
by the cumulative time. This means: if the method _add_
has a cumulative time of 3.5 seconds, then the computation spends 3.5 seconds in the method _add_
itself, or in methods called from this. Conversely, the internal time (listed under tottime
) tells us how much time is spent on basic Python operations like for
loops or additions of int
s inside the actual code lines of _add_
.
In my experience, cumulative time is almost always more useful to find the slow parts of your code.
But what to do about the matrix M
above? From the output of %prun
we see that a big chunk of the time is spent on basic addition and multiplication in QQbar
. Taking a look at the actual code, we see that the entries of the matrix M
are not arbitrary numbers in QQbar
but in fact integers. Thus it is much faster to compute the characteristic polynomial with integer coefficients (where addition is quick).
Exercise
Let and consider the -matrix given by The function strange_quantity
below computes the determinant of M
in dependence of n
. Even though the problem seems manageable, the function struggles at n=8
and downright crashes at higher n
. Can you find out which parts are slow? Can you see how to make it faster (and not crash)?
Solution (click to expand)
Using %prun -s cumulative strange_quantity(8)
we can see that almost all the time is spent in the final computation of the determinant and inside there, the time is spent in maxima
, which is the engine handling symbolic expressions in SageMath. A priori this might sound strange: by substituting x=y=1
, we would expect that the entries of M
are constants (in fact, Integers). But that's not the case: after subs
, the entries of M
are still symbolic expressions. To see this, you can insert a line print(type(M[0,0]))
before the final return
statement, which prints:
Thus as before, we can make things faster by making sure we have an integer matrix before computing the determinant. Here is a variant of the code that does this:
Now the code finishes extremely fast (for n
up to 40). Going even higher, it starts getting slow again, and %prun
shows us that now the bottleneck is the derivative
method. If you do need strange_quantity
for higher n
, one possibility is to compute a formula for the entries of M
by hand (it's not hard) and program that formula instead of using symbolic derivatives.
There are many general strategies for writing fast code, e.g.
avoid unnecessary copy operations
prefer list comprehensions over
for
loopsprefer generators over lists
... (see the references at the beginning of the notebook)
These are good to learn and apply in the long term. But for a concrete programming problem, it's often already half the solution once you know which part is slow.
Programming workflow
Above we learned how to write good code, find mistakes in it (I guess it's not that good after all) and speed it up. An important final piece of advice is that this is the right order to approach a programming project. Or, to quote from a guide to code optimization linked in the references above, when approaching a programming problem, you should
Make it run
Make it right
Make it fast
To expand this a bit, a good workflow is to
Write a first version of the program that should work in principle. It often helps to already start writing the documentation here, explaining the precise input and expected output.
Try that version and find all basic coding errors.
Once it runs without errors on some small example, try testing it for bigger examples and check whether it actually outputs the correct result there (this often involves computing some small cases by hand).
Add these basic examples to the documentation as doctests.
Now you can see whether the program is fast enough. If not, begin profiling to work out the bottleneck of the calculation, and then optimize it.
In particular, you should not do step 5 before 1-4, since you might a) optimize parts of the code that are not actually slow, or b) break the code without noticing it.
Appendix: Crash course on consoles
Probably you already used a console to start the Jupyter notebooks. On Linux and Mac OS the program to start a console is often called Terminal
, whereas on Windows (if you use the standard Linux installation) there is the program SageMath 9.X Shell
which serves the same purpose. In the console you enter one command at a time, followed by Enter
. Here are a few examples:
Here is an example session:
Our old friend, Tab
completion also works here, so if you type cd Desk
+Tab
it will complete the directory.
Assignments
Exercise
Find out what goes wrong with the following code for computing the body mass index of a list of patients. This example is taken from the course Software Carpentry: Programming with Python., which has many more nice explanations about debugging.
Solution (click to expand)
For some reason, the program always computes the BMI of the first person, and ignores the others, even though we have a for
loop going through all patients. The explanation is that the code ignores the variable of the for
loop and always manually looks up the data of the first patient. Thus the line
should be replaced by
The second thing going wrong is that the printed BMIs are very small! This you would notice if you compute some example by hand (e.g. when writing a doctest for your function). Investigating again, we see that when calling the function calculate_bmi
, we have switched the order of the arguments. Thus the fully corrected code is:
Exercise
Below is an implementation of the sieve of Eratosthenes for computing the prime numbers up to n
:
Write a documentation for the function, including a few doctests.
Save the function into a Sage file and use a console to run the doctests.
Unfortunately it takes a long time to finish for
n=30000
. Can you find out which parts are slow, and write a version which finishes in under 5 seconds forn=30000
(either by improving the code above, or by writing a new one)?
Hint: You are also allowed to use your mathematical understanding to improve the algorithm above!
Solution (click to expand)
Here is a reasonable documentation for the function eratosthenes
. Since it only has one input variable, we don't need a separate INPUT
section. Note that in the doctests we checked some non-trivial examples, in particular for zero or negative inputs.
After saving the code to a Sage file (e.g. esieve.sage
) we can run our doctests as follows:
Concerning optimization: using %prun
we can see that a significant part of the time is spent on appending new elements (since in each iteration we build an entirely new list by starting with the empty list and appending the elements we like. In general, it is much faster to use list comprehension, or specialized methods like filter
to create new lists or iterators.
Also, mathematically we know that it makes no sense to filter out multiples of numbers d
which we already saw are not prime numbers. So it is better to always just filter out the multiples of the smallest number left in our list (which we know must be a prime number since it was not filtered out earlier). Here is some code which finishes reasonably fast:
By repeating the doctests after our changes, we can make sure that our changes did not break the function.
Exercise
Here is a program solving random systems of equations:
Can you see how to make it faster for (n,m) = (40, 1000)
?
Solution (click to expand)
Looking at the code, we see that for the created matrix M
, it solves linear systems M * x = v
for m
vectors v
. For large numbers m
, it is faster to once compute the inverse matrix of M
, and then solve the system just by multiplying it against v
. Thus the following code is indeed faster: