Building Blocks¶

Optimization is the route to all evil

Getting right first and fast then by D. Knuth. AKA “Get it right first, then make it fast”.

For this lecture, we followed a lot of the contents already included in the following tutorials.

We’ll follow the concise Software Carpentry Testing Tutorial authored by Dr. Katy Huff.
Also this Dr. Katy Huff.

Note

There isn’t a clear borderline between software engineers and data analysts.

How would you write unit tests for data analysis? I feel it will be both tricky and unnecessary. For a function/method, if you defined it, you know what its expected output should be. For data, you often don’t know what exactly to expect in the output. For example, when you subset a dataset, how do you know the result is correct?

mtcars2 = dplyr::filter(mtcars, hp > 100)

That is probably not something you, as a data analyst, need to worry about. It is the responsibility of the package author (the software engineer) to write enough unit tests in the package that you are using.

On the other hand, data analysts often do tests in an informal way, too. As they explore the data, they may draw plots or create summary tables, in which they may be able to discover problems (e.g., wrong categories, outliers, and so on). Notebooks are great for these inline output elements, from which you can make quick discoveries.

1. Motivation¶

Let’s start by taking a look to some of the reasons why continously testing our code is a good prectice that produce better code and more reproducible too.

1.1. Numerical precision¶

As we saw in the notebook simple-numerical-chaos.ipynb, notebook, even simple arithmetic in computers can produce surprising numerical behavior. This means that, especially when we handle lots of data, we should strive to always validate that our codes are producing the answers we expect them to produce.

In brief, the basic issue is that even two algebraically equivalent forms of the same (simple!) expression, in a computer, may give different results:

def f1(x): return r*x*(1-x)
def f2(x): return r*x - r*x**2

r = 3.9
x = 0.8
print('f1:', f1(x))
print('f2:', f2(x))

print('difference:', (f1(x)-f2(x)))

f1: 0.6239999999999999
f2: 0.6239999999999997
difference: 2.220446049250313e-16

Now, the decimal digits of the difference are just garbage: eirher f1(x) or f2(x) have no information after the last digit. The apparent precision in the difference f1(x) - f2(x) is completely spourious.

Now, this raises the question about what does it mean to get the right answer from our code and what does it mean to be reproducible in scientific computing.

This short example help us to undersrand what is important in the context of computational

1.2. Implementing or changing features¶

Testing also help us when we want to make significant changes in our code and we want to ensure that the functionallity of the code doesn’t go affected by these new changes. These cases include

Adding a new function/feature that communicates with other existing pieces of code.
Making changes to the implementation of existing function, for example by changing the data types or the algorithm we use for certain operations
Change the data we used to feed our code

2. Types of tests¶

There are different classes of test that evaluate the correctness of our code at different levels and scales. In this course, we re goign to cover the following tests:

Assertions statements
Exceptions statements
Unit tests
Regression tests
Integration tests

2.1. Assestions¶

The assert statement in Python just evaluates when some given condition is true or false. If False, it interrupst the exectution of the code.

assert 1+1 == 2, "One plus one is not two."

As you can see from the previous example, you can also add a small text description for the error induced. in this way, assertion statements are very simple to write and evaluate.

As you can imagine from the discussion in the previous section, we need to be careful at the moment of comparing objects in Python. For example, for float types we have

assert 0.1 + 0.2 == 0.3

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[3], line 1
----> 1 assert 0.1 + 0.2 == 0.3

AssertionError:

The problem here is induced by floating point aritmethics in our code. In order to raise an AssertionError here, we can use numpy.testing.assert_allclose():

from numpy.testing import assert_allclose
assert_allclose(0.1 + 0.2, 0.3)

Since assertions are raised when a given condition is not satisfied, we can also use any other functionallity that retuns True/False for doing this. Other examples are

import math
assert math.isclose(0.1 + 0.2, 0.3), "Numbers are not close."

import pytest
assert 0.1 + 0.2 == pytest.approx(0.3), "Numbers are not close."

Ussually assertion statements go inside a functions or definitions an help us to keep the correctness of the code. In pair programming, it is the role of the observer to think in cases where the code may not work and think about simple assertion statements that will help prevent those errors.

2.2. Exceptions¶

Different kinds of errors that occur as we write code include syntax, runtime and semantic errors. Specially for runtime errors, Python give us a clue about what kind or error may happened during the execution of our code. For example,

1 / 0

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[7], line 1
----> 1 1 / 0

ZeroDivisionError: division by zero

my_dict = {'a':1, 'b':2}
my_dict['c']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[8], line 2
      1 my_dict = {'a':1, 'b':2}
----> 2 my_dict['c']

KeyError: 'c'

my_dict + {'c':3}

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 my_dict + {'c':3}

TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

There are many more different kind of built-in exceptions in Python. You can find some more examples in this link. A general RuntimeError is raised when the detected error doesn’t fall in any of the other categories.

There are different ways of dealing with runtime errors in Python, there include the

try...except clause
raise statement

def division(numerator, denominator):
    try:
        return numerator / denominator
    except ZeroDivisionError:
        return 0

division(1,1)

1.0

division(1,0)

0

Now, at the moment of raising an error we would like to print a meaningful message. We can do this

def division(numerator, denominator):
    try:
        return numerator / denominator
    except ZeroDivisionError:
        raise ZeroDivisionError(f"You cannot divide by {denominator=}")

division(1,0)

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[13], line 3, in division(numerator, denominator)
      2 try:
----> 3     return numerator / denominator
      4 except ZeroDivisionError:

ZeroDivisionError: division by zero

During handling of the above exception, another exception occurred:

ZeroDivisionError                         Traceback (most recent call last)
Cell In[14], line 1
----> 1 division(1,0)

Cell In[13], line 5, in division(numerator, denominator)
      3     return numerator / denominator
      4 except ZeroDivisionError:
----> 5     raise ZeroDivisionError(f"You cannot divide by {denominator=}")

ZeroDivisionError: You cannot divide by denominator=0

If you already know what may be causing an error in your code, you can avoind the use of the try / except statement and directly raise an exception when certain critical condition happens:

def division(numerator, denominator):
    if denominator == pytest.approx(0.0):
        raise ZeroDivisionError(f"You cannot divide by {denominator=}")
    return numerator / denominator

division(1, 0)

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[16], line 1
----> 1 division(1, 0)

Cell In[15], line 3, in division(numerator, denominator)
      1 def division(numerator, denominator):
      2     if denominator == pytest.approx(0.0):
----> 3         raise ZeroDivisionError(f"You cannot divide by {denominator=}")
      4     return numerator / denominator

ZeroDivisionError: You cannot divide by denominator=0

Something cool about exceptions is that their are classes and Python allow us to create new assertion errors.

class LightSpeedBound(Exception):
    """
    Defines a new exception error of my preference.
    """
    pass

def lorentz_factor(v, c=299_792_458):
    if v > c:
        raise LightSpeedBound(f"The current velocity {v} cannot exceed the speed of light")
    return 1 / (1 - v**2/c**2) ** 0.5

lorentz_factor(300_000_000)

---------------------------------------------------------------------------
LightSpeedBound                           Traceback (most recent call last)
Cell In[18], line 1
----> 1 lorentz_factor(300_000_000)

Cell In[17], line 9, in lorentz_factor(v, c)
      7 def lorentz_factor(v, c=299_792_458):
      8     if v > c:
----> 9         raise LightSpeedBound(f"The current velocity {v} cannot exceed the speed of light")
     10     return 1 / (1 - v**2/c**2) ** 0.5

LightSpeedBound: The current velocity 300000000 cannot exceed the speed of light

Note

Currently Python supports type hinting at the moment of defining new functions. Although these are only type hints and not enforced automatically by the compiler as is the case in other languages, being explicit about the input and output types helps having a more readable code:

def division(numerator:float, denominator:float) -> float:
    return numerator / demoninator

And with these type hints in place, tools like mypy can be used to provide actual checking of your codebase, bringing some of the benefits of static type systems to the dynamic world of Python.

2.3. Unit Tests¶

In previous section we were discussing about the importance of writting clean and modular code. Having small functions that perfom very specific tasks help us to desing pipelines for testing those small units of code. That is the purpose of unit tests, to individually test the functions in our code.

The way of writing unit tests consist in defining function that will return an assert statement testing whenever the output matches the true answer.

import numpy as np

def division(numerator, denominator):
    if denominator == pytest.approx(0.0):
        raise ZeroDivisionError(f"You cannot divide by {denominator=}")
    return numerator / denominator

def test_float_division():
    assert np.isclose(division(2.0,0.5), 4.0)

test_float_division()

The next step is to scalate this! Having more than one test for function that can evaluate different cases (eg, different types) and then extent to all the functions in your code. For example, for the division function we probably want to add a test that fix the expected behaviour when dividing by zero. Surprisingly, we can assert that the output of a funcition is an Error itself:

import pytest

def test_division_by_zero():
    with pytest.raises(ZeroDivisionError):
        division(numerator=10.0, denominator=0.0)

test_division_by_zero()

2.4. Integration tests¶

As their name indicate, integration tests are the responsible of evaluating how multiple units of code work together, instead of individually. For example, it is easy to see how a simple code that has the division function can fail, even when each unit has being tested independnely.

In general, any test that involves more than one function is called an integration test. Let’s see the following example that uses inheritance classes in Python.

class Person:
    
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def birthday(self):
        self.age += 1
        
    def append_lastname(self, lastname):
        self.name += " " + lastname
        
class Student(Person):
    
    def __init__(self, name, age, major):
        super().__init__(name, age)
        self.major = major
        self.grades = {}
        
    def add_grade(self, course, grade):
        self.grades[course] = grade

def test_student():
    
    subject = Student("Facu", 28, "Statistics")
    subject.birthday()
    subject.add_grade("Stat 159", "A+")
    assert subject.age == 29 and subject.grades["Stat 159"] == "A+"
    
test_student()

2.5. Regression tests¶

Regression tests try to fix in time the expected behaviour of certain piece of code. This is particularry useful when we don’t know what the true output of a piece of code is, but we want to ensure the stability of the code. In a sense, we want to be sure that as we make changes we don’t break or change the code that, in principle, was working before.

Another example of a regression test happens after we found and fix a bug in our code. After detecting an error, we may want to include a test for this so we are sure that the bug doesn’t reapear in the future.