Testing
The correctness of scientific findings relies on that of scientific software. A cautionary tale was presented by Greg Miller: in which five papers had to be retracted due to a bug in the software. The incorrect protein structure confused the field even before this retraction, as people doubted their different but valid structures. Proper tests can both reduce the number of errors and inspire confidence in one’s software. We would like to reiterate the importance of considering unit testing and verification for scientific software.
Testing is a central concept in team programming, as test coverage increases trust and allows any member (including the developer's future self) to add new features safely. However, as highlighted in the literature (10.1016/j.infsof.2014.05.006, 10.1109/MS.2008.85, 10.1109/MIC.2014.88), testing is complex for scientific software. Scientific software development requires dedicated testing methods to maintain scientific integrity since unexpected discoveries can confound expected outcomes and error detection.
Scientists often evaluate the validity of a model but usually do not rigorously test the underlying code. This oversight arises from the common misconception that a model and its implementation are the same thing (10.1016/j.infsof.2014.05.006). As they are not, faults in the software can interfere with the model itself. The main challenge usually lies in viewing software testing as more than just validating the scientific features of the software on a small test dataset. This requires examining the software implementation from multiple perspectives through several test cases. We view testing not as a form of quality control that acts as a checkpoint but as part of an iterative development process.
Revisiting testing multiple times in our sessions helps build an intuition about its nature and usage. These include discussing how to write unit tests in both Python (pytest
and unittest
) and R (testthat
) as comparing languages allowed us to highlight the similarities and specificities of unit testing implementations. We also extensively discussed the type of functions to test, as bioinformaticians deal with both existing and new code that requires a different approach. We could also touch upon the automation of tests via continuous integration (e.g., GitHub Actions), which helps safely modify the code base throughout the development process and in maintenance.
Similarly to modularization, a recurring question was when to start adding tests. We advise testing early, departing from the mindset: “I will start tomorrow after implementing this new idea." Peer reviews challenge the implementation, and code reviews provide an optimal platform to discuss tests. This aspect can be viewed as a scientific endeavor when we envision the edge cases and discuss the scientific problem the software aims to address.
When working with analysis pipelines, we can (not always safely) assume that the used packages are thoroughly tested by their developer, so what remains for us is to ensure that the steps connect well. It can be viewed as a form of integration testing, where different test examples can be passed through the pipeline. The developer of such a pipeline can think about cases that might break the flow (e.g., missing output from one step). The Nextflow workflow system has further support for testing. To check the behavior of the individual steps, one may consult the nf-test framework.
Testing examples
Here we present an example of testing novel scientific code, represented by a subset of tests used by the SPONGE package. The unit tests check the correctness of individual functions. Some of the tests shown test the plogp
function, which calculates the value of p*log2(p) while treating the zero case correctly. Selected content of tests/test_helper_functions.py
is shown below.
import pytest
import sponge.helper_functions
# Parametrize allows testing multiple
# inputs without code duplication
@pytest.mark.parametrize(
"input,
expected_output",
[
(0, 0),
(0.5, -0.5),
(1, 0)
]
)
def test_plogp(input, expected_output):
assert helper_functions.plogp(input) == expected_output
Also tested is the calculation of the information content for individual motifs in the calculate_ic
function. The motifs used by the tests are defined in a separate file test_motifs
and accessible as pytest
fixtures.
import pytest
import pandas as pd
from Bio.motifs.jaspar import Motif
from pyjaspar import jaspardb
# A motif without any information
@pytest.fixture
def no_info_motif():
no_info_row = [0.25] * 4
no_info_counts = [no_info_row] * 6
no_info_pwm = pd.DataFrame(no_info_counts, columns=['A', 'C', 'G', 'T'])
no_info_motif = Motif(matrix_id='XXX', name='XXX', counts=no_info_pwm)
yield no_info_motif
# A motif with perfect information
@pytest.fixture
def all_A_motif():
all_A_row = [1] + [0] * 3
all_A_counts = [all_A_row] * 6
all_A_pwm = pd.DataFrame(all_A_counts, columns=['A', 'C', 'G', 'T'])
all_A_motif = Motif(matrix_id='XXX', name='XXX', counts=all_A_pwm)
yield all_A_motif
# A real motif for SOX2
@pytest.fixture
def SOX2_motif():
jdb_obj = jaspardb(release='JASPAR2024')
SOX2_motif = jdb_obj.fetch_motif_by_id('MA0143.1')
yield SOX2_motif
Selected content of tests/test_helper_functions.py
is shown below.
import pytest
from test_motifs import *
from sponge.helper_functions import calculate_ic
def test_calculate_ic_no_info(no_info_motif):
assert calculate_ic(no_info_motif) == 0
def test_calculate_ic_all_the_same(all_A_motif):
# Length of the test motif is 6, so expected value is 2 * 6 = 12
assert calculate_ic(all_A_motif) == 12
def test_calculate_ic_SOX2(SOX2_motif):
assert calculate_ic(SOX2_motif) == pytest.approx(12.95, abs=0.01)
The integration tests check that the entire workflow produces the expected output, effectively checking that the components work well together. In this case, the full functionality of SPONGE with the default parameters is checked. Selected content of tests/test_sponge.py
is shown below.
import os
import pytest
from sponge.sponge import Sponge
# The test is marked as slow because the download of the bigbed file takes
# a lot of time and the filtering is also time consuming unless parallelised
@pytest.mark.slow
def test_full_default_workflow(tmp_path):
# Tests the full SPONGE workflow with default values
ppi_output = os.path.join(tmp_path, 'ppi_prior.tsv')
motif_output = os.path.join(tmp_path, 'motif_prior.tsv')
sponge_obj = Sponge(
run_default=True,
temp_folder=tmp_path,
ppi_outfile=ppi_output,
motif_outfile=motif_output,
)
assert os.path.exists(ppi_output)
assert os.path.exists(motif_output)