Menu

Modularization

Modular design is one of the most common approaches for modern programming, ensuring the maintainability and extensibility of a software product. Over time, the software may become increasingly challenging to maintain (10.1109/CSEET.2009.44). One can address this increased complexity through refactoring with modularization in mind. This means continuously monitoring the code, recognizing a code that grew too much, and re-structuring it into smaller parts. It involves an understanding that the code is not a static entity but an ever-growing, ever-changing organism.

Based on the team's guidelines and experience, we understood that moving from unstructured scripts to organized code with functions brings several benefits at a low cost. Consequently, this topic was covered several times during our software quality seminars and code review sessions. Specifically, we dedicated software quality seminars to the following topics to improve modularization: object-oriented programming, class diagrams and unified modeling language in general, design patterns, software architecture, Snakemake, S4 objects, R package development, a case report from the organization of the JASPAR database project, and a review of the book titled The Pragmatic Programmer. Modularization can take form on many levels. On a small scale, it means naming and organizing parts of the code into functions. Once a code grows, one can start refactoring into classes and focus on the coherence and coupling of the parts (Figure 1-2). When building a pipeline of scripts, one can identify coherent modules that would translate to rules in Snakemake (Figure 3-4). A recurring question is whether a script needs refactoring or can remain a prototype. Taschuk and Wilson suggest a cut-off at which one reuses a script, shares it with others, or uses it to produce findings in a publication. Although this definition potentially includes most code written by bioinformaticians, we suggest weighing the time spent improving the scripts against the time required to deal with sub-optimal code on a case-by-case basis. Modularization becomes the norm with practice and exposure to a lot of code, reducing the distance between a prototype and the refactored code.

code_previous
Figure 1. Improving the modularization of a small codebase: PREVIOUS. In the previous design, a single script processed data in a one-off manner without consideration for extendability.

code_current
Figure 2. Improving the modularization of a small codebase: CURRENT. In the current design, we separate different aspects of the analysis into dedicated modules, which can be extended more robustly.

jaspar_old
Figure 3. Improving the modularization of a large codebase: PREVIOUS. In the previous design, the files were arranged by their type. The numbers denote the number of files in each directory represented by the rectangle. mk: makefile.

jaspar_new
Figure 4. Improving the modularization of a large codebase: CURRENT. In the current design the files are arranged by their function. The numbers denote the number of files in each directory represented by the rectangle. The number of files is different due to added features and changes beyond the organization. pfm: position frequency matrix.