Organizing Source Code for Scientific Programmers: Let's Start a Conversation

Published: 1 month ago (December 22, 2025 at 09:00 AM EST)

8 min read

Source: Dev.to

Why It Matters

Poor organization creates more than inconvenience—it creates real problems:

Lost data
Irreproducible analyses
Hours wasted searching for files

A well‑organized repository accelerates science, facilitates collaboration, and makes your research reproducible. When everyone on the team can predict where files should be, collaboration becomes intuitive rather than frustrating.

A Starting Point (Not a Rigid Prescription)

The key organizing principle for scientific repositories is to structure by file type and purpose. Below is a proposed directory layout:

project-name/
├── data/
│   ├── raw/
│   └── processed/
├── src/
│   ├── data_processing/
│   ├── analysis/
│   └── visualization/
├── notebooks/
├── scripts/
├── results/
│   ├── figures/
│   └── output/
├── docs/
├── tests/
├── environment/
├── README.md
├── LICENSE
└── .gitignore

Note: This is a starting point for discussion, not a rigid prescription.

Below we break down each component and pose questions for the community.

`data/` – The Sacred Ground

Sub‑directories

data/raw/ – Original, unmodified data files. Treat this as immutable (you can even set files to read‑only).
data/processed/ – Cleaned, transformed, or analyzed versions of your data.

Why the separation?
It guarantees reproducibility: anyone should be able to run your processing code on the raw data and regenerate the processed data.

Tips

For large datasets, store only metadata or download scripts in version control.
Use external services (OSF, Zenodo, institutional repositories) for the actual data files.

Community question:
Do you organize data differently? How do you handle intermediate processing stages? Do you have a data/interim/ directory?

`src/` – Your Core Analysis Code

This is where reusable, production‑quality code lives—the “scientific guts” of your project.

Guidelines

Organize into logical modules or packages.
Write clear docstrings and documentation.
Make the code testable and (ideally) write tests.
Ensure it can be imported from interactive environments or scripts.

Community question:
How do you organize multi‑language projects? Separate directories per language, or a mixed src/?

`notebooks/` – Exploration & Prototyping

Interactive environments (Jupyter, R Markdown, Pluto.jl, MATLAB Live Scripts, Mathematica notebooks) are fantastic for exploration, but they can encourage non‑modular code and accumulate cruft.

Best practices

Use notebooks for exploration, visualization, and prototyping.
Keep each notebook focused on a specific question or analysis.
Name them clearly with a numeric prefix for ordering, e.g.
- 01-data-exploration.ipynb
- 02-initial-modeling.Rmd
- 03-sensitivity-analysis.jl
When code matures, move it to src/ and import functions rather than copying code.
Extract reusable parts into modules in src/ once a notebook becomes unwieldy.

Community question:
Some researchers keep all work in notebooks/scripts; others move everything to modules. What’s your philosophy? Does it depend on the project stage?

`scripts/` – Automation & Batch Processing

Scripts are for automated, reproducible workflows. They should:

Run without interaction.
Accept command‑line arguments or configuration files.
Be executable on clusters or in pipelines.
Orchestrate complete analyses from start to finish.

Typical use cases

Data download and preprocessing pipelines.
Running models with different parameters.
Generating all figures for a paper.
Batch processing multiple datasets.

A controller script (e.g., run_all.sh, Makefile, Snakefile) that executes the entire analysis workflow in order is extremely valuable for reproducibility.

Community question:
Do you use workflow managers (Make, Snakemake, Nextflow, Drake, Luigi)? How do you organize pipeline definitions?

`results/` – Analysis Outputs

Store generated outputs outside version control (add to .gitignore):

results/
├── figures/   # Plots, visualizations
├── output/    # Tables, statistics, processed results
└── models/    # Trained models, fitted parameters

Why separate from data/?
Results are generated by code and should be reproducible. If you lose them, you can regenerate them. Raw data, however, cannot be regenerated.

Community question:
Do you version‑control any results? How do you handle results that take days/weeks to generate?

`docs/` – Documentation

Include:

Project documentation.
Analysis notes or lab‑notebook entries.
Manuscript drafts.
Supplementary materials.
API documentation (if auto‑generated).

Community question:
Where do you keep your manuscript? In the repo, a separate repo, Overleaf, Google Docs…?

`tests/` – Test Code

Yes, even scientific code should have tests! At minimum, include:

Unit tests for core functions.
Integration tests for end‑to‑end workflows (if feasible).
Regression tests to guard against accidental changes in results.

`environment/` – Reproducible Environments

Store environment specifications:

environment.yml (conda) or requirements.txt (pip).
Dockerfile, environment.yaml for Binder, or renv.lock for R.

These files let anyone recreate the exact software stack used for the analysis.

Final Thoughts

The structure above is a starting point. Feel free to adapt it to your discipline, team size, and project complexity. The goal is to create a repository that:

Is intuitive to navigate.
Supports reproducibility.
Enables collaboration without friction.

Join the Discussion

How do you organize your scientific codebases?
What works? What doesn’t?
What am I missing?

Your experiences and suggestions will help the community converge on best practices. 🚀

Integration & Validation Tests

Integration tests for workflows
Validation tests that check against known results

Testing helps ensure correctness and catches bugs when you modify code.

Question for the community:
What’s your testing philosophy for scientific code?

Do you test everything?
Only critical functions?
Or not at all?

README – The Front Door of Your Project

Your README should contain at least the following sections:

Project overview & goals
Installation instructions
Quick‑start guide
Project structure explanation
How to reproduce key results
Dependencies & requirements
Citation information
Contact information

Environment Specification Files

Language	Typical Files
Python	`environment.yml` (conda) `requirements.txt` (pip) `pyproject.toml` (modern packaging)
R	`renv.lock` (renv) `DESCRIPTION` (R packages) `install.R` (installation script)
Julia	`Project.toml` & `Manifest.toml`
MATLAB	Dependency list in README or a separate document
Multi‑language	`Dockerfile` (containerised env) Separate env files for each language Shell script that sets up the whole environment

Question for the community:
How do you handle dependencies that span multiple languages?

Containers?
Virtual machines?
Detailed documentation?

Licensing

If your project is open source, include a license. Common choices for scientific code:

Ignoring Generated Files

Prevent clutter by adding language‑specific .gitignore entries (see the collection at ).
A minimal set for all projects:

data/raw/*
results/*
.DS_Store   # macOS

Suggested Directory Structures

Small / Exploratory Projects

project/
├── data/
├── analysis/
├── results/
├── environment/
└── README.md

Works well for class projects or quick prototypes where you don’t expect major expansion.

Larger, Multi‑Year / Multi‑Paper Projects

project/
├── data/
│   ├── study1/
│   ├── study2/
│   └── shared/
├── src/
│   ├── preprocessing/
│   ├── analysis_core/
│   └── utils/
├── analyses/
│   ├── paper1/
│   ├── paper2/
│   └── exploratory/
├── docs/
└── manuscripts/
    ├── paper1/
    └── paper2/

Key idea: Organise analyses by the output they support (paper, report) while keeping shared code in a central src/ directory.

Question for the community:
How do you organise multi‑year, multi‑paper projects?

One repository or many?
How do you handle shared code?

Self‑Containedness vs. Duplication

Goal: Everything needed to reproduce the analysis should live inside the project directory.

Pros: Guarantees reproducibility for reviewers and future collaborators.
Cons: May duplicate large data sets across projects.

A colleague should be able to:

git clone the repository
Set up the environment (conda, Docker, etc.)
Run the analysis scripts
Reproduce the results

Question for the community:
How do you balance self‑containedness with sharing code/data between projects?

What Belongs in the Repository?

Category	Recommended Inclusion
Code	Scripts, interactive notebooks, source files
Documentation	README, design docs, API docs
Environment specs	`environment.yml`, `requirements.txt`, Dockerfile, etc.
Small data	Files that are small enough to store in the repository

The structure above is a starting point, not a final answer.

Core Universal Principles

Separate concerns – keep data, code, and results in distinct directories.
Preserve raw data – never modify original files.
Modularise code – extract reusable functionality.
Document everything – future you will thank present you.
Version control – track changes and enable collaboration.
Enable reproduction – anyone should be able to reproduce your work.

Implementation will vary based on:

Programming language(s)
Field conventions
Team preferences
Project scale & complexity
Computing environment (laptop, HPC, cloud)

Your Turn

What works? How do you organise your scientific code? What directory structure do you use?
What doesn’t work? What have you tried that failed? Which pain points remain?
What’s missing? Any essential aspects of scientific code organisation that I overlooked?
Language‑specific tips? Share tricks that work particularly well in your language(s).

Looking forward to the discussion!

Organizing Scientific Code: Tips, Resources, and Community Discussion

Why Organize Your Code?

Reproducibility – makes it easier for you and others to reproduce results.
Collaboration – clear structure reduces friction when multiple people work on the same project.
Maintainability – a well‑organized repository is simpler to extend, debug, and refactor.

“The goal isn’t perfection, it’s progress.”

Start organizing better today, and iterate as you learn what works for you and your team.

Common Questions

What language(s) should I use?
What are the field‑specific conventions?
What are the norms in my discipline?

Share your experiences in the comments. Let’s build a community knowledge base of what actually works in practice.

Helpful Resources

#	Resource	What It Offers
1	“Good Enough Practices in Scientific Computing” – Wilson et al.	A comprehensive guide to scientific‑computing best practices.
2	Software Carpentry	Workshops on version control, testing, and project organization.
3	“Ten Simple Rules for Taking Advantage of Git and GitHub” – PLOS Computational Biology	Practical rules for using Git/GitHub effectively.
4	Cookie‑Cutter Data Science	A standardized project‑structure template.
5	The Turing Way	Handbook for reproducible, ethical, and collaborative data science.

How Do You Organize Your Scientific Code?

Share your folder layout, naming conventions, or any scripts you find useful.
Post tips, ask questions, or suggest additional resources in the comments below!

References

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017). Good enough practices in scientific computing. PLOS Computational Biology 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510.
Software Carpentry. Lessons. https://software-carpentry.org/lessons/.
Perez‑Riverol Y, Gatto L, Wang R, Sachsenberg T, Uszkoreit J, Leprevost FdV, et al. (2016). Ten Simple Rules for Taking Advantage of Git and GitHub. PLOS Computational Biology 12(7): e1004947. https://doi.org/10.1371/journal.pcbi.1004947.
DrivenData. Cookiecutter Data Science. https://cookiecutter-data-science.drivendata.org/.
The Turing Way Community. (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research. Zenodo. https://doi.org/10.5281/zenodo.3233853.

Organizing Source Code for Scientific Programmers: Let's Start a Conversation

Why It Matters

A Starting Point (Not a Rigid Prescription)

`data/` – The Sacred Ground

`src/` – Your Core Analysis Code

`notebooks/` – Exploration & Prototyping

`scripts/` – Automation & Batch Processing

`results/` – Analysis Outputs

`docs/` – Documentation

`tests/` – Test Code

`environment/` – Reproducible Environments

Final Thoughts

Join the Discussion

Integration & Validation Tests

README – The Front Door of Your Project

Environment Specification Files

Licensing

Ignoring Generated Files

Suggested Directory Structures

Small / Exploratory Projects

Larger, Multi‑Year / Multi‑Paper Projects

Self‑Containedness vs. Duplication

What Belongs in the Repository?

Core Universal Principles

Your Turn

Organizing Scientific Code: Tips, Resources, and Community Discussion

Why Organize Your Code?

Common Questions

Helpful Resources

How Do You Organize Your Scientific Code?

References

Related posts

Excel Migration Strategy: A Practical Tool Selection and Transition Guide

AI Engineering: Advent of AI with goose Day 15 - AI Multi Platform Recipe System

Thank you for the mention Tim Deschryver and for the opportunity contribute to the NgRx project. It's good to contribute back to open source from time to time!

How We Prevent Ads from Interrupting Critical User Workflows

Why It Matters

A Starting Point (Not a Rigid Prescription)

data/ – The Sacred Ground

src/ – Your Core Analysis Code

notebooks/ – Exploration & Prototyping

scripts/ – Automation & Batch Processing

results/ – Analysis Outputs

docs/ – Documentation

tests/ – Test Code

environment/ – Reproducible Environments

Final Thoughts

Join the Discussion

Integration & Validation Tests

README – The Front Door of Your Project

Environment Specification Files

Licensing

Ignoring Generated Files

Suggested Directory Structures

Small / Exploratory Projects

Larger, Multi‑Year / Multi‑Paper Projects

Self‑Containedness vs. Duplication

What Belongs in the Repository?

Core Universal Principles

Your Turn

Organizing Scientific Code: Tips, Resources, and Community Discussion

Why Organize Your Code?

Common Questions

Helpful Resources

How Do You Organize Your Scientific Code?

References

Related posts

Excel Migration Strategy: A Practical Tool Selection and Transition Guide

AI Engineering: Advent of AI with goose Day 15 - AI Multi Platform Recipe System

Thank you for the mention Tim Deschryver and for the opportunity contribute to the NgRx project. It's good to contribute back to open source from time to time!

How We Prevent Ads from Interrupting Critical User Workflows

`data/` – The Sacred Ground

`src/` – Your Core Analysis Code

`notebooks/` – Exploration & Prototyping

`scripts/` – Automation & Batch Processing

`results/` – Analysis Outputs

`docs/` – Documentation

`tests/` – Test Code

`environment/` – Reproducible Environments