Organizing Source Code for Scientific Programmers: Let's Start a Conversation
Source: Dev.to
Why It Matters
Poor organization creates more than inconvenience—it creates real problems:
- Lost data
- Irreproducible analyses
- Hours wasted searching for files
A well‑organized repository accelerates science, facilitates collaboration, and makes your research reproducible. When everyone on the team can predict where files should be, collaboration becomes intuitive rather than frustrating.
A Starting Point (Not a Rigid Prescription)
The key organizing principle for scientific repositories is to structure by file type and purpose. Below is a proposed directory layout:
project-name/
├── data/
│ ├── raw/
│ └── processed/
├── src/
│ ├── data_processing/
│ ├── analysis/
│ └── visualization/
├── notebooks/
├── scripts/
├── results/
│ ├── figures/
│ └── output/
├── docs/
├── tests/
├── environment/
├── README.md
├── LICENSE
└── .gitignore
Note: This is a starting point for discussion, not a rigid prescription.
Below we break down each component and pose questions for the community.
data/ – The Sacred Ground
Sub‑directories
data/raw/– Original, unmodified data files. Treat this as immutable (you can even set files to read‑only).data/processed/– Cleaned, transformed, or analyzed versions of your data.
Why the separation?
It guarantees reproducibility: anyone should be able to run your processing code on the raw data and regenerate the processed data.
Tips
- For large datasets, store only metadata or download scripts in version control.
- Use external services (OSF, Zenodo, institutional repositories) for the actual data files.
Community question:
Do you organize data differently? How do you handle intermediate processing stages? Do you have a data/interim/ directory?
src/ – Your Core Analysis Code
This is where reusable, production‑quality code lives—the “scientific guts” of your project.
Guidelines
- Organize into logical modules or packages.
- Write clear docstrings and documentation.
- Make the code testable and (ideally) write tests.
- Ensure it can be imported from interactive environments or scripts.
Community question:
How do you organize multi‑language projects? Separate directories per language, or a mixed src/?
notebooks/ – Exploration & Prototyping
Interactive environments (Jupyter, R Markdown, Pluto.jl, MATLAB Live Scripts, Mathematica notebooks) are fantastic for exploration, but they can encourage non‑modular code and accumulate cruft.
Best practices
- Use notebooks for exploration, visualization, and prototyping.
- Keep each notebook focused on a specific question or analysis.
- Name them clearly with a numeric prefix for ordering, e.g.
01-data-exploration.ipynb02-initial-modeling.Rmd03-sensitivity-analysis.jl
- When code matures, move it to
src/and import functions rather than copying code. - Extract reusable parts into modules in
src/once a notebook becomes unwieldy.
Community question:
Some researchers keep all work in notebooks/scripts; others move everything to modules. What’s your philosophy? Does it depend on the project stage?
scripts/ – Automation & Batch Processing
Scripts are for automated, reproducible workflows. They should:
- Run without interaction.
- Accept command‑line arguments or configuration files.
- Be executable on clusters or in pipelines.
- Orchestrate complete analyses from start to finish.
Typical use cases
- Data download and preprocessing pipelines.
- Running models with different parameters.
- Generating all figures for a paper.
- Batch processing multiple datasets.
A controller script (e.g., run_all.sh, Makefile, Snakefile) that executes the entire analysis workflow in order is extremely valuable for reproducibility.
Community question:
Do you use workflow managers (Make, Snakemake, Nextflow, Drake, Luigi)? How do you organize pipeline definitions?
results/ – Analysis Outputs
Store generated outputs outside version control (add to .gitignore):
results/
├── figures/ # Plots, visualizations
├── output/ # Tables, statistics, processed results
└── models/ # Trained models, fitted parameters
Why separate from data/?
Results are generated by code and should be reproducible. If you lose them, you can regenerate them. Raw data, however, cannot be regenerated.
Community question:
Do you version‑control any results? How do you handle results that take days/weeks to generate?
docs/ – Documentation
Include:
- Project documentation.
- Analysis notes or lab‑notebook entries.
- Manuscript drafts.
- Supplementary materials.
- API documentation (if auto‑generated).
Community question:
Where do you keep your manuscript? In the repo, a separate repo, Overleaf, Google Docs…?
tests/ – Test Code
Yes, even scientific code should have tests! At minimum, include:
- Unit tests for core functions.
- Integration tests for end‑to‑end workflows (if feasible).
- Regression tests to guard against accidental changes in results.
environment/ – Reproducible Environments
Store environment specifications:
environment.yml(conda) orrequirements.txt(pip).- Dockerfile,
environment.yamlfor Binder, orrenv.lockfor R.
These files let anyone recreate the exact software stack used for the analysis.
Final Thoughts
The structure above is a starting point. Feel free to adapt it to your discipline, team size, and project complexity. The goal is to create a repository that:
- Is intuitive to navigate.
- Supports reproducibility.
- Enables collaboration without friction.
Join the Discussion
- How do you organize your scientific codebases?
- What works? What doesn’t?
- What am I missing?
Your experiences and suggestions will help the community converge on best practices. 🚀
Integration & Validation Tests
- Integration tests for workflows
- Validation tests that check against known results
Testing helps ensure correctness and catches bugs when you modify code.
Question for the community:
What’s your testing philosophy for scientific code?
- Do you test everything?
- Only critical functions?
- Or not at all?
README – The Front Door of Your Project
Your README should contain at least the following sections:
- Project overview & goals
- Installation instructions
- Quick‑start guide
- Project structure explanation
- How to reproduce key results
- Dependencies & requirements
- Citation information
- Contact information
Environment Specification Files
| Language | Typical Files |
|---|---|
| Python | environment.yml (conda) requirements.txt (pip) pyproject.toml (modern packaging) |
| R | renv.lock (renv) DESCRIPTION (R packages) install.R (installation script) |
| Julia | Project.toml & Manifest.toml |
| MATLAB | Dependency list in README or a separate document |
| Multi‑language | Dockerfile (containerised env) Separate env files for each language Shell script that sets up the whole environment |
Question for the community:
How do you handle dependencies that span multiple languages?
- Containers?
- Virtual machines?
- Detailed documentation?
Licensing
If your project is open source, include a license. Common choices for scientific code:
- MIT
- BSD
- GPL
Ignoring Generated Files
Prevent clutter by adding language‑specific .gitignore entries (see the collection at ).
A minimal set for all projects:
data/raw/*
results/*
.DS_Store # macOS
Suggested Directory Structures
Small / Exploratory Projects
project/
├── data/
├── analysis/
├── results/
├── environment/
└── README.md
Works well for class projects or quick prototypes where you don’t expect major expansion.
Larger, Multi‑Year / Multi‑Paper Projects
project/
├── data/
│ ├── study1/
│ ├── study2/
│ └── shared/
├── src/
│ ├── preprocessing/
│ ├── analysis_core/
│ └── utils/
├── analyses/
│ ├── paper1/
│ ├── paper2/
│ └── exploratory/
├── docs/
└── manuscripts/
├── paper1/
└── paper2/
Key idea: Organise analyses by the output they support (paper, report) while keeping shared code in a central src/ directory.
Question for the community:
How do you organise multi‑year, multi‑paper projects?
- One repository or many?
- How do you handle shared code?
Self‑Containedness vs. Duplication
Goal: Everything needed to reproduce the analysis should live inside the project directory.
Pros: Guarantees reproducibility for reviewers and future collaborators.
Cons: May duplicate large data sets across projects.
A colleague should be able to:
git clonethe repository- Set up the environment (conda, Docker, etc.)
- Run the analysis scripts
- Reproduce the results
Question for the community:
How do you balance self‑containedness with sharing code/data between projects?
What Belongs in the Repository?
| Category | Recommended Inclusion |
|---|---|
| Code | Scripts, interactive notebooks, source files |
| Documentation | README, design docs, API docs |
| Environment specs | environment.yml, requirements.txt, Dockerfile, etc. |
| Small data | Files that are small enough to store in the repository |
The structure above is a starting point, not a final answer.
Core Universal Principles
- Separate concerns – keep data, code, and results in distinct directories.
- Preserve raw data – never modify original files.
- Modularise code – extract reusable functionality.
- Document everything – future you will thank present you.
- Version control – track changes and enable collaboration.
- Enable reproduction – anyone should be able to reproduce your work.
Implementation will vary based on:
- Programming language(s)
- Field conventions
- Team preferences
- Project scale & complexity
- Computing environment (laptop, HPC, cloud)
Your Turn
- What works? How do you organise your scientific code? What directory structure do you use?
- What doesn’t work? What have you tried that failed? Which pain points remain?
- What’s missing? Any essential aspects of scientific code organisation that I overlooked?
- Language‑specific tips? Share tricks that work particularly well in your language(s).
Looking forward to the discussion!
Organizing Scientific Code: Tips, Resources, and Community Discussion
Why Organize Your Code?
- Reproducibility – makes it easier for you and others to reproduce results.
- Collaboration – clear structure reduces friction when multiple people work on the same project.
- Maintainability – a well‑organized repository is simpler to extend, debug, and refactor.
“The goal isn’t perfection, it’s progress.”
Start organizing better today, and iterate as you learn what works for you and your team.
Common Questions
- What language(s) should I use?
- What are the field‑specific conventions?
- What are the norms in my discipline?
Share your experiences in the comments. Let’s build a community knowledge base of what actually works in practice.
Helpful Resources
| # | Resource | What It Offers |
|---|---|---|
| 1 | “Good Enough Practices in Scientific Computing” – Wilson et al. | A comprehensive guide to scientific‑computing best practices. |
| 2 | Software Carpentry | Workshops on version control, testing, and project organization. |
| 3 | “Ten Simple Rules for Taking Advantage of Git and GitHub” – PLOS Computational Biology | Practical rules for using Git/GitHub effectively. |
| 4 | Cookie‑Cutter Data Science | A standardized project‑structure template. |
| 5 | The Turing Way | Handbook for reproducible, ethical, and collaborative data science. |
How Do You Organize Your Scientific Code?
- Share your folder layout, naming conventions, or any scripts you find useful.
- Post tips, ask questions, or suggest additional resources in the comments below!
References
- Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017). Good enough practices in scientific computing. PLOS Computational Biology 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510.
- Software Carpentry. Lessons. https://software-carpentry.org/lessons/.
- Perez‑Riverol Y, Gatto L, Wang R, Sachsenberg T, Uszkoreit J, Leprevost FdV, et al. (2016). Ten Simple Rules for Taking Advantage of Git and GitHub. PLOS Computational Biology 12(7): e1004947. https://doi.org/10.1371/journal.pcbi.1004947.
- DrivenData. Cookiecutter Data Science. https://cookiecutter-data-science.drivendata.org/.
- The Turing Way Community. (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research. Zenodo. https://doi.org/10.5281/zenodo.3233853.