Organization and reports

Stat 550

Daniel J. McDonald

Last modified โ€“ 03 April 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\mid} \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \renewcommand{\hat}{\widehat} \]

Topics for today

  1. Organizing your file system
  2. Writing reports that mix output and text
  3. (Avoiding buggy code)

The guiding theme

Organization

  • Students come to my office
  • All their stuff is on their Desktop
  • This is ๐Ÿคฎ

I urge you to consult:

Karl Bromanโ€™s Notes

Some guiding principles

  1. Avoid naming by date.
    • Your file system already knows the date.
    • Sometimes projects take a while.
    • You can add this inside a particular report: Last updated: 2022-01-07
  2. If youโ€™re going to use a date anywhere, do YYYY-MM-DD or YYYYMMDD not DD-MMM-YY
  3. This is a process
  4. Donโ€™t get tied down
  5. But donโ€™t reorganize every time you find a better system
  6. Customize to your needs, preferences

Organizing your stuff

โ”œโ”€โ”€ Advising
โ”‚   โ”œโ”€โ”€ arash
โ”‚   โ”œโ”€โ”€ gian-carlo
โ”œโ”€โ”€ CV
โ”œโ”€โ”€ Computing
โ”‚   โ”œโ”€โ”€ batchtools.slurm.tmpl
โ”‚   โ”œโ”€โ”€ computecanada_notes.md
โ”‚   โ”œโ”€โ”€ FKF
โ”‚   โ””โ”€โ”€ ghclass
โ”œโ”€โ”€ Grants
โ”‚   โ”œโ”€โ”€ B&E JSM 2010
โ”‚   โ”œโ”€โ”€ CANSSI RRP 2020
โ”‚   โ”œโ”€โ”€ NSERC 2020
โ”œโ”€โ”€ LettersofRec
โ”œโ”€โ”€ Manuscripts
|   โ”œโ”€โ”€ learning-white-matter
|   โ”œโ”€โ”€ rt-est
โ”‚   โ”œโ”€โ”€ zzzz Old
โ”œโ”€โ”€ Referee reports
โ”œโ”€โ”€ Talks
โ”‚   โ”œโ”€โ”€ JobTalk2020
โ”‚   โ”œโ”€โ”€ ubc-stat-covid-talk
โ”‚   โ””โ”€โ”€ utoronto-grad-advice
โ”œโ”€โ”€ Teaching
โ”‚   โ”œโ”€โ”€ stat-406
โ”‚   โ”œโ”€โ”€ stat-550
โ”‚   โ”œโ”€โ”€ zzzz CMU TA
โ”‚   โ””โ”€โ”€ zzzz booth
โ””โ”€โ”€ Website

Inside a project

.
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ Summary of Goals.rtf
โ”œโ”€โ”€ cluster_output
โ”œโ”€โ”€ code
โ”œโ”€โ”€ data
โ”œโ”€โ”€ dsges-github.Rproj
โ”œโ”€โ”€ manuscript
โ””โ”€โ”€ waldman-triage
  • Include a README
  • Ideally have a MAKEFILE
  • Under version control, shared with collaborator

Basic principles

  • Be consistent โ€“ directory structure; names
    • all project files in 1 directory, not multiples
  • Always separate raw from processed data
  • Always separate code from data
  • It should be obvious what code created what files, and what the dependencies are. (MAKEFILE forces this)
  • No hand-editing of data files
  • Donโ€™t use spaces in file names
  • In code, use relative paths, not absolute paths
    • ../blah not ~/blah or /users/dajmcdon/Documents/Work/proj-1/blah
    • The {here} package in R is great for this

Problem: Coordinating with collaborators

  • Where to put data that multiple people will work with?
  • Where to put intermediate/processed data?
  • Where to indicate the code that created those processed data files?
  • How to divvy up tasks and know who did what?
  • Need to agree on directory structure and file naming conventions

GitHub is (I think) the ideal solution, but not always feasible.

Problem: Collaborators who donโ€™t use GitHub

  • Use GitHub yourself
  • Copy files to/from some shared space
    • Ideally, in an automated way (Dropbox, S3 Bucket)
    • Avoid Word at all costs. Google Docs if needed.
    • Word and Git do not mix
    • Last resort: Word file in Dropbox. Everything else nicely organized on your end. Rmd file with similar structure to Manuscript that does the analysis.
  • Commit their changes.

Overleaf has Git built in (paid tier). I donโ€™t like Overleaf. Costs money, the viewer is crap and so is the editor. I suggest you avoid it.

Reports that mix output and text

Using Rmarkdown/Quarto/Jupyter for most things

Your goal is to Avoid at all costs:

  • โ€œHow did I create this plot?โ€
  • โ€œWhy did I decide to omit those six samples?โ€
  • โ€œWhere (on the web) did I find these data?โ€
  • โ€œWhat was that interesting gene/feature/predictor?โ€

Really useful resource:

When I begin a new project

  1. Create a directory structure
    • code/
    • papers/
    • notes/ (maybe?)
    • README.md
    • data/ (maybe?)
  2. Write scripts in the code/ directory
  3. TODO items in the README
  4. Use Rmarkdown/Quarto/Jupyter for reports, render to .pdf

As the project progressesโ€ฆ

Reorganize

  • Some script files go to a package (thorougly tested), all that remains is for the paper
  • These now load the package and run simulations or analyses (that take a while)
  • Maybe add a directory that contains dead-ends (code or text or โ€ฆ)
  • Add manuscript/. I try to go for main.tex and Supplement.Rmd
  • Supplement.Rmd runs anything necessary in code/ and creates all figures in the main doc and the supplement. Also generates any online supplementary material
  • Sometimes, just manuscript/main.Rmd
  • Sometimes main.tex just inputs intro.tex, methods.tex, etc.

The old manuscript (starting in School, persisting too long)

  1. Write lots of LaTeX, R code in separate files
  2. Need a figure. Run R code, get figure, save as .pdf.
  3. Recompile LaTeX. Axes are unreadable. Back to R, rerun R code, โ€ฆ
  4. Recompile LaTeX. Canโ€™t distinguish lines. Back to R, rerun R code, โ€ฆ
  5. Collaborator wants changes to the simulation. Edit the code. Rerun figure script, doesnโ€™t work. More editsโ€ฆ.Finally Recompile.
  6. Reviewer โ€œwhat if n is biggerโ€. Hope I can find the right location. But the code isnโ€™t functions. Something breaks โ€ฆ
  7. Etc, etc.

Now:

  1. R package with documented code, available on GitHub.
  2. One script to run the analysis, one to gather the results.
  3. One .Rmd file to take in the results, do preprocessing, generate all figures.
  4. LaTeX file on Journal style.

The optimal

Same as above but with a MAKEFILE to automatically run parts of 1โ€“4 as needed

Evolution of presentations

  1. LaTeX + Beamer (similar to the manuscript):
    1. Write lots of LaTeX, R code in separate files
    2. Need a figure. Run R code, get figure, save as .pdf.
    3. Rinse and repeat.
  2. Course slides in Rmarkdown + Slidy
  3. Seminars in Rmarkdown + Beamer (with lots of customization)
  4. Seminars in Rmarkdown + Xaringan
  5. Everything in Quarto
  • Easy to use.
  • Easy to customize (defaults are not great)
  • WELL DOCUMENTED

Takeaways