One of the tasks I seem to be spending a lot time thinking about these days is how to name files and structure them in the appropriate directories so that they follow a consistent logic. This is because my current research involves development of analysis pipelines of Next Generation Sequencing Data where the output file(s) of a program(s) is the input to the next. These processing steps allow raw data straight out of the machine to help answer the biological questions for which the experiments were run on the first place.
File and directory naming conventions may sound like a trivial thing to do but I have found that their complexity increases exponentially when many components are run. To illustrate my current approach to tackling this problem, I present here a simple example. Suppose a project (‘project_name’) that runs two programs, ‘program_1’ and ‘program_2’. Each time the pipeline is run, input files may vary and so I create a new ‘job_name’ for each run. I have come up with this directory architecture:
/project_name /project_name/data /project_name/data/job_name_1 /project_name/data/job_name_1/input_data_type_1 /project_name/data/job_name_1/input_data_type_2 /project_name/data/job_name_1/input_data_type_3 /project_name/results /project_name/results/job_name_1/program_1 /project_name/results/job_name_1/program_1/output_1 /project_name/results/job_name_1/program_1/output_2 ... /project_name/results/job_name/program_2/output_1 /project_name/results/job_name/program_2/output_1 ...
What would happen if instead of running 2 programs as I did above I run 5 or 6? And what if for each input data file I had replicates? What about maximising the number steps taken in parallel? You can start to see that the thing really gets complicated.
File and directory naming conventions is something that I am teaching myself, but any directives or systematic methods taught during my computer science student years would have come in handy now. In future bioinformatics lectures I teach I will definitively challenge my students to think about this issue very carefully.