Good Practice in computational research

A guide to good practice for collaborative research in data-driven computational modeling and simulation

"You do not really understand something unless you can explain it to your grandmother." -- Albert Einstein

"A mathematical theory is not to be considered complete until you have made it so clear that you can explain it to the first man whom you meet on the street." -- View of an (unknown) old French mathematician, retold by David Hilbert during his [WWW] address, titled "Mathematical Problems" to the Second International Congress of Mathematicians in Paris, 1900.

Please note: This page is a temporary backup of the original wiki page hosted at Cornell University.

  1. Intended audience
  2. Open and collaborative wiki content
  3. Introduction
  4. The good practice guide
    1. Administrative and mental preparation
    2. Before collecting data
    3. Pilot tests and “proof of concept”
    4. Preparation of acquired data
    5. Data analysis
    6. Modeling issues impacted by experimental approach
  5. External resources and further reading
    1. Professional
      1. The Ten Simple Rules Series
    2. Background reading and advocacy
    3. Technical

1. Intended audience

This document is a working outline of ideas concerning good practice guidelines for collaborative projects between experimentalists and computational modelers / applied mathematicians in the physical sciences. It is intended as a training resource primarily for students of the field, specifically in the context of utilizing experimental data and [WWW] dynamical systems models. It is not directly connected to the PyDSTool software project (see the menu on the sidebar for links to that) despite being hosted on the same wiki at the [WWW] Cornell Center for Applied Mathematics. The contents of this document have been compiled largely from personal experience and contributions from other practitioners. As a result it is presently (and unintentionally) biased towards the fields of biomechanics and neuroscience, but the aim is to keep the advice as generally applicable as possible. The tone is meant to be somewhat informal and not preachy -- it should contain insightful practical tips and warnings but not technical material from a lecture course on mathematical modeling. If it is of use to you please contribute, and link to it as

2. Open and collaborative wiki content

This document is intended to be inherently collaborative in nature, and as such this page is editable by you if you would like to contribute material, or just make minor edits. Because of recent wiki spam the editing of this page is now only available after you set up a simple account. Please contact me (RobClewley) to get access, and learn how to use the MoinMoin wiki markup language (see the HelpContents page). When you have an account you can click "Edit" in the sidebar. Ideally you would add a note about your changes in the "Optional comment about this change" box with your name attached.

3. Introduction

Accountability and maintainability of tools and methods are fundamental aspects of good practice in all branches of science. From this we can aspire to widespread reproducibility of results and to build confidence in our work. The combination of computational modeling methodologies with applied mathematics and experimental scientific research is a new and rapidly evolving domain. Unlike many established branches of science, this area does not always exhibit the same level of maturity in communicating the essential details of its tools, methods, and data. For instance, journal publications often inadequately communicate the assumptions, methodology, and algorithms used in generating modeling results. This can lead to poor reproducibility of results, and to frustration, uncertainty, and potentially even to skepticism from the community.

While administration is rarely an enjoyable task, and should never become more important than the scientific objectives, we can motivate ourselves by remembering that maintaining good records from the beginning of a project will greatly simplify (a) the generation of progress reports, theses, grant proposals, and publications; and (b) the justification and validation of our scientific results. What is more, this material does not necessarily have to be impeccably presented or conform to any formal quality assurance standards in order to achieve these benefits. Realistically, the degree to which we can make detailed records depends on the needs of the project and the expectations of our collaborators and ourselves, as well as the available time resources. However, for anything but the most trivial investigation we can benefit from any effort to systematically document our work. In particular we can avoid the headache and embarrassment of trying to reconstruct faded memories and old computer code detailing how we calculated a certain important result from a year before. Remember that it is typical, not exceptional, for our data, our objectives, and our methods to remain moving targets during a project, even after careful initial planning.

Our goal for the auditing and documentation of a collaborative modeling project should be to be able to reconstruct the process at a later date, and reproduce our own results, even when the data set, objectives, and methods may have changed in the interim. For an ultimate goal such as publication, we should aim to prepare data sets, algorithms, and code in a form that can be published as supplementary material to the written portion of our work, e.g. in an online repository associated with the research document. (There are several such repositories for this, even outside of those often provided by journals themselves as part of a regular publication.)

This document presently assumes that you have already worked out a basic plan for your modeling project. More specifically, this should include how you intend to make use of real experimental data in approaching scientific questions through mathematical modeling. From the outset it can be very useful to have potential journals in mind for publication of your results. This can guide the specific planning of your experimental collaboration.

4. The good practice guide

What follows is a somewhat loosely connected list of suggestions, pitfalls, and questions to ask yourself when considering the acquisition and use of data from an experimental lab, whether or not you actually participate in the experiments yourself. Not all of these ideas will necessarily apply -- either in part or in full -- to a particular data-driven modeling problem.

4.1. Administrative and mental preparation

  1. Start a project notebook and document binder to keep your materials together

  2. Learn how to use programming environments, database programs, spreadsheets, etc. to organize your records and files; evolve a filing system that’s suitable to your needs

  3. Consider whether / how to synchronize your written notes with those kept on computer (scanning, transcription, printouts, CDs)

  4. What software and computing resources will you have available?

  5. Which new software tools or mathematical techniques should you learn ahead of time, or expect to learn as you go?

  6. Who are your go-to people (professors, colleagues, lab techs) for help when you get stuck with different types of technical or general problems?

4.2. Before collecting data

  1. Experiments usually work out to be much more complex to execute in practice than expected – ensure that your ideas as simple and as focused on a small number of specific issues (e.g. ONE)

  2. Familiarize yourself with the experimental set up; spend time in the lab

  3. Detail a plan that connects the types of data you expect to acquire to its expected use in simulations, analysis, etc.

  4. How do you expect to visualize or otherwise present the types of data you want to collect?

  5. Become aware of data acquisition issues

    1. Noise, variability, and uncertainty inherent in the dynamics of the physical system you are observing

    2. Noise, variability, and uncertainty in the instrumentation (tolerances/precision/data sampling rates)

    3. Collect references for how the equipment is calibrated

    4. Record references to the equipment manufacturers for later citation

    5. How can things go wrong during the experiment? Can you detect when this happens?

  6. Prepare documentation jointly with the lab members

    1. List the acquisition processing stages involved from initial measurement to final data set (both automatic and manual steps)

    2. Create a summary chart of the process

    3. Agree upon an export data format (what to do with NaNs, column headings, etc.)

    4. Record units, conversions, and derivations needed to get final data set from raw data set

    5. Record any important references to these methods in the literature

    6. Record inherently problematic derived measurements that might magnify error or introduce uncertainty, distortion, or bias (e.g. reconstruction of angles, discretization, filtering (phase bias), behavior of any implied algorithmic parameter fitting in the derivations)

    7. Are there any “fudge” steps that are not rigorously justified? Why?

    8. Record other anecdotal or general points about the set up for future reference

  7. How much data will you need?

  8. How much time will it take to collect the data?

    • Be realistic about these issues, but try to be generous to yourself – you often need more than you think. Some runs might have problems and need to be discarded; your methods might turn out to need more data to improve accuracy, etc.

4.3. Pilot tests and “proof of concept”

  1. Analyze "dry runs" of the experiment before collecting massive amounts of data

  2. Re-assess whether you are being too ambitious too soon with your modeling goals; there may be a simpler version of an experiment that needs to be done first in order to demonstrate the viability of your ideas

  3. Ambitious projects typically run into many small obstacles early on, both in the lab and later on the computer

    1. Follow through later modeling and analysis stages with dry run data to see where problems might lie

    2. Some obstacles may present crucial philosophical or practical challenges to your objectives – prioritize these, as tackling all the obstacles at once can be too resource-consuming

    3. Step back and decompose the project into smaller sub-problems that might require preliminary experiments and tests in order to establish the working of your methods in the “bigger picture”

4.4. Preparation of acquired data

  1. Create regular backups of data to CD/DVD/backup server and record their date!

  2. Create a chart summarizing what types of data you have and how it is broadly organized

  3. Create an interactive database for your data

    1. Store names of data types (e.g. the column headings in your arrays) and indices of attributes as “mapping objects” (e.g. dictionary type in Python, cell array type in Matlab)

    2. Use a real database program if your data is sufficiently complex

    3. Document how you’ve organized your database for others to use it

  4. Evaluate the quality of data

    1. Decide how to evaluate the technical quality of your data

      1. Encode this in a function that calculates an appropriate measure, if possible

    2. Characterize properties of data

      1. Size, “quality”, variability, skew, etc., of each data set

    3. Does the data “look” reasonable from these properties?

  5. Structure your data

    1. Create different “views” of data, e.g. using database queries and filters

    2. Break down the data set to according to where there is greatest variability or uncertainty in measurements – you may want to analyze only a high-quality sub-set of your data

4.5. Data analysis

  1. Create regular backups of data to CD/DVD/backup server and record their date!

  2. Decide on suitability and accessibility of existing algorithms and code for data analysis; collect these methods together as a “toolkit”

  3. Consider writing your own implementations or translating those that are not already in a convenient form

  4. Add version numbers and dates to your analysis code

  5. Document your code in-line, and use human-readable names, etc. (see good programming practice guides)

  6. Keep your code files well structured on disc (use sub-directories, etc.)

  7. Test your analysis methods in a modular fashion using simplified surrogate data

  8. Surrogate data can mimic properties of real data – this helps you be clear about your expectations of the data and your assumptions about their properties

  9. Record copies of exactly what was run and when (keep old copies of code with it if necessary)

4.6. Modeling issues impacted by experimental approach

[ This section is just slightly more off-topic, so perhaps future expansion of this can be moved to a new page. ]

  1. Appropriate type of dynamical model

    1. Continuous vs. discrete (in time), Deterministic vs. stochastic, ODE (spatially discrete) vs. PDE (spatially extended), hybrid systems

    2. Background scientific theory and notation: mechanics, thermodynamics, kinetics, etc.

  2. Appropriate parameters, variables

  3. Explicit or implicit constraints, reductions, symmetries and conserved quantities in model

    1. Use of differential-algebraic equations, averaging, etc.

  4. Establish notation for the model (based on appropriate literature)

  5. Model simulation

    1. Numerical issues

    2. Software environment; integration with data and model analysis tools

  6. Mathematical model analysis

    1. Control and systems theory

    2. Dynamical systems theory

    3. Statistics

  7. Parameter estimation methods to fit models to data?

    1. Identifiability of parameters?

    2. Bias from discretization, filtering, other data processing

  8. Identification of ill-conditioning (stiffness), well-posedness, multiple scales in the candidate models

    1. Understanding of numerical methods suitable for dealing with these problems

  9. Mathematical goals, and roles of different analysis techniques?

    1. E.g., use of bifurcation theory, phase-plane analysis

  10. Can you generate experimentally viable and scientifically interesting predictions with the model?

5. External resources and further reading

5.1. Professional

5.1.1. The Ten Simple Rules Series

* [WWW] Gu J, Bourne PE (2007) Ten Simple Rules for Graduate Students. PLoS Comput Biol 3(11): e229 doi:10.1371/journal.pcbi.0030229

* [WWW] Bourne PE (2005) Ten simple rules for getting published. PLoS Comp Biol 1: e57. doi:10.1371/journal.pcbi.0010057

* [WWW] Bourne PE, Korngreen A (2006) Ten simple rules for reviewers. PLoS Comp Biol 2: e110. doi:10.1371/journal.pcbi.0020110

* [WWW] Bourne PE, Chalupa LM (2006) Ten Simple Rules for Getting Grants. PLoS Comput Biol 2(2): e12 doi:10.1371/journal.pcbi.0020012

* [WWW] Bourne PE, Friedberg I (2006) Ten Simple Rules for Selecting a Postdoctoral Position. PLoS Comput Biol 2(11): e121 doi:10.1371/journal.pcbi.0020121

* [WWW] Vicens Q, Bourne PE (2007) Ten Simple Rules for a Successful Collaboration. PLoS Comput Biol 3(3): e44 doi:10.1371/journal.pcbi.0030044

* [WWW] Bourne PE (2007) Ten Simple Rules for Making Good Oral Presentations. PLoS Comput Biol 3(4): e77 doi:10.1371/journal.pcbi.0030077

* [WWW] Erren TC, Bourne PE (2007) Ten Simple Rules for a Good Poster Presentation. PLoS Comput Biol 3(5): e102 doi:10.1371/journal.pcbi.0030102

5.2. Background reading and advocacy

5.3. Technical

This wiki page is maintained by Robert Clewley, but suggestions and contributions are welcomed and encouraged. This page is editable by you once you sign up for a user account on this wiki.

last edited 2009-03-27 03:37:06 by RobClewley