A guide to good practice for collaborative research in data-driven computational modeling and simulation
"You do not really understand something unless you can explain it to your grandmother." -- Albert Einstein
"A mathematical theory is not to be considered complete until you have made it so clear that you can explain it to the first man whom you meet on the street." -- View of an (unknown) old French mathematician, retold by David Hilbert during his address, titled "Mathematical Problems" to the Second International Congress of Mathematicians in Paris, 1900.
Please note: This page is a temporary backup of the original wiki page hosted at Cornell University.
- Intended audience
- Open and collaborative wiki content
- The good practice guide
- External resources and further reading
1. Intended audience
This document is a working outline of ideas concerning good practice guidelines for collaborative projects between experimentalists and computational modelers / applied mathematicians in the physical sciences. It is intended as a training resource primarily for students of the field, specifically in the context of utilizing experimental data and dynamical systems models. It is not directly connected to the PyDSTool software project (see the menu on the sidebar for links to that) despite being hosted on the same wiki at the Cornell Center for Applied Mathematics. The contents of this document have been compiled largely from personal experience and contributions from other practitioners. As a result it is presently (and unintentionally) biased towards the fields of biomechanics and neuroscience, but the aim is to keep the advice as generally applicable as possible. The tone is meant to be somewhat informal and not preachy -- it should contain insightful practical tips and warnings but not technical material from a lecture course on mathematical modeling. If it is of use to you please contribute, and link to it as http://www.cam.cornell.edu/~rclewley/cgi-bin/moin.cgi/GoodPractice.
2. Open and collaborative wiki content
This document is intended to be inherently collaborative in nature, and as such this page is editable by you if you would like to contribute material, or just make minor edits. Because of recent wiki spam the editing of this page is now only available after you set up a simple account. Please contact me (RobClewley) to get access, and learn how to use the MoinMoin wiki markup language (see the HelpContents page). When you have an account you can click "Edit" in the sidebar. Ideally you would add a note about your changes in the "Optional comment about this change" box with your name attached.
Accountability and maintainability of tools and methods are fundamental aspects of good practice in all branches of science. From this we can aspire to widespread reproducibility of results and to build confidence in our work. The combination of computational modeling methodologies with applied mathematics and experimental scientific research is a new and rapidly evolving domain. Unlike many established branches of science, this area does not always exhibit the same level of maturity in communicating the essential details of its tools, methods, and data. For instance, journal publications often inadequately communicate the assumptions, methodology, and algorithms used in generating modeling results. This can lead to poor reproducibility of results, and to frustration, uncertainty, and potentially even to skepticism from the community.
While administration is rarely an enjoyable task, and should never become more important than the scientific objectives, we can motivate ourselves by remembering that maintaining good records from the beginning of a project will greatly simplify (a) the generation of progress reports, theses, grant proposals, and publications; and (b) the justification and validation of our scientific results. What is more, this material does not necessarily have to be impeccably presented or conform to any formal quality assurance standards in order to achieve these benefits. Realistically, the degree to which we can make detailed records depends on the needs of the project and the expectations of our collaborators and ourselves, as well as the available time resources. However, for anything but the most trivial investigation we can benefit from any effort to systematically document our work. In particular we can avoid the headache and embarrassment of trying to reconstruct faded memories and old computer code detailing how we calculated a certain important result from a year before. Remember that it is typical, not exceptional, for our data, our objectives, and our methods to remain moving targets during a project, even after careful initial planning.
Our goal for the auditing and documentation of a collaborative modeling project should be to be able to reconstruct the process at a later date, and reproduce our own results, even when the data set, objectives, and methods may have changed in the interim. For an ultimate goal such as publication, we should aim to prepare data sets, algorithms, and code in a form that can be published as supplementary material to the written portion of our work, e.g. in an online repository associated with the research document. (There are several such repositories for this, even outside of those often provided by journals themselves as part of a regular publication.)
This document presently assumes that you have already worked out a basic plan for your modeling project. More specifically, this should include how you intend to make use of real experimental data in approaching scientific questions through mathematical modeling. From the outset it can be very useful to have potential journals in mind for publication of your results. This can guide the specific planning of your experimental collaboration.
4. The good practice guide
What follows is a somewhat loosely connected list of suggestions, pitfalls, and questions to ask yourself when considering the acquisition and use of data from an experimental lab, whether or not you actually participate in the experiments yourself. Not all of these ideas will necessarily apply -- either in part or in full -- to a particular data-driven modeling problem.
4.1. Administrative and mental preparation
Start a project notebook and document binder to keep your materials together
Learn how to use programming environments, database programs, spreadsheets, etc. to organize your records and files; evolve a filing system that’s suitable to your needs
Consider whether / how to synchronize your written notes with those kept on computer (scanning, transcription, printouts, CDs)
What software and computing resources will you have available?
Which new software tools or mathematical techniques should you learn ahead of time, or expect to learn as you go?
Who are your go-to people (professors, colleagues, lab techs) for help when you get stuck with different types of technical or general problems?
4.2. Before collecting data
Experiments usually work out to be much more complex to execute in practice than expected – ensure that your ideas as simple and as focused on a small number of specific issues (e.g. ONE)
Familiarize yourself with the experimental set up; spend time in the lab
Detail a plan that connects the types of data you expect to acquire to its expected use in simulations, analysis, etc.
How do you expect to visualize or otherwise present the types of data you want to collect?
Become aware of data acquisition issues
Noise, variability, and uncertainty inherent in the dynamics of the physical system you are observing
Noise, variability, and uncertainty in the instrumentation (tolerances/precision/data sampling rates)
Collect references for how the equipment is calibrated
Record references to the equipment manufacturers for later citation
How can things go wrong during the experiment? Can you detect when this happens?
Prepare documentation jointly with the lab members
List the acquisition processing stages involved from initial measurement to final data set (both automatic and manual steps)
Create a summary chart of the process
Agree upon an export data format (what to do with NaNs, column headings, etc.)
Record units, conversions, and derivations needed to get final data set from raw data set
Record any important references to these methods in the literature
Record inherently problematic derived measurements that might magnify error or introduce uncertainty, distortion, or bias (e.g. reconstruction of angles, discretization, filtering (phase bias), behavior of any implied algorithmic parameter fitting in the derivations)
Are there any “fudge” steps that are not rigorously justified? Why?
Record other anecdotal or general points about the set up for future reference
How much data will you need?
How much time will it take to collect the data?
Be realistic about these issues, but try to be generous to yourself – you often need more than you think. Some runs might have problems and need to be discarded; your methods might turn out to need more data to improve accuracy, etc.
4.3. Pilot tests and “proof of concept”
Analyze "dry runs" of the experiment before collecting massive amounts of data
Re-assess whether you are being too ambitious too soon with your modeling goals; there may be a simpler version of an experiment that needs to be done first in order to demonstrate the viability of your ideas
Ambitious projects typically run into many small obstacles early on, both in the lab and later on the computer
Follow through later modeling and analysis stages with dry run data to see where problems might lie
Some obstacles may present crucial philosophical or practical challenges to your objectives – prioritize these, as tackling all the obstacles at once can be too resource-consuming
Step back and decompose the project into smaller sub-problems that might require preliminary experiments and tests in order to establish the working of your methods in the “bigger picture”
4.4. Preparation of acquired data
Create regular backups of data to CD/DVD/backup server and record their date!
Create a chart summarizing what types of data you have and how it is broadly organized
Create an interactive database for your data
Store names of data types (e.g. the column headings in your arrays) and indices of attributes as “mapping objects” (e.g. dictionary type in Python, cell array type in Matlab)
Use a real database program if your data is sufficiently complex
Document how you’ve organized your database for others to use it
Evaluate the quality of data
Decide how to evaluate the technical quality of your data
Encode this in a function that calculates an appropriate measure, if possible
Characterize properties of data
Size, “quality”, variability, skew, etc., of each data set
Does the data “look” reasonable from these properties?
Structure your data
Create different “views” of data, e.g. using database queries and filters
Break down the data set to according to where there is greatest variability or uncertainty in measurements – you may want to analyze only a high-quality sub-set of your data
4.5. Data analysis
Create regular backups of data to CD/DVD/backup server and record their date!
Decide on suitability and accessibility of existing algorithms and code for data analysis; collect these methods together as a “toolkit”
Consider writing your own implementations or translating those that are not already in a convenient form
Add version numbers and dates to your analysis code
Document your code in-line, and use human-readable names, etc. (see good programming practice guides)
Keep your code files well structured on disc (use sub-directories, etc.)
Test your analysis methods in a modular fashion using simplified surrogate data
Surrogate data can mimic properties of real data – this helps you be clear about your expectations of the data and your assumptions about their properties
Record copies of exactly what was run and when (keep old copies of code with it if necessary)
4.6. Modeling issues impacted by experimental approach
[ This section is just slightly more off-topic, so perhaps future expansion of this can be moved to a new page. ]
Appropriate type of dynamical model
Continuous vs. discrete (in time), Deterministic vs. stochastic, ODE (spatially discrete) vs. PDE (spatially extended), hybrid systems
Background scientific theory and notation: mechanics, thermodynamics, kinetics, etc.
Appropriate parameters, variables
Explicit or implicit constraints, reductions, symmetries and conserved quantities in model
Use of differential-algebraic equations, averaging, etc.
Establish notation for the model (based on appropriate literature)
Software environment; integration with data and model analysis tools
Mathematical model analysis
Control and systems theory
Dynamical systems theory
Parameter estimation methods to fit models to data?
Identifiability of parameters?
Bias from discretization, filtering, other data processing
Identification of ill-conditioning (stiffness), well-posedness, multiple scales in the candidate models
Understanding of numerical methods suitable for dealing with these problems
Mathematical goals, and roles of different analysis techniques?
E.g., use of bifurcation theory, phase-plane analysis
Can you generate experimentally viable and scientifically interesting predictions with the model?
5. External resources and further reading
Laboratory management "how to" guides -- at sciencecareers.org.
The Academic Scientist's Toolkit -- at sciencecareers.org.
ProjectContinuity -- an outline for documenting lab work that will enable project continuity in spite of personnel changes.
Keeping a lab notebook -- Chemistry department, Wellesley College. (Thanks for the great quote!)
5.1.1. The Ten Simple Rules Series
* Gu J, Bourne PE (2007) Ten Simple Rules for Graduate Students. PLoS Comput Biol 3(11): e229 doi:10.1371/journal.pcbi.0030229
5.2. Background reading and advocacy
Scientific research communication: the promise and current realities of enhanced publications -- Mackenzie Smith, MIT.
Vision papers at the Commons of Science conference, 2006.
Scientific Method -- at wikipedia.
Maturation Phase of the Modeling and Simulation Discipline -- Tuncer Oren, U Ottawa.
Scientific Computing -- at wikipedia.
Numerical Computation in the Information Age -- John Guckenheimer, Cornell University.
Dynamical Systems and Computational Science -- SIAM Past President Lecture -- John Guckenheimer, Cornell University.
Experimental design FAQ -- Department of Statistics, U Southern Denmark.
Mathematical Biology -- at wikipedia.
Data Mining -- at statsoft.com.
Data Acquisition -- at wikipedia.
Exploratory Data Analysis -- at wikipedia.
Scholarpedia -- science-oriented wiki-style encyclopedia (especially oriented to computational neuroscience and dynamical systems).
This wiki page is maintained by Robert Clewley, but suggestions and contributions are welcomed and encouraged. This page is editable by you once you sign up for a user account on this wiki.