Coding Principles
We generally want to follow the Gentzkow and Shapiro code structure and data storage protocols. Also see chapters 6 and 7 of the old Gentzkow and Shapiro guide that cover abstraction and self-documentation. Several basic things to re-emphasize:
- The entire project, from initial data to compiling the paper pdf, can be run from one command, typically GitHub/ProjectName/MakePaper.sh. This shell script will call other files, e.g. Stata, R, Matlab, and Python.
- Do not manually pre-process data, e.g. manipulate Excel sheets, before importing into R or Stata. All data processing, beginning with the original file, should be automated and called from MakePaper.sh.
- To keep MakePaper.sh from getting too long and unwieldy, MakePaper.sh may instead call sub-batch files that group together calls to functionally-related Stata, R, etc. code (e.g., BuildPriceData.sh, BuildDrillingData.sh, Estimation.sh, etc.)
- Keep code less than 100 characters wide so that it is easy to read.
- Each dataset has a valid (unique, non-missing) key. For example, you might have dataset of US county characteristics, e.g. square miles and 1969 population, with one row for each county, and the key being the stateFIPS+countyFIPS.
- Keep datasets normalized (meaning that they contain only variables at the same logical level as the key) as late in the data preparation process as possible. Once you merge a state-level dataset with a county-level dataset, the state-level variables are recorded many times (one for each county). This takes a lot of space and can also confuse other aspects of data preparation.
- Code should be system agnostic, in that once user-specific directories are identified at the top of the code, along with a query for the identity of the user, the code can be run from any team member's machine without having to manually change the program's working directory.
- File paths should always use forward slashes (
/) rather than backslashes (\) to avoid problems on non-Windows OS's. - Best practice is to include a short piece of code in the /Code directory that can other routines can call to obtain directory paths. In Stata, this can be accomplished through the use of the include command. Thus, each routine need only point to each user's /Code directory rather than the full set of sub-directories. This approach avoids the need to modify every routine should we decide to re-organize the directory structure at a later date.
- File paths should always use forward slashes (
- Code should also be version agnostic, so that team members (and future reproducers) get the same results no matter what version of software they are using. In Stata, this can be accomplished by using the version command to set the Stata version to the lowest version in use by the team.
- Whenever using an algorithm that requires random number generation (e.g., Monte Carlo simulation or bootstrapping), set the seed so that the results do not change every time the code is run.
- As much as possible we would like to use object-oriented programming (OOP). It adds a layer of structure and simplification to large, complicated code. A nice introduction to OOP for Economics can be found here.