From the course: Data Curation Foundations

Standardized code

From the course: Data Curation Foundations

Start my 1-month free trial

Standardized code

Now, we are going to start talking about text-based curation with a focus on standardized code. The reason why this is the last topic in this curation course is that it is admittedly kind of a grab bag. Most data curation is either tabular, like a data dictionary, or a crosswalk, or a figure, like all those flow diagrams we just looked at. But then, there is everything else. I decided to call the 'everything else,' text-based curation files, because as you will see, they often come in the form of files with a lot of text, such as reports, or code. As I was saying, in the project, you probably won't have that many text-based curation files. In my experience, you will have very few of them, but the ones involved will be very, very important. And, oddly enough, often the most important text-based curation files were written by someone who is not you, or on your team, often someone you do not even know personally. Before we focus on standardized code as a type of text-based curation, I wanted to briefly give an overview of the different type of curation covered in this chapter. The topics are on the slide. Right now, I'm going to cover standardized code, but later, I will cover data reports, manuals, cheat sheets, and diagrams. These items will likely be authored by you, or someone involved with your team; either another programmer, an admin, or a manager type. What this means is that this type of curation is made by people you work with, so you can have some control over what it is, and how it is made. By contrast, these items, founding documents, and project-related agreements, most likely originate outside of your team. Often, these are created by leaders or policy makers. For these, it is just important to have the right ones on hand for reference, so that your team can use them as resources during the project. All right, let's get back to the focus of this particular video, which is the text-based curation of standardized code. And remember, I pointed out that this is under the control of your team. Your project team can make the deliberate decision to make efforts to standardize code for curation purposes. Wow, did I just hear a big, collective groan? Please, let me put your mind at ease. If you are a Spaghetti coder, and you know who you are, realize that standardizing code isn't always necessary. When it is necessary, however, you really need to rebuild your code to some sort of policy or standard. Here are the two situations where you really have to standardize your code. The first is the situation where you are building an analytic dataset for a one-time study. This is so anyone else who replicates the study can replicate the dataset. The second situation where you really need to do it, is when developing an ETL protocol for loading data into a data warehouse. And the reason why it is important in these situations, is that all analysts need to be synchronized, and these situations need to be easily replicable. I'm going to give you five tips for standardizing code files. Here's the first two which I will talk about together. Creating modular code files, and naming the files with numbers at the beginning. If you take my R or SAS courses on LinkedIn Learning, you'll see me doing this. See these example code names? They all start with numbers, so they sort in order. And each file does almost nothing. Just one process. For example, all 105 does is remove unwanted columns. That is probably less than ten lines of programming. But doing it this way makes it very easy to manage. If I wanted to keep a column we were dropping routinely, I'd just go edit 105. Tips 3 and 4 have to do with what is actually in the code. Tip 3 is about formatting the code, and tip 4 is about creating naming conventions for datasets created within the code. See? I made up some example SAS code, but this could be any language. The top 2 lines copy a dataset name, brfss_b, into one called brfss_c See that naming convention? That's what tip 4 is all about. But, for tip 3, the style guide, please look at the rest of the code formatting on the slide. See how the first command, we call it a DATA step in SAS, lines up with the run command at the bottom? And see the comment in the middle? Those are the kinds of policies you can set up as a style guide for the code. And finally, even though you'll make a data dictionary, it's helpful to set up a naming convention policy for making new fields. Here is example SAS code. See how the variable VETERAN4 is being made out of the variable VETERAN3? That is a simple naming convention I set up for brfss data, because they were always adding numbers at the end of their variables, so I just had us carry on and do the same. Remember, micromanaging code to this level is not always necessary. It depends on the project, but, if you do have to set up a style guide for standardized code, rest assured that your other curation, like your data dictionary and flow diagrams, will be consistent. All the files will work together to support your project.

Contents