From the course: Data Curation Foundations

Data dictionary basics

From the course: Data Curation Foundations

Start my 1-month free trial

Data dictionary basics

- [Instructor] To being our study of backend curation, we must focus on the main tool in our tool box, which is the data dictionary. As you can probably guess a data dictionary is a guide for understanding stored data. So let's think a little about stored data for a moment. I put some data on the slide to inspire us. In a logical sense, data are stored in tables, which have columns and rows. The columns are also referred to as fields or variables, and the row are also referred to as records and observations. So imagine you wanted to make a guide for understanding the data in a data table. What would you need to do? Well, trust me, the first thing you would need to do is look at it. Nothing helps a person understand data like actually viewing the data. But depending upon the format of the data, viewing it isn't always straightforward. You may have noticed that data can be stored in different formats, txt, csv, and custom software formats. Smaller datasets in the more universal formats like txt and csv can be viewed easily in Excel, Notepad, Word, or other common programs. If you try to open the dataset using one of those programs it might need to go through some sort of conversion process, but usually, you can get the data to display so you can see it. But if the datasets are big or the data are particularly hairy you might end up viewing data through a database management system application or DBMS for short. When I worked at a health insurance we used a DBMS to view our data because our datasets were so huge. Those of you who know how to use statistical or data management software like SQL, SAS or R, probably already routinely use data viewing functions in those programs. In any case, if you want someone to help you with data curation they are going to have to be able to view the raw data in some way. If there's not a way now for them to view the data you will have to come up with an access method for them. You might be wondering why I am freaking out over the fact that backend curators need to be able to see the raw data. That is because it is totally impossible to do backend curation if you can't see the raw data. First of all, you need to see the actual data table itself that you are curating, the name of the table, and its attributes like how many rows and columns does it have. Next, yo have to be able to see the actual field names and their attributes. Is this a categorical or a continuous variable? Is it associated with a drop-down list? Is it an index or pointer to another table? You have to be able to see these things for yourself. You also need to be able to get into the actual data and see the values in the fields. Just because a categorical variable is hooked up to a drop-down list doesn't necessarily mean that all levels of that categorical variable are recorded in the data. A-ha, sometimes the entire column is blank. Or it just says, "Unknown." You can tell I've been burned before. Once you get access to all of that, you can start making your data dictionary. The data dictionary might have a lot in it by the time you are done, but it will have two main components. The first main component will be the main table documentation, which will list each data field in the table and information about the data field, such as the sources of field and the meaning of the field. Let's look at your exercise file for this video as an example. See this file? It's called School Nurse Survey Data Dictionary. As you probably guessed, it is a data dictionary for survey data, which is study data, and there is one main table, the one with the fields in the survey. See, I named the tab Main. See here, I have two different sets of field names documented. And here is the source. And here are a few other columns I needed that were specific to the project. See these other tabs here? YrsSchoolNurse, NumStudents, Level, these document what I call picklist. That's the second component of the data dictionary. Documenting picklist or each level of each categorical variable. So basically, the main table will list all the variables then I'll use the tabs to list levels in each categorical variable. So you are probably thinking if there are a lot of categorical variables then they all have different lists as potential answers. There might be a lot of tabs with picklists. If you are thinking that you are correct. In fact, we will talk about this issue in the next video where we examine the differences between curating a production backend versus a study backend.

Contents