From the course: Learning Data Analytics: 1 Foundations

Maintaining original data

From the course: Learning Data Analytics: 1 Foundations

Maintaining original data

- [Lecturer] In a dream world, we would run quality data sets from our systems, but that's not always the case. We might only receive a spreadsheet or have limited access to export. In some cases we'll receive a CSV file. It is important to maintain the original data that you've exported or received. It gives you two things, really the ability to restart without having to start all the way back over and also provide you an audit trail. Let's talk about a real life example. You don't have full access to run your own record set for monthly reports, you receive it from the DBAs and the IT department, they run it and send it to you. You then do your cleanup processes, write your reports and then deliver them to your team month by month. In a scenario like this, I keep the file that was delivered to me in a dedicated folder for these types of requests, I do not rely just on my email as my source. I typically start by creating a folder called Original Data and I'll do this now in my folder in chapter four in the 04_05. So I'll right click and do new folder. Now I like to call mine Original Data so that I know if I ever look inside this folder, the source of my original data is in the original data folder. Okay, you've noticed that there is a HireData CSV file in chapter four 04_04. So what I would like to do now is actually move that HireData to my Original Data folder. So I'll just drag and drop it. Okay, let me open up the Original Data folder. I do want this file to be read only, that way I can't accidentally damage it. I'm going to right click and choose properties and mark it as read only, I'll go ahead and click OK. I could also do the same thing by opening the file up and marking it as read only inside the file tab. Now I need to create a working copy of the file, I'll go ahead and double click HireData, and again, it is CSV and it is currently read only. I'm going to use my favorite shortcut F12, F12 quickly opens up save as, so I'm going to move it up to 04_04 and call it a HireData_WorkingFile. Okay, notice that it's comma separated values which means I can't have multiple tabs, pivots or charts so I'll go ahead and make this an Excel Workbook. Okay. I'll choose save. Okay I just want to confirm all my info tab. Perfect, I don't see any read only here, so I'm good to go I'll go ahead and choose save. Now I have a copy of my Original File and my Working File and these simple techniques can save you a lot of time with not having to start back over. My running joke on this is that the minute your bosses see the information, they suddenly remember all the things they forgot to ask you for. Understand for most people, this is happening because they need to see some of the data to know what to ask for next. Having the dataset to work with without starting over can be helpful, let me show you how I do this. So first of all, there's some cleaning that will always occur with this particular file and I want to say that as my base cleaning. So here's what I'll do, I'll go ahead and hit my select all, I'll go ahead and double-click in between the A and the B column headers to size it all out. I know that every time I work with this file I do not need columns B, C or D. So I'll go ahead and highlight B, C and D and then I'll right click and delete those. Okay, I'll go ahead and click on A1. I also want to bold my header so for this, I'll use a shortcut, I'm going to do shift control and right arrow that will go all the way from where my cursor was placed in A1 to the end of the data which it sees in this scenario as G1. And then I'll bold these headers, I'll do control B. This let's Excel know that this is my header. I will do this every time I work with this particular file I'll delete those columns every time I worked with this file. So therefore this is my base cleaning routine. Okay, so I'm going to go ahead and click on A1, I'm going to save my work and then I'll press function key F12 again and name this my BaseCleaning and then I'll save it. That way If I get a new request or a change to the request, I don't have to re-perform those steps again. Okay, let's navigate back over to our working vow. Okay, so I'll double click my working file. Now that I have my working file ready to go, there is one more step that I will do. Until you've worked out all the details of what you need on this file, sometimes you may find that you need to start back over. So I will make a copy of this sheet so I can return to it if I need to, again, this file is not millions of rows long so this is a simple and easy technique. I'll take my mouse and put it on the sheet tab, I'll hold my control key on my keyboard, I'll left click and hold and drag to the step. You'll notice there's a little arrow there and a little plus sign that indicates copy. I'm going to let go of the mouse and then I'll let go of control. So I have my HireData sheet tab and I have a new copy of HireData. I'll go ahead and rename this one by double-clicking and call this one Working. That way, if I need to start back over with a fresh copy I can just drag and drop a new copy. It's important to create a folder structure that makes sense for you, filing is so important, but it's also personal. You'll need to maintain your original data and you'll want to keep your base cleaning routines. You cannot have too many copies of your data you'll often refer back to these original sets as you work out the exact steps and methods you need to get the data to the state you need that data for reporting or visualizing. There are so many variations of keeping your files based on your organization. You may be saving them to a server folder or a cloud drive. Just be sure you keep up with the work so you don't have to reinvent the wheel over a single mistake.

Contents