From the course: Learning Data Analytics: 1 Foundations

Working with flat files

- [Instructor] People are not always able to directly connect to the backend of a database system and sometimes they're limited to exports of the data or they receive files that are in Excel format or CSV. I would refer to these as flat files and they're disconnected from the data. So what do I mean by flat or disconnected data? You're not directly connected to the system that houses that data which means when the data in the system is updated, your data in that flat file is not updated. You would need another file to capture those data changes. CSV common flat files. CSV stands for comma separated values. With a CSV it's a delimited file, and the comma serves as the delimiter. What that delimiter does is separates each of the columns from each other. There's also other types like tab delimited or fixed width, which isn't a delimiter but an amount of space. I've always had more CSV or tab delimited than any other style. And by default CSV files open in Excel. All too often, I see these individuals stuck just using Excel as their data solution because that's what they receive and it's already in Excel so it makes total sense to stay in the tool but they may not be leveraging the tools in Excel like Power Query to support their efforts. You may receive a comprehensive list of information. It has everything you need to report on and then the next time you received the file and has all of the information you already reported on plus they need new data. You can then change the data out and update your file. You may run a series of macros whatever you need to do to get the necessary changes to the data into your file. I want to show you a way to leverage Power Query to handle this type of solution, because to me it's a whole lot easier than doing a lot of copying and pasting. Let's go take a look at our chapter five folder in the 05_03 folder. You see two files here. They're both CSV files, Research Project_March and Research Project_MarchandApril. So let me set this up. The very first file we ever received is the Research Project_March. You notice the Excel icon has a little a there. This is actually a comma separated value file. So it's the first file we receive. And the next month, they send us a new comprehensive data file that has March and April. So, if this was us and we didn't know Power Query, we might open this April file, we might pull all of the April data out and copy and paste it into March. There's definitely a better way to work with this type of data. Let's go to a new blank workbook. Let's go to my favorite tab of Excel Data and my favorite option, Get Data and choose from File, From Text. I've got in that chapter five 05_03 folder and I'll double click that March file. So now I've connected my Excel file to my CSV. It reads the type of file and gives me some default information. And interesting look, data type detection based on the first 200 rows. Great. So on the bottom right, I'm going to choose Load. We'll talk about transform later. Now the data comes from Power Query and is loaded to my sheet. Notice I have 126 rows loaded. I'm going to build a basic pivot just for demonstration purposes. I'll go to summarize with pivot. I'll click, OK. I'll go ahead and drag the respondent ID to values and change that sum to account. It sees the respondent ID as a number which is why it's summed it. I'll drag my start date to my columns and then I'll drag my current age to my rows. Let me close my pivot table fields. So if you look at the pivot table, you now see each age of the respondent and when they responded to the survey. This is great. I can set it up, I can build more visuals and then when April rolls around, they're going to send me March and April's data. So again, most people would go add the April data to the bottom of the March data, there's such an easier way to do that. So let's take a look. I'm going to right click my Research Project_March, I'll choose edit. I'm in Power Query and on the right hand side, I'll choose "Data source settings." It shows me my current source is the March CSV I'll go change my source on the bottom left and then I'll browse. I'll choose my margin April data. Now worth noting, these file structures are identical. I'll go ahead and choose Import and then, OK and then I'll choose Close. I have 126 rows which I can see on the bottom left of my screen. Let me refresh. And now it's pulled in the April data I have 480 rows of data. I'll go ahead and choose Closing load, it shows me that it's refreshed over here with 480 rows. I'll go to my sheet three, which has my pivot and I'll go ahead and refresh here. I'll go to my Pivot Table Analyze and choose Refresh. Great. Let me close my connections and you'll notice any new data that came in and then also there's my April data. Let's work through another scenario. Let's go ahead and open up a new blank workbook. Sometimes all you receive is the new data. So in your 05_03 folder, do you see the research project data? If you open that up, you see date stamped, datasets. This research project has different datasets that come in 3:23, 3:27, 3:30 and 4:01. These datasets only contain the newest data at that point. I'm going to tell Power Query to read the entire folder. That way if I add a new file to the folder, it's automatically brought into my dataset. Let's go back to our new blank spreadsheet. I'll go back to Data, I'll choose Get Data, I'll choose From File and then I'll select From Folder. I'll go ahead and browse and I'll navigate to my chapter five folder and 05_03. And I'll choose my ResearchProject Data and click OK. And then I'll choose OK again. This lets me see all of the files that are in that folder. I'll go ahead and choose Combine and I'll tell it to Combine and Load. Now it's going to sample everything based on the first file. And again, it's important to note that all of these files are structured with the same headings in each file. I'll go ahead and click OK and then you'll notice that the ResearchProject Data has 246 rows loaded. And then if you look in the source name, you're going to notice that it has the actual source name with the .CSV. If I hit that dropdown, I can see all four of those files from that folder were brought into my dataset. This means when I get let's just say the 415, 2020 file, if I dump it into that folder, I can come directly to my spreadsheet, choose refresh and I'll have all that data included. No matter how you work with flat files, you can leverage the data tools that you have right inside of the Excel to make it an easier process.

Contents