From the course: Data Science Foundations: Data Assessment for Predictive Modeling

Getting creative about data sources

From the course: Data Science Foundations: Data Assessment for Predictive Modeling

Start my 1-month free trial

Getting creative about data sources

- [Instructor] I want to tell you a brief story to help you picture what it feels like to do all this detective work acquiring data during a real world project. Some years ago, I was on an insurance fraud project. We were trying to predict risky claims. Specifically, we were focused on staged accidents. That's when the criminal gets into an accident on purpose and uses a network of other criminals to generate fake medical bills for reimbursement. It's big money, hundreds of millions of dollars. I was using my approach of following the data through the various departments that were related to the problem. The primary focus was claims investigation, but one of my subject matter experts was the executive that was in charge of claims processing. In other words, the day-to-day handling of the paperwork of receiving the claims and entering the information. In my effort to understand all the business processes that generated the data for each of the data sources, I asked to sit in for a half hour while inbound claims initiation calls were coming in. I was introduced to one of the team members and tried to stay out of her way while she worked. Keep in mind that I would have already studied the data dictionaries at this point. I was just there to observe. There was no call on her queue at the moment, so I noticed that she had a folder and she was typing some information from paper documents into the computer and made note of this and was about to ask her about it when the first call came in. Then I noticed that caller ID appeared when she took her call. I made note of that too. After the call, I asked her if the caller ID was being captured in the data. She wasn't sure, but I made notes to ask the data team about the caller ID. Trust me, these were big discoveries. First, the caller ID. If you're going to make a career out of fraud, you need to make a bunch of fake identities. You can't use the same identity twice, so you put fake names and addresses on the forms, but wouldn't it be strange if several different claims all had the same caller ID? Then remember those paper forms I observed being typed into the computer? They were weather descriptions from the accident report. No one had mentioned them to me. The weather turned out to be very helpful and it had been missed because it wasn't in the data dictionaries I had seen. If you are faking an accident, you don't want to do it in icy weather. Somebody could get hurt, including the criminal themselves. They aren't trying to work that hard. They just want to pretend to be hurt. So two extremely helpful variables that were used in the final model were just a half hour investment. Think about this, a serious return on investment producing model like this might only have two to three dozen variables in the final model. So two top-performing variables is a very big deal. If you spent the same half hour observing each business process in your project and another half hour with a subject matter expert in each department, imagine what you might find. It won't always come easy and most of your tries won't work out, but this is the kind of hard work it takes to acquire the data for a world-class model.

Contents