I have been working with a large financial services company on a “big data” project. Let me qualify the term “big data” for the purpose of this discussion. We are talking of dealing directly with about 10 million records and indirectly about 100 million records. The direct records (approximately 10 million) are those that are of interest to the organization or group I work with. The indirect records (approximately 100 million) is the entire universe of records available that impact our subset in a meaningful manner, but are not directly relevant for our needs.
I understand that many readers will consider this volume of data to be puny within a “big data” framework. Regardless, the data set is large enough to be included in, and relevant to, issues other “big data” implementations might face.
Data usage in financial services industry presents very interesting and unique legal and regulatory challenges. For example, making a wrong identification of a customer and exposing bank account details to the wrong person is embarrassing at best (if you are really lucky). More likely, it will result in a huge lawsuit accompanied by tons of extremely damaging negative publicity. This is before the regulators start investigating you and slapping fines all over the place.
There is no such downside if a search engine mistakenly thinks I am a 16 year old girl from Mumbai, India and shows me banner ads for the latest Bollywood blockbuster playing in a local theater. There is no real harm done apart from a lost click through opportunity. In case you are wondering, I am not a girl. I live about 10,000 miles from Mumbai. And most importantly, I am a few decades removed from when I was 16.
Yet, at a fundamental level, they are both trying to do the same thing using big data – find the right customer or user so that they can share meaningful data with them. It is entirely possible that they are using similar technologies, techniques and algorithms to determine the identity of the person who is interacting with them. It is the potential downside associated with a false identification when working with regulated industries like financial services and health care that makes working with large data sets in these environments orders of magnitudes more complex than just the technical challenges.
The real world problems we are dealing with as we attempt to identify the ONE right user from a list of 100 million potential candidates are as follows:
- Bad Data – misspelled names, improper addresses, bogus phone numbers, “Mickey Mouse” email addresses and a whole host of similar data entry errors (inadvertent and intentional)
- Incomplete Data – missing contact information, phone numbers, email addresses and so on
- Duplicate Entries – multiple records referring to the same person or entity
- Related Entries – seemingly duplicate entries but valid because the context is different. For example, a person who is a purchasing agent for a company and also a direct customer of the firm.
None of the above are hugely problematic for a search engine attempting to establish the identity of a person. They can afford to get a small percentage of their identities wrong with no meaningful impact to their bottom lines, reputations or business. However, the industries I am talking about have to get them right every single time.
So, how did we solve the problem? Without getting into the details of all the fuzzy logic and probabilistic algorithms to make an identification, we use human beings as the arbiter of last resort. This is not going to be a revelation to readers who work in regulated or highly sensitive environments. But it might sound practically medieval for “big data” mavens from other environments.
Simply put, if we are not absolutely sure who we are dealing with, we ask a human actor to intervene and make the right choice. Will they always make the right decision? No. But humans are still way better at fuzzy logic than the technologies and algorithms we use today. That may change in the future but for now there is still no substitute for a human.
Machines are great at sifting through 100 million records to get you a list of 5, 10 or even 100 possible matches in microseconds. That would be an impossible task for any human to do. But we humans are still way better at picking the ONE out of 5 or 10 where there are just subtle differences between the records unique to each result set than any algorithm out there today.
This is why a search engine mistakes me for a 16 year old girl from Mumbai while the combination of machines and humans knows who I really am, when it really matters, at our financial services firm.