If you’re reading this then it means you’ve already begun to blaze your trail with AI and Salesforce Einstein. You understand more fully the business value of leveraging machine learning and data to be more predictive in your business. You also probably identified a use case or two from the Big Book of Predictions that could provide immediate value to your business. So, now it’s time to build your first predictive model. A predictive model produces predictions that fill in missing information which can be used to support an action, recommendation or decision.
Fortunately, Salesforce Einstein provides the tools to make creating predictions easy, but there is still some special domain expertise that you as an Admin, analyst or developer will bring to these models. It’s your job to decide what goes into the model, and what stays out.
You’ve heard the term “garbage in, garbage out” before, right? Well, the same is true with predictive models. If your model is full of bad data, the insights will be bad as well. So let’s review a few key points about what data you should be careful to exclude from your models in order to get the best predictions and insights.
At Salesforce, Trust is our number 1 value, and proper handling of sensitive customer data is critical to maintaining that trust. That’s why new regulations such as GDPR have been designed to ensure consumers data is protected. But regulations alone aren’t enough. Everyone has a role to play to protect customer data, including those in charge of building and maintaining predictive models inside the CRM.
When building any predictive models you’ll want to be thoughtful to exclude sensitive data. So what qualifies as sensitive data? It depends on your business, but in general you’ll want to exclude things like: government-issued identification numbers, financial information (such as credit or debit card numbers, any related security codes or passwords, and bank account numbers), racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, information concerning health or sex life, information related to an individual’s physical or mental health, and information related to the provision or payment of health care or any other data that is deemed too sensitive to include in a predictive model.
Legacy or Stale Data
If you’re lucky, you likely have a governance process in place that defines all the fields tracked inside Salesforce and denotes whether or not it is legacy data. This governance process would clearly state the update frequency and the shelf life of data. But, if you’re like most people that governance process is still on a wishlist somewhere—whomp, whomp… Well, that means you’re going to have to use other means to identify what data is legacy or stale because when it comes to creating predictions, if the data is stale then, well, it’s bad. And, you don’t want bad, stale data from the past powering your predictions about the future.
Okay, this one is less intuitive but equally as important so, read this carefully: A “leaker”, often referred to as data leakage, happens when you train your model on a dataset that includes information that would not be available at the time of prediction. Basically, a piece of data that only shows up after the question you are trying to predict the answer to has already been answered. This can be tricky for a model because on paper, leakers look like highly correlated or really good predictive signals. But they produce unrealistically accurate predictions. Basically, it’s like bringing an answer sheet into an exam. To avoid leakers, remove any fields or field values from your model that would not be known at the time of the prediction.
The good news is that when you are the domain expert it can be a lot easier for you to figure out what data is relevant to the process you are trying to predict than someone who isn’t as close to the business. To do this, consider constructing a timeline for the process you are trying to predict and mark any data that gets populated after the process is completed as a leaker. For example: say you want to build a prediction for lead conversion, and you know after a lead is converted the “converted timestamp” is set, but before conversion, the field is always blank. That means you would just exclude the converted timestamp from your model, and you won’t have to worry about it leaking into your predictive model and corrupting your good predictions.
Legacy, sensitive data and leakers, oh my! Whew, that was a lot! But guess what? It was totally worth it because now you are a lot more prepared to blaze your trail with Einstein. So let’s get cracking! Check out trailhead.einstein.com for a whole host of trails to help you get started.
And be sure to check out the #BeAnInnovator adventure to build your own AI-powered app with Salesforce Einstein! You can read all about how to get involved right here.