Customer Fit: Training and Validation datasets

The Customer Fit model learns from your past prospects and conversions to predict future conversions. It analyzes leads and accounts who converted versus those who didn't convert in order  to predict if this new prospect knocking on your door looks like your ideal customer profile or instead someone who would be a waste of your Sales team. 

To train this model to recognize good leads versus bad leads, we need to give it a training dataset from which it can learn. Think of it as teaching a baby to differentiate a bear from a dog. You would show them photos of black bears, grizzly bears, polar bears ... and labradors, german shepherds, corgis... etc. The training dataset here is the set of photos, while in your case the training dataset is the set of leads who converted versus the set of leads who didn't convert. 

A training dataset is essentially a table with as many observations (rows) as possible and two columns: email which contains the email of the lead, and target with a value of 1 if the lead has converted or 0 otherwise.

Sample: 

Email

target

francis@madkudu.com

1

james@tesla.com

0

john@ibm.com

1

This dataset is then enriched with all the available Computations to be able to find common traits (industry, company size, country...) between converters and non-converters. 

A validation dataset will be needed to validate the model on a different set of leads than the ones it was trained to make sure the model is able to predict conversions. The model will give a score for the leads of the validation dataset, so you can check whether or not the model predicted the right score for those leads.

When to load a training dataset?

  • You would like to create a new model from scratch. 

  • You would like to see what is your Ideal Customer Profile on a more recent dataset.
    -> You would load a training dataset either on a duplicate of your live model or on a newly created model.   

When NOT to load a training dataset?

  • On a model not live but being worked on

Why? A customer fit tree-based model is built on a training dataset. If you change the training dataset you would lose the exact model as it is. The model would adapt to the new training set. Because of a random component in the building process of the training set, relaunching the upload of a training set with the same parameters would change the model. So if your co-worker is working on it, it may change what they've built. 

How to build and load a training dataset? 

Well, with MadKudu only a few clicks and no code is needed! You would just have to input some parameters and it will be created automatically from your CRM data. 

Step 1: Create the training dataset

  1. Go to the Data Studio (studio.madkudu.com or through app.madkudu.com > Predictions > Data Studio)

  2. Create a new model

  3. Click on Import data

  4. Click on Build from your integrations mceclip0.png

  5. Configure the following parameters for the training dataset


     

    • The training dataset is built from all the leads created in a specific timeframe. We recommend using a 6-month time frame but not too recent. If you consider leads created last week, they would not have had time to convert into an opp yet and would be labeled as "non converted" (0) so the model will be taught it is a bad lead while it maybe is not. Try this: 

      • Dataset start date: today - 12 months

      • Dataset end date: today - 6 months

      • Use the date picker to select the timeframe included in your training dataset                                             mceclip0.png

    • Audience: What is the population of leads you'll want to apply the scoring on? Inbound leads? Outbound leads? North American leads? Select the audience of leads you'd like to focus on, which is configured in the Audience mapping in the app

    • Conversion model: What is the outcome you want the model to predict? Leads who would become any qualified opportunity? Leads who would become paying customers above a certain amount (Enterprise customers)? Use a conversion model name defined in the Conversion mapping in the app.
      This decision depends on what you intend to do with this scoring. For example: if you'd like to use the scoring to route your best leads to your Enterprise sales team, then you'd want the model to flag the leads who look like your very good Enterprise customers. 

Step 2: Create the validation dataset

To create the validation dataset, you'd want a more recent dataset but not too recent for the same reason. We recommend using the following dates but should be adjusted depending on your time to convert leads into opportunities.

  • Dataset start date: today - 6 months

  • Dataset end date: today - 3 months

  • Use the date picker to select the timeframe included in your validation dataset 
     mceclip1.png

When training a model, use in the validation the same audience and conversion filters as the training dataset to validate the model on the same type of population. 

Step 3: Launch the loading of the dataset(s)

To launch the loading, click on Build dataset. 
The dataset(s) will be created and enriched from our database that stores your relevant CRM data and enrichment. It usually takes about 1 hour but depending on the size of the datasets. It can take up to a few hours. You will get a confirmation email once finished. 

Step 4: Check the size of the datasets loaded 

Once finished uploading, go back to your model and in the Overview tab check the number of records and conversions in the datasets. To build a model that makes statistical sense, the larger the dataset, the better. You need at least 1000 records and 200 conversions. Ideally, having 2000-3000 records is better. If you don't have enough records or conversions in your dataset, here is how to proceed:

  • First, take a larger timeframe. You can go back up to the past 18 months. Data older than 18 months might not reflect your current Ideal Customer Profile if you targeted different markets.

  • If this is not enough, the next step is to choose a larger conversion definition. Beware, this basically means you will be building a different model, since the conversion definition is what the model predicts. We recommend going a step back in the funnel. For example, using the Closed Won definition instead of a definition CW > $ amount. You can also create a new custom definition. For example instead of using Closed Won, use Qualified Opps with a probability of > 20% let's say.

How to upload a training dataset from a CSV?

You can also upload your own training data set via CSV using the following CSV template.

The columns amount, amount_closed_won, and target_closed_won can default to 0 if you don't have the information. 

No transformation or cleaning is applied to the dataset(s) you upload, therefore make sure there are no null values or your dataset won't be enriched and will be rejected. 

  1. Duplicate the template

  2. Import your data

  3. Download as .csv

  4. Go to studio.madkudu.com

  5. Create a new model

  6. Click Import data

  7. Click Upload from CSV

  8. If you want to build a full model upload both a training dataset and a validation dataset 

  9. If you only want to look at the Customer Fit Insights of a dataset, just upload a training dataset, no need for a validation dataset.

  10. Click on Save audience 

It will start enriching your dataset and you should receive a confirmation email. Depending on the size of the datasets, it should take a few minutes (if ~100 records) to a few hours (if >10k records). 

 

F.A.Q

What are the Advanced options parameters for in the training dataset? 

  • This option removes from the training dataset any lead associated to a company who already converted before the simulated dates, meaning that it removes leads created after the creation date of the opportunity they're attached to. It is recommended to activate this option when building a model to predict new business opportunities , and to disable this option when predicting upsell or expansions.

  • Rebalancing ratio: the ratio between the number of non converted and the number of converters. A training set is usually created to obtain at least 20% conversion rate to avoid a class imbalance problem. The Rebalancing Ratio allows adjusting the conversion rate of the training dataset. 

    • Ratio of 5 -> 20% conversion rate 

    • Ratio of 10 -> 10% conversion rate

  • Max number of leads per domain: this prevents bias in a dataset. let's say you get 20 leads from IBM within the timeframe you selected for the training dataset. If IBM has converted, then the model would be biased towards scoring better companies like IBM. Therefore we want to make sure that larger companies, from which it might be easier to get a lot more leads compared to a smaller company, don't influence the predictions too much. As such, we recommend keeping the max number of leads between 3 and 5 so that we can reduce the chance of some traits being overrepresented by one domain.

Do you exclude leads coming from our employees' email tests? 

Yes, we exclude leads with the domain of your company from the training and validation dataset to avoid polluting the dataset and creating a bias in the model. 

 

To go further ... 

If you have a data science background or just curious about the methodology of creating a training dataset, using boosting and downsampling to fight class imbalance, continue by reading this article.