Amazon S3 - Frequently Asked Questions

You and your team may have questions about how to set up an Amazon S3 integration with MadKudu, the data format, the security controls etc. If you don't find answers to your questions, please reach out to product@madkudu.com and we would be happy to assist. 

You can find the main documentations on the S3 integration here: 

- How to format data and files in the S3 bucket
- How to create an S3 bucket and give MadKudu access 

 

Amazon S3 is a storage service that MadKudu uses to transfer data from data warehouses like Redshift or from systems with which we don't have a direct integration with.

You would transfer data to an S3 bucket from which MadKudu would be pulling this data to use for scoring purposes. 

mceclip0.png

 

You'll need to host the S3 bucket on your side per our security policy and grant MadKudu access to your S3 bucket with an IAM role (see How to create an S3 bucket and give MadKudu access ).

 

You will need an AWS account and to ask a quick favor from someone, who is very often a data engineer ;), who can stream data from your source to the S3 bucket.

 

MadKudu can only ingest events from your s3 bucket. We are planning on adding support for contact and account attributes in the future.

What Events specifically? You can refer to this article to know What type of events can be used in a behavioral segmentation

Yes, we recommend the instructions in How to format data and file in the S3 bucket to format the data from your system or data warehouse to the S3 bucket. 

 

We would need 9 months of historical data to train the predictive models plus fresh data every ~4 to 12 hours.

The data should be uploaded as JSON or CSV files in the S3 bucket:

  • either each file contains only the most recent data, then load each files separately 

  • either the file contains all the data with any recent data, then you can replace the existing file at each upload 

The idea is to have in the bucket fresh and historical data at all times, not just the most recent.  

 

MadKudu's scoring updates every 4-12 hours, therefore the data should be uploaded every 4-12hours. 

We highly recommend compressing files to speed up transfers. 

If you think of sending data in the range of multiple billions of records per month, this is when we may look at trimming down that number. We would do this to only the most relevant data in order to make the overall process of pulling, scoring, and pushing back scores to you faster. 

 

Yes, MadKudu will need 9 months of historical data to train the predictive models.  

 

If sending events:

  • 1 record = 1 event with timestamp and user id/email

  • The data should be sent as events with timestamps, with the oldest timestamp 9 months back.

 

Yes, at least once a day is the recommended frequency to provide MadKudu with fresh data. Historical data can be loaded only once with a fixed timeframe of 9 months. 

For specific use cases the data should be loaded only once. For example, historical data dump for a fixed timeframe. In other scenarios, we need to enable a continuous live sync to fetch the new data as soon as it arrives. 

 

From BigQuery yes ! See our BigQuery integration article

MadKudu does not integrate directly withGCS but you can stream data from GCS to S3 using gsutil

 

Yes ! See our S3 integration to set up as Destination