Amazon S3 - Frequently Asked Questions

You and your team may have questions about how to set up an Amazon S3 integration with MadKudu, the data format, the security controls etc. If you don't find answers to your questions, please reach out to product@madkudu.com and we would be happy to assist.

You can find the main documentations on the S3 integration here:

- How to format data and files in the S3 bucket
- How to create an S3 bucket and give MadKudu access

Q: How does the S3 integration work with MadKudu?

Amazon S3 is a storage service that MadKudu uses to transfer data from data warehouses like Redshift or from systems with which we don't have a direct integration with.

You would transfer data to an S3 bucket from which MadKudu would be pulling this data to use for scoring purposes.

Q: Should I host the S3 bucket or does MadKudu have one?

You'll need to host the S3 bucket on your side per our security policy and grant MadKudu access to your S3 bucket with an IAM role (see How to create an S3 bucket and give MadKudu access ).

Q: What are the requirements to set up an S3 integration with MadKudu ?

You will need an AWS account and to ask a quick favor from someone, who is very often a data engineer ;), who can stream data from your source to the S3 bucket.

Q: What type of data should I send MadKudu?

MadKudu can only ingest events from your s3 bucket. We are planning on adding support for contact and account attributes in the future.

What Events specifically? You can refer to this article to know What type of events can be used in a behavioral segmentation

Q: Should the data be transformed before sending it to S3?

Yes, we recommend the instructions in How to format data and file in the S3 bucket to format the data from your system or data warehouse to the S3 bucket.

Q: How should the data be transferred?

We would need 9 months of historical data to train the predictive models plus fresh data every ~4 to 12 hours.

The data should be uploaded as JSON or CSV files in the S3 bucket:

either each file contains only the most recent data, then load each files separately
either the file contains all the data with any recent data, then you can replace the existing file at each upload

The idea is to have in the bucket fresh and historical data at all times, not just the most recent.

Q: How fast should the data be loaded?

MadKudu's scoring updates every 4-12 hours, therefore the data should be uploaded every 4-12hours.

We highly recommend compressing files to speed up transfers.

Q: What is the volume of data to be loaded?

If you think of sending data in the range of multiple billions of records per month, this is when we may look at trimming down that number. We would do this to only the most relevant data in order to make the overall process of pulling, scoring, and pushing back scores to you faster.

Q: Is a historical upload of data required?

Yes, MadKudu will need 9 months of historical data to train the predictive models.

Q: For which time period should the data be loaded (specific month, week, year, etc.)?

If sending events:

1 record = 1 event with timestamp and user id/email
The data should be sent as events with timestamps, with the oldest timestamp 9 months back.

Q: Should the data be transferred on a periodic schedule?

Yes, at least once a day is the recommended frequency to provide MadKudu with fresh data. Historical data can be loaded only once with a fixed timeframe of 9 months.

For specific use cases the data should be loaded only once. For example, historical data dump for a fixed timeframe. In other scenarios, we need to enable a continuous live sync to fetch the new data as soon as it arrives.

Q: Can MadKudu pull data directly from our Snowflake?

Yes, see our Snowflake integration article

Q: Can MadKudu pull data directly from our BigQuery / GCS?

From BigQuery yes ! See our BigQuery integration article

MadKudu does not integrate directly withGCS but you can stream data from GCS to S3 using gsutil.

Q: Can MadKudu push the scores to an S3 bucket for us to load into Snowflake, BigQuery, etc.?

Yes ! See our S3 integration to set up as Destination