Segment - How to send historical data to MadKudu via S3

Prerequisites

When connecting Segment and MadKudu, Segment will start sending data to MadKudu starting from the day of the connection. To train and monitor models, MadKudu will need at least 9 months of historical data as a one-off drop. 

If you are on a Business Plan or higher with Segment, please refer to this article to start an automatic "replay", otherwise, stay here there, we'll explain how to send the historical data through Amazon S3!

 

To export data from a non-Segment Redshift warehouse, please follow these instructions.

Depending on the volume of data to transfer, it may take from a few hours to a few days/weeks (>500M records) transfer rate limits. We recommend to only send the Segment events necessary to configure MadKudu otherwise your implementation will be delayed until we get the full history of data. Please refer to this documentation or consult our implementation team at success@madkudu.com to understand what are relevant versus irrelevant events to configure your scoring.

How to transfer the data 

There are two options to give us access to your historical Segment data:

(If you use another type of Segment Warehouse (Postgres, BigQuery), you can still integrate with us. Please follow the instructions to export data to S3)

If you did not enable any of those destinations before integrating with MadKudu, we might not be able to get your historical data immediately. Make sure to connect Segment to MadKudu so we can start accumulating behavioral data.

Option 1: Segment S3 destination

If you've been sending Segment data to a S3 destination in your AWS account, the easiest solution is to grant us access to that bucket. Please follow the instructions on how to share access to a S3 bucket here. 

Option 2: Segment Redshift warehouse

If you're using a Redshift warehouse but not an S3 destination, you will first need to "unload" the data from Redshift into an S3 bucket then grant MadKudu access to the bucket with these instructions.

The idea is to use a UNLOAD statement in Redshift to transfer the data to S3, from where we can read the data it.

If you're using your bucket, use the following UNLOAD where the credentials are the ones you maintain internally.

UNLOAD ("SELECT ... FROM [your_table]")
TO 's3://your_bucket/your_folder/project/track.csv'
CREDENTIALS 'aws_access_key_id=[your_access_key];aws_secret_access_key=[your_secret_key]'

 

Object: Track

For unloading Track data, there are two options:

  • If the event properties are not relevant to share with MadKudu, just UNLOAD the track table from Redshift to S3:

UNLOAD ("SELECT id, anonymous_id, user_id, event, event_text, sent_at FROM [source].tracks")
TO 's3://madkudu-data-in-[your_org_id]/[source]/track.csv'
CREDENTIALS 'aws_access_key_id=[your_access_key];aws_secret_access_key=[your_secret_key]'
HEADER
DELIMITER '~'
ADDQUOTES
ESCAPE
MANIFEST
GZIP
ALLOWOVERWRITE
PARALLEL ON
;
  • If the event properties are relevant (path, url ...) , you'll need to do an UNLOAD per event table:

UNLOAD ("SELECT id, anonymous_id, user_id, event, event_text, sent_at, [properties] FROM [source].[event]")
TO 's3://madkudu-data-in-[your_org_id]/[source]/[event].csv'
CREDENTIALS 'aws_access_key_id=[your_access_key];aws_secret_access_key=[your_secret_key]'
HEADER
DELIMITER '~'
ADDQUOTES
ESCAPE
MANIFEST
GZIP
ALLOWOVERWRITE
PARALLEL ON
;

where [properties] is the list of properties you want to send us, separated by commas.
For example: SELECT id, ..., product_name, referrer FROM <source>.added_to_cart  

 

Object: Identify

Use the following UNLOAD function to send Identify data from Redshift to S3 where [traits] is the list of properties you want to send us, separated by commas.

UNLOAD ("SELECT id, anonymous_id, user_id, sent_at, [traits] FROM <source>.identifies")
TO 's3://madkudu-data-in-[your_org_id]/[source]/identify.csv'
CREDENTIALS 'aws_access_key_id=[your_access_key];aws_secret_access_key=[your_secret_key]'
HEADER
DELIMITER '~'
ADDQUOTES
ESCAPE
MANIFEST
GZIP
ALLOWOVERWRITE
PARALLEL ON
;

 

Object: Group

Use the following UNLOAD function to send Group data from Redshift to S3 where [traits] is the list of properties you want to send us, separated by commas.

UNLOAD ("SELECT id, anonymous_id, user_id, group_id, sent_at, [traits] FROM <source>.groups")
TO 's3://madkudu-data-in-[your_org_id]/[source]/page.csv'
CREDENTIALS 'aws_access_key_id=[your_access_key];aws_secret_access_key=[your_secret_key]'
HEADER
DELIMITER '~'
ADDQUOTES
ESCAPE
MANIFEST
GZIP
ALLOWOVERWRITE
PARALLEL ON
;

 

Object: Page [optional]

Pages can be quite voluminous. Only send them to us if your page tracking contains important information that is not included in track calls. For [properties], we usually recommend at least title and url.

UNLOAD ("SELECT id, anonymous_id, user_id, group_id, sent_at, [properties] FROM <source>.<event>")
TO 's3://madkudu-data-in-[your_org_id]/[source]/identify.csv'
CREDENTIALS 'aws_access_key_id=[your_access_key];aws_secret_access_key=[your_secret_key]'
HEADER
DELIMITER '~'
ADDQUOTES
ESCAPE
MANIFEST
GZIP
ALLOWOVERWRITE
PARALLEL ON
;