How does MadKudu's spam detector work?

Before we start...

An email address is composed of a local part and a domain: [local part]@[domain].

 

Filters applying to both local and domain parts of the email

Email contains "test"

If the email contains "test" pretty much anywhere, it is flagged as spam.

 

Some characters are repeated many times

If any character is repeated at least 4 times consecutively

OR

If any pair of letters is repeated at least 4 times (tetetete@gmail.com)

 

Spammy patterns detected!

If the email is long enough and the email's most common characters make up more than 70% of characters

 

Domain or local part contains blacklisted word or phrase

If the domain contains any phrase from an in-house list of blacklisted words/phrases (ex: "noemail" or "nothing")

 

Local part or domain length is 1

the local part (a@gmail.com) or domain contains exactly one character (logan@a.com)

 

Not F1000 or personal and has numbers in local

There are two consecutive numbers in the local part and the email domain does not belong to a Fortune1000 company

 

Filters applying to the email's domain

Domain end is domain

If the strings before and after the period in the domain are the same (logan@hello.hello)

 

Domain contains short gibberish

If the domain contains known gibberish patterns from a list. Examples:

asdef

asdf

etc.

 

Domain is considered disposable

Madkudu maintains a list of disposable domains. If the domain belongs to this list, the email is automatically flagged as spam. Ex: "yahooo", "randomail.net"

 

Filters applying to the email's local part

Numbers exceed letters

If there is at least one more digit than there are letters in the local part (1234aa@gmail.com)

OR

If there are at least 6 numbers in the local part

 

Local part has no letters

If the local part does not contain any letters

 

Local part has no vowels

If the local part of the string is at least 4 characters long, does not contain any numbers, and does not contain any vowels.

 

Local part low vowel ratio

If the local part is greater or equal to 5 characters and the fraction of vowels is very low (vowels / letters ratio)

 

Local part contains short gibberish (AI)

Madkudu uses a predictive model to detect gibberish patterns on groups of 4 letters (ex: dfgh). Every group of 4 letters is scored according to its similarity to known gibberish patterns from the training dataset. When the score reaches a certain threshold above average score, Madkudu flags the emails as a spam.