How does MadKudu's spam detector work?

Before we start...

An email address is composed of a local part and a domain: [local part]@[domain].

 

Filters applying to both local and domain parts of the email

If the email contains "test" pretty much anywhere, it is flagged as spam.

 

If any character is repeated at least 4 times consecutively

OR

If any pair of letters is repeated at least 4 times (tetetete@gmail.com)

 

If the email is long enough and the email's most common characters make up more than 70% of characters

 

If the domain contains any phrase from an in-house list of blacklisted words/phrases (ex: "noemail" or "nothing")

 

the local part (a@gmail.com) or domain contains exactly one character (logan@a.com)

 

There are two consecutive numbers in the local part and the email domain does not belong to a Fortune1000 company

 

Filters applying to the email's domain

If the strings before and after the period in the domain are the same (logan@hello.hello)

 

If the domain contains known gibberish patterns from a list. Examples:

asdef

asdf

etc.

 

Madkudu maintains a list of disposable domains. If the domain belongs to this list, the email is automatically flagged as spam. Ex: "yahooo", "randomail.net"

 

Filters applying to the email's local part

If there is at least one more digit than there are letters in the local part (1234aa@gmail.com)

OR

If there are at least 6 numbers in the local part

 

If the local part does not contain any letters

 

If the local part of the string is at least 4 characters long, does not contain any numbers, and does not contain any vowels.

 

If the local part is greater or equal to 5 characters and the fraction of vowels is very low (vowels / letters ratio)

 

Madkudu uses a predictive model to detect gibberish patterns on groups of 4 letters (ex: dfgh). Every group of 4 letters is scored according to its similarity to known gibberish patterns from the training dataset. When the score reaches a certain threshold above average score, Madkudu flags the emails as a spam.