Building Regular Expressions
DLP engine contains 3000+ predefined data identifiers that can be used in the DLP rules. DLP engine also supports custom data identifiers that use either a keyword search or regular expression search. This page describes how to write custom data identifiers for DLP using regular expressions.
Syntax
This section describes the regular expressions syntax that the DLP engine supports. The DLP engine parser interprets regular expression syntax identically to the UNIX regular expression syntax.
Supported Operators
Operator | Matched Pattern |
---|---|
\ | Quote the next metacharacter. |
^ | Match the beginning of a line. |
$ | Match the end of a line. |
. | Match any character (except newline). |
| | Alternation |
( ) | Used for grouping to force operator precedence |
[xy] | Character x or y |
[x-z] | The range of characters between x and z |
[^z] | Any character except z |
Supported Quantifiers
Operator | Matched Pattern |
---|---|
* | Match 0 or more times |
+ | Match 1 or more times |
? | Match 0 or 1 times |
{n} | Match exactly n times |
{n,} | Match atleast n times |
{n,m} | Match atleast n times, but no more than m times |
Note
The use of unrestricted greedy quantifiers of arbitrary characters such as, .*
or .+
are not allowed. If you are attempting to include the characters in a class or set, reverse them. For example, *.
Metacharacters
Operator | Matched Pattern |
---|---|
\t | Match tab |
\n | Match newline |
\r | Match return |
\f | Match form feed |
\a | Match alarm (bell, beep and so on) |
\e | Match escape |
\v | Match vertical tab |
\021 | Match octal character (in this example, 21 octal) |
\xF0 | Match hex character (in this example, F0 hex) |
\x{263a} | Match wide hex character (Unicode) |
\w | Match word character (alphanum plus '_') |
\W | Match non-word character |
\s | Match whitespace character. This metacharacter also includes \n and \r |
\S | Match non-whitespace character |
\d | Match digit character |
\D | Match non-digit character |
\b | Match word boundary |
\B | Match non-word boundary |
\A | Match start of string (never match at line breaks) |
\Z | Match end of string. Never match at line breaks; only match at the end of the final buffer of text submitted for matching |
Examples of Regular Expressions
Regex to detect 16-digit credit card number
Regex
\d{4}-?\d{4}-?\d{4}-?\d{4}
\d - Checks for digit character.
{4} - Match exactly n times. It validates that there are exactly 4 digits.
-? - This would validate that the digits are occasionally separated by hyphen. ? indicates 0 or 1 times.
This simple regex would validate that the number is a16 digit number occasionally separated by -.
Example matches
The regex would match 1234-5678-9123-4567 or 1234567891234567.
Regex to validate if the 16-digit credit card number is from a major credit card issuer
Matches major credit cards including Visa (length 16, prefix 4) or MasterCard (length 16, prefix 51-55)
Regex
^((4\d{3})|(5[1-5])\d{2})-?\d{4}-?\d{4}-?\d{4}
^ - Matches beginning of the line
4 - To validate if the first digit is 4. Visa card starts with 4
\d{3} - followed by 3 digits
| - Alternation is used for matching a single regular expression out of many possible regular expressions
(5[1-5]\d{2}) - Matches MasterCard prefix 51 to 55 followed by 2 digits
-? - This validates if the digits are occasionally separated by hyphens. ? Indicates 0 or
Example matches
The regex would match 4001123456781234 or 5100123456781234.
Regex to check the medical record number
Assume you have a medical record number which is 16 characters long prefixed by "NWH" which represents that the patient record is from Northwestern Hospital, followed by first 3 letters of the first name and 3 letters of the last name, followed by 7 digits.
Regex
\b(NWH)-?[a-zA-Z]{3}-?[a-zA-Z]{3}-?\d{7}\b
\b - Match the word boundary
(NWH) - Looks for prefix NWH
-? - This is to check if 0 or 1 occurrence of "-" exists
[a-zA-z]{3} - Checks for three alphabet characters. It could be any character from a-z or A-Z
\d{7} - Check for seven digit character
Example matches
The regex would match NWHCARVAN0000001 or NWH-TIM-BRO-0000002.