Regex Primer

Useful references:

Regular expression - Wikipedia  - Everything you would want to know about REGEX

Regex Cheat Sheet | Python, PHP, Perl, JavaScript, Ruby - A regex syntax cheat sheet

Some basics:

^ indicates start of the string the pattern is testing.

$ indicates the end of the string.

() – using parenthesis pairs creates a grouping.

\.  looking for a period

.   any char except newline

[] – using square brackets creates a set – any member of the set is a match, example [ABDT], either an “A”, “B”, “D”, or “T” would be a success match – but only one.

\w  Word character [a-zA-Z0-9_]

\W Nonword character [^a-zA-Z0-9_]

\d   Digit character [0-9]

\D  Nondigit character [^0-9]

\s   Whitespace character [\n\r\f\t]

\S  Nonwhitespace character [^\n\r\f\t]

|     or operand,  (a|b|d|t)  is the same as [abdt]

{x,y} or {x}     number of in pattern,  x=min (can be 0) and y=max.

 \d{0,3}  means zero to a max of three digits (0-9),  \d{2} means 2 digits exactly.

Starting with the PDF file extension using groups:

To match the patterns “PDF”, “pdf”, “Pdf” “pDf” “pdF”…  

REGEX using three groups:  (p|P)(d|D)(f|F)

Group 1: (p|P)    either a “p” or a “P”

Group 2: (d|D)   

Group 3: (f|F)

For the “.” in “.PDF” we use a \. 

just using the “.” without the slash will match “any” character

To  match “.PDF”  the regex pattern therefore becomes..

REGEX:  \.(p|P)(d|D)(f|F)  

REGEX patterns groupings that can be assembled together:

.PDF

\.(p|P)(d|D)(f|F) 

Lower- or upper-case PDF, pdf, Pdf, PDf, pDf, pDF,…

0-00

9-99

(\d-\d{2})

where: \d = any 0-9 digit,  {2} indicates 2 digits required

0-00 or 0 00

9-99 or 9 99

(\d(-|\s{1})\d{2})

where: \s = space,  {1} indicates only 1 space allowed, “|” means OR

000

999

(\d{3})

where: \d = any 0-9 digit,  {3} indicates 3 digits required

000.0

999.9

(\d{3}\.\d)

where: the \. means a period; using “.” without a preceding “\” has special REGEX meaning of “any char”

0.0 or 0.00 or 00.0 or 00.00 or 000.0 or 000.00

9.9 or 9.99 or 99.9 or 99.99 or 999.9 or 999.99

(\d{1,3}\.\d{1,2})

where: the \. means a period; using “.” without a preceding “\” has special REGEX meaning of “any char

where: \d{1,3} means 1 to 3 digits,  \d{1,2} means 1 to 2 digits

Must have any of the following one or two chars  example set: A,B,C,A-,AI,V,T,TC upper case only

([ABCVT]|AI|A-|TC){1}

where: “[set]” set means either A or B or C or V or T, “|” mean OR “AI” or “A-“ or “TC”, {1} means at least of either the set or the OR’s

Must have any of the following one or two chars  example set: A,B,C,A-,AI,V,T,TC either case allow

([ABCVTabcvt]|AI|Ai|aI|ai|A-|a-|TC|tC|Tc|tc){1}

where: same as above but adding the lower-case options of same

Either one space, underbar or dash

(\s|_|-){1}

where: \s = space, “|” means OR,  {1} mean only 1 space OR “ _” OR “-“ allowed

Sheet title string or project string starting with either an underbar or space, can include underbar, & and/or spaces

_any string

 any string     (first char is space)

_any  & string1

_detail_sheet1

((\s|_)([a-zA-Z0-9]|\s|_|&){0,})

where: \s = space, “|” means OR,  “[set] where set is any char a through z, A through Z, or 0-9”, {1,} mean only 1 space OR “ _” OR “-“ allowed

Sheet title string, project string starting with either an underbar, dash or space, can include underbar, & and/or space

-any & string1

- detail_sheet1

_any string

 any string     (first char is space)

_any & string1

_detail_sheet1

((\s|_|-|-\s)([a-zA-Z0-9]|\s|_|&){0,})

where: \s = space, “|” means OR,  “[set] where set is any char a through z, A through Z, or 0-9”, {1,} mean only 1 space OR “ _” OR “-“ allowed

Limit the file name to 20 characters or less

(?!.{21,})

where: (?|exp) is being used to say accept if the total string of any char (.) is less than 21 characters in total

Most of the naming standard regex patterns can be created by using “grouping”.  Breaking the file name standard into “groups” helps in creating a REGEX pattern for the entire file name. 

Combining groups to check a file name pattern

To match DDD-DD_DD.PDF  or DDD.DD_DD.PDF

This pattern has 4 groups DDD-DD_DD.PDF  or DDD.DD_DD.PDF

Group 1: the DDD   (\d{3})

Group 2: a dash or period    (-|\.)   - either the “-“ OR the “.”

Group 3: the DD_DD   (\d{2}_\d{2})

Group 4: the PDF part   (\.(p|P)(d|D)(f|F)) 

REGEX: (\d{3})(-|\.)(\d{2}_\d{2})(\.(p|P)(d|D)(f|F))

Optional patterns – using OR on groups…

To match DDD-DD_DD.PDF  or CCDD-DD_DD.PDF,  where first C (char) can be either A,B,C and second C can be B,D,G,Z   (  BB00-01_01.PDF or AZ20-99_01.PDF )

This pattern has 4 groups DDD-DD_DD.PDF  or CCDD-DD_DD.PDF

Group 1: the DDD  (\d{3})

                 OR  the CCDD one of the set ABC and one of BDCZ and 2 digits   (([ABC])(BDGZ)(\d{2}))  

             Together – using an OR for either group  (\d{3})|(([ABC])([BDGZ]))

Group 2: the dash  (-)  

Group 3: the DD_DD   (\d{2}_\d{2})

Group 4: the PDF part   (\.(p|P)(d|D)(f|F))  the .PDF

REGEX: (\d{3})|(([ABC])([BDGZ]))(\d{2}))(-|\.)(\d{2}_\d{2})(\.(p|P)(d|D)(f|F))


In most drawing naming standards one or two characters in the file name will represent the trade or discipline of the content that can be found on the drawing sheet / file.   Specific one or two character “sets” will be allowed in the file name.  A group can be created to only accept those specific character.

Creating a group for specific character set

To create a group that would only allow for these specific characters (upper and lower case) to be accepted:

([GHVBCLSAIQFPDMEWTRXZOghvbclsaiqfpdmewtrxzo])  a single character in this REGEX set will be accepted.


Creating a pattern based on the National Cad Standard construct…  United States National CAD Standard - V6: Uniform Drawing System Module 1

The file naming pattern…

A1A2DDD.DD-UUU.PDF    where .DD is optional,  -UUU is optional

Examples:

A-001.pdf

a-001.pdf

A-001.01-R1.pdf

el001.01-X1.PDF

Assumptions:  1. accept either upper or lower case and 2. no sheet title/project title string (example with this given below)

Permitted characters for “A”

A1 = 1st level:  ABCDEFGHILMOPQRSTUVWXZabcdefghilmopqrstuvwxz

A2 = 2nd level: -ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxy

Create a REGEX group using a REGEX “set”…

A1= ([ABCDEFGHILMOPQRSTUVWXZabcdefghilmopqrstuvwxz])

A2= ([-ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxy])

Create a REGEX group for the DDD

DDD = (\d{3})    total of three digits (000-999)

Create a REGEX group for the .DD

.DD = (.\d{2})     period followed by total of two digits  (00-99)

Create a REGEX group for the -UUU   (U can be a digit or a character A-z)

-UUU = (-(\w|\d){1,3})     a dash followed by at least 1 and up to a max of 3 characters (A-Z or a-z) or digits (0-9)

Since .DD and -UUU are optional and .DD must appear before -UUU if both are present use two ORS as a combined group

((.\d{2})|(-(\w|\d){1,3})|(\d{3})(-(\w|\d){1,3})    Either .DD or -UUU or .DD-UUU  

Putting the groups together with OR’s for the options .DD and -UUU and adding the PDF group

^([ABCDEFGHILMOPQRSTUVWXZabcdefghilmopqrstuvwxz]) ([-ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxy])(\d{3})((.\d{2})|(-(\w|\d){1,3})|(\d{3})(-(\w|\d){1,3}))\.(p|P)(d|D)(f|F)$

The above is one continuous text line for the REGEX pattern:  A1A2DDD.DD-UUU.PDF