Useful references:
Regular expression - Wikipedia - Everything you would want to know about REGEX
Regex Cheat Sheet | Python, PHP, Perl, JavaScript, Ruby - A regex syntax cheat sheet
Some basics:
^ indicates start of the string the pattern is testing.
$ indicates the end of the string.
() – using parenthesis pairs creates a grouping.
\. looking for a period
. any char except newline
[] – using square brackets creates a set – any member of the set is a match, example [ABDT], either an “A”, “B”, “D”, or “T” would be a success match – but only one.
\w Word character [a-zA-Z0-9_]
\W Nonword character [^a-zA-Z0-9_]
\d Digit character [0-9]
\D Nondigit character [^0-9]
\s Whitespace character [\n\r\f\t]
\S Nonwhitespace character [^\n\r\f\t]
| or operand, (a|b|d|t) is the same as [abdt]
{x,y} or {x} number of in pattern, x=min (can be 0) and y=max.
\d{0,3} means zero to a max of three digits (0-9), \d{2} means 2 digits exactly.
Starting with the PDF file extension using groups:
To match the patterns “PDF”, “pdf”, “Pdf” “pDf” “pdF”…
REGEX using three groups: (p|P)(d|D)(f|F)
Group 1: (p|P) either a “p” or a “P”
Group 2: (d|D)
Group 3: (f|F)
For the “.” in “.PDF” we use a \.
just using the “.” without the slash will match “any” character
To match “.PDF” the regex pattern therefore becomes..
REGEX: \.(p|P)(d|D)(f|F)
REGEX patterns groupings that can be assembled together:
|
\.(p|P)(d|D)(f|F) Lower- or upper-case PDF, pdf, Pdf, PDf, pDf, pDF,… |
0-00 9-99 |
(\d-\d{2}) where: \d = any 0-9 digit, {2} indicates 2 digits required |
0-00 or 0 00 9-99 or 9 99 |
(\d(-|\s{1})\d{2}) where: \s = space, {1} indicates only 1 space allowed, “|” means OR |
000 999 |
(\d{3}) where: \d = any 0-9 digit, {3} indicates 3 digits required |
000.0 999.9 |
(\d{3}\.\d) where: the \. means a period; using “.” without a preceding “\” has special REGEX meaning of “any char” |
0.0 or 0.00 or 00.0 or 00.00 or 000.0 or 000.00 9.9 or 9.99 or 99.9 or 99.99 or 999.9 or 999.99 |
(\d{1,3}\.\d{1,2}) where: the \. means a period; using “.” without a preceding “\” has special REGEX meaning of “any char where: \d{1,3} means 1 to 3 digits, \d{1,2} means 1 to 2 digits |
Must have any of the following one or two chars example set: A,B,C,A-,AI,V,T,TC upper case only |
([ABCVT]|AI|A-|TC){1} where: “[set]” set means either A or B or C or V or T, “|” mean OR “AI” or “A-“ or “TC”, {1} means at least of either the set or the OR’s |
Must have any of the following one or two chars example set: A,B,C,A-,AI,V,T,TC either case allow |
([ABCVTabcvt]|AI|Ai|aI|ai|A-|a-|TC|tC|Tc|tc){1} where: same as above but adding the lower-case options of same |
Either one space, underbar or dash |
(\s|_|-){1} where: \s = space, “|” means OR, {1} mean only 1 space OR “ _” OR “-“ allowed |
Sheet title string or project string starting with either an underbar or space, can include underbar, & and/or spaces _any string any string (first char is space) _any & string1 _detail_sheet1 |
((\s|_)([a-zA-Z0-9]|\s|_|&){0,}) where: \s = space, “|” means OR, “[set] where set is any char a through z, A through Z, or 0-9”, {1,} mean only 1 space OR “ _” OR “-“ allowed |
Sheet title string, project string starting with either an underbar, dash or space, can include underbar, & and/or space -any & string1 - detail_sheet1 _any string any string (first char is space) _any & string1 _detail_sheet1 |
((\s|_|-|-\s)([a-zA-Z0-9]|\s|_|&){0,}) where: \s = space, “|” means OR, “[set] where set is any char a through z, A through Z, or 0-9”, {1,} mean only 1 space OR “ _” OR “-“ allowed |
Limit the file name to 20 characters or less |
(?!.{21,}) where: (?|exp) is being used to say accept if the total string of any char (.) is less than 21 characters in total |
Most of the naming standard regex patterns can be created by using “grouping”. Breaking the file name standard into “groups” helps in creating a REGEX pattern for the entire file name.
Combining groups to check a file name pattern
To match DDD-DD_DD.PDF or DDD.DD_DD.PDF
This pattern has 4 groups DDD-DD_DD.PDF or DDD.DD_DD.PDF
Group 1: the DDD (\d{3})
Group 2: a dash or period (-|\.) - either the “-“ OR the “.”
Group 3: the DD_DD (\d{2}_\d{2})
Group 4: the PDF part (\.(p|P)(d|D)(f|F))
REGEX: (\d{3})(-|\.)(\d{2}_\d{2})(\.(p|P)(d|D)(f|F))
Optional patterns – using OR on groups…
To match DDD-DD_DD.PDF or CCDD-DD_DD.PDF, where first C (char) can be either A,B,C and second C can be B,D,G,Z ( BB00-01_01.PDF or AZ20-99_01.PDF )
This pattern has 4 groups DDD-DD_DD.PDF or CCDD-DD_DD.PDF
Group 1: the DDD (\d{3})
OR the CCDD one of the set ABC and one of BDCZ and 2 digits (([ABC])(BDGZ)(\d{2}))
Together – using an OR for either group (\d{3})|(([ABC])([BDGZ]))
Group 2: the dash (-)
Group 3: the DD_DD (\d{2}_\d{2})
Group 4: the PDF part (\.(p|P)(d|D)(f|F)) the .PDF
REGEX: (\d{3})|(([ABC])([BDGZ]))(\d{2}))(-|\.)(\d{2}_\d{2})(\.(p|P)(d|D)(f|F))
Creating a group for specific character set
To create a group that would only allow for these specific characters (upper and lower case) to be accepted:
([GHVBCLSAIQFPDMEWTRXZOghvbclsaiqfpdmewtrxzo]) a single character in this REGEX set will be accepted.
Creating a pattern based on the National Cad Standard construct… United States National CAD Standard - V6: Uniform Drawing System Module 1
The file naming pattern…
A1A2DDD.DD-UUU.PDF where .DD is optional, -UUU is optional
Examples:
A-001.pdf
a-001.pdf
A-001.01-R1.pdf
el001.01-X1.PDF
Assumptions: 1. accept either upper or lower case and 2. no sheet title/project title string (example with this given below)
Permitted characters for “A”
A1 = 1st level: ABCDEFGHILMOPQRSTUVWXZabcdefghilmopqrstuvwxz
A2 = 2nd level: -ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxy
Create a REGEX group using a REGEX “set”…
A1= ([ABCDEFGHILMOPQRSTUVWXZabcdefghilmopqrstuvwxz])
A2= ([-ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxy])
Create a REGEX group for the DDD
DDD = (\d{3}) total of three digits (000-999)
Create a REGEX group for the .DD
.DD = (.\d{2}) period followed by total of two digits (00-99)
Create a REGEX group for the -UUU (U can be a digit or a character A-z)
-UUU = (-(\w|\d){1,3}) a dash followed by at least 1 and up to a max of 3 characters (A-Z or a-z) or digits (0-9)
Since .DD and -UUU are optional and .DD must appear before -UUU if both are present use two ORS as a combined group
((.\d{2})|(-(\w|\d){1,3})|(\d{3})(-(\w|\d){1,3}) Either .DD or -UUU or .DD-UUU
Putting the groups together with OR’s for the options .DD and -UUU and adding the PDF group
^([ABCDEFGHILMOPQRSTUVWXZabcdefghilmopqrstuvwxz]) ([-ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxy])(\d{3})((.\d{2})|(-(\w|\d){1,3})|(\d{3})(-(\w|\d){1,3}))\.(p|P)(d|D)(f|F)$
The above is one continuous text line for the REGEX pattern: A1A2DDD.DD-UUU.PDF