CLR Assembly RegEx Functions for SQL Server by Example

来源:互联网 发布:javascript 切换页面 编辑:程序博客网 时间:2024/05/16 12:28
CLR Assembly RegEx Functions for SQL Server by Example
15 April 2009
by Phil Factor

 

Phil Factor presents a simple CLR Assembly to allow SQL Server users to access all the powerful RegEx library methods in .NET. In the spirit of 'get something up and running', he takes a hands-on approach and demonstrates that you needn't be a C# hotshot or Visual Studio expert to  develop a CLR assembly for SQL Server

Writing CLR-based functions and procedures for SQL Server isn’t necessarily a complex task. It doesn’t even need Visual Studio. It is possible to get things up and running very quickly.  For this example, I’ve chosen to revisit the hoary old example of the RegEx package. There is nothing heroic about doing so since it is a tiny smear of code that merely accesses the .NET Framework regular expression engine.  I decided on the CLR RegEx package because I’ve never found a complete example: The sample code that is readily available generally leaves something important out.

I decided to use VB.NET, so as to provide something fresh, and thought it might be interesting to do it without using Visual Studio at all, just to prove it could be done with just SQLExpress.  For the SQL programmer, there is no need to even look at the source code as it can be installed merely from the CREATE ASSEMBLY script which I’ve included. The assembly-creation script can be executed in the same way as any SQL.

The CLR assembly

The Visual Basic Source is here. It was too long to place in the article itself. You don’t need it if you just to try out the package, only if you want to modify or extend it.

This assembly gives you a reasonably full palette of RegEx functions within SQL Server such as…

RegExReplace 
replaces the substring matching the expression with a result string that can include the matches. This works the same way as the SQL Server REPLACE function, except it takes a full RegEx instead of the LIKE expression, and allows substitution of the matches in the replacement string using $0, $1…$n.
RegExReplaceX 
like RegExReplace, except that the parameters are in the more normal order of  pattern, input, replacement, options. The options parameter allows the RegEx to be used with a number of different options
RegExSplit
split a string into a table basing each row-delimiter on the substring that matches the expression, returns the table.
RegExMatch
returns the first substring that matches the expression
RegExIsMatch
returns non-zero if there is a match of the RegEx within the string
RegExMatches  
returns a table of all the substring that match the expression within the string
RegExEscape
returns the input string with all characters that could be interpreted as part of a RegEx expression ‘escaped’. This allows strings to be inserted into RegExes
RegExIndex
returns the index into the string of any substring that matches the RegEx. It works the same way as PatIndex
RegExOptionEnumeration
returns a bitmap of the options you select that controls the way that the RegEx is used.

Although this looks like a comprehensive list, it isn’t complete. The NET RegEx package allows you to specify the maximum number of matches or splits to be returned. I didn’t include it because it seemed less useful, and it would be a simple feature to add. I also avoided returning capturing groups, to RegExMatch and RegExMatches. This would have meant returning a de-normalized table. There are uses for it but the source of such a CLR routine it is already in the SQL Server  sample code if you need it.

SQL Server Assemblies can be written in any language that can produce a .NET assembly, though C# and VB .net are most often used.  They are loaded into SQL Server but are not directly visible to the database developer. Once they are in SQL Server, they can be easily distributed, saved, and loaded into other servers using TSQL.  Once you have them installed in SQL Server, you can get to the source very easily using .NET Reflector with an add-in, so it becomes very easy to modify them. You have no need for Visual Studio either. You can merely use the tools that are already provided within the .NET Framework to create and alter the code for these assemblies.

Creating and using a simple CLR in SQL Server

We’ll create a simple RegEx CLR function without Visual Studio and SSMS just to prove that it isn’t at all complex. Let’s take a very simple example first. We then have something we can expand on once we have it working. Here is a simple, but  useful, implementation of the  RegEx.IsMatch function from the .NET Regular Expression (RegEx) class. First, a bit of Visual Basic.

Imports System

Imports System.Data.Sql

Imports Microsoft.SqlServer.Server

Imports System.Data.SqlTypes

Imports System.Runtime.InteropServices

Imports System.Text.RegularExpressions

Imports System.Collections 'the IEnumerable interface is here 

 

 

Namespace SimpleTalk.Phil.Factor

    Public Class RegularExpressionFunctions

        'RegExIsMatch function

        <SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _

        Public Shared Function RegExIsMatch( _

                                            ByVal pattern As SqlString, _

                                            ByVal input As SqlString, _

                                            ByVal Options As SqlInt32) As SqlBoolean

            If (input.IsNull OrElse pattern.IsNull) Then

                Return SqlBoolean.False

            End If

            Dim RegExOption As New System.Text.RegularExpressions.RegExOptions

            RegExOption = Options

            Return RegEx.IsMatch(input.Value, pattern.Value, RegExOption)

        End Function

    End Class      '

End Namespace

This just used the IsMatch method of the RegEx . It indicates whether the regular expression finds a match in the input string, using the regular expression specified in the pattern parameter and the matching options supplied in the options parameter. The real script is much longer and can be downloaded from the speechbubble.

We simply create an assembly from the command line (just put it into your programmer’s  editor as a ‘tool’-  I did all the development work for this project in EditPad Pro.)

%PATH_TO_FRAMEWORK%/vbc.exe /t:library "%FILE%"

Where "%FILE%" is the full path and name of the file, and %PATH_TO_FRAMEWORK%/ is the path to the framework you are using.

Then a simple bit of SQL will have you up and running (this can be run by SQLCMD) Remember to substitute the file and path of the assembly for %FILE%

sp_configure 'clr enabled', 1

RECONFIGURE WITH OVERRIDE

 

IF EXISTS ( SELECT   1

            FROM     sys.objects

            WHERE    object_id = OBJECT_ID(N'dbo.RegExIsMatch') )

   DROP FUNCTION dbo.RegExIsMatch

go

 

IF EXISTS ( SELECT   1

            FROM     sys.assemblies asms

            WHERE    asms.name = N'RegExFunction ' )

   DROP ASSEMBLY [RegExFunction]

 

CREATE ASSEMBLY RegExFunction

           FROM '%FILE%’

GO

 

CREATE FUNCTION RegExIsMatch

   (

    @Pattern NVARCHAR(4000),

    @Input NVARCHAR(MAX),

    @Options int

   )

RETURNS BIT

AS EXTERNAL NAME

   RegExFunction.[SimpleTalk.Phil.Factor.RegularExpressionFunctions].RegExIsMatch

GO

 

--a few tests

---Is this card a valid credit card?

SELECT dbo.RegExIsMatch ('^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35/d{3})/d{11})$','4241825283987487',1)

--is there a number in this string

SELECT dbo.RegExIsMatch( '/d','there is 1 thing I hate',1)

--Verifies number Returns 1

DECLARE @pattern VARCHAR(255)

SELECT @pattern ='[a-zA-Z0-9]/d{2}[a-zA-Z0-9](-/d{3}){2}[A-Za-z0-9]'

SELECT  dbo.RegExIsMatch (@pattern, '1298-673-4192',1),

        dbo.RegExIsMatch (@pattern,'A08Z-931-468A',1),

        dbo.RegExIsMatch (@pattern,'[A90-123-129X',1),

        dbo.RegExIsMatch (@pattern,'12345-KKA-1230',1),

        dbo.RegExIsMatch (@pattern,'0919-2893-1256',1)

 

Running the full project

So now we are ready for the full project. You can either compile the DLL from the source code, or you can execute the CREATE ASSEMBLY from the build script if you are just interested in taking the code for a trip around the block.

To compile the full project, download the VB source and compile it using the following command in the command-line

%PATH_TO_FRAMEWORK%/vbc.exe /t:library "%FILE%"

Where "%FILE%" is the full path and name of the file, and %PATH_TO_FRAMEWORK%/ is the path to the framework you are using.

Then, from SSMS or SQLCMD run the …
        InstallRegExSample.sql file,
…which can be downloaded from the speech bubble

If you want to avoid compiling the DLL from the VB source, or you want to run the CLR assembly on a remote computer, run the build script,( which can be downloaded from the speech bubble)

followed by, from SSMS or SQLCMD, the …
        InstallRegExAssembly.sql file,
…which can be downloaded from the speech bubble

You can view the two files, the  InstallRegExSample.sql  and  InstallRegExAssembly.sql files. Amongst other things, they will tell you what parameters are passed, and in what order, for every SQLCLR function.

A Trip Around the Block.

/* Probably the simplest routines for a SQL programmer is the RegExReplace and the RegExIndex. This is because they work the same way as does the REPLACE() and PatIndex() */

Select Replace ('this is a revolting view','revolting', 'stunning')

Select dbo.RegExReplace('this is a revolting view','revolting', 'stunning')

/* Both give the same result. However, with RegExReplace, we can use any RegEx pattern instead of a string

We'll create an insert script from a comma-delimited list*/

select dbo.RegExReplace(

'Sprocket,6.26,Paris

widget,2.476,London

Bucket,8.25,New Orleans',

'^("[^"/r/n]*"|[^,/r/n]*),("[^"/r/n]*"|[^,/r/n]*),("[^"/r/n]*"|[^,/r/n]*)',

'insert into MyTable select ''$1'',$2,''$3'' --$0'

)

/* which gives the result...

insert into MyTable select 'Sprocket',6.26,'Paris' --Sprocket,6.26,Paris

insert into MyTable select 'widget',2.476,'London' --widget,2.476,London

insert into MyTable select 'Bucket',8.25,'New Orleans' --Bucket,8.25,New Orleans

 

 

We have a powerful device here. We can put the string that matches the pattern into the result string using the $1, $2 $3 ... convention.

 

We can do exactly the same thing with RegExReplacex, which gives us a finer control over how the RegEx is executed. */

 

select dbo.RegExReplacex('^("[^"/r/n]*"|[^,/r/n]*),("[^"/r/n]*"|[^,/r/n]*),("[^"/r/n]*"|[^,/r/n]*)',

'Sprocket,6.26,Paris

widget,2.476,London

Bucket,8.25,New Orleans',

'insert into MyTable select ''$1'',$2,''$3'' --$0',

 dbo.RegExOptionEnumeration(1,1,0,0,0,0,0,0,0)

)

 

--remove repeated words in text

SELECT  dbo.RegExReplace('Sometimes I cant help help help stuttering','/b(/w+)(?:/s+/1/b)+', '$1')

 

--find a #comment and add a TSQL --

SELECT  dbo.RegExReplace('

# this is a comment

first,second,third,fourth','#.*','--$&',1)

 

--replace a url with an HTML anchor

SELECT  dbo.RegExReplacex(

        '/b(https?|ftp|file)://([-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|])',

         'There is  this amazing site at http://www.simple-talk.com',

        '<a href="$2">$2</a>',1)

 

--strip all HTML elements out of a string

SELECT  dbo.RegExReplace('<a href="http://www.simple-talk.com">Simle Talk is wonderful</a><!--This is a comment --> we all love it','<(?:[^>''"]*|([''"]).*?/1)*>',

   '')

 

 

/*But there are a whole lot of other things we can do.

Let's return the first number in a string*/

SELECT dbo.RegExMatch('/d+', ' somewhere there is a number 4567 and then more ',1)

-- 4567

--escape a literal string so it can be part of a regular expression

SELECT dbo.RegExEscape(' I might need to search for [*///*]')

/* this would become useful if you wanted to insert a literal string into a RegEx. How about, for example, you want to search a string for a substring where two words are near each other (at most four words apart) in either order.*/

Declare @String1 varchar(80), @String2 varchar(80), @RegEx Varchar(200)

Select  @String1= dbo.RegExEscape('often'),

           @String2= dbo.RegExEscape('wrong')

Select @RegEx=

'/b(?:'+@String1+'(?:/W+/w+){0,4}?/W+'+@String2+'|'+@String2+'(?:/W+/w+){0,4}?/W'+@String1+')/b'

SELECT dbo.RegExMatch(@RegEx,'A RegEx expression can often be wrong but it is usually possible to put it right.',1)

--split a string into words

SELECT * FROM dbo.RegExSplit('/W+','this is an exciting  regular   expression',1)

--Find if the words 'Simple' and 'Talk' are within three words distant

Select dbo.RegExIsMatch('/bsimple(?:/W+/w+){0,3}?/W+talk/b',

'It is simple to say that there is talk of

a wonderful website called Simple Talk',1)

--Find the words 'Simple' and 'Talk' within three words distant

Select dbo.RegExIndex('/bsimple(?:/W+/w+){0,3}?/W+talk/b',

'It is simple to say that there is talk of

a wonderful website called Simple Talk',1)

/* we can return a table of every repeating word in a string (along with the index intyo the string and the length of the match) */

select * from RegExMatches(

'/b(/w+)/s+/1/b',--match any repeated word

'i have had my ups and downs

but wotthehell wotthehell

yesterday sceptres and crowns

fried oysters and velvet gowns

and today i herd with bums

but wotthehell wotthehell

i wake the world from sleep

as i caper and sing and leap

when i sing my wild free tune

wotthehell wotthehell

under the blear eyed moon

i am pelted with cast off shoon

but wotthehell wotthehell',3)

 

--get valid dates and convert to SQL Server format

SELECT DISTINCT CONVERT(DATETIME,match,103) FROM dbo.RegExMatches ('/b(0?[1-9]|[12][0-9]|3[01])[- /.](0?[1-9]|1[012])[- /.](19|20?[0-9]{2})/b','

12/2/2006 12:30 <> 13/2/2007

32/3/2007

2-4-2007

25.8.2007

1/1/2005

34/2/2104

2/5/2006',1)

 

 

/* There are a number of ways we can use the RegExIsMatch function. Here are a few simple examples */

--is there a repeated word?

SELECT dbo.RegExIsMatch('/b(/w+)/s+/1/b','this has has been repeated',1)--1

SELECT dbo.RegExIsMatch('/b(/w+)/s+/1/b','this has not been repeated',1)--0

 

--Is the word 'for' and 'last' up to 2 words apart)

SELECT dbo.RegExIsMatch('/bfor(?:/W+/w+){0,2}?/W+last/b',

           'You have failed me for the last time, Admiral',1)--1

SELECT dbo.RegExIsMatch('/bfor(?:/W+/w+){1,2}?/W+last/b',

           'You have failed me for what could be the last time, Admiral',1)--0

 

--is this likely to be a valid credit card?

SELECT dbo.RegExIsMatch('^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{12}|3(?:0

[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13}|(?:2131|1800)/d{11})$','4953129482924435',1)        

--IS this a valid ZIP code

SELECT dbo.RegExIsMatch('^[0-9]{5,5}([- ]?[0-9]{4,4})?$','02115-4653',1)

 

--is this a valid Postcode?

SELECT dbo.RegExIsMatch('^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha

-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))

) {0,1}[0-9][A-Za-z]{2})$','RG35 2AQ',1)

 

--is this a valid European date?

SELECT dbo.RegExIsMatch('^((((31//(0?[13578]|1[02]))|((29|30)//(0?[1,3-9]|1[0-2])))//(1[

6-9]|[2-9]/d)?/d{2})|(29//0?2//(((1[6-9]|[2-9]/d)?(0[48]|[2468][048]|[13579][26])|((16

|[2468][048]|[3579][26])00))))|(0?[1-9]|1/d|2[0-8])//((0?[1-9])|(1[0-2]))//((1[6-9]|[2

-9]/d)?/d{2})) (20|21|22|23|[0-1]?/d):[0-5]?/d:[0-5]?/d$','12/12/2007 20:15:27',1)

 

--is this a valid currency value (dollar)?

SELECT dbo.RegExIsMatch('^/$(/d{1,3}(/,/d{3})*|(/d+))(/./d{2})?$','$34,000.00',1)

 

--is this a valid currency value (Sterling)?

SELECT dbo.RegExIsMatch('^/&pound;(/d{1,3}(/,/d{3})*|(/d+))(/./d{2})?$',

'&pound;34,000.00',1)

 

--A valid email address?

SELECT dbo.RegExIsMatch('^(([a-zA-Z0-9!#/$%/^&/*/{/}''`/+=-_/|//?]+(/.[a-zA-Z0-9!#/$%/^&

/*/{/}''`/+=-_/|//?]+)*){1,64}@(([A-Za-z0-9]+[A-Za-z0-9-_]*){1,63}/.)*(([A-Za-z0-9]+[A

-Za-z0-9-_]*){3,63}/.)+([A-Za-z0-9]{2,4}/.?)+){1,255}$','Phil.Factor@simple-Talk.com',1)

The RegEx Options

RegEx functions can work in different ways by setting the option bitmap which is passed to the various functions. The most important of these is the flag that tells the RegEx to do case-insensitive searches. This project has a special function that translates these options into a bitmap. An example of its use was given in the ‘Trip around the block’, above

select dbo.RegExReplacex('^("[^"/r/n]*"|[^,/r/n]*),("[^"/r/n]*"|[^,/r/n]*),("[^"/r/n]*"|[^,/r/n]*)',

'Sprocket,6.26,Paris

widget,2.476,London

Bucket,8.25,New Orleans',

'insert into MyTable select ''$1'',$2,''$3'' --$0',

 dbo.RegExOptionEnumeration(1,1,0,0,0,0,0,0,0)

)

And the function that implements it  it is..

CREATE FUNCTION RegExOptionEnumeration

     (

     @IgnoreCase bit,

        @MultiLine bit,

        @ExplicitCapture bit,

        @Compiled  bit,

        @SingleLine  bit,

        @IgnorePatternWhitespace  bit,

        @RightToLeft  bit,

        @ECMAScript  bit,

        @CultureInvariant  bit

        )

returns int

AS EXTERNAL NAME

   RegExFunction.[SimpleTalk.Phil.Factor.RegularExpressionFunctions].RegExOptionEnumeration

 

go

These flags, in order, are:

IgnoreCase
By default, the regular expression is case sensitive. In the examples we used, we set it to ignore case

 

The Multiline property
The caret and dollar only match at the very start and very end of the subject string by default. If your subject string consists of multiple lines separated by line breaks, you can make the caret and dollar match at the start and the end of those lines by setting the Multiline property to True. (there is no option to make the dot match line break characters).

 

Explicit Capture
Specifies that the only valid captures are explicitly named or numbered groups of the form (?<name>). This allows unnamed parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:).

 

Compiled
Specifies that the regular expression is compiled to an assembly. This yields faster execution but increases startup time.

 

Singleline
Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except/n).

 

IgnorePatternWhitespace
Eliminates unescaped white space from the pattern and enables comments marked with #.
RightToLeft
Specifies that the search will be from right to left instead of from left to right
ECMAScript
Enables ECMAScript-compliant behaviour for the expression. This flag can be used only in conjunction with the IgnoreCase, Multiline, and Compiled flags. The use of this flag with any other flags results in an exception.
CultureInvariant
Specifies that cultural differences in language is ignored

 

 

 

So what next?

I’m hoping that this project can be appreciated on several levels. Firstly, it gives what I hope is a working implementation of a SQLCLR RegEx library of functions that is immediately useful. Secondly, I want to show how easy it is to develop  CLR routines in your favourite .NET language. Thirdly, it would be nice to think that it could be developed and extended to  support more complex RegEx operations, such as those involving matching groups .

If you are going to make extensive use of the package, then you'll find everything you need to know in the

  • function definitions within the InstallRegex*.SQL files
  • Documentation for the System.Text.RegularExpressions Namespace
  • VB Source of the CLR assembly  that accompanies the article.

I'd be fascinated to hear of any improvements or bugfixes you make.



This article has been viewed 260 times.

Phil Factor

Author profile: Phil Factor

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 25 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

  • The Phrenetic Phoughts of Phil Factor Phil on Twitter
  • Phil on SQL Server Central
  • Phil on Business Of Software
  • Robyn and Phil's Workbenches

 

Search for other articles by Phil Factor