Thursday, January 12, 2006 - Posts

Script Transform Component to Cleanse Data

In yesterday's post, I showed you how to cleanse data using the Script Component acting as a transform. Let's take it a lot farther by doing some more advanced expressions. In this very typical scenario, you receive data that's all upper case from a mainframe. You want to cleanse the data to where the first letter of every word is upper case and everything else is lower case. To do this, drag over the Script Component and select Transform. Check the input column that you wish to send into the script. For my example, let's use the input name of InputName. I'm going to send into the pipeline a column called OutputValue. I add this column in the Inputs and Outputs page.

Once this is selected, you can select Design Script on the Script page. The script will look like something like the below script. In order to make things a bit simplier,  you'll want to import the System.Text.RegularExpressions namespace. As you can see, I'm using RegularExpressions to parse through the string and find when one word ends and another begins. You'll want to make sure you cleanse the input for Nulls before that chunk of the code executes. The MatchHandler function takes the two groups of strings and upper cases one and lower cases another.

All you would have to essentially do to use this code is change the InputValue and OutputValue columns in the code and you're ready to go.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.Text.RegularExpressions

Public Class ScriptMain
   
Inherits UserComponent

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

   Dim myDelegate As New MatchEvaluator(AddressOf MatchHandler)
   Dim strpattern As String = "\b(\w)(\w+)?\b"

   Dim re As New Text.RegularExpressions.Regex( _
   strpattern, RegexOptions.Multiline
Or _
   RegexOptions.IgnoreCase _
   )

'Will check to see if there is a NULL value before trying the function.

If Row.InputName_IsNull = False Then
   
 Row.OutputName= re.Replace(Row.InputName, myDelegate)
End If

  
End Sub

Private Function MatchHandler(ByVal m As Match) As String
   Return m.Groups(1).Value.ToUpper() & m.Groups(2).Value.ToLower()
End Function

End
Class

It's important to note that I'm not a coder so I'm sure my code may not be the most prestine. :)

For more on how to use the Script Component as a transform, see yesterday's post.

Update: A much cleaner approach to doing proper case was sent to me by Toby Rogers and can be seen in this post.

-- Brian Knight

How to use a Script Transformation

Using a Script transformation is a handy way to extend what transforms are available to you. For example, picture social security data that goes into your Script transformation unencrypted and then out encrypted. Typically, you can also use this type of custom script to extend what a Derived Column task can do. If you have the choice between the two though, always choose a Derived Column.

A good extension of a Derived Column transformation into a Script transformation would be where you need to replace certain text with other text. If you have only two options, you could use the REPLACE expression inside a Derived Column transform. When you go above two, the Replace expression becomes less elegant. So, in comes the Script transform.

After connecting the Script transform to the data flow, go to the Input Column task for the transform and check the columns that will be sent into the script as an input. Next, go to the Inputs and Outputs tab and click Add Output. Name the columns that you wish to output. I’m going to call this one OutputValue for this example.  Next, click Design Script.

In this script you can see that first we localize the variable. Because I can’t predict whether the user inputted mixed, lower or upper case, I use the UCASE function to upper case the input value for comparison purposes. The word InputValue in Row.InputValue would be replaced with whatever you checked in the input screen. Then, the main thing to remember is Row.OutputValue will set the OutputValue output we set earlier will be set to the new value. So this code will take the input of whatever the color is and translate it to the hex value. Then, the OutputValue output will be presented to the pipeline and can be consumed by the destination or the next transform in the pipeline.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class ScriptMain
Inherits UserComponent

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

Dim HexValue As String

Select Case UCase(Row.InputValue)
   
Case "RED"
      HexValue = "#FF0000"
   Case "YELLOW"
      HexValue = "#FFFF00"
   Case "GREEN"
      
HexValue = "#008000"
End Select

Row.OutputValue = HexValue
End Sub
End
Class 

-- Brian Knight