Data Flow Tasks
Transforming Data from Source to Destination
Contents
- Data Flow Sources
- Data Flow Destinations
- Data Flow Transformations
- Aggregate Transformation
- Audit Transformation
- Cache Transformation
- Character Map Transformation
- Conditional Split
- Copy Column Transformation
- Data Conversion Transformation
- Data Mining Query
- Derived Column Transformation
- Export Column Transformation
- Fuzzy Grouping Transformation
- Fuzzy Lookup Transformation
- Import Column Transformation
- Lookup Transformation
- Merge Transformation
- Merge Join Transformation
- Multicast
- OLE DB Command Transformation
- Percentage and Row Sampling Transformation
- Pivot Transformation
- Unpivot Transformation
- Row Count Transformation
- Script Component Transformation
- Slowly Changing Dimension Transformation
- Sort Transformation
- Term Extraction Transformation
- Term Lookup Transformation
- Union All Transformation
Each task in the Control Flow of an SSIS package has a task editor, except the Data Flow task, which has its own tab, where various Data Flow tasks can be combined to configure the actual data flow. Data Flows consist of:
- Sources
- Transformations
- Destinations
The tasks used to move from data from a source to a destination are linked by paths connecting the output from one task to the input of another. Each path can be configured with a Data Viewer, that can be used in the development environment to visualise data at each stage of the Data Flow. To add a Data Viewer to a path, right-click the path and choose 'Configure Data Flow Path Editor'. Then click 'Add' to open the 'Configure Data Viewer' dialogue.
The Connection Manager pane shows active connections to sources and destination data containers. Creating a Connection via the solution explorer makes the connection shared across all packages in the solution. Connections for the current package only, are created via the Connection Manager Pane in the current packages Data Flow tab.
Data Flow Sources
Top BottomDefine the source for data for the data flow. There are a variety of Data Flow source types, with configuration options appropriate to the source of the data:
OLE DB Source
Top BottomProvides access to data sources from any OLE DB compliant database. Once a connection is defined for an OLE DB Source task, the data access mode can be set to:
- A Table or View in the Database
- An SQL Command
- An SSIS variable storing either of the above
Using an SQL command instead of a Table or View, allows you to filter the selected columns and rows that are retrieved from the datasource, and thus reduces network traffic and filtering overhead on the client-side. Using a Table or View as the data source, returns all the data to the the client which can then be filtered on the columns page of the data source task editor. Using SQL Commands also allows the use of parameter holders, and the data source task editor will allow you to map the parameter place holders to package/task variables. The Error Output page on the task editor allows you to specify how to handle rows with errors or truncated values. The options are:
- Redirect
- Will send the row in error through a different output path
- Ignore
- Will ignore the error and set the field value to NULL
- Fail
- Will cause the task to return a FAIL execution status
Excel Spreadsheet Source
Top BottomTreats an Excel Workbook as a database and each sheet as a table. Because of the mismatch in datatypes between Excel columns and SQL Server datatypes, conversion transformation may be necessary for Excel data sources
Flat File Source
Top BottomCan handle fixed-width, delimited or ragged-right data-sources. The flat file source is configured from the Connection Manager, and any changes to the source definition will require the Data Flow task to be refreshed to reflect the changes.
Other File Source
Top Bottom- Raw File sources
- a specialised version of Flat File sources, in a format specially optimised for use with SSIS
- XML
- for handling XML Data sources
- ADO.NET
- an ADO.NET data source handler
Data Flow Destinations
Top BottomData Flow Destinations are configured similarly to Data Flow Sources, with the additions of a Mapping page used to map input (External Columns) to output (Output Columns). Data Flow Destinations can not be configured until an input path has been assigned to the task. Data Flow Destinations include:
- OLE DB
- Excel
- Flat File
- Data Mining Model Training
- Data Reader
- Dimension and Partition Processing
- Raw File
- Recordset
- SQL Server and Mobile
Data Flow Transformations
Top BottomTransformations tasks sit between source and destination tasks, and are used to apply transformations to the data flow.
Aggregate Transformation
Top BottomAggregates data by applying a GROUP BY expression to an input column, and then applying aggregate functions to other columns to define the output stream for the transformation.
Available aggregate functions include:
- Group By
- Average
- Count
- Count Distinct
- Minimum
- Maximum
- Sum
Audit Transformation
Top BottomAdds auditing information regarding SSIS package to the output data flow. Available auditing fields to add are:
- Execution Instance GUID
- Package ID
- Package Name
- Version ID
- Execution Start Time
- Machine Name
- User Name
- Task Name
- Task ID
Cache Transformation
Top BottomUsed to load a cache file into the data flow for later use in a lookup transformation
Character Map Transformation
Top BottomPerforms character transformation in the Data Flow
Conditional Split
Top BottomUsed to send data from a single input path to multiple output paths based on one or more conditions written in SSIS expression language. Each condition or case is given a name, corresponding to the name of the transformations output paths. A default output path is also created, for any rows in the input flow that do not match any of the defined conditions. The various output paths can be brought together again with a Union All transformation or sent to different Data Flow Destinations.
| Expression | Description |
|---|---|
| customer_category == "NHS" | Selects records where the customer_category field value is "NHS" |
| customer_category == "NHS" || customer_category == "PP" | Selects records where the customer_category field value is either "NHS" or "PP" |
| customer_category == "NHS" || org_category == "NHS" | Selects records where the customer_category field value is "NHS" or the org_category field value is "NHS" |
| (customer_category == "NHS" || customer_category == "PP") && org_category == "NHS" | Selects records where the customer_category field value is either "NHS" or "PP" and the org_category field value is "NHS" |
Copy Column Transformation
Top BottomAdds a copy of a column from the input path with a new column name to the output path
Data Conversion Transformation
Top BottomAllows you to convert column data from one type to another. Similar in functionality to the CAST or CONVERT TSQL functions. The Data Conversion transformation results in the creation of an additional field in the pipeline representing the data after conversion.
Data Mining Query
Top BottomUsed to fill gaps or predict values for new columns based on the input data stream
Derived Column Transformation
Top BottomCreates a new column based on the value from an existing column using SSIS expressions. For example, the ISNULL operator can be used with the ternary operator to fill in blank fields. Alternatively SUBSTRING or TRIM functions can be used to remove blank spaces or truncate large fields. The derived column can be used either to replace the original column in the output stream, or added as a new column. Cast operators are available in the expression, to ensure the data type of the derived column will be able to match its final destination type.
| Expression | Description |
|---|---|
| (DT_STR,5,1252)"NHS" | Adds the string value "NHS" to the defined column in the dataflow as a 5 character string value |
| (DT_STR,1,1252)3 | Adds the string value "3" to the defined column in the dataflow as a 1 character string value |
| "B" + SUBSTRING(ordercat,1,1) | Returns a string consisting of a "B" followed by the first character of the ordercat field value |
| (DT_DBDATE)GETDATE() | Adds the current date value to the defined column in the dataflow as a database date type |
| LEN(ordertime) == 5 ? ordertime : (DT_STR,5,1252)NULL(DT_STR,5,1252) | Checks if the ordertime field value is 5 characters long. If it is then this value is used. If not, then a NULL string is used. Note that although the expression "NULL(DT_STR, 5, 1252)" is supposed to return a NULL string, the return value still needs to be cast as a string using "(DT_STR, 5, 1252)" |
| LEN(ordertime) == 4 ? SUBSTRING(ordertime,1,2) + ":" + SUBSTRING(ordertime,3,4) : (DT_STR,5,1252)NULL(DT_STR,5,1252) | If the ordertime field value is only 4 characters long, then insert a colon between the 2 and 3 character. Otherwise return a NULL string |
| RTRIM(SUBSTRING(category,1,4)) | Selects the first four characters of the "category" field, then removes any trailing spaces |
Export Column Transformation
Top BottomExports data from the data flow to a file. Can be used to create files from BLOB data in the input flow.
Fuzzy Grouping Transformation
Top BottomUsed to find redundant data in dimension tables for data cleanup
Fuzzy Lookup Transformation
Top BottomUsed to correct poor data quality during an ETL load
Import Column Transformation
Top BottomUsed to import files into BLOB fields
Lookup Transformation
Top BottomPerforms a lookup using columns from the input flow and a lookup table to retrieve additional columns from the lookup table for the output flow, either replacing input columns or adding new columns to the output.
The lookup table used can be specified as a table, view or SQL query. A join is specified on the columns tab between the input flow and the lookup table. Then the additional columns to add to the pipeline are selected. Failed lookups can force the component to fail, the row to be redirected or the lookup columns to be set to NULL.
Rows that fail the Lookup can be sent along a different output path by setting the value of the 'Specify how to handle rows with no matching entries' property. Thus it is possible to use the lookup to split the input data flow, without actually adding new columns of data, based on whether a matching record is found in the lookup or not.
Merge Transformation
Top BottomMerges data from two input paths to a single output path. Useful for joining error paths back to the main flow once the errors have been dealth with or for loading data from two distinct sources. Data from both sources must be sorted beforehand, must have matching metadata and only two input paths can be specified
Merge Join Transformation
Top BottomPerforms a join on two inputs to produce a single output stream
Multicast
Top BottomSends a single input data flow to multiple output paths. No filtering is available for the output data flow and each output path contains the same data flow content.
OLE DB Command Transformation
Top BottomUsed to execute an SQL statement for each row in an input stream. The statement can be a stored procedure call or a prepared SQL statement. The statement will typically contain one or more parameter markings, and the parameters are then mapped to the appropriate fields in the input flow and executed for each row in the data flow.
Percentage and Row Sampling Transformation
Top BottomAllows selection of a sample of an input stream to produce two output streams corresponding to selected and non-selected data
Pivot Transformation
Top BottomProduces cross-tabulated column data by summarising row data from the input stream
Unpivot Transformation
Top BottomUnpivots columns into rows
Row Count Transformation
Top BottomUsed to count rows from the input stream
Script Component Transformation
Top BottomUse custom scripts for sources, transformations and destinations. The first step in configuring a Script Component task is to identify which type of component it is:
- Source
- Destination
- Transformation
Slowly Changing Dimension Transformation
Top BottomDesigned to handle propagation of attribute changes in lookup tables
Sort Transformation
Top BottomUsed to sort data stream by specified columns
Term Extraction Transformation
Top BottomUsed to mine text data for word or phrase frequency
Term Lookup Transformation
Top BottomDoes Term Extraction combined with mapping from a lookup table
Union All Transformation
Top BottomTakes output from multiple sources and combines them into a single resultset. Used to bring together different paths in the data flow.
