Importing excel files having variable headers

Asked 22/11, 2017 at 14:45 Answered 1/12, 2017 at 20:36

Solved sql-server excel ssis etl sql-server-data-tools

I have the SSIS package, which will load the excel file into Database. I have created Excel Source task to map the excel column name to Database table column name and its working fine.

In rare case, We are receiving the excel file column name with some space (for example : Column name is "ABC" but we are receiving "ABC ") and which cause the mapping issue and SSIS got failed.

Is there any possible to trim the column name without opening the excel.

Note : Page name will be dynamic and Column position may change (eg: Column "ABC may exist in first row or second row or ..").

Goldengoldenberg answered 22/11, 2017 at 14:45 Comment(2)

When you say Note : Page name will be dynamic and Column position may change (eg: Column "ABC may exist in first row or second row or ..") are you meaning Note : Page name will be dynamic and Column position may change (eg: Column "ABC may exist in first column or second column or ..")? – Adiaphorous 28/11, 2017 at 18:3

@VigneshKumar you can combine if you consider BHouse and DrHouseofSQL as the first step then you go with Hadi solution. Then you have a 100% working solution. I think that you will not receive a better solution, because you are asking for a complex situation issue – Spano 29/11, 2017 at 15:40

First of all, my solution is based on @DrHouseofSQL and @Bhouse answers, so you have to read @DrHouseofSQL answer first then @BHouse answer then continue with this answer

Problem

Note : Page name will be dynamic and Column position may change (eg: Column "ABC may exist in first row or second row or ...

This situation is a little complex and can be solved using the following workaround:

Solution Overview

Add a script task before the data flow task that import the data
You have to use the script task to open the excel file and get the Worksheet name and the header row
Build the Query and store it in a variable
in the second Data Flow task you have to use the query stored above as source (Note that you have to set Delay Validation property to true)

Solution Details

First create an SSIS variable of type string (i.e. @[User::strQuery])
Add another variable that contains the Excel File Path (i.e. @[User::ExcelFilePath])
Add A Script Task, and select @[User::strQuery] as ReadWrite Variable, and @[User::ExcelFilePath] as ReadOnly Variable (in the script task window)
Set the Script Language to VB.Net and in the script editor window write the following script:

Note: you have to imports System.Data.OleDb

In the code below, we search the excel first 15 rows to find the header, you can increase the number if the header can be found after the 15 rows. Also i assumed that the columns range is from A to I

    m_strExcelPath = Dts.Variables.Item("ExcelFilePath").Value.ToString

    Dim strSheetname As String = String.Empty
    Dim intFirstRow As Integer = 0

    m_strExcelConnectionString = Me.BuildConnectionString()
    Try


        Using OleDBCon As New OleDbConnection(m_strExcelConnectionString)

            If OleDBCon.State <> ConnectionState.Open Then
                OleDBCon.Open()
            End If

            'Get all WorkSheets
            m_dtschemaTable = OleDBCon.GetOleDbSchemaTable(OleDbSchemaGuid.Tables,
                                                               New Object() {Nothing, Nothing, Nothing, "TABLE"})

            'Loop over work sheet to get the first one (the excel may contains temporary sheets or deleted ones

            For Each schRow As DataRow In m_dtschemaTable.Rows
                strSheetname = schRow("TABLE_NAME").ToString

                If Not strSheetname.EndsWith("_") AndAlso strSheetname.EndsWith("$") Then

                    Using cmd As New OleDbCommand("SELECT * FROM [" & strSheetname & "A1:I15]", OleDBCon)

                        Dim dtTable As New DataTable("Table1")


                        cmd.CommandType = CommandType.Text

                        Using daGetDataFromSheet As New OleDbDataAdapter(cmd)

                            daGetDataFromSheet.Fill(dtTable)

                            For intCount As Integer = 0 To 15

                                If Not String.IsNullOrEmpty(dtTable.Rows(intCount)(0).ToString) Then

                                    '+1 because datatable is zero based indexed, +1 because we want to start from the second row
                                    intFirstRow = intCount + 2

                                End If


                            Next



                        End Using

                        If intFirstRow = 0 Then Throw New Exception("header not found")

                    End Using

                    'when the first correct sheet is found there is no need to check others
                    Exit For

                End If
            Next

            OleDBCon.Close()

        End Using

    Catch ex As Exception
        Throw New Exception(ex.Message, ex)
    End Try


    Dts.Variables.Item("strQuery").Value = "SELECT * FROM [" & strSheetname & "A" & intFirstRow.ToString & ":I]"

    Dts.TaskResult = ScriptResults.Success
End Sub

Then you have to add an Excel connection manager, and choose the excel file that you want to import (just select a sample to define the metadata for the first time only)
Assign a default value of Select * from [Sheet1$A2:I] to the variable @[User::strQuery]
In the Data Flow Task add an Excel Source, choose SQL Command from variable, and select @[User::strQuery]
Go to the columns tab and name the columns in the same way that @BHouse suggested

Image taken from @BHouse answer

Set the DataFlow Task Delay Validation property to True
Add other components to DataFlow Task

UPDATE 1:

From the OP comments: sometimes excel with empty data will come.(i.e) we have only header row not not data... in that case it fails entire task

Solution:

If your excel file contains no data (only header) you have to do these steps:

Add an SSIS variable of type boolean *(i.e. @[User::ImportFile])
Add @[User::ImportFile] to the script task ReadWrite variables
In the Script Task check if the file contains rows
If yes Set @[User::ImportFile] = True, else @[User::ImportFile] = False
Double Click on the arrow (precedence constraint) that connect the script task to the DataFlow
Set its type to Constraint and Expression
Write the following expression
```
@[User::ImportFile] == True
```

Note: The new Script Task code is:

    m_strExcelPath = Dts.Variables.Item("ExcelFilePath").Value.ToString

    Dim strSheetname As String = String.Empty
    Dim intFirstRow As Integer = 0

    m_strExcelConnectionString = Me.BuildConnectionString()
    Try


        Using OleDBCon As New OleDbConnection(m_strExcelConnectionString)

            If OleDBCon.State <> ConnectionState.Open Then
                OleDBCon.Open()
            End If

            'Get all WorkSheets
            m_dtschemaTable = OleDBCon.GetOleDbSchemaTable(OleDbSchemaGuid.Tables,
                                                               New Object() {Nothing, Nothing, Nothing, "TABLE"})

            'Loop over work sheet to get the first one (the excel may contains temporary sheets or deleted ones

            For Each schRow As DataRow In m_dtschemaTable.Rows
                strSheetname = schRow("TABLE_NAME").ToString

                If Not strSheetname.EndsWith("_") AndAlso strSheetname.EndsWith("$") Then

                    Using cmd As New OleDbCommand("SELECT * FROM [" & strSheetname & "A1:I15]", OleDBCon)

                        Dim dtTable As New DataTable("Table1")


                        cmd.CommandType = CommandType.Text

                        Using daGetDataFromSheet As New OleDbDataAdapter(cmd)

                            daGetDataFromSheet.Fill(dtTable)

                            For intCount As Integer = 0 To 15

                                If Not String.IsNullOrEmpty(dtTable.Rows(intCount)(0).ToString) Then

                                    '+1 because datatable is zero based indexed, +1 because we want to start from the second row
                                    intFirstRow = intCount + 2

                                End If


                            Next



                        End Using





                    End Using

                    'when the first correct sheet is found there is no need to check others
                    Exit For

                End If
            Next

            OleDBCon.Close()

        End Using

    Catch ex As Exception
        Throw New Exception(ex.Message, ex)
    End Try

                If intFirstRow = 0 OrElse _
                   intFirstRow > dtTable.Rows.Count Then

                    Dts.Variables.Item("ImportFile").Value = False

                Else

                    Dts.Variables.Item("ImportFile").Value = True

                End If                    

    Dts.Variables.Item("strQuery").Value = "SELECT * FROM [" & strSheetname & "A" & intFirstRow.ToString & ":I]"

    Dts.TaskResult = ScriptResults.Success
End Sub

UPDATE 2:

From the OP comments: is there any other work around available to process the data flow task without skipping all data flow task,Actually one of the task will log the filename and data count and all, which are missing here

Solution:

Just add another DATA FLOW task
Connect this dataflow with the script task using another connector and with the expression @[User::ImportFile] == False (same steps of the first connector)
In the DataFlow Task add a SCript Component as a Source
Create the Output columns you want to import to Logs
Create a Row that contains the information you need to import
Add the Log Destination

Or Instead of adding another Data Flow Task, you can add an Execute SQL Task to insert a row in the Log Table

Hypothalamus answered 25/11, 2017 at 23:25 Comment(15)

Thanks for the detailed information.. it is working fine... except one condition.. sometimes excel with empty data will come.(i.e) we have only header row not not data... in that case it fails entire task. – Goldengoldenberg 27/11, 2017 at 3:11

@VigneshKumar, you're right, there is a simple workaround, add an expression to the connector between the script task and the dataflow task. I updated my answer, take a look – Hypothalamus 27/11, 2017 at 3:59

Yes it is working @hadi but is there any other work around available to process the data flow task without skipping all data flow task. – Goldengoldenberg 27/11, 2017 at 9:43

@VigneshKumar why you need to process the data flow task if the source contains no data? when the excel header is dynamic this will make things more complex. – Hypothalamus 27/11, 2017 at 11:47

Actually one of the task will log the filename and data count and all, which are missssig here – Goldengoldenberg 27/11, 2017 at 12:51

@VigneshKumar just add another dataflow task with script component as a source and create a row that contains log values. Connect this dataflow with the script task using another connector and with the expression @[User::ImportFile] == False – Hypothalamus 27/11, 2017 at 13:43

@VigneshKumar in case where no rows are found, data count = 0, and you got the filename, so you can log them using an Execute SQL Task (using expressions), or using a script task. If there is no data rows, even the Data Flow Task is executed, no rows will be processed because source is empty – Hypothalamus 30/11, 2017 at 21:8

if you are able to figure out column names wouldn't it be easier to simply assign the query definition to a variable and conform the source to the expected format? Such as SELECT col1 as ThisColumnName....... then you wouldn't have to worry about mapping differences. Or perhaps even dump the output of your script task to a RecordSet Variable seeing you are already querying the source and then use that recordset in your dataflow? While plausible and I have written a bunch of scripts for SSIS I am questioning why more and more as a console program would actually be easier for this. – Kirkuk 2/12, 2017 at 0:5

@Kirkuk there are 2 things to take into consideration 1) column names are dynamic 2) Header position is dynamic , in my answer i used the same logic that you suggested but instead of writing it as a query Select col1 as ... i gave aliases to columns in the Source Editor which is the same. But i didn't know what is the advantage to import rows to a recordset then using this recordset as Source?? And a console program may be easier but it will be never faster.. – Hypothalamus 2/12, 2017 at 2:29

@Hadi. I am aware of the dynamic nature of column names and the header position. I guess I was saying why use F1, F2, etc. when you could just alias to the intended column name it adds an additional layer of confusion. – Kirkuk 5/12, 2017 at 18:28

@Hypothalamus Anyway, but yes using a recordset as a source and filling that recordset from your script should be more effecient. you are already opening and querying the recordset in your script daGetDataFromSheet.Fill(dtTable) so why not just format that way you want by modifying column names and then dump to a recordset to use as source. Otherwise you have to query it twice once in script and once in data flow which isn't efficient. – Kirkuk 5/12, 2017 at 18:30

@Kirkuk i agree with both solutions, you can use aliases when building the query in the script task or in the source editor – Hypothalamus 5/12, 2017 at 18:30

@Kirkuk daGetDataFromSheet.Fill(dtTable) is used to get the first 15 rows, not all the file. if you check the query it is SELECT * FROM [" & strSheetname & "A1:I15], A1:I15 means that we are getting the range from the first column to the 9 colums and the first 15 rows – Hypothalamus 5/12, 2017 at 18:31

@Hypothalamus ah I missed the range now I see why you are going that route. If the files aren't very big though seeing you are already scripting you could go the recordset route... but yeah if large files doing the 15 rows to get format would be more effecient – Kirkuk 5/12, 2017 at 18:33

@Kirkuk now i agree with that – Hypothalamus 5/12, 2017 at 18:33

This has been documented well in MSDN , running through the steps similar to as @houseofsql mentioned

Step1:

Exclude column names in first row in excel connection, use sql command as data access mode

Step2: Alias column names in output column as matching your destination,

Select * from [Sheet1$A2:I] will select from second row

Finally Add destination as OLEDB destination

Bacchanal answered 22/11, 2017 at 16:46 Comment(8)

Hi Thanks for your idea but the complex is we are using dynamic page name – Goldengoldenberg 23/11, 2017 at 15:12

You mean name is excel sheet changes ? – Bacchanal 23/11, 2017 at 15:20

Yes. Both file name and excel sheet name will be dynamic – Goldengoldenberg 23/11, 2017 at 15:28

Did u tried by following same steps with variable as source path – Bacchanal 23/11, 2017 at 16:58

No... is there any way to open the excel and update the column through script task? – Goldengoldenberg 24/11, 2017 at 5:47

Does this rely on column order always being the same? – Demetrademetre 25/11, 2017 at 13:25

Yes..it will be same for all – Goldengoldenberg 28/11, 2017 at 5:47

@BHouse if the file contains no rows then this will trow an exception – Spano 28/11, 2017 at 13:1

First of all, my solution is based on @DrHouseofSQL and @Bhouse answers, so you have to read @DrHouseofSQL answer first then @BHouse answer then continue with this answer

Problem

Note : Page name will be dynamic and Column position may change (eg: Column "ABC may exist in first row or second row or ...

This situation is a little complex and can be solved using the following workaround:

Solution Overview

Add a script task before the data flow task that import the data
You have to use the script task to open the excel file and get the Worksheet name and the header row
Build the Query and store it in a variable
in the second Data Flow task you have to use the query stored above as source (Note that you have to set Delay Validation property to true)

Solution Details

First create an SSIS variable of type string (i.e. @[User::strQuery])
Add another variable that contains the Excel File Path (i.e. @[User::ExcelFilePath])
Add A Script Task, and select @[User::strQuery] as ReadWrite Variable, and @[User::ExcelFilePath] as ReadOnly Variable (in the script task window)
Set the Script Language to VB.Net and in the script editor window write the following script:

Note: you have to imports System.Data.OleDb

    m_strExcelPath = Dts.Variables.Item("ExcelFilePath").Value.ToString

    Dim strSheetname As String = String.Empty
    Dim intFirstRow As Integer = 0

    m_strExcelConnectionString = Me.BuildConnectionString()
    Try


        Using OleDBCon As New OleDbConnection(m_strExcelConnectionString)

            If OleDBCon.State <> ConnectionState.Open Then
                OleDBCon.Open()
            End If

            'Get all WorkSheets
            m_dtschemaTable = OleDBCon.GetOleDbSchemaTable(OleDbSchemaGuid.Tables,
                                                               New Object() {Nothing, Nothing, Nothing, "TABLE"})

            'Loop over work sheet to get the first one (the excel may contains temporary sheets or deleted ones

            For Each schRow As DataRow In m_dtschemaTable.Rows
                strSheetname = schRow("TABLE_NAME").ToString

                If Not strSheetname.EndsWith("_") AndAlso strSheetname.EndsWith("$") Then

                    Using cmd As New OleDbCommand("SELECT * FROM [" & strSheetname & "A1:I15]", OleDBCon)

                        Dim dtTable As New DataTable("Table1")


                        cmd.CommandType = CommandType.Text

                        Using daGetDataFromSheet As New OleDbDataAdapter(cmd)

                            daGetDataFromSheet.Fill(dtTable)

                            For intCount As Integer = 0 To 15

                                If Not String.IsNullOrEmpty(dtTable.Rows(intCount)(0).ToString) Then

                                    '+1 because datatable is zero based indexed, +1 because we want to start from the second row
                                    intFirstRow = intCount + 2

                                End If


                            Next



                        End Using

                        If intFirstRow = 0 Then Throw New Exception("header not found")

                    End Using

                    'when the first correct sheet is found there is no need to check others
                    Exit For

                End If
            Next

            OleDBCon.Close()

        End Using

    Catch ex As Exception
        Throw New Exception(ex.Message, ex)
    End Try


    Dts.Variables.Item("strQuery").Value = "SELECT * FROM [" & strSheetname & "A" & intFirstRow.ToString & ":I]"

    Dts.TaskResult = ScriptResults.Success
End Sub

Then you have to add an Excel connection manager, and choose the excel file that you want to import (just select a sample to define the metadata for the first time only)
Assign a default value of Select * from [Sheet1$A2:I] to the variable @[User::strQuery]
In the Data Flow Task add an Excel Source, choose SQL Command from variable, and select @[User::strQuery]
Go to the columns tab and name the columns in the same way that @BHouse suggested

Image taken from @BHouse answer

Set the DataFlow Task Delay Validation property to True
Add other components to DataFlow Task

UPDATE 1:

From the OP comments: sometimes excel with empty data will come.(i.e) we have only header row not not data... in that case it fails entire task

Solution:

If your excel file contains no data (only header) you have to do these steps:

Add an SSIS variable of type boolean *(i.e. @[User::ImportFile])
Add @[User::ImportFile] to the script task ReadWrite variables
In the Script Task check if the file contains rows
If yes Set @[User::ImportFile] = True, else @[User::ImportFile] = False
Double Click on the arrow (precedence constraint) that connect the script task to the DataFlow
Set its type to Constraint and Expression
Write the following expression
```
@[User::ImportFile] == True
```

Note: The new Script Task code is:

    m_strExcelPath = Dts.Variables.Item("ExcelFilePath").Value.ToString

    Dim strSheetname As String = String.Empty
    Dim intFirstRow As Integer = 0

    m_strExcelConnectionString = Me.BuildConnectionString()
    Try


        Using OleDBCon As New OleDbConnection(m_strExcelConnectionString)

            If OleDBCon.State <> ConnectionState.Open Then
                OleDBCon.Open()
            End If

            'Get all WorkSheets
            m_dtschemaTable = OleDBCon.GetOleDbSchemaTable(OleDbSchemaGuid.Tables,
                                                               New Object() {Nothing, Nothing, Nothing, "TABLE"})

            'Loop over work sheet to get the first one (the excel may contains temporary sheets or deleted ones

            For Each schRow As DataRow In m_dtschemaTable.Rows
                strSheetname = schRow("TABLE_NAME").ToString

                If Not strSheetname.EndsWith("_") AndAlso strSheetname.EndsWith("$") Then

                    Using cmd As New OleDbCommand("SELECT * FROM [" & strSheetname & "A1:I15]", OleDBCon)

                        Dim dtTable As New DataTable("Table1")


                        cmd.CommandType = CommandType.Text

                        Using daGetDataFromSheet As New OleDbDataAdapter(cmd)

                            daGetDataFromSheet.Fill(dtTable)

                            For intCount As Integer = 0 To 15

                                If Not String.IsNullOrEmpty(dtTable.Rows(intCount)(0).ToString) Then

                                    '+1 because datatable is zero based indexed, +1 because we want to start from the second row
                                    intFirstRow = intCount + 2

                                End If


                            Next



                        End Using





                    End Using

                    'when the first correct sheet is found there is no need to check others
                    Exit For

                End If
            Next

            OleDBCon.Close()

        End Using

    Catch ex As Exception
        Throw New Exception(ex.Message, ex)
    End Try

                If intFirstRow = 0 OrElse _
                   intFirstRow > dtTable.Rows.Count Then

                    Dts.Variables.Item("ImportFile").Value = False

                Else

                    Dts.Variables.Item("ImportFile").Value = True

                End If                    

    Dts.Variables.Item("strQuery").Value = "SELECT * FROM [" & strSheetname & "A" & intFirstRow.ToString & ":I]"

    Dts.TaskResult = ScriptResults.Success
End Sub

UPDATE 2:

Solution:

Just add another DATA FLOW task
Connect this dataflow with the script task using another connector and with the expression @[User::ImportFile] == False (same steps of the first connector)
In the DataFlow Task add a SCript Component as a Source
Create the Output columns you want to import to Logs
Create a Row that contains the information you need to import
Add the Log Destination

Or Instead of adding another Data Flow Task, you can add an Execute SQL Task to insert a row in the Log Table

Hypothalamus answered 25/11, 2017 at 23:25 Comment(15)

Yes it is working @hadi but is there any other work around available to process the data flow task without skipping all data flow task. – Goldengoldenberg 27/11, 2017 at 9:43

@VigneshKumar why you need to process the data flow task if the source contains no data? when the excel header is dynamic this will make things more complex. – Hypothalamus 27/11, 2017 at 11:47

Actually one of the task will log the filename and data count and all, which are missssig here – Goldengoldenberg 27/11, 2017 at 12:51

@Kirkuk i agree with both solutions, you can use aliases when building the query in the script task or in the source editor – Hypothalamus 5/12, 2017 at 18:30

@Kirkuk now i agree with that – Hypothalamus 5/12, 2017 at 18:33

Is the file being created manually or automatically? In either case you could remove the header row (either programmatically or tell the people to delete it before saving the file) from the Excel file altogether. Once you do that, go into the Excel Connection Manager and find the box that indicates 'First row has column names'. If you can clear out that box then map the columns again to the destination that should solve your problem. You would never have to worry about a misspelled (or extra spaces in) the column names.

I think there is also an option in SSIS to skip the first row altogether but I cannot remember where that option is. If you can find that then just skip the first row of the Excel file. Same mappings still remain.

Thank you

Cheerio answered 22/11, 2017 at 15:1 Comment(1)

File will be creating automatically. – Goldengoldenberg 22/11, 2017 at 15:2

I am fairly new to the forum, so if you think this is silly, take it with a grain of salt.

MS Access has much of the same VBA functionality as Excel or you could script a a new stub Excel workbook that parses and formats before your SQL import and then import that (a middle ware if you will).

For the problem regarding trailing or leading spaces I have used the following on many occasion:

myString = trim(msytring) 'This will remove all leading and trailing spaces but not mess around with any spaces between characters. So on import you can run trim on the column headers as you import them.

There is also LTrim and RTrim 'you can guess what those do left and right of the string

https://support.office.com/en-us/article/LTrim-RTrim-and-Trim-Functions-e340ced1-67df-435f-b078-1527a4eddea2

For Uppercase you can use UCase

myString = UCase(Trim(myString))

And Replace always comes in handy if there is a situation as I often deal with where sometimes a user might use a # char and sometimes not.

Example: " Patterson #288 " or " PatTeRson 288 " myString = UCase(Trim(Replace(myString,"#","") 'eliminates the # sign and gets rid of the leading and trailing spaces and also Uppercases the letters in case the user also made a mistake

Pretty handy to run this is loops importing and exporting.

Now if the file name is changing (this is the Workbook name) or if the Worksheet names are changing you could also have your "middleware" always name the workbook to the same name (with the contents of the workbook that you are going to import) same with the sheets, or you can count the # of sheets and record the names (again a chance to standardize and rename them in your "middle ware")

I suppose it is not an SQL answer, but because I am not that good with SQL I would prep the data, in this case an excel Workbook first and standardize it for import so the code doesn't break on the DB side (Server side).

I use excel as a front end to Access with SQL query scripts and it can be linked directly to SQL but it is much more difficult. A .CSV friendly DB like PostGre SQL helps in that regard.

I hope this helps. If you need help formatting the workbook prior to import by make a copy and applying all of your changes (naming, field name convention // column header) let me know. I could probably help with that.

This is similar to V's comment of running a pre-processing script on the workbook. That is how I would approach it.

Cheers, WWC

Merits answered 1/12, 2017 at 20:36 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Problem

Solution Overview

Solution Details

UPDATE 1:

UPDATE 2:

Problem

Solution Overview

Solution Details

UPDATE 1:

UPDATE 2:

Recommended topics

Hot tags