FileTable Delimited String Split
Asked Answered
F

5

12

EDIT - To be clear I will put the goal sentence at the top. The test and my question is whether there is a way to get the same performance as a temp table without using a temp table.

I feel this should be an easy question but I am stuck. I am experimenting with FileTables in SQL2014. I know of a few alternatives that would work well but the goal is to establish feasibility of extracting substrings of text from a filetable.

This test has 35,000 text files with one line of text as follows with each file having an average of 100 bytes of non-unicode text.

Aaa|Bbb|Ccc|Ddd|Eee|Fff|Ggg

The desired output is one row for each file and the delimited string to be split into seven columns.

I have found a quick string parser function, but running on the filestream has a significant performance impact compared to a varchar column.

This query takes 18 seconds to run. I am trying to have the conversion from filestream to varchar execute only once but I think calling the UDF may be causing it to happen for every row(file).

Create View vAddresses As
Select file_type, Convert(Varchar(8000),file_stream) TextData /* Into #Temp */ From InputFiles Where file_type = 'adr'
Go
Select  --TextData,
        dbo.udf_StringSplit(TextData, 1, '|'), dbo.udf_StringSplit(TextData, 2, '|'), dbo.udf_StringSplit(TextData, 3, '|'),
        dbo.udf_StringSplit(TextData, 4, '|'), dbo.udf_StringSplit(TextData, 5, '|'), dbo.udf_StringSplit(TextData, 6, '|'),
        dbo.udf_StringSplit(TextData, 7, '|')--, TextData
    From vAddresses

I have tried it as a view, a cte and a subquery. The only thing that seems to help is creating a temp table. Creating the temp table takes 1 second and the query takes one second. So for 35k rows 2 seconds total query time vs. 18 seconds.

Drop Table #Temp
(Select file_type, Convert(Varchar(8000),file_stream) TextData Into #Temp From HumanaInputFiles Where file_type = 'adr')
Select  --TextData,
        dbo.udf_StringSplit(TextData, 1, '|'), dbo.udf_StringSplit(TextData, 2, '|'), dbo.udf_StringSplit(TextData, 3, '|'),
        dbo.udf_StringSplit(TextData, 4, '|'), dbo.udf_StringSplit(TextData, 5, '|'), dbo.udf_StringSplit(TextData, 6, '|'),
        dbo.udf_StringSplit(TextData, 7, '|')--, TextData
    From #Temp

I have read many posts and blogs on both the topics of filetables and temp table vs single query performance but I cant seem to figure it out. It might have something to do with sargable or statistics? Any advice is much appreciated.

Here is the UDF, I found it on an MSDN blog / forum and is the best performer I have found so far.

ALTER FUNCTION [dbo].[udf_StringSplit](
 @TEXT      varchar(8000)
,@COLUMN    tinyint
,@SEPARATOR char(1)
)RETURNS varchar(8000)
AS
  BEGIN
       DECLARE @POS_START  int = 1
       DECLARE @POS_END    int = CHARINDEX(@SEPARATOR, @TEXT, @POS_START)

       WHILE (@COLUMN >1 AND @POS_END> 0)
         BEGIN
             SET @POS_START = @POS_END + 1
             SET @POS_END = CHARINDEX(@SEPARATOR, @TEXT, @POS_START)
             SET @COLUMN = @COLUMN - 1
         END 

       IF @COLUMN > 1  SET @POS_START = LEN(@TEXT) + 1
       IF @POS_END = 0 SET @POS_END = LEN(@TEXT) + 1 

       RETURN SUBSTRING (@TEXT, @POS_START, @POS_END - @POS_START)
  END

This is the execution plan for the temp table.

<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.2" Build="12.0.4100.1" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementCompId="1" StatementEstRows="17486" StatementId="1" StatementOptmLevel="TRIVIAL" CardinalityEstimationModelVersion="120" StatementSubTreeCost="0.166487" StatementText="Select --TextData,&#xD;&#xA;       dbo.udf_StringSplit(TextData, 1, '|'), dbo.udf_StringSplit(TextData, 2, '|'), dbo.udf_StringSplit(TextData, 3, '|'),&#xD;&#xA;      dbo.udf_StringSplit(TextData, 4, '|'), dbo.udf_StringSplit(TextData, 5, '|'), dbo.udf_StringSplit(TextData, 6, '|'),&#xD;&#xA;      dbo.udf_StringSplit(TextData, 7, '|')--, TextData&#xD;&#xA; From #Temp" StatementType="SELECT" QueryHash="0xC4D6F0215D332F3D" QueryPlanHash="0xC50CFAF9494B5DBE" RetrievedFromCache="true">
          <StatementSetOptions ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="true" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="true" />
          <QueryPlan DegreeOfParallelism="0" NonParallelPlanReason="CouldNotGenerateValidParallelPlan" CachedPlanSize="24" CompileTime="1" CompileCPU="1" CompileMemory="168">
            <MemoryGrantInfo SerialRequiredMemory="0" SerialDesiredMemory="0" />
            <OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="838735" EstimatedPagesCached="419367" EstimatedAvailableDegreeOfParallelism="4" />
            <RelOp AvgRowSize="28023" EstimateCPU="0.0017486" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="17486" LogicalOp="Compute Scalar" NodeId="0" Parallel="false" PhysicalOp="Compute Scalar" EstimatedTotalSubtreeCost="0.166487">
              <OutputList>
                <ColumnReference Column="Expr1003" />
                <ColumnReference Column="Expr1004" />
                <ColumnReference Column="Expr1005" />
                <ColumnReference Column="Expr1006" />
                <ColumnReference Column="Expr1007" />
                <ColumnReference Column="Expr1008" />
                <ColumnReference Column="Expr1009" />
              </OutputList>
              <RunTimeInformation>
                <RunTimeCountersPerThread Thread="0" ActualRows="17486" ActualEndOfScans="1" ActualExecutions="1" />
              </RunTimeInformation>
              <ComputeScalar>
                <DefinedValues>
                  <DefinedValue>
                    <ColumnReference Column="Expr1003" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([tempdb].[dbo].[#Temp].[TextData],(1),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" Column="TextData" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(1)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1004" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([tempdb].[dbo].[#Temp].[TextData],(2),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" Column="TextData" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(2)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1005" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([tempdb].[dbo].[#Temp].[TextData],(3),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" Column="TextData" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(3)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1006" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([tempdb].[dbo].[#Temp].[TextData],(4),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" Column="TextData" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(4)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1007" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([tempdb].[dbo].[#Temp].[TextData],(5),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" Column="TextData" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(5)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1008" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([tempdb].[dbo].[#Temp].[TextData],(6),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" Column="TextData" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(6)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1009" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([tempdb].[dbo].[#Temp].[TextData],(7),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" Column="TextData" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(7)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                </DefinedValues>
                <RelOp AvgRowSize="4011" EstimateCPU="0.0193131" EstimateIO="0.145426" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="17486" LogicalOp="Table Scan" NodeId="1" Parallel="false" PhysicalOp="Table Scan" EstimatedTotalSubtreeCost="0.164739" TableCardinality="17486">
                  <OutputList>
                    <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" Column="TextData" />
                  </OutputList>
                  <RunTimeInformation>
                    <RunTimeCountersPerThread Thread="0" ActualRows="17486" ActualEndOfScans="1" ActualExecutions="1" />
                  </RunTimeInformation>
                  <TableScan Ordered="false" ForcedIndex="false" ForceScan="false" NoExpandHint="false" Storage="RowStore">
                    <DefinedValues>
                      <DefinedValue>
                        <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" Column="TextData" />
                      </DefinedValue>
                    </DefinedValues>
                    <Object Database="[tempdb]" Schema="[dbo]" Table="[#Temp]" IndexKind="Heap" Storage="RowStore" />
                  </TableScan>
                </RelOp>
              </ComputeScalar>
            </RelOp>
          </QueryPlan>
        </StmtSimple>
      </Statements>
    </Batch>
  </BatchSequence>
</ShowPlanXML>

This is the plan for the view.

<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.2" Build="12.0.4100.1" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementCompId="1" StatementEstRows="17486" StatementId="1" StatementOptmLevel="FULL" StatementOptmEarlyAbortReason="GoodEnoughPlanFound" CardinalityEstimationModelVersion="120" StatementSubTreeCost="0.905265" StatementText="Select    --TextData,&#xD;&#xA;       dbo.udf_StringSplit(TextData, 1, '|'), dbo.udf_StringSplit(TextData, 2, '|'), dbo.udf_StringSplit(TextData, 3, '|'),&#xD;&#xA;      dbo.udf_StringSplit(TextData, 4, '|'), dbo.udf_StringSplit(TextData, 5, '|'), dbo.udf_StringSplit(TextData, 6, '|'),&#xD;&#xA;      dbo.udf_StringSplit(TextData, 7, '|')--, TextData&#xD;&#xA; From vAddresses" StatementType="SELECT" QueryHash="0xB4F8A0B288802C4E" QueryPlanHash="0x28DA02D774B1AF53" RetrievedFromCache="true">
          <StatementSetOptions ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="true" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="true" />
          <QueryPlan DegreeOfParallelism="0" NonParallelPlanReason="CouldNotGenerateValidParallelPlan" CachedPlanSize="32" CompileTime="3" CompileCPU="3" CompileMemory="520">
            <Warnings>
              <PlanAffectingConvert ConvertIssue="Cardinality Estimate" Expression="CONVERT(varchar(8000),[DmProd01].[dbo].[HumanaInputFiles].[file_stream],0)" />
            </Warnings>
            <MemoryGrantInfo SerialRequiredMemory="0" SerialDesiredMemory="0" />
            <OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="838735" EstimatedPagesCached="419367" EstimatedAvailableDegreeOfParallelism="4" />
            <RelOp AvgRowSize="28023" EstimateCPU="0.0017486" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="17486" LogicalOp="Compute Scalar" NodeId="0" Parallel="false" PhysicalOp="Compute Scalar" EstimatedTotalSubtreeCost="0.905265">
              <OutputList>
                <ColumnReference Column="Expr1004" />
                <ColumnReference Column="Expr1005" />
                <ColumnReference Column="Expr1006" />
                <ColumnReference Column="Expr1007" />
                <ColumnReference Column="Expr1008" />
                <ColumnReference Column="Expr1009" />
                <ColumnReference Column="Expr1010" />
              </OutputList>
              <RunTimeInformation>
                <RunTimeCountersPerThread Thread="0" ActualRows="17486" ActualEndOfScans="1" ActualExecutions="1" />
              </RunTimeInformation>
              <ComputeScalar>
                <DefinedValues>
                  <DefinedValue>
                    <ColumnReference Column="Expr1004" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([Expr1011],(1),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Column="Expr1011" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(1)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1005" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([Expr1011],(2),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Column="Expr1011" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(2)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1006" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([Expr1011],(3),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Column="Expr1011" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(3)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1007" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([Expr1011],(4),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Column="Expr1011" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(4)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1008" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([Expr1011],(5),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Column="Expr1011" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(5)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1009" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([Expr1011],(6),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Column="Expr1011" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(6)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Column="Expr1010" />
                    <ScalarOperator ScalarString="[DmProd01].[dbo].[udf_StringSplit]([Expr1011],(7),'|')">
                      <UserDefinedFunction FunctionName="[DmProd01].[dbo].[udf_StringSplit]">
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Column="Expr1011" />
                          </Identifier>
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(7)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="'|'" />
                        </ScalarOperator>
                      </UserDefinedFunction>
                    </ScalarOperator>
                  </DefinedValue>
                </DefinedValues>
                <RelOp AvgRowSize="4019" EstimateCPU="0.0034972" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="17486" LogicalOp="Compute Scalar" NodeId="1" Parallel="false" PhysicalOp="Compute Scalar" EstimatedTotalSubtreeCost="0.88673">
                  <OutputList>
                    <ColumnReference Column="Expr1011" />
                  </OutputList>
                  <ComputeScalar>
                    <DefinedValues>
                      <DefinedValue>
                        <ColumnReference Column="Expr1011" />
                        <ScalarOperator ScalarString="CONVERT(varchar(8000),[DmProd01].[dbo].[HumanaInputFiles].[file_stream],0)">
                          <Convert DataType="varchar" Length="8000" Style="0" Implicit="false">
                            <ScalarOperator>
                              <Identifier>
                                <ColumnReference Database="[DmProd01]" Schema="[dbo]" Table="[HumanaInputFiles]" Column="file_stream" />
                              </Identifier>
                            </ScalarOperator>
                          </Convert>
                        </ScalarOperator>
                      </DefinedValue>
                    </DefinedValues>
                    <RelOp AvgRowSize="4043" EstimateCPU="0.0386262" EstimateIO="0.844606" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="17486" LogicalOp="Table Scan" NodeId="2" Parallel="false" PhysicalOp="Table Scan" EstimatedTotalSubtreeCost="0.883233" TableCardinality="34972">
                      <OutputList>
                        <ColumnReference Database="[DmProd01]" Schema="[dbo]" Table="[HumanaInputFiles]" Column="file_stream" />
                      </OutputList>
                      <RunTimeInformation>
                        <RunTimeCountersPerThread Thread="0" ActualRows="17486" ActualEndOfScans="1" ActualExecutions="1" />
                      </RunTimeInformation>
                      <TableScan Ordered="false" ForcedIndex="false" ForceScan="false" NoExpandHint="false" Storage="RowStore">
                        <DefinedValues>
                          <DefinedValue>
                            <ColumnReference Database="[DmProd01]" Schema="[dbo]" Table="[HumanaInputFiles]" Column="file_stream" />
                          </DefinedValue>
                        </DefinedValues>
                        <Object Database="[DmProd01]" Schema="[dbo]" Table="[HumanaInputFiles]" IndexKind="Heap" Storage="RowStore" />
                        <Predicate>
                          <ScalarOperator ScalarString="[DmProd01].[dbo].[HumanaInputFiles].[file_type]=N'adr'">
                            <Compare CompareOp="EQ">
                              <ScalarOperator>
                                <Identifier>
                                  <ColumnReference Database="[DmProd01]" Schema="[dbo]" Table="[HumanaInputFiles]" Column="file_type" ComputedColumn="true" />
                                </Identifier>
                              </ScalarOperator>
                              <ScalarOperator>
                                <Const ConstValue="N'adr'" />
                              </ScalarOperator>
                            </Compare>
                          </ScalarOperator>
                        </Predicate>
                      </TableScan>
                    </RelOp>
                  </ComputeScalar>
                </RelOp>
              </ComputeScalar>
            </RelOp>
          </QueryPlan>
        </StmtSimple>
      </Statements>
    </Batch>
  </BatchSequence>
</ShowPlanXML>

EDIT: Searched on this differently and found an answer is to use top and order by. This got it down to 4 seconds. Seems kind of hokey and still dont explain how looking at query plans helps figure this out, so not going to answer this myself, instead leave it open.

Friesian answered 4/7, 2017 at 15:12 Comment(27)
Instead of multistatement scalar UDF try using table function like sqlperformance.com/2016/03/sql-server-2016/string-splitMaharashtra
Thanks for the suggestion but I tried those suggestions and a few other blogs / posts recommending that approach. There was even one using delimitedsplit8k and pivot. None of the table value tests came in under 40 seconds with this test data.Friesian
Using the temp table only takes 2 sec, am I missing something? Are you trying to figure out the reason for the difference in performance or are you looking for a different solution?Scintillant
The test is to determine if we can get temp table performance using a single sql statement with UDF rather than a stored procedure with a temp table.Friesian
Have you tried the following string splitter: https://mcmap.net/q/1011142/-efficient-query-to-split-a-delimited-column-into-separate-rows-in-another-table? It shows an inline TVF which should perform better than a regular one.Effeminize
Thanks for the suggestion. I tried it just now, it takes 10 seconds which is significantly more than the two second performance of a temp table approach. Also it returns one row per delimited item when I want a column per item.Friesian
Maybe you need to decide on what is more important: speed of text extraction (tempt table (from your question), BULK INSERT or SSIS), or clean code with moderate performance penalty.Orthopter
@JoeC please clearify your question. You know that varchar is faster then FileTable and seeking for fastest way to convert FileTable to varchar?Platinotype
Here is a list of several string splitting functions. The best non-native performer, the CLR, uses nvarchars, which means you'd probably want to modify the code.Spratt
I also wonder if you can make a PERSISTED computed column in whatever the base table is as TextData AS CONVERT(Varchar(8000),file_stream), although that would certainly slow INSERT performance and occupy disk space, it would mean you wouldn't need to convert the data when it was queried, too.Spratt
In the end though, the idea that you want to use a FILESTREAM for text files of about 100 bytes when those text files contain delimited data just screams to me: "You're designing your table to work against the relational model. Stop breaking First Normal Form and expecting the database to perform reasonably. Don't be lazy. Parse your data and store it correctly. If you need the original file, that's fine, but that doesn't mean you should be parsing that FILESTREAM."Spratt
Thanks Bacon Bits, I have tried all the examples in that post with the exception of CLR which I am pretty sure will perform great. Hoping to avoid turning on CLR at this time though. Also I had thought of the computed column but file tables do not allow you to add columns. Maybe an index will help I will have to look into that.Friesian
@Bacon Bits, this is a proof of concept. I can easily use another form of import. However using a filetable allows real time data. As soon as a file is changed a sql query reflects the updated data without running any import / parsing process. The fact that the temp table performs so well makes this a very tempting solution. Filetables were created for a reason and are awesome for full text searching, hoping to find another use.Friesian
You can emulate persistent column - create another table and populate it with triggersPlatinotype
That is an interesting suggestion. I wonder if saving a text file will generate an update event. I will try that out, thank you.Friesian
@JoeC Could you provide statistics for both SET STATISTICS TIME, IO ON? Plus full execution plan for temp table (with SELECT ... INTO ...) part? Do you clear buffers/cache before executing each method?Maharashtra
@JoeC, is SQL 2016 an option? On my box, STRING_SPLIT plus unpivoting with XML averaged 1.3 sec with a varchar temp table and 6.4 sec with the FileTable. Simply scanning the FileTable with no conversion or parsing took over 4 seconds longer than the temp table (201ms versions 4300ms) so you'll never approach performance of a relational table.Gough
@Mikhail Lobanov +1 for putting a trigger on the File TableSalaried
@DanGuzman 2016 is not an optionFriesian
@lad2025 I cant add any more to my post, I reached the char limit. I will try to post stats in comments. Also I do not bother to clear cache except for the first time as there was no difference if I did or did not.Friesian
View - (17486 row(s) affected) Table 'InputFiles'. Scan count 1, logical reads 1137, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected) SQL Server Execution Times: CPU time = 7223 ms, elapsed time = 19552 ms.Friesian
Temp table pt 1 Table 'InputFiles'. Scan count 9, logical reads 1137, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Friesian
Temp table pt 2 SQL Server Execution Times: CPU time = 1046 ms, elapsed time = 919 ms. (17486 row(s) affected) SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 1 ms. (17486 row(s) affected) Table '#Temp__000000000113'. Scan count 1, logical reads 193, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1700 ms, elapsed time = 1854 ms.Friesian
You might need a hybrid data management. In this case you need this kind of persistence. Even so, I am not convinced that the rows have to be spread across files. Except the case when isolation is important for owner wrights.Teamwork
A person far smarter than me has a great write-up on this over at SQLServerCentral that I think breaks down the problem and a pretty impressive native SQL solution. sqlservercentral.com/articles/Tally+Table/72993Nerves
Than you @RoyFolkker for the info. However this question is less about splitter speed than about execution of the view / subquery multiple times. At least that is what I believe so far based on my research into spliters (including what you posted) and as my last edit states using top and order by gives performance almost equal to using a temp table.Friesian
Thank you for the correction @JoeC. I assumed that the performance hit that you were seeing seems more reminiscent of a tree or segmentation anomaly manifesting in the display; so, I thought looking more towards how you structure your data before prepared might be relative to the display issues. Sorry for the red herring.Nerves
P
6

There are recomendations in MSDN to not use FileTable in your case:

The FileTable feature builds on top of SQL Server FILESTREAM technology.

For small objects perfomance of FileStream is not good. Filestream is designed to work with files about 1MB and more, but you have only 100 bytes(https://learn.microsoft.com/en-us/sql/relational-databases/blob/filestream-sql-server):

When to Use FILESTREAM In SQL Server, BLOBs can be standard varbinary(max) data that stores the data in tables, or FILESTREAM varbinary(max) objects that store the data in the file system. The size and use of the data determines whether you should use database storage or file system storage. If the following conditions are true, you should consider using FILESTREAM:

  • Objects that are being stored are, on average, larger than 1 MB.
  • Fast read access is important.
  • You are developing applications that use a middle tier for application logic.

For smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.

You can emulate persistent column - create another table and populate it with triggers. In this case you can receive bonuses from both cases

P.S. You can use inline TVF + union all to implement @TT 's comment

Platinotype answered 7/7, 2017 at 8:38 Comment(3)
Thank you for the information. However I have read that already and the advantages for this situation make it a desirable solution. Working with a temp table provides excellent performance. The test and my question however is whether there is a way to get the same performance without a temp table.Friesian
ps I dont understand how adding a union all to the mix will help the performance. I would love to try it though if you could post an example.Friesian
@JoeC union all does not improve perfomance, I wrote it because I thought that you have problem with unpivot data from inline function.Platinotype
D
3

EDIT - Even Faster Approach

Rather than calling your Parse/Split function 7 times, perhaps this TVF may be more efficient. The following will process 35,000 Unique Records in 0.773 seconds.

Example

-- Create Some Sample/UNIQUE Data
Select N,TextData =concat(N,TextData )
Into #Temp
From  (values ('Aaa|Bbb|Ccc|Ddd|Eee|Fff|Ggg') ) A (TextData )
Cross Apply (Select Top 35000 N=Row_Number() Over (Order By (Select NULL)) From master..spt_values n1,master..spt_values n2) B

Select B.*
 From  #Temp A
 Cross Apply (
                Select Pos1=max(case when RetSeq=1 then RetVal end)
                      ,Pos2=max(case when RetSeq=2 then RetVal end)
                      ,Pos3=max(case when RetSeq=3 then RetVal end)
                      ,Pos4=max(case when RetSeq=4 then RetVal end)
                      ,Pos5=max(case when RetSeq=5 then RetVal end)
                      ,Pos6=max(case when RetSeq=6 then RetVal end)
                      ,Pos7=max(case when RetSeq=7 then RetVal end)
                 From [dbo].[udf-Str-Parse-8K](A.TextData,'|') B1
             ) B

The UDF if Interested

CREATE FUNCTION [dbo].[udf-Str-Parse-8K] (@String varchar(max),@Delimiter varchar(25))
Returns Table 
As
Return (  
    with   cte1(N)   As (Select 1 From (Values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) N(N)),
           cte2(N)   As (Select Top (IsNull(DataLength(@String),0)) Row_Number() over (Order By (Select NULL)) From (Select N=1 From cte1 a,cte1 b,cte1 c,cte1 d) A ),
           cte3(N)   As (Select 1 Union All Select t.N+DataLength(@Delimiter) From cte2 t Where Substring(@String,t.N,DataLength(@Delimiter)) = @Delimiter),
           cte4(N,L) As (Select S.N,IsNull(NullIf(CharIndex(@Delimiter,@String,s.N),0)-S.N,8000) From cte3 S)

    Select RetSeq = Row_Number() over (Order By A.N)
          ,RetVal = LTrim(RTrim(Substring(@String, A.N, A.L)))
    From   cte4 A
);
--Orginal Source http://www.sqlservercentral.com/articles/Tally+Table/72993/

Just to Aid the Visualization

The TVF as a Stand-alone includes the Item Sequence which can be applied to the conditional aggregation.

Select * from [dbo].[udf-Str-Parse-8K]('Aaa|Bbb|Ccc|Ddd|Eee|Fff|Ggg','|')

Returns

RetSeq  RetVal
1       Aaa
2       Bbb
3       Ccc
4       Ddd
5       Eee
6       Fff
7       Ggg
Durban answered 7/7, 2017 at 12:35 Comment(7)
Thank you for the well formed answer. However the query was running over 13 minutes so I stopped it. I have tried many split functions that normally perform very well. As I stated in my question the best ones I have found finish in 18 seconds. The goal is to find an approach that comes close the the 2 second performance of a temp table.Friesian
@JoeC Can't imagine why you are witnessing such poor performance. 7 calls to a looping function vs 1 tally parse? Just for fun, I parsed 91,861 branch address (using your pipe formatted string) and that took 4.031 seconds. I'll noodle it a bit more.Durban
@JoeC Just curious... how does the following perform ... Select min(TextData),max(TextData),count(*) from vAddressesDurban
I believe that the reason for slow performance is that it is retrieving the file stream every time it calls the function. Where using the temp table to do a one time conversion to varchar performs much better. I could not tell from the execution plans if that is indeed happening. The aggregate query you posted took 5 seconds to run. Thanks again for your help.Friesian
@JoeC Fascinating. If the aggregate query took 5 seconds, The parse provided should have taken just North of that. Now, I am really curious. Later today, I will see if I can re-create your structure and results.Durban
Thank you John. The file table was pretty simple --> Create Table [dbo].[InputFiles] As FileTable On [PRIMARY] FileStream_On [FileStreamGroup1] With ( FileTable_Directory = N'HumanaInputFiles', FileTable_Collate_Filename = SQL_Latin1_General_CP1_CI_AS )Friesian
The files are basically addresses with each 'field' averaging 5 characters as usually only the first three have data but I am allowing up to 7Friesian
G
0

[1] I've used many times following solution that (1.1) it converts source strings to XML and then (1.2) it extract every time from XMl using value method thus:

USE tempdb
GO
IF OBJECT_ID('dbo.SourceTable') IS NOT NULL
    DROP TABLE dbo.SourceTable
GO
CREATE TABLE dbo.SourceTable (
    ID      INT IDENTITY PRIMARY KEY,
    Col1    VARCHAR(100) NOT NULL
);
INSERT  dbo.SourceTable (Col1) VALUES ('Aaa|Bbb|Ccc|Ddd|Eee|Fff|Ggg')
INSERT  dbo.SourceTable (Col1) VALUES ('hhh|iii|JJJ|kkk')

SELECT  b.ID, c.XmlCol.value('.', 'VARCHAR(100)') AS ItemVal--, ROW_NUMBER() OVER(PARTITION BY b.ID ORDER BY c.XmlCol) AS RowNum
FROM (
    SELECT  a.ID, CONVERT(XML, '<root><i>' + REPLACE(a.Col1, '|', '</i><i>') + '</i></root>') AS Col1AsXML
    FROM    dbo.SourceTable a
) b OUTER APPLY b.Col1AsXML.nodes('root/i') c(XmlCol)
--OPTION(FORCE ORDER)

[2] I would to some performance test if you provide DDL & DML statements.

[3] It could help if you could provide following details

  1. There is a maximum number of items for every string ?
  2. If all items have the same length (ex. 3 chars) ?
Gentlemanfarmer answered 9/7, 2017 at 7:45 Comment(0)
F
0

I've played with this a bit and the fastest solution barring writing a CLR assembly (which isn't terribly faster) seems to be some type of variant of the following:

CREATE FUNCTION [UDF_Split] (
    @InputStr NVARCHAR(Max),
    @Delimiter NVARCHAR(255)
)
RETURNS TABLE
AS RETURN(

   WITH lv0 AS (SELECT 0 g UNION ALL SELECT 0)
    ,lv1 AS (SELECT 0 g FROM lv0 a CROSS JOIN lv0 b) -- 4
    ,lv2 AS (SELECT 0 g FROM lv1 a CROSS JOIN lv1 b) -- 16
    ,lv3 AS (SELECT 0 g FROM lv2 a CROSS JOIN lv2 b) -- 256
    ,lv4 AS (SELECT 0 g FROM lv3 a CROSS JOIN lv3 b) -- 65,536
    ,lv5 AS (SELECT 0 g FROM lv4 a CROSS JOIN lv4 b) -- 4,294,967,296
    ,Tally (n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM lv5)

 SELECT SUBSTRING(@InputStr, N, CHARINDEX(@Delimiter, @InputStr + @Delimiter, N) - N) AS TextLine
 FROM   Tally
 WHERE  N BETWEEN 1 AND LEN(@InputStr) + LEN(@InputStr)
        AND SUBSTRING(@Delimiter + @InputStr, N, LEN(@Delimiter)) = @Delimiter);

Starting with MS SQL 2016 though, we get our very own Split Function: Sql Server 2016 STRING_SPLIT ( string , separator )

Ferrand answered 9/7, 2017 at 20:21 Comment(0)
B
0

Since you have a fixed/predetermined number of columns to split out, you don't actually need to use a string splitting function at all.

The following is able to chew through 100K rows in ~2 seconds on a fairly anemic dev server.

Note... This solution is incorporating Adam Machanic's MakeParallel function to force a parallel execution plan.

IF OBJECT_ID('tempdb..#TestData', 'U') IS NOT NULL 
DROP TABLE #TestData;

WITH 
    cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)), 
    cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
    cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b),
    cte_Tally (n) AS (
        SELECT TOP 100000
            ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
        FROM
            cte_n3 a CROSS JOIN cte_n3 b
        )
SELECT 
    ID = ISNULL(CAST(t.n AS INT), 0),
    FilePath = CAST(fp.FilePath AS VARCHAR(1000))
    INTO #TestData
FROM
    cte_Tally t
    CROSS APPLY ( VALUES (CONCAT(
                                ABS(CHECKSUM(NEWID())) % 9999 + 1000, '|',
                                ABS(CHECKSUM(NEWID())) % 9999 + 1000, '|',
                                ABS(CHECKSUM(NEWID())) % 9999 + 1000, '|',
                                ABS(CHECKSUM(NEWID())) % 9999 + 1000, '|',
                                ABS(CHECKSUM(NEWID())) % 9999 + 1000, '|',
                                ABS(CHECKSUM(NEWID())) % 9999 + 1000, '|',
                                ABS(CHECKSUM(NEWID())) % 9999 + 1000
                                )
                            ) ) fp (FilePath);

ALTER TABLE #TestData ADD PRIMARY KEY CLUSTERED (ID);

--=============================================================================
DECLARE     -- Dump values into variables to eliminate display rendering from execution time.
    @id INT,
    @Col_1 VARCHAR(5),
    @Col_2 VARCHAR(5),
    @Col_3 VARCHAR(5),
    @Col_4 VARCHAR(5),
    @Col_5 VARCHAR(5),
    @Col_6 VARCHAR(5),
    @Col_7 VARCHAR(5);

SELECT 
    @ID = td.ID,
    @Col_1 = SUBSTRING(td.FilePath, 1, ABS(d1.DelimLocation - 1)),
    @Col_2 = SUBSTRING(td.FilePath, d1.DelimLocation + 1, ABS(d2.DelimLocation - d1.DelimLocation - 1)),
    @Col_3 = SUBSTRING(td.FilePath, d2.DelimLocation + 1, ABS(d3.DelimLocation - d2.DelimLocation - 1)),
    @Col_4 = SUBSTRING(td.FilePath, d3.DelimLocation + 1, ABS(d4.DelimLocation - d3.DelimLocation - 1)),
    @Col_5 = SUBSTRING(td.FilePath, d4.DelimLocation + 1, ABS(d5.DelimLocation - d4.DelimLocation - 1)),
    @Col_6 = SUBSTRING(td.FilePath, d5.DelimLocation + 1, ABS(d6.DelimLocation - d5.DelimLocation - 1)),
    @Col_7 = SUBSTRING(td.FilePath, d6.DelimLocation + 1, 1000)
FROM
    #TestData td
    CROSS APPLY ( VALUES (LEN(td.FilePath) - LEN(REPLACE(td.FilePath, '|', ''))) ) dc (DelimiterCount)
    CROSS APPLY ( VALUES (IIF(dc.DelimiterCount < 1, 1000,  CHARINDEX('|', td.FilePath, 1))) ) d1 (DelimLocation)
    CROSS APPLY ( VALUES (IIF(dc.DelimiterCount < 2, 1000,  CHARINDEX('|', td.FilePath, d1.DelimLocation + 1))) ) d2 (DelimLocation)
    CROSS APPLY ( VALUES (IIF(dc.DelimiterCount < 3, 1000,  CHARINDEX('|', td.FilePath, d2.DelimLocation + 1))) ) d3 (DelimLocation)
    CROSS APPLY ( VALUES (IIF(dc.DelimiterCount < 4, 1000,  CHARINDEX('|', td.FilePath, d3.DelimLocation + 1))) ) d4 (DelimLocation)
    CROSS APPLY ( VALUES (IIF(dc.DelimiterCount < 5, 1000,  CHARINDEX('|', td.FilePath, d4.DelimLocation + 1))) ) d5 (DelimLocation)
    CROSS APPLY ( VALUES (IIF(dc.DelimiterCount < 6, 1000,  CHARINDEX('|', td.FilePath, d5.DelimLocation + 1))) ) d6 (DelimLocation)
    CROSS APPLY dbo.MakeParallel() mp;  -- Forces a parallel execution plan.
                                        -- http://dataeducation.com/next-level-parallel-plan-forcing-an-alternative-to-8649/

Of course, the MakeParallel function can also be used in conjunction a Splitter function. In this case, using Jeff Moden's DelimitedSplit8K function.

--=============================================================================
DECLARE     -- Dump values into variables to eliminate display rendering from execution time.
    @id INT,
    @Col_1 VARCHAR(5),
    @Col_2 VARCHAR(5),
    @Col_3 VARCHAR(5),
    @Col_4 VARCHAR(5),
    @Col_5 VARCHAR(5),
    @Col_6 VARCHAR(5),
    @Col_7 VARCHAR(5);

SELECT 
    @ID = td.ID,
    @Col_1 = MAX(CASE WHEN sc.ItemNumber = 1 THEN sc.Item END),
    @Col_2 = MAX(CASE WHEN sc.ItemNumber = 2 THEN sc.Item END),
    @Col_3 = MAX(CASE WHEN sc.ItemNumber = 3 THEN sc.Item END),
    @Col_4 = MAX(CASE WHEN sc.ItemNumber = 4 THEN sc.Item END),
    @Col_5 = MAX(CASE WHEN sc.ItemNumber = 5 THEN sc.Item END),
    @Col_6 = MAX(CASE WHEN sc.ItemNumber = 6 THEN sc.Item END),
    @Col_7 = MAX(CASE WHEN sc.ItemNumber = 7 THEN sc.Item END)
FROM
    #TestData td
    CROSS APPLY dbo.DelimitedSplit8K(td.FilePath, '|') sc
    CROSS APPLY dbo.MakeParallel() mp       -- Forces a parallel execution plan.
                                            -- http://dataeducation.com/next-level-parallel-plan-forcing-an-alternative-to-8649/
GROUP BY
    td.ID;

HTH, Jason

Boustrophedon answered 13/7, 2017 at 0:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.