Excel UDF to Unpivot (Melt, Reverse pivot, Flatten, Normalize) blocks of data within Tables

Asked 7/8, 2021 at 20:43 Answered 8/8, 2021 at 5:19

vba excel-formula user-defined-functions powerquery dynamic-arrays

This question will seek multiple approaches LET/LAMBDA VBA UDF and Power Query Function, so there will be no single right answer, but a solicitation of approaches to be used as references.

Scott raised a question here about unpivoting a complex table that contains blocks of data instead of individual data points. The basic idea is illustrated in this table:

		Jan	Jan	Jan	Jan	Feb	Feb	Feb	Feb	Mar	Mar	Mar	Mar
State	City	Pressure	Temp	Humidity	CO2	Pressure	Temp	Humidity	CO2	Pressure	Temp	Humidity	CO2
Georgia	Atlanta	1	2	3	4	5	6	7	8	9	10	11	12
Massachusetts	Boston	49	50	51	52	53	54	55	56	57	58	59	60
Texas	Dallas	97	98	99	100	101	102	103	104	105	106	107	108
Louisiana	Jonesboro	145	146	147	148	149	150	151	152	153	154	155	156
California	San Francisco	193	194	195	196	197	198	199	200	201	202	203	204

The data for each city is in blocks of four columns containing Pressure, Temperature, Humidity and CO2 (or PTHC). We want to unpivot the PTHC blocks of values according to their month by the State and City. Here is the desired output:

State	City	month	Pressure	Temp	Humidity	CO2
Georgia	Atlanta	Jan	1	2	3	4
Georgia	Atlanta	Feb	5	6	7	8
Georgia	Atlanta	Mar	9	10	11	12
Massachusetts	Boston	Jan	49	50	51	52
Massachusetts	Boston	Feb	53	54	55	56
Massachusetts	Boston	Mar	57	58	59	60
Texas	Dallas	Jan	97	98	99	100
Texas	Dallas	Feb	101	102	103	104
Texas	Dallas	Mar	105	106	107	108
Louisiana	Jonesboro	Jan	145	146	147	148
Louisiana	Jonesboro	Feb	149	150	151	152
Louisiana	Jonesboro	Mar	153	154	155	156
California	San Francisco	Jan	193	194	195	196
California	San Francisco	Feb	197	198	199	200
California	San Francisco	Mar	201	202	203	204

The order of the rows is not important, so long as they are complete - i.e. the output could be sorted by month, city, state, ... it does not matter. The output does not need to be a dynamic array that spills - i.e. in the case of a Power Query function, it clearly would not be.

It can be assumed that the PTHC block is always consistent, i.e.

it never skips a field value, e.g. PTHC PTC PTHC...
it never changes order, e.g. PTHC PCHT

The months are always presented in groups that are equally sized to the block (in this example, 4, so there will be four Jan columns, Feb columns, etc.). e.g. if there are 7 months, there will be 7 PTHC blocks or 28 columns of data.

However, the pattern of months can also be interleaved such that the months will increment and the PTHC block will be grouped (i.e. PPP TTT HHH CCC) like this:

		Jan	Feb	Mar	Jan	Feb	Mar	Jan	Feb	Mar	Jan	Feb	Mar
State	City	Pressure	Pressure	Pressure	Temp	Temp	Temp	Humidity	Humidity	Humidity	CO2	CO2	CO2

The UDF would also have to accommodate more or less than 4 fields inside the block. The use of Months and PTHC are just illustrations, the attribute that represents months in this example will always be a single row (although a multi-row approach would be an interesting question - but a new and separate one). The attribute that represents the field values PTHC will also be a single row.

I will propose a LET function based on Scott's question, but there certainly can be better approaches and both VBA and Power Query have their own strengths. The objective is to create a collection of working approaches.

Amulet answered 7/8, 2021 at 20:43 Comment(0)

LET/LAMBDA Approach

This requires Excel 365. The formula is:

=LET( upValues, C3:N7,  upHdr, C2:N2,  upAttr, C1:N1,
      byBody, A3:B7,  byHdr, A2:B2,
      attrTitle, "month",

         upFields, UNIQUE( upHdr,1 ), blockSize, COUNTA( upFields ),
         byC, COLUMNS( byBody ), upC, COLUMNS( upValues ),
         dmxR, MIN( ROWS( upValues ), ROWS( byBody ) ),
         upCells, dmxR * upC/blockSize,
         tCSeq, SEQUENCE( 1, byC + 1 + blockSize ),  tRSeq, SEQUENCE( upCells + 1,, 0 ),  upSeq, SEQUENCE( upCells,, 0 ),
         hdr, IF( tCSeq <= byC,  INDEX( byHdr, , tCSeq ),
                 IF( tCSeq = byC + 1, attrTitle,
                     INDEX( upFields, 1, tCSeq - byC - 1 ) ) ),
         muxBody, INDEX( byBody, SEQUENCE( upCells, byC, 0 )/byC/upC*blockSize + 1, SEQUENCE( 1, byC ) ),
         muxAttr, INDEX( upAttr, MOD( SEQUENCE( upCells,, 0, blockSize ), upC ) + 1 ),
         muxValues, INDEX( upValues, SEQUENCE( upCells, blockSize, 0 )/upC+1, MOD(SEQUENCE( upCells, blockSize, 0 ),upC)+1),
         table, IF( tCSeq <= byC, muxBody,
                   IF( tCSeq = byC + 1, muxAttr,
                       INDEX( muxValues, upSeq + 1, tCSeq - byC - 1 ) ) ),
         IF( tRSeq = 0, hdr, INDEX( table, tRSeq, tCSeq) )  )

This takes in 6 variables:

upValues - the data that will be unpivoted in blocks
upHdr - the header row that contains the PTHC values
upAttr - the attribute that will be unpivoted i.e. the months row
byBody - the body of values that will unpivot the values i.e. the State and City values
byHdr - the header of the byBody (the titles "State" and "City")
attrTitle - an optional title for the attribute that will be unpivoted

These are better understood in this illustration:

and here it is with the test data and the results shown to make it easier to understand:

The output above can also be illustrated:

The red text are the internal variables used to construct the result.

The formula has 5 parts as follows:

Taking Dimensions is obvious - it is simply parameterizing the variables that will be used repeatedly later. dmxR is using the MIN of the rows of either upValues or byBody just in case the user accidentally puts in malformed values and byBody that would otherwise result in a nonsensical output.

Building Sequences creates three sequences that will be used for indexing the inputs and outputs:

tCSeq (table column sequence) is a column-wise sequence sized to the final output table that will have byBody + Attribute (month) + values (blocksize) columns.
tRSeq (table row sequence) is a row-wise sequence sized to the final output table that will have dmxR*upC/blocksize + 1 (hdr) rows.
upSeq (unpivot sequence) is a row-wise sequence sized to the final output table that will have dmxR*upC/blocksize rows (no header).

Create Array Components uses the dimensions and sequences above to construct the parts of the output table.

hdr (header) is the new header with the labels (State & City), the attribute title (month) and the field names (PTHC).
muxBody (multiplexed byBody) is the repetition of the byBody that is multiplexed across the dmxR rows.
muxAttr (multiplexed upAttr) is the repetition of the upAttr that is multiplexed across the dmxR rows.
muxValues (multiplexed upValues) is a block-wise repetition that will have dmxR*upC/blocksize rows.

The last two lines stitch parts together. First, table stitches muxBody, muxAttr and muxValues in a column-wise integration using tCSeq and a row-wise multiplex using upSeq.

Just because it is mentally easier (and easier to test), I separated the row-wise integration (using tRSeq) of the hdr onto the table in the last line.

An alternative to stitching with IF statements is to use IFERROR(INDEX which forces errors and then replaces the errors with the next part of the table, but that is sooo hard to test and debug even when it is only row-wise or column-wise. Put in a combination of row-wise and column-wise, it is a cauchemar.

Amulet answered 7/8, 2021 at 21:2 Comment(0)

Powerquery version. A bit longer code to accommodate possibility of AAAABBBB instead of ABABABAB

let  Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
// list of months
#"Unpivoted Other Columns" = List.Repeat(Table.UnpivotOtherColumns(Table.FirstN(Source,1), {"Column1", "Column2"}, "Attribute", "Value")[Value],Table.RowCount(Source)-2),
#"Converted to Table" = Table.AddIndexColumn(Table.FromList(#"Unpivoted Other Columns", Splitter.SplitByNothing(), null, null, ExtraValues.Error), "Index", 0, 1),

// list of PTHC
#"Unpivoted Other Columns2" = List.Repeat(Table.UnpivotOtherColumns(Table.FirstN(Table.Skip(Source,1) ,1), {"Column1", "Column2"}, "Attribute", "Value")[Value],Table.RowCount(Source)-2),
#"Converted to Table2" = Table.AddIndexColumn(Table.FromList(#"Unpivoted Other Columns2", Splitter.SplitByNothing(), null, null, ExtraValues.Error), "Index", 0, 1),

// all other data
#"Unpivoted Other Columns1" = Table.UnpivotOtherColumns(Table.Skip(Source,2), {"Column1", "Column2"}, "Attribute", "Value"),
#"Added Index" = Table.AddIndexColumn(#"Unpivoted Other Columns1", "Index", 0, 1),

// merge in months and PTHC
#"Merged Queries" = Table.NestedJoin(#"Added Index",{"Index"},#"Converted to Table",{"Index"},"X1",JoinKind.LeftOuter),
#"Merged Queries2" = Table.NestedJoin(#"Merged Queries" ,{"Index"},#"Converted to Table2",{"Index"},"X2",JoinKind.LeftOuter),
#"Expanded X1" = Table.ExpandTableColumn(#"Merged Queries2", "X1", {"Column1"}, {"Month"}),
#"Expanded X2" = Table.ExpandTableColumn(#"Expanded X1", "X2", {"Column1"}, {"Type"}),

//extra work to pivot in correct format
#"Renamed Columns" = Table.RenameColumns(#"Expanded X2",{{"Column1", "State"}, {"Column2", "City"}}),
#"Removed Columns" = Table.RemoveColumns(#"Renamed Columns",{"Attribute","Index"}),
#"Sorted Rows" = Table.Sort(#"Removed Columns",{{"State", Order.Ascending}, {"City", Order.Ascending}, {"Month", Order.Ascending}, {"Type", Order.Ascending}}),
#"Added Index1" = Table.AddIndexColumn(#"Sorted Rows", "Index", 0, 1),
TypeCount=List.Count(List.Distinct(#"Added Index1"[Type])), 
#"Integer-Divided Column" = Table.TransformColumns(#"Added Index1", {{"Index", each Number.IntegerDivide(_, TypeCount), Int64.Type}}),
#"Pivoted Column" = Table.Pivot(#"Integer-Divided Column", List.Distinct(#"Integer-Divided Column"[Type]), "Type", "Value"),
#"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"Index"})
in #"Removed Columns1"

Latoyialatreece answered 7/8, 2021 at 22:31 Comment(5)

I had not realized that the order of the months would present a challenge in M. I intuitively thought that PQ would make quick work of this without much effort (having invested zero effort myself). Your opening approach is much faster than anything I would have done. I will use your techniques in the future. Normally, I would have stopped at #"Removed Columns" for most applications, but bc %Hum is a different data type, we cannot stop there - so your final repivoting is difficult and necessary. Nice job. As I see how you did it, I think hardcoding is inevitable i.e. PQ function = impossible. – Amulet 8/8, 2021 at 5:56

slightly changed code to remove the hardcoded number of PTHC unique values – Latoyialatreece 8/8, 2021 at 16:8

Also, not sure why you say a PQ function is impossible. All you have to do is feed a range into this as a function and it will work, no coding changes necessary. Order of months is irrelevant – Latoyialatreece 9/8, 2021 at 12:25

As I looked through your script, I saw places where M must take on explicit values, so (without having tried to tackle it myself) I thought that this would be very difficult to navigate around. e.g. take this line: Table.RenameColumns(#"Expanded X2",{{"Column1", "State"}, {"Column2", "City"}}), If the inputs that I defined in my UDF as byHdr were different, such as "Country", "State", "Parish", "City", "Postal Code", how could this be turned into a dynamic function in PQ? Seems super hard to me. – Amulet 9/8, 2021 at 13:22

There is no reason we need to have named columns, though this is hard coded to assume only two columns before the data kicks in, so in that view, you are correct it would be hard to make this dynamic – Latoyialatreece 9/8, 2021 at 13:39

Not sure if it can be called an improvement to the existing LET solution, but this is both shorter and a little more intuitive to me.

=LET( upValues, C3:N7,  upHdr, C2:N2,  upAttr, C1:N1,
      byBody, A3:B7,  byHdr, A2:B2,
      attrTitle, "month",

          attributes, UNIQUE(upAttr,1), attrcount, COUNTA(attributes),
          vars, UNIQUE(upHdr,1), varcount, COUNTA(vars),
          rowseq, SEQUENCE(ROWS(byBody)*attrcount),
          colseq, SEQUENCE(1,varcount+3),
          rept, CEILING(rowseq/attrcount,1),
          rept1, IF(MOD(rowseq, attrcount)=0, attrcount, MOD(rowseq, attrcount)),
          byC, COLUMNS(byBody),
          header, IF(colseq<3, byHdr, IF(colseq=3, attrTitle, INDEX(vars, 1, colseq-byC-1))),
          loc, INDEX(byBody,rept, SEQUENCE(1,byC)),
          attrCol, INDEX(attributes, 1, rept1),
          data, INDEX(upValues, rept, SEQUENCE(1,varcount)+(rept1*varcount)-varcount),
          mydata, IF(colseq<(byC+1), loc, IF(colseq<4, attrCol, INDEX(data, rowseq, colseq-byC-1))),
          final, IF(SEQUENCE(MAX(rowseq)+1)=1, header, INDEX(mydata, SEQUENCE(ROWS(byBody)*attrcount+1)-1, colseq)),
         final )

Beelzebub answered 8/8, 2021 at 5:19 Comment(3)

Good morning (my time) EDS - This reduces steps by 2. I am still testing it, but you forgot the month column in the output. – Amulet 8/8, 2021 at 6:46

ahh - i just noticed, that you didn't have it in the input - may have missed that point. The original ideas is to unpivot the months, but in blocks of measurements. – Amulet 8/8, 2021 at 6:57

Hey - it fully works and I think it is doing it in less steps. I proposed some edits to parameterize it into a UDF form. – Amulet 8/8, 2021 at 7:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags