Groupby aggregations and missing combinations of values
Asked Answered
S

1

6

I recently started tinkering with Vega-Lite templates to make a confusion matrix for an open-source data science software called DVC. You can see the template in my PR here, but I'll also repeat a simplified version below:

{
    ...
    "data": {
        "values": [
            {"actual": "Wake", "predicted": "Wake", "rev": "HEAD"},
            {"actual": "Wake", "predicted": "Deep", "rev": "HEAD"},
            {"actual": "Light", "predicted": "Wake", "rev": "HEAD"},
            {"actual": "REM", "predicted": "Light", "rev": "HEAD"},
            ....
        ],
    },
    "spec": {
        "transform": [
            {
                "aggregate": [{"op": "count", "as": "xy_count"}],
                "groupby": ["actual", "predicted"],
            },
            {
                "joinaggregate": [
                    {"op": "max", "field": "xy_count", "as": "max_count"}
                ],
                "groupby": [],
            },
            {
                "calculate": "datum.xy_count / datum.max_count",
                "as": "percent_of_max",
            },
        ],
        "encoding": {
            "x": {"field": "predicted", "type": "nominal", "sort": "ascending"},
            "y": {"field": "actual", "type": "nominal", "sort": "ascending"},
        },
        "layer": [
            {
                "mark": "rect",
                "width": 300,
                "height": 300,
                "encoding": {
                    "color": {
                        "field": "xy_count",
                        "type": "quantitative",
                        "title": "",
                        "scale": {"domainMin": 0, "nice": True},
                    }
                },
            },
            {
                "mark": "text",
                "encoding": {
                    "text": {
                        "field": "xy_count",
                        "type": "quantitative"
                    },
                    "color": {
                        "condition": {
                            "test": "datum.xy_count / datum.max_count > 0.5",
                            "value": "white"
                        },
                        "value": "black"
                    }
                }
            }
        ]
    }
}

So, since I'm doing a groupby aggregation, it's possible for there to be cells in the confusion matrix with no entries. Here's an example output: link

How can I fill in these cells with "fallback" or something. I also looked at using pivot and impute, but couldn't quite figure it out. Help much appreciated :)

Sorrento answered 23/10, 2020 at 16:42 Comment(0)
G
3

You can do this by adding two Impute transforms to the end of your sequence of transforms:

  {"impute": "xy_count", "groupby": ["actual"], "key": "predicted", "keyvals": ["Deep", "Light", "Wake", "REM"], "value": 0},
  {"impute": "xy_count", "groupby": ["predicted"], "key": "actual", "keyvals": ["Deep", "Light", "Wake", "REM"], "value": 0}

The keyvals specify which missing values you would like to be imputed on each axis; you can leave it out if at least one of the groups is present for each keyval.

Goose answered 23/10, 2020 at 17:26 Comment(1)
Thank you! This was almost exactly correct, I just needed to add "rev" to the "groupby" fields, then placed these imputes directly after the first groupby, since that seemed the most logical place. I also dropped the "keyvals", because this is supposed to be a generic template and not dependent on my particular values :)Sorrento

© 2022 - 2024 — McMap. All rights reserved.