Why are Numpy masked arrays useful?
Asked Answered
H

2

13

I've been reading through the masked array documentation and I'm confused - what is different about MaskedArray than just maintaining an array of values and a boolean mask? Can someone give me an example where MaskedArrays are way more convenient, or higher performing?

Update 6/5

To be more concrete about my question, here is the classic example of how one uses a MaskedArray:

>>>data = np.arange(12).reshape(3, 4)
>>>mask = np.array([[0., 0., 1., 0.],
                    [0., 0., 0., 1.],
                    [0., 1., 0., 0.]])

>>>masked = np.ma.array(data, mask=mask)
>>>masked

masked_array(
  data=[[0, 1, --, 3],
        [4, 5, 6, --],
        [8, --, 10, 11]],
  mask=[[False, False,  True, False],
        [False, False, False,  True],
        [False,  True, False, False]],
  fill_value=999999)

>>>masked.sum(axis=0)

masked_array(data=[12, 6, 16, 14], mask=[False, False, False, False], fill_value=999999)

I could just as easily well do the same thing this way:

>>>data = np.arange(12).reshape(3, 4).astype(float)
>>>mask = np.array([[0., 0., 1., 0.],
                    [0., 0., 0., 1.],
                    [0., 1., 0., 0.]]).astype(bool)

>>>masked = data.copy()  # this keeps the original data reuseable, as would
                         # the MaskedArray. If we only need to perform one 
                         # operation then we could avoid the copy
>>>masked[mask] = np.nan
>>>np.nansum(masked, axis=0)

array([12.,  6., 16., 14.])

I suppose the MaskedArray version looks a bit nicer, and avoids the copy if you need a reuseable array. Doesn't it use just as much memory when converting from standard ndarray to MaskedArray? And does it avoid the copy under the hood when applying the mask to the data? Are there other advantages?

Handsel answered 4/5, 2019 at 23:26 Comment(7)
It's not about performance. I've seen occasional SO questions involving masked arrays, but not many. They may have been more useful in pre-pandas days.Oshaughnessy
Shouldn’t it be for a case where a function expects to write to various indices in an existing array and you want to restrict its action to a subset of values?Pronoun
Here is a fresh example of using MaskedArray to mask un-wanted part of an array, and make use of the result to plot. #56412087Dachi
@Dachi why could you not just use a boolean array to do the masking, rather than having this separate class?Handsel
@RedPanda: MaskedArray's mask is a boolean array.Dachi
@Dachi right, I understand that. My question is whether there is something about the implementation of MaskedArray that is better than just Boolean indexing a standard ndarrayHandsel
@Handsel MaskedArray has 2 parts in it, the data and the mask, both have identical shape (i.e. equal rows, columns for 2D) it is convenient to use with many numpy's operations. Most convenient when the data part needs to be preserved while parts of it (masked data) are used. The mask can be changes without affecting the data. The mask can be used to apply other arrays thus, make a group of arrays compatible in size/shape, and be able to used together. Here is an example when it is useful (missing data cases): currents.soest.hawaii.edu/ocn760_4/_static/masked_arrays.html.Dachi
T
15

The official answer was reported here:

In theory, IEEE nan was specifically designed to address the problem of missing values, but the reality is that different platforms behave differently, making life more difficult. On some platforms, the presence of nan slows calculations 10-100 times. For integer data, no nan value exists.

In fact, masked arrays can be quite slow compared to the analogous array of nans:

import numpy as np
g = np.random.random((5000,5000))
indx = np.random.randint(0,4999,(500,2))
g_nan = g.copy()
g_nan[indx] = np.nan
mask =  np.full((5000,5000),False,dtype=bool)
mask[indx] = True
g_mask = np.ma.array(g,mask=mask)

%timeit (g_mask + g_mask)**2
# 1.27 s ± 35.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit (g_nan + g_nan)**2
# 76.5 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

When are they useful?

In many years of programming, I found them useful on the following occasions:

  • when you want to preserve the values you masked for later processing, without copying the array.
  • you don't want to get tricked by the strange behaviour of nan operations (you might be tricked by the behaviour of masked array by the way).
  • when you have to handle many arrays with their masks if the mask is part of the array you avoid code and confusion.
  • you can assign different meanings to the masked value compared to the nan value. For instance, I use np.nan for missing values but I mask also the value with poor SNR, so I can identify both.

In general, you can consider a masked array as a more compact representation. The best approach is to test case by case the more comprehensible and efficient solution.

Tetanic answered 2/11, 2019 at 21:12 Comment(0)
T
0

Masked arrays can be used to greatly speed up data analysis when you are computing many pairwise comparisons and you have missing values.

In my blog I demonstrate how you can obtain a 1000-fold speed advantage (at least) for analysis of data with missing data, compared to the only other solution when you have realistic data - for loops.

Here is a summary of my blog post:

  • Fast computation of all pairwise comparisons among many variables can achieved with a sequence of matrix multiplications and other matrix operations. This applies for many measures/stats. I focus on percentage agreement (and Cohen's kappa) in my blog post. But that's just one example.
  • Missing values negates this advantage for normal Numpy arrays. Matrix multiplication on data with missing values will result in lots of wasted data. The problem with real data is that different variables will have missing values for different observations.
  • What you want for your measure/statistic is pairwise deletion - observations are only removed when a missing value is present for the pair of variables under consideration.
  • Matrix multiplication of masked arrays does pairwise deletion. And it does so very fast!
Tangential answered 10/8, 2023 at 11:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.