MCAR Little's test in Python

Asked 28/9, 2019 at 8:44 Answered 29/6, 2023 at 13:44

python-3.x statistics missing-data imputation hypothesis-test

How can I execute Little's Test, to find MCAR in Python? I have looked at the R package for the same test, but I want to do it in Python. Is there an alternate approach to test MCAR?

Brigand answered 28/9, 2019 at 8:44 Comment(3)

What about impyute library? Little’s MCAR Test (WIP) is in its feature list. – Yorktown 28/9, 2019 at 10:35

@Yorktown impyute library does not explain how to do it (as far as I have seen), can you elaborate steps or give link for proper documentation. – Rozella 13/10, 2019 at 9:33

The impyute library has a ticket to implement Little's MCAR Test, but it's not in progress: github.com/eltonlaw/impyute/issues/71 – Hydrocellulose 26/2, 2020 at 3:16

You can use rpy2 to get the mcar test from R. Note that using rpy2 requires some R coding.

Set up rpy2 in Google Colab

# rpy2 libraries
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import globalenv

# Import R's base package
base = importr("base")

# Import R's utility packages
utils = importr("utils")

# Select mirror 
utils.chooseCRANmirror(ind=1)

# For automatic translation of Pandas objects to R
pandas2ri.activate()

# Enable R magic
%load_ext rpy2.ipython

# Make your Pandas dataframe accessible to R
globalenv["r_df"] = df

You can now get R functionality within your Python environment by using R magics. Use %R for a single line of R code and %%R when the whole cell should be interpreted as R code.

To install an R package use: utils.install_packages("package_name")

You may also need to load it before it can be used: %R library(package_name)

For the Little's MCAR test, we should install the naniar package. Its installation is slightly more complicated as we also need to install remotes to download it from github, but for other packages the general procedure should be enough.

utils.install_packages("remotes")
%R remotes::install_github("njtierney/naniar")

Load naniar package:

%R library(naniar)

Pass your r_df to the mcar_test function:

# mcar_test on whole df
%R mcar_test(r_df)

If an error occurs, try including only the columns with missing data:

%%R
# mcar_test on columns with missing data
r_dfMissing <- r_df[c("col1", "col2", "col3")]
mcar_test(r_dfMissing)

Tantivy answered 6/5, 2022 at 14:59 Comment(2)

Nice. Can you put a few words on why you would include only variables with missing data? I thought the idea was to assess differences in variables grouped by missing/non-missing, which I cannot imagine will work if we drop cols without missing. – Biocatalyst 18/6, 2023 at 13:15

That's a good question. The only reason I suggested including variables with missing data is because the mcar_test() function raises an error. I am not sure if this happens in every situation or just with the data I tried it with. – Tantivy 23/8, 2023 at 7:18

you can simply use this function to do a Little's MCAR test, instead of using R code:

import numpy as np
import pandas as pd
from scipy.stats import chi2

def little_mcar_test(data, alpha=0.05):
    """
    Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
    
    Parameters:
    data (DataFrame): A pandas DataFrame with n observations and p variables, where some values are missing.
    alpha (float): The significance level for the hypothesis test (default is 0.05).
    
    Returns:
    A tuple containing:
    - A matrix of missing values that represents the pattern of missingness in the dataset.
    - A p-value representing the significance of the MCAR test.
    """
    
    # Calculate the proportion of missing values in each variable
    p_m = data.isnull().mean()
    
    # Calculate the proportion of complete cases for each variable
    p_c = data.dropna().shape[0] / data.shape[0]
    
    # Calculate the correlation matrix for all pairs of variables that have complete cases
    R_c = data.dropna().corr()
    
    # Calculate the correlation matrix for all pairs of variables using all observations
    R_all = data.corr()
    
    # Calculate the difference between the two correlation matrices
    R_diff = R_all - R_c
    
    # Calculate the variance of the R_diff matrix
    V_Rdiff = np.var(R_diff, ddof=1)
    
    # Calculate the expected value of V_Rdiff under the null hypothesis that the missing data is MCAR
    E_Rdiff = (1 - p_c) / (1 - p_m).sum()
    
    # Calculate the test statistic
    T = np.trace(R_diff) / np.sqrt(V_Rdiff * E_Rdiff)
    
    # Calculate the degrees of freedom
    df = data.shape[1] * (data.shape[1] - 1) / 2
    
    # Calculate the p-value using a chi-squared distribution with df degrees of freedom and the test statistic T
    p_value = 1 - chi2.cdf(T ** 2, df)
    
    # Create a matrix of missing values that represents the pattern of missingness in the dataset
    missingness_matrix = data.isnull().astype(int)
    
    # Return the missingness matrix and the p-value
    return missingness_matrix, p_value

Hamlen answered 14/5, 2023 at 11:55 Comment(1)

Cool. What df do you expect as input? And I thought Little's test should return one test with one p-value, not one per column. – Biocatalyst 18/6, 2023 at 13:25

Comments suggest using existing packages. Here is an example directly taken from pyampute:

import pandas as pd
from pyampute.exploration.mcar_statistical_tests import MCARTest
data_mcar = pd.read_table("data/missingdata_mcar.csv")
mt = MCARTest(method="little")
print(mt.little_mcar_test(data_mcar))
0.17365464213775494

Biocatalyst answered 18/6, 2023 at 8:41 Comment(0)

import numpy as np
import pandas as pd
from scipy.stats import chi2

def little_mcar_test(data, alpha=0.05):
    """
    Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
    """
    data = pd.DataFrame(data)
    data.columns = ['x' + str(i) for i in range(data.shape[1])]
    data['missing'] = np.sum(data.isnull(), axis=1)
    n = data.shape[0]
    k = data.shape[1] - 1
    df = k * (k - 1) / 2
    chi2_crit = chi2.ppf(1 - alpha, df)
    chi2_val = ((n - 1 - (k - 1) / 2) ** 2) / (k - 1) / ((n - k) * np.mean(data['missing']))
    p_val = 1 - chi2.cdf(chi2_val, df)
    if chi2_val > chi2_crit:
        print(
            'Reject null hypothesis: Data is not MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
        )
    else:
        print(
            'Do not reject null hypothesis: Data is MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
        )

Polytechnic answered 29/6, 2023 at 13:44 Comment(1)

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. – Welton 1/7, 2023 at 6:10

Recommended topics

Hot tags