How to deal with NaN values when plotting a boxplot
Asked Answered
P

1

20

I am using matplotlib to plot a box figure but there are some missing values (NaN). Then I found it doesn't display the box figure within the columns having NaN values. Do you know how to solve this problem? Here are the codes.

import numpy as np
import matplotlib.pyplot as plt

#==============================================================================
# open data
#==============================================================================
filename='C:\\Users\\liren\\OneDrive\\Data\\DATA in the first field-final\\ks.csv'

AllData=np.genfromtxt(filename,delimiter=";",skip_header=0,dtype='str')

TreatmentCode = AllData[1:,0]
RepCode = AllData[1:,1]
KsData= AllData[1:,2:].astype('float')
DepthHeader = AllData[0,2:].astype('float')
TreatmentUnique = np.unique(TreatmentCode)[[3,1,4,2,8,6,9,7,0,5,10],]
nT = TreatmentUnique.size#nT=number of treatments
#nD=number of deepth;nR=numbers of replications;nT=number of treatments;iT=iterms of treatments
nD = 5
nR = 6
KsData_3D = np.zeros((nT,nD,nR)) 

for iT in range(nT):
    Treatment = TreatmentUnique[iT]

    TreatmentFilter = TreatmentCode == Treatment

    KsData_Filtered = KsData[TreatmentFilter,:]
    
    KsData_3D[iT,:,:] = KsData_Filtered.transpose()iD = 4
                      
fig=plt.figure()
ax = fig.add_subplot(111)
plt.boxplot(KsData_3D[:,iD,:].transpose())
ax.set_xticks(range(1,nT+1))
ax.set_xticklabels(TreatmentUnique)
ax.set_title(DepthHeader[iD])

Here is the final figure and some of the treatments are missing in the box.

enter image description here

Puckery answered 1/6, 2017 at 11:7 Comment(0)
M
35

You can remove the NaNs from the data first, then plot the filtered data.

To do that, you can first find the NaNs using np.isnan(data), then perform the bitwise inversion of that Boolean array using the ~: bitwise inversion operator. Use that to index the data array, and you filter out the NaNs.

filtered_data = data[~np.isnan(data)]

In a complete example (adapted from here)

Tested in python 3.10, matplotlib 3.5.1, seaborn 0.11.2, numpy 1.21.5, pandas 1.4.2

For 1D data:

import matplotlib.pyplot as plt
import numpy as np

# fake up some data
np.random.seed(2022)  # so the same data is created each time
spread = np.random.rand(50) * 100
center = np.ones(25) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low), 0)

# Add a NaN
data[40] = np.NaN

# Filter data using np.isnan
filtered_data = data[~np.isnan(data)]

# basic plot
plt.boxplot(filtered_data)

plt.show()

enter image description here

For 2D data:

For 2D data, you can't simply use the mask above, since then each column of the data array would have a different length. Instead, we can create a list, with each item in the list being the filtered data for each column of the data array.

A list comprehension can do this in one line: [d[m] for d, m in zip(data.T, mask.T)]

import matplotlib.pyplot as plt
import numpy as np

# fake up some data
np.random.seed(2022)  # so the same data is created each time
spread = np.random.rand(50) * 100
center = np.ones(25) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low), 0)

data = np.column_stack((data, data * 2., data + 20.))

# Add a NaN
data[30, 0] = np.NaN
data[20, 1] = np.NaN

# Filter data using np.isnan
mask = ~np.isnan(data)
filtered_data = [d[m] for d, m in zip(data.T, mask.T)]

# basic plot
plt.boxplot(filtered_data)

plt.show()

enter image description here

I'll leave it as an exercise to the reader to extend this to 3 or more dimensions, but you get the idea.


The solution above is how to do this using matplotlib alone. Other alternatives (that use matplotlib under the hood) are available that have this behaviour built in, so no need to filter the data yourself.

  1. Use seaborn, which is a high-level API for matplotlib. seaborn.boxplot filters NaN under the hood.
import seaborn as sns

sns.boxplot(data=data)

1D

enter image description here

2D

enter image description here


  1. Use pandas. NaN is also ignored if plotting from df.plot(kind='box') for pandas, which uses matplotlib as the default plotting backend.
import pandas as pd

df = pd.DataFrame(data)

df.plot(kind='box')

1D

enter image description here

2D

enter image description here

Mezzo answered 1/6, 2017 at 12:0 Comment(1)
Excellent summary! One question though. In the beginning instead of the isnan and the bitwise inversion, wouldn't it be simpler to use dropna()?Marinemarinelli

© 2022 - 2024 — McMap. All rights reserved.