GAM with mrf smooth - errors (mismatch between nb/polys area names and data area names

Asked 11/6, 2019 at 11:20 Answered 13/5, 2023 at 0:44

I am trying to fit Polish local government election results in 2015 following the superb blog by @GavinSimpson. https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/ I joined my xls data with the shp data using a 6 digit identifier (there may be leading 0's). I kept it as a text variable. EDIT, I simplified the identifier and am now using a sequence from 1 to nrow to simplify my question.

library(tidyverse)
library(sf)
library(mgcv)

# Read data
# From https://www.gis-support.pl/downloads/gminy.zip shp file

boroughs_shp <- st_read("../../_mapy/gminy.shp",options = "ENCODING=WINDOWS-1250",
                     stringsAsFactors = FALSE ) %>% 
  st_transform(crs = 4326)%>% 
  janitor::clean_names() %>% 
# st_simplify(preserveTopology = T, dTolerance = 0.01) %>% 
  mutate(teryt=str_sub(jpt_kod_je, 1, 6)) %>% 
  select(teryt, nazwa=jpt_nazwa, geometry)

# From https://parlament2015.pkw.gov.pl/wyniki_zb/2015-gl-lis-gm.zip data file
elections_xls <-
  readxl::read_excel("data/2015-gl-lis-gm.xls",
             trim_ws = T, col_names = T) %>% 
  janitor::clean_names() %>% 
  select(teryt, liczba_wyborcow, glosy_niewazne)

elections <-
  boroughs_shp %>% fortify() %>% 
  left_join(elections_xls, by = "teryt") %>% 
  arrange(teryt) %>%
  mutate(idx = seq.int(nrow(.)) %>% as.factor(), 
         teryt = as.factor(teryt)) 

# Neighbors

boroughs_nb <-spdep::poly2nb(elections, snap = 0.01, queen = F, row.names = elections$idx )
names(boroughs_nb) <- attr(boroughs_nb, "region.id")

# Model

ctrl <- gam.control(nthreads = 4) 
m1 <- gam(glosy_niewazne ~ s(idx, bs = 'mrf', xt = list(nb = boroughs_nb)), 
          data = elections,
          offset = log(liczba_wyborcow), # number of votes
          method = 'REML', 
          control = ctrl,
          family = betar())

Here is the error message:

    Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) : 
  mismatch between nb/polys supplied area names and data area names
In addition: Warning message:
In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") :
  the condition has length > 1 and only the first element will be used

elections$idx is a factor. I am using it to give names to boroughs_nb to be absolutely sure I have the same number of levels. What am I doing wrong?

EDIT: The condition mentioned in error message is met:

> all(sort(names(boroughs_nb)) == sort(levels(elections$idx)))
[1] TRUE

Warlock answered 11/6, 2019 at 11:20 Comment(12)

Make sure that teryt is a factor; at least one of the warnings is complain about that. – Bukharin 12/6, 2019 at 17:41

@GavinSimpson Thank you very much for helping me out. I am trying to follow your hint. It seems that poly2nb output must also be a 'class factor' somehow. But it is a list with names (char) and numeric values for neighbors of varying length. How to make it acceptable for mrf? Can names of poly2nb output be factors at all? What about the unequal lists of numerics for neighbors? – Warlock 13/6, 2019 at 6:43

You are setting up the smooth as s(teryt, bs = "mrf", ...) and the warning implies that teryt is not a factor. Is teryt a factor in df_wybory? Does that factor have exactly the same levels as the names of the object you pass to nb? – Bukharin 13/6, 2019 at 15:39

@GavinSimpson I have simplified the code and shortened the question. Now in the code I am using an idx column which is just an index from 1 to nrow, turn it to factor. Later I will use it to build a nb list, so that I am sure about the number of levels. – Warlock 13/6, 2019 at 16:26

@GavinSimpson the factors part disappeared, now I am dealing with error with one warning only: In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") – Warlock 13/6, 2019 at 16:35

That message is telling you that all(sort(names(boroughs_nb)) == sort(levels(idx))) must be TRUE and it isn't; the levels of idx don't match the names of the regions in the neighbour list. Basically you need to have a factor with levels that match the names on the neighbour list you pass to xt. – Bukharin 13/6, 2019 at 16:50

@GavinSimpson I am trying to make index as simple as possible... now the test is passed all(sort(names(boroughs_nb)) == sort(levels(elections$idx))) gives [1] TRUE but the "mismatch" error still remains – Warlock 13/6, 2019 at 16:56

What does all.equal(sort(names(boroughs_nb)), sort(levels(elections$idx))) say? It's failing this check so something isn't right. That's also a bug in mgcv (the warning), but that this is cropping up indicates that mgcv doesn't think these are equal – Bukharin 13/6, 2019 at 17:39

@GavinSimpson all.equal is ok too. I am not fluwbt enough in R to try other approach ie turn nb list items all from numeric to factors with equal levels. Maybe this is expected to be factors too? I was trying this but the sp_touches does not go through al data and throws a single “geometry error and does not return result. poly2nb returns result but the elements of a list are numeric not factor... :https://mcmap.net/q/1626717/-error-when-trying-to-evaluate-markov-random-fields-using-mgcv-gam-quot-mismatch-between-nb-polys-supplied-area-names-and-data-area-names-quot – Warlock 13/6, 2019 at 21:11

Any chance you can make the data available so I can run this and look? Feel free to email me; you'll find my gmail address at the blog post you cited in your Q. – Bukharin 13/6, 2019 at 21:25

@GavinSimpson i sent you an e-mail with my original script and links to data. I am now also thinking maybe it is a problem with corrupt geometry. – Warlock 14/6, 2019 at 16:44

@GavinSimpson I have now also tried to turn all the neighbor lists to factors and tried a method st_touches instead of poly2nb it seems that now I have all I could as factor: boroughs_nb <- lwgeom::st_make_valid(elections) %>% st_touches(sparse = T) %>% map(function(x) factor(x, levels = elections$idx)) names(boroughs_nb) <- elections$idx and all.equal(sort(names(boroughs_nb)), sort(levels(elections$idx))) returns TRUE – Warlock 17/6, 2019 at 9:30

To anyone who also finds this post years later: I don't know if I was facing the same problem as the original poster, but it presented in the same way.

I determined that the cause of my issue was that not all levels of the factor (ix in the above example) had rows associated for them. So in the above example, the equivalent would be boroughs for which there were no records in the dataset.

Because of this, when the model looked at the factor levels in the dataset, it saw only a subset of the levels defined in the neighborhood structure (boroughs_nb in the above example), which would cause the mismatch error.

Once I restructured my data so that every factor level had records associated with it, the error went away and the model fit correctly.

Asparagine answered 13/5, 2023 at 0:44 Comment(0)

It seems that I solved the issue, maybe not quite realizing how it did being stat beginner.

First, not a single NA should be present in modeled data. There was one. After that the mcgv seemed to run, but it took long time (quarter of an hour) and inexplicably for me, only when I limited no of knots to k=50, with poor results (less or more and it did not return any result) and with warning to be cautious about results. Then I tried to remove offset=log(liczba_wyborcow) ie offset number of voters and made number of void votes per 1000 my predicted variable.

elections <-
 boroughs_shp %>%  
 left_join(elections_xls, by = "teryt") %>% na.omit() %>% 
 arrange(teryt) %>% 
 mutate(idx = row_number() %>% as.factor()) %>% 
 mutate(void_ratio=round(glosy_niewazne/liczba_wyborcow,3)*1000)

Now that it is a count, why not try change family = betar() in gam formula to poisson() - still not a good result, and then to negative binomial family = nb() Now my formula looks like

m1 <-
gam(
 void_ratio ~ s(
 idx,
 bs = 'mrf',
 k =500,
 xt = list(nb = boroughs_nb),
 fx = TRUE),
 data = elections_df,
 method = 'REML', 
 control = gam.control(nthreads = 4),
 family = nb()
)

It seems now to be blazingly fast and return valid results with no warnings or errors. On a laptop with 4 cores Intel Core I7 6820HQ @ 2.70GHZ 16GB Win10 it takes now 1-2 minutest to build a model.

In brief, what I changed was: remove a single NA, remove offset from formula and use negative binomial distribution.

Here is the result of what I wanted to achieve, from left to right, real rate of void votes, a rate smoothed by a model and residuals indicating discrepancies. The mcgv code let me do that.

Warlock answered 21/6, 2019 at 13:14 Comment(0)

Recommended topics

Hot tags