RethinkDB - Find documents with missing field
Asked Answered
H

1

10

I'm trying to write the most optimal query to find all of the documents that do not have a specific field. Is there any better way to do this than the examples I have listed below?

// Get the ids of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).pluck("id")

// Get a count of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).count()

Right now, these queries take about 300-400ms on a table with ~40k documents, which seems rather slow. Furthermore, in this specific case, the "location" attribute contains latitude/longitude and has a geospatial index.

Is there any way to accomplish this? Thanks!

Ha answered 18/4, 2015 at 23:25 Comment(0)
P
19

A naive suggestion

You could use the hasFields method along with the not method on to filter out unwanted documents:

r.db("mydb").table("mytable")
  .filter(function (row) {
    return row.hasFields({ location: true }).not()
  })

This might or might not be faster, but it's worth trying.

Using a secondary index

Ideally, you'd want a way to make location a secondary index and then use getAll or between since queries using indexes are always faster. A way you could work around that is making all rows in your table have a value false value for their location, if they don't have a location. Then, you would create a secondary index for location. Finally, you can then query the table using getAll as much as you want!

  1. Adding a location property to all fields without a location

For that, you'd need to first insert location: false into all rows without a location. You could do this as follows:

r.db("mydb").table("mytable")
  .filter(function (row) {
    return row.hasFields({ location: true }).not()
  })
  .update({
    location: false
  })

After this, you would need to find a way to insert location: false every time you add a document without a location.

  1. Create secondary index for the table

Now that all documents have a location field, we can create a secondary index for location.

r.db("mydb").table("mytable")
 .indexCreate('location')

Keep in mind that you only have to add the { location: false } and create the index only once.

  1. Use getAll

Now we can just use getAll to query documents using the location index.

r.db("mydb").table("mytable")
 .getAll(false, { index: 'location' })

This will probably be faster than the query above.

Using a secondary index (function)

You can also create a secondary index as a function. Basically, you create a function and then query the results of that function using getAll. This is probably easier and more straight-forward than what I proposed before.

  1. Create the index

Here it is:

r.db("mydb").table("mytable")
 .indexCreate('has_location', 
   function(x) { return x.hasFields('location'); 
 })
  1. Use getAll.

Here it is:

r.db("mydb").table("mytable")
 .getAll(false, { index: 'has_location' })
Paginate answered 19/4, 2015 at 3:34 Comment(10)
I forgot to mention that 'location' does have an index - but it's a geospatial index, so it would never itself be false. It either contains the geo data or is not present at all.Ha
If you really want to make those queries faster, how about this: Create a new field on every document called hasLocation. Make it a secondary index. Every field that doesn't have a location will have hasLocation: false and then you can use getAll with a secondary index. There might be a more elegant way of doing this, but this will solve your problem in the meantime. Would that work?Paginate
Also, quick question: Why do some rows have locations and some don't? What type of data are you working with? It seems that you might need some normalization to distinguish between entries that have a location and places that don't. Maybe there's a better way to structure your data? Maybe you should create two tables depending on the existence of a location? Just a thought.Paginate
I'm working with a table full of businesses and their addresses. In theory, all of the businesses should have geocode data - but I'm working against the limit of 2,500 geocoding requests per day on Google's geocoding API. Also, the amount of records is growing every day, so I'm building an incremental process to geocode new addresses and addresses without coordinates daily.Right now the data is pretty sparse, but I actually like the idea of breaking the geodata into it's own table (id, location). Then I'm sure I could run this operation and most geospatial operations much faster.Ha
Ok, that makes total sense. I think creating a different table would definitely work, although you might run into slower queries doing joins later. If that happens, then maybe the approach of having location and hasLocation might make more sense (since it doesn't require any joins). You could add the hasLocation: false as a default and let your Google geocode API handler switch it to true once it has updated the location.Paginate
@ambient Just added a new solution, in case you're interested.Paginate
Jorge, the new solution is great! It's very intuitive and optimal in terms of execution time. Since the index is based off of a function, does it still rebuild itself automatically? For example, if I add location data to a document that previously had none, will it automatically fall out of the has_location index? I suppose I could test this pretty easily myself and find out, but I'd like to hear your thoughts too.Ha
Yes, the index will rebuild itself automatically. If you add a new document, it will execute that function on it and save it for future querying.Paginate
How can I do this with multiple conditions? for example 'hasLocation: false' , isCurrent: true' ?Aindrea
SO question: #40065764Aindrea

© 2022 - 2024 — McMap. All rights reserved.