Should I denormalize or run multiple queries in DocumentDb?
Asked Answered
T

1

10

I'm learning about data modeling in DocumentDb. Here's where I need some advice

Please see what my documents look like down below.

I can take two approaches here both with pros and cons.

Scenario 1:

If I keep the data denormalized (see my documents below) by keeping project team member information i.e. first, last name, email, etc. in the same document as the project, I can get the information I need in one query BUT when Jane Doe gets married and her last name changes, I'd have to update a lot of documents in the Projects collection. I'd also have to be extremely careful in making sure that all collections with documents that contain employee information get updated as well. If, for example, I update Jane Doe's name in Projects collection but forget to update the TimeSheets collection, I'd be in trouble!

Scenario 2:

If I keep data somewhat normalized and keep only EmployeeId in the project documents, I can then run three queries whenever I want to get a projects list:

  • Query 1 returns projects list
  • Query 2 would give me EmployeeId's of all project team members that appear in the first query
  • Query 3 for employee information i.e. first, last name, email, etc. I'd use the result of Query 2 to run this one

I can then combine all the data in my application.

The problem here is that DocumentDb seems to have a lot of limitations now. I may be reading hundreds of projects with hundreds of employees in project teams. Looks like there's no efficient way to get all employee information whose Id's appear in my second query. Again, please keep in mind that I may need to pull hundreds of employee information here. If the following SQL query is what I'd use for employee data, I may have to run the same query a few times to get all the information I need because I don't think I can have hundreds of OR statements:

SELECT e.Id, e.firstName, e.lastName, e.emailAddress
FROM Employees e
WHERE e.Id = 1111 OR e.Id = 2222

I understand that DocumentDb is still in preview and some of these limitations will be fixed. With that said, how should I approach this problem? How can I efficiently both store/manage and retrieve all project data I need -- including project team information? Is Scenario 1 a better solution or Scenario 2 or is there a better third option?

Here's what my documents look like. First, the project document:

{
   id: 789,
   projectName: "My first project",
   startDate: "9/6/2014",
   projectTeam: [
      { id: 1111, firstName: "John", lastName: "Smith", position: "Sr. Engineer" },
      { id: 2222, firstName: "Jane", lastName: "Doe", position: "Project Manager" }
   ]
}

And here are two employee documents which reside in the Employees collection:

{
   id: 1111,
   firstName: "John",
   lastName: "Smith",
   dateOfBirth: "1/1/1967',
   emailAddresses: [
      { email: "[email protected]", isPrimary: "true" },
      { email: "[email protected]", isPrimary: "false" }
   ]
},
{
   id: 2222,
   firstName: "Jane",
   lastName: "Doe",
   dateOfBirth: "3/8/1975',
   emailAddresses: [
      { email: "[email protected]", isPrimary: "true" }
   ]
}
Thicken answered 7/9, 2014 at 2:17 Comment(5)
Why have you added mongodb here?Eolian
Because ultimately the question is about data modeling in document databases and I want to see if there's an approach out there that I'm not yet seeing.Thicken
I am pondering the exact same thing, Have you any comment/insight on your experience since then? I am wondering either to normalise and read the users within the context (eg: associated with a project in your case) and reference the volatile information there (more than 1 query) or de-nornalise and possible have an "Update/Refresh user data" function for admins to refresh the names if required? (I am trying to keep away from stored procedure if I can)Elicia
Lots of lessons learned since I posted this question. The two main take aways are if you're going to use a NoSQL database, you need to get used to the idea of denormalizing. Otherwise, you're not taking advantage of what NoSQL databases offer which is speed and scalability. The second equally important lesson is that you need to have a good data integrity strategy because the NoSQL database is not doing this for you. So you need a good backend worker app such as an Azure function or WebJobs to handle it. Last but not least, do not shy away from stored procedures. They're super useful.Thicken
For whatever it's worth, here's a video I did on my journey from relational to NoSQL. Here it is: youtu.be/15-tb6u4RGM Good luck!Thicken
P
13

I believe you're on the right track in considering the trade-offs between normalizing or de-normalizing your project and employee data. As you've mentioned:

Scenario 1) If you de-normalize your data model (couple projects and employee data together) - you may find yourself having to update many projects when you update an employee.

Scenario 2) If you normalize your data model (decouple projects and employee data) - you would have to query for projects to retrieve employeeIds and then query for the employees if you wanted to get the list of employees belonging to a project.

I would pick the appropriate trade-off given your application's use case. In general, I prefer de-normalizing when you have a read-heavy application and normalizing when you have a write-heavy application.

Note that you can avoid having to make multiple roundtrips between your application and the database by leveraging DocumentDB's store procedures (queries would be performed on DocumentDB-server-side).

Here's an example store procedure for retrieving employees belonging to a specific projectId:

function(projectId) {
  /* the context method can be accessed inside stored procedures and triggers*/
  var context = getContext();
  /* access all database operations - CRUD, query against documents in the current collection */
  var collection = context.getCollection();
  /* access HTTP response body and headers from the procedure */
  var response = context.getResponse();

  /* Callback for processing query on projectId */
  var projectHandler = function(documents) {
    var i;
    for (i = 0; i < documents[0].projectTeam.length; i++) {
      // Query for the Employees
      queryOnId(documents[0].projectTeam[i].id, employeeHandler);
    }
  };

  /* Callback for processing query on employeeId */
  var employeeHandler = function(documents) {
    response.setBody(response.getBody() + JSON.stringify(documents[0]));
  };

  /* Query on a single id and call back */
  var queryOnId = function(id, callbackHandler) {
    collection.queryDocuments(collection.getSelfLink(),
      'SELECT * FROM c WHERE c.id = \"' + id + '\"', {},
      function(err, documents) {
        if (err) {
          throw new Error('Error' + err.message);
        }
        if (documents.length < 1) {
          throw 'Unable to find id';
        }
        callbackHandler(documents);
      }
    );
  };

  // Query on the projectId
  queryOnId(projectId, projectHandler);
}

Even though DocumentDB supports limited OR statements during the preview - you can still get relatively good performance by splitting the employeeId-lookups into a bunch of asynchronous server-side queries.

Phenobarbitone answered 8/9, 2014 at 22:39 Comment(1)
Thank you for your response. I feel the latter approach is a better one too. I just need to master JS stored procedures in DocumentDB. Thanks again. I do appreciate the code sample too.Thicken

© 2022 - 2024 — McMap. All rights reserved.