Survey Data Model - How to avoid EAV and excessive denormalization?

Asked 7/1, 2011 at 15:16 Answered 7/1, 2011 at 19:11

My database skills are mediocre at best and I have to design a data model for survey data. I have spent some thoughts on this and right now I feel that I am stuck between some kind of EAV model and a design involving hundreds of tables, each with hundreds of columns (and thousands of records). There must be a better way to do this and I hope that the wise folks on this forum can help me.

My question is: how should I model the answers to survey questions in an RDBMS? Using SQL Server is mandatory. So alternative data storage systems should be excluded from this discussion. (Sure, some should and will be evaluated, but not here please.) I don't need a solution for the entire data model, for now I'm only interested in the Answers part.

I have already searched various forums, but I couldn't really find a solution. If it has already been given elsewhere, please excuse me and provide me with a link so I can read it up.

Some assumptions about the data I have to deal with:

Each survey consists of 1 to n questionnaires
Each questionnaire consists of 100-2,000 questions (please ignore that 2,000 questions really sound like a lot to answer...)
Questions can be of various types: multiple-choice, free text, a number (like age, income, percentages, ...)
Each survey involves 10-200 countries (These are not the respondents. The respondents are actually people in the countries.)
Depending on the type of questionnaire, each questionnaire is answered by 100-20,000 respondents per country.
A country can adapt the questionnaires for a survey, i.e. add, remove or edit questions
The data for one country is gathered in a separate database in that country. There is no possibility for online integration from the start.
The data for all countries has to be integrated later. This means for example, if a country has deleted a question, that data must somehow be derived from what they sent in order to achieve a uniform design across all countries
I will have to write the integration and cleaning software, which will need to work with every country's data
1. In the end the data needs to be exported to flat files, one rectangular grid per country and questionnaire.

I have already discussed this topic with people from various backgrounds and have not come to a good solution yet. I mainly got two kinds of opinions.

The domain experts, who are used to working with flat files (spreadsheet-style) for data processing and analysis vote for a denormalized structure with loads of tables and columns as I described above (1 table per country and questionnaire). This sounds terrible to me, because I learned that wide tables are to be avoided, it will be annoying to determine which columns are actually in a table when working with it, the database will become cluttered with hundreds of tables (or I even need to set up multiple databases, each with a similar yet a bit differetn design), etc.
O-O-programmers vote for a strongly "normalized" design, which would effectively lead to a central table containing all the answers from all respondents to all questions. This table would either need to contain a column of type sql_variant type or multiple answer columns with different types to store answers of different types (multiple choice, free text, ..). The former would essentially be a EAV model. I tend to follow Joe Celko here, who strongly discourages its use (he calls it OTLT or "One True Lookup Table"). The latter would imply that each row would contain null cells for the not applicable types by design.

Another alternative I could think of would be to create one table per answer type, i.e., one for multiple-choice questions, one for free text questions, etc.. That's not so generic, it would lead to a lot of union joins, I think and I would have to add a table if a new answer type is invented.

Sorry for boring you with all this text and thank you for your input!

Cheers, Alex

PS: I asked the same question here: http://www.eggheadcafe.com/community/aspnet/13/10242616/survey-data-model--how-to-avoid-eav-and-excessive-denormalization.aspx

Gabriellagabrielle answered 7/1, 2011 at 15:16 Comment(6)

To me, this sounds like a good candidate for an EAV solution. What's your objection to going that route? – Delia 7/1, 2011 at 15:19

What about using a document or NoSQL database? Maybe the problem here is adapting your domain model to a relational infrastructure, so why not just avoid it...? See en.wikipedia.org/wiki/NoSQL. – Querida 7/1, 2011 at 15:24

The EAV model seems to make integrity constraints a lot more cumbersome. I'd basically have to squeeze values of different data types into one column. See eggheadcafe.com/software/aspnet/32645959/… – Gabriellagabrielle 7/1, 2011 at 15:36

I think the relational structure per se is not a problem. Similar designs have been made, see #1764935 for an example. – Gabriellagabrielle 7/1, 2011 at 15:39

8 is completely bogus and really can't be answered here. I don't understand how you're going to "complete" my age if I don't fill it in. But I try to answer the rest below – Nordine 7/1, 2011 at 18:1

@Stephanie: I'm asking about the data model. I have given these assumptions to illustrate what will be done with the data. Besides, this point is not "bogus". There are various ways to fill in such missing data. E.g., maybe the respondent can be contacted and asked about her age when she omitted it. Or you you can try to retrieve the repondent's age from another source. Or you can statistically impute it: en.wikipedia.org/wiki/Multiple_imputation. – Gabriellagabrielle 9/1, 2011 at 12:15

alt text Well imgur is down so i'll post the pic later.

I think this is completely feasible within a relational model. I've built a CDM to show how I would do this.

Outbound

It takes 4 entities to define a Country's Survey. Some Parent Survey, the country and a list of questions. Your questions have an internal relationship so when one country "edits" a question, you can track both the question asked by the country and the question it came from. The other thing you need is a Possible Answer entity/table. Each question may have an associated list of possible answers (multiple choice or ranges etc). Those 4 should completely define the "OUTBOUND" side of this.

Inbound

The "INBOUND" side is just 2 new entities, The Respondent and the answer. The respondent is straightforward, just the demographics of that person if you know them and here you can include a relationship back to country. Each respondent answered the survey in a given country. (Person may be 1:n with Respondent if the person travels or has dual citizenship)

The answer is basic; either it is one of the choices listed in the list of Possible Answers or it is provided. Don't get all caught up in the fact that the answer may be a number, date, etc just yet. Either it's a FK or a string of characters.

Reporting

A report is a join over all of these... You'll choose a country and a survey, get the list of questions and answers.

Answer Complexity

Depends on where you want to do your calculations. If you used a Varchar2(4000) column for your user-provided answers, you could add an attribute to question to describe the datatype of the answer. Q: Age? DT: Integer Between (0 and 130). Then your integration layer can do the validation instead of the database enforcing it. Or you can have 4 columns, one for number, date, character and CLOB. And your integration layer will determine the column to use. When you report those answers out, you'll just select all four columns with Coalesce().

Is this an EAV because there's a slight ambiguity to the datatype of "Answer"

No, it's not.

AN EAV model breaks down an Entity into a list of attributes. like so:

Entity     Attribute     Value
  1          Fname         Stephanie
  1          Lname         Page
  1          Age           30

because you see the Answer column of the Survey schema is holding both words and numbers like the Value column does here you think that defines EAV. It does not. Just as if I added 3 datatype columns to this model it wouldn't change it FROM an EAV.

I soooo hate it when

I've had people tell me that the query I'm tuning has to go "as fast as possible". Ok, so give me a billion dollars and 30 years. "Wait, a Billion what?" "As much as", "as fast as" aren't requirements. You can validate anything you want in a database... build a shedload of Before triggers, voila! Validation galore.

What's the datatype of an age column? Or Birthdate column? Depends on what your data source is. Some older records may only have Month and Year, or just year, or 'around' or 'circa' some year. You couldn't have just a number column and do 'as much validation as possible'. and NUMBER(2) may be BETTER validation than just NUMBER. So now you'll have NUMBER(1), NUMBER(2), NUMBER... to have "as much as".

Where I think you are getting tripped up

Think of this as a Conceptual Data Model, not a Physical one. In those terms Survey is an entity. Is Question an entity or just an attribute of Survey. If you built One table PER you're clearly saying that Question is just an Attribute of Survey and storing them vertically makes this an EAV. What this model shows is that Question is actually another entity. There is a relationship between Questions, e.g. 'a country [can] edit questions'. There was the original question and edited one. Each question has a collection of possible answers. And the most important this is that, they are all questions. In an EAV I call fname, lname, bdate, age, major, salary, etc... all very disparate things, just attributes. In this case we're not including the name of the agency who originated the survey and the date it was issued and the date is due back and the etc... as questions.

Let me put this another way. You're Fedex. You want to store timestamps for certain events. Each time a package enters or leaves a facility or vehicle. Time on the picking up truck, time off the truck and into the first facility, time out of that facility and onto a plane, etc. Do you store them Horizontally? How do you know the number of hops in advance? If you store them vertically does that automatically make it an EAV? And if so why.

You're a weather company getting temps from stations around the country. Let's say the sensors are designed to send a reading when the temperature changes +/- a full degree. If you store a sensor_ID|timestamp|temp is a Reading Table is that an EAV? Each reading isn't an attribute of the sensor, they are themselves entities which belong to a collection/series.

One thing that vertical storage of answers has in common with an EAV is its difficulty in performing analytic queries. If you wanted a list of all the people who answered TRUE to question 5 and 10 but FALSE to 6 and 11 would be very difficult when done vertically. Maybe that's why you see this an EAV. If you want to do that, you need a different storage. The relational storage of the question and answers isn't the best reporting database. Let's go back to the Fedex example. It's not simple to do "transit" time reporting when the rows are vertical.

Nordine answered 7/1, 2011 at 19:11 Comment(1)

Thanks for your suggestion. This is another example of an EAV model. I don't want to store the answers as strings, because of all the type casting it would involve. Also I would prefer to have as much validation as possible in the database. So having one column per data type sounds better to me. I have also thought about having one table per data type to reduce the amount of nulls. Now my main goal here is to understand the benefits of any solution. So why would you prefer your solution over an approach with multiple wide tables (see my initial question, point 1 on the bottom)? – Gabriellagabrielle 9/1, 2011 at 13:7

This sounds like you are wrestling with a common problem: how to use a hammer to fasten a screw.

Both alternatives you listed are bad, each for different reasons. But that's because you are trying to stuff your particular data model into a relational database system. A good approach would be to look beyond the relational database at some other database/storage systems, try a couple out, and find the best fit for your project.

I have tried the EAV model and gave up because it was far too complex, and I am afraid to try the multi-tables model with a relational database system. The easiest solution I have found with a relational database is: store each complete response as a single CLOB, serialized into JSON or YAML (or something else lightweight), in a responses table.

create table responses (
  id uuid primary key,
  questionnaire_id uuid references questionnaires.id,
  data text
)

Orlantha answered 7/1, 2011 at 15:39 Comment(6)

Why are you so sure about that? Relational systems are what me and my co-workers here understand best and what we have the infrastructure for. Furthermore similar relational designs have been developed and are being used. See for example the 4 "Questionnaires" designs on databaseanswers.org/data_models. Please don't get me wrong. I'm always eager to learn and use new technologies, but - especially in this case - it needs to pay off. I even need to convince a lot of people here that it makes sense to move away from a chaotic system of flat files in a directory system... – Gabriellagabrielle 7/1, 2011 at 15:51

So he's got a hammer, he's working with a screw, and you recommend pliers? – Richburg 7/1, 2011 at 15:57

My recommendation is: there is a whole range of data stores out there. It's not only SQL. Don't just look in the hammer drawer for the solution to your problem. Go to the Ace Hardware store and find the screwdriver that fits your screw. – Orlantha 7/1, 2011 at 16:1

I've added my relation database approach as well. – Orlantha 7/1, 2011 at 16:10

I appreciate your input, but I would like to stick to RDBMS here. I have edited my initial question to make this clearer. Your relational suggestion looks like an extreme case of EAV, very generic for sure. I will surely keep this in mind as an alternative. But I would prefer to store the data in a way, which allows for processing (checks, reports) with pure SQL. Also I would like assure data integrity as much as possible in the database and not in client apps, of which there will likely be multiple. – Gabriellagabrielle 9/1, 2011 at 12:49

PostgreSQL, MySQL, and MSQL seem to support XML fields to some degree, including some support for querying XML fields using an XML query language (xquery, xpath). – Orlantha 12/1, 2011 at 4:11

If I was using SQL Server, Express will be OK, then I would do this:

Table with list of questions, flags for type (bit), if required flag (bit), the correct answer if exists, etc
Table with list of countries
Table linking of countries and questions (some countries may not get some questions
Table for answers with columns for the question(s) and a xml column for the optional questions including those which are added

If you are not versed in shredding XML then use sparse columns for all the optional questions. I do not recall exactly the limit on the number of sparse columns in a table but I believe it is above 30,000. SQL Server internally stores sparse columns as XML and will shred it when one selects the column and yes it can be indexed

The diagram below show a diagram created with SQL Server. the column AL_A4 will hold the answer to QL_Id = 4 and is of type sparse. The QL_Id in the QuestionList table is not flagged required letting you know to make the column in AnswerList sparse.

Since countries will add questions create QuestionListCustom, QuestiontoCountryCustom and AnswerListCustom tables and add the information from the custom questions.

I am sure there are other ways to design the storage, this is the way I would turn in the homework, if this is not homework then you surely work for the UN.

alt text

Elyssa answered 7/1, 2011 at 17:24 Comment(4)

This completely ignores the fact that questions are grouped into surveys – Nordine 7/1, 2011 at 17:46

@Stephanie you are correct I completely left the survey part out. New table has been added – Elyssa 7/1, 2011 at 22:51

Thanks for your suggestion. I'm not sure if I understand the idea behind AnswerList. Do I get it right that it will contain one column per (mandatory) question? If yes, I doubt that this works well when multiple surveys are entered. I understand that the column names are generic, so question 1 for survey A can be completely different from question 1 in survey B. But they have to be of the same data type, which might or might not be the case. Or do I get it completely wrong? – Gabriellagabrielle 9/1, 2011 at 12:41

No, this is not a homework assignment, but a real project. And I'm not working for the UN. I do work in the field of large-scale sudent assessment: iea.nl – Gabriellagabrielle 9/1, 2011 at 12:42

-1

Have you considered not reinventing the wheel? There are open source survey applications already built. Even if they don't meet your needs, download a few and check out their data models.

Radiolarian answered 7/1, 2011 at 15:52 Comment(1)

I did have a closer look at limesurvey.org. They apparently create flat tables for each survey. Given the amount of tables I'd have to create and maintain, I'd like to avoid that. As I wrote above, EAV-type models are also available. It's not that I don't have any idea about the design at all - I'm having trouble taking a decision and am kindly asking for some input here. – Gabriellagabrielle 7/1, 2011 at 15:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags