Which database design gives better performance?

Asked 12/6, 2014 at 19:41 Answered 13/6, 2014 at 11:44

Solved sql database-design database-performance database-normalization

-1

I want to select to retrieve person and also further make some inserts, deletes and updates.

If I want retrieve person who lives in Brazil what will be the best approach?

Make 2 foreign key city and country in table person:

Person(id, name, profession, **id_country**, **id_city**)   
cities (id, city, **id_country**)  
countries (id, country)

or just one foreign key of cities in table person and a other foreign key county in table cities

Person(id, name, profession, **id_city**)   
cities (id, city, **id_country**)   
countries (id, country)

or make a view like the first option?

For inserting, deleting and updating data, are they still the best tables? No difference in performance ? Also I'm confused, what can affect performance in a schema?

Forepaw answered 12/6, 2014 at 19:41 Comment(8)

How do you measure "best" ? – Hydrated 12/6, 2014 at 19:43

in term of performance. and other criteria if they can help me – Forepaw 12/6, 2014 at 19:45

And what are you trying to perform? Inserts? Deletes? updates? specific queries? – Hydrated 12/6, 2014 at 19:47

i want now just make some search query select. and if you want give all the cases for Inserts, Deletes and updates. – Forepaw 12/6, 2014 at 19:48

Performance is a property of commands that retrieve or modify data, not a property of how data is stored. You should clarify your question with examples of queries that you want to perform well, as well as estimates of the sizes of the various tables. – Esperanzaespial 12/6, 2014 at 20:18

for example if i want retrieve person who lives in Brazil what will be the best approach ? – Forepaw 12/6, 2014 at 20:25

For your example the best solution would be Person(id, name, profession, country, city) :) – Bluecollar 12/6, 2014 at 20:27

so if i have other table the best way is to link them with the main table as foreign key ? – Forepaw 12/6, 2014 at 20:31

This question is very similar to the one you made yestarday:

Create many tables or just one

The answer is also similar - that depends on what you want to achieve. Both solutions could work and both have pros and cons and one should do a little trade-off analysis in the light of the specific situation. Out of this context is not possible to answer your question.

The only difference I see in both version is foreign key id_country in Person table:

Person(id, name, profession, ****id_country****, id_city)
cities (id, city, id_country)
countries (id, country)

The question is "do we need it?"

So, the pros and cons of both solutions:

1. Solution: With id_contry:

pros: easier retrival of a Person based on land (simpler query) and better performance of this query
cons: more complex underlaying DB and more redundancy, more chance of inreoducing inconsiscenties in the DB, harder updates

2. Solution: Without id_country:

pros: simpler and cleaner model, no redundancy, easier maintenance
cons: slower performance and more complex query for retrival of a Person based on land (simpler query)

So, the 1st solution effectivelly gives you easier query structure and better performance for retrieving Persons by Country (what you wanted), but it has its cost (see pros and cons). On the other side, pragmatic thinking says that country-city data are quite stabile and not often changed and this fact goes in favor of the 1st solution.

If this denormalization and slight chance of inconsistencies something you can live with, you can take the 1st solution.

Interpol answered 13/6, 2014 at 10:39 Comment(0)

Generally speaking, data integrity is more important than performance, so denormalize¹ only if you have performed measurements on representative amounts of data and the results indicate a strong need for better performance. More often that not, a normalized schema will perform just fine, especially if you got your physical design right (such as indexing).

In this particular case, my hunch would be that the second (normalized) design will work just fine.

That being said, probably the most efficient design for a query: "which persons live in a given country" would look something like this:

enter image description here

And then cluster the PERSON on its PK. This way, all persons belonging to the same COUNTRY_ID are stored physically close together in the database, dramatically reducing the I/O for the query above².

On the other hand, you no longer use simple auto-increment mechanism to generate your CITY_NO and PERSON_NO, the secondary indexes on PERSON are expensive because of clustering leading to other queries becoming slower etc. So, this is not absolutely "better" than your second design, they just both have different strengths/weaknesses and you'll have to decide which is a better trade-off in your particular case.

¹ Which prevents the database from "self-defending" itself from bad data. In your case, the first (denormalized) design would allow a person to reference a country that does not contain the city referenced by the same person.

² I/O tends to be the biggest cost for most queries.

Pleomorphism answered 13/6, 2014 at 11:44 Comment(6)

Why can't you use an auto-gen values for those columns? What about them am I missing? Granted that there would be "gaps", but they'd still be grouped by the leading elements of the key, meaning people in the same country/city would still be grouped together. – Spadix 14/6, 2014 at 6:8

But if i let my DB normalized and i create views ? is it the best way ? create views will affect performance ? – Forepaw 14/6, 2014 at 13:7

@Spadix Yes you could do that, but these "gaps" would either waste storage (if your DBMS has integers in variable-width representation), or make you run out of available values sooner (in fixed-width representation). – Pleomorphism 14/6, 2014 at 19:32

@Youssef Unless these views are materialized, no - views alone would not affect performance. A view is just a convenient way to store a query in the database itself, and generally speaking does not have a different performance from the (properly prepared) query. But don't take my word for it. Seriously, don't. Always measure yourself. – Pleomorphism 14/6, 2014 at 19:35

Okay, so you're doing an increment for every city/country combo, right? What happens if a city moves to a different country? Or a person to a different city? And how do you plan to generate these ids? – Spadix 14/6, 2014 at 23:1

@Spadix Generate new NO. This can be done as MAX(NO) + 1 WHERE COUNTRY_ID = ? ..., but special care needs to be taken to avoid race conditions with concurrent transactions trying to do the same (e.g. through locking, or by retrying the operation after a key violation). – Pleomorphism 14/6, 2014 at 23:59

Depending on what you want to achieve. In both cases you are entering redundant data. It is called denormalization. Second example of denormalization is called "Short-Circuit Keys":

Person(id, name, profession, **id_city**)   
cities (id, city, **id_country**)   
countries (id, country)

The result of that can produce fewer table joins when queries execute. First example is called "The star schema":

Person(id, name, profession, **id_country**, **id_city**)   
cities (id, city, **id_country**)  
countries (id, country)

The star schema consists of one or more fact tables referencing any number of dimension tables. Star schema join logic is generally simpler than the join logic required to retrieve data from a highly normalized transactional schemas. This example is usually seen in data warehouse database design.

There is no impact on performance in any case, you can choose one or the other to retrieve the data that you need.

Cheater answered 12/6, 2014 at 22:35 Comment(6)

thanks for the explanation but for inserting, Deleting and updating data, they still the same ? no difference on performance ? also i'm confused what can affect the performance in a schema ? – Forepaw 13/6, 2014 at 0:58

For inserting, deleting and updating data, there should be a trigger defined on tables. To prevent so-called orphaned data. And inserting, deleting and updating redundant data is always a performance issue. You must decide between selecting and writing joins the easier way or having quicker insert, delete and update. – Cheater 13/6, 2014 at 8:4

But if i let my DB normalized and i create views ? is it the best way ? create views will affect performance ? – Forepaw 14/6, 2014 at 13:6

Yes in some cases views can update performance, depending on what kind of view you are creating. – Cheater 14/6, 2014 at 13:12

so if i use views it will not affect negatively the performance ? – Forepaw 14/6, 2014 at 13:29

You can create both solutions and post a question hire, which of the two is a better performance, with database design and some data in tables, then we will see. :) – Cheater 14/6, 2014 at 13:35

(Your original post did not address performance but normalization, but it has has many edits and "performance" was introduced, maybe because your comments have mentioned it.)

One of the points of the relational model is generic querying with automated implementation with automated optimization. Ignore performance initially. Just make a straightforward design. (Before which you must learn how to make one.) Ids have nothing to do with performance. Normalization has something to do with performance but since you should normalize to 5NF first that is moot. Foreign keys have something to do with performance but since you should be defining them for integrity their role in performance is moot. Proper designs enable later tuning.

Anyway performance is a tradeoff of factors and if you do not know what sorts of things you want to do it is meaningless to discuss performance. (Or if you do not even know what those sorts of things are.) Also, performance-related properties must be measured to even hold an opinion that manual optimization intervention is appropriate. (And again you would have to understand what those factors even are.)

When performance becomes a demonstrated issue because a particular application makes particular patterns of queries or updates then you can address performance. First via indexing and views to make those patterns perform better--always at the expense of others.

The sorts of things you are mentioning (and not mentioning) and the way you are mentioning them suggest that you have misconceptions about performance and its relation to design. Also that your understanding of relational structure, querying and DBMSs is extremely low. Until you learn much more about basic design any advice you get about biasing for performance is misdirected. So just forget about performance. The main thing that adversely affects performance is premature concern for performance getting in the way of a straightforward design.

A simplest design is

person(id, name, profession, city, country)
    -- person [id] is named [name] and practises [profession] in [city], [country]  
city (name, country) -- [name] uniquely names a city within country [country]
country (name) -- [name] uniquely names a country

This has certain keys and FKs, just declare them--which has nothing to do with performance. It is in 5NF.

You may come to understand that the following design (you can add relevant constraints) may be better for you than the previous one--which will have nothing to do with performance. Then you can move to it and offer the previous tables as views to old users--which will have nothing to do with performance.

person(id, name, profession, id_city)
    -- person [id] is named [name] and practises [profession] in [id_city]  
city (id, name, id_country) -- city [id] is named [name] and is in country [id_country]
country (id, name) -- country [id] is named [name]

Here id_country in person would violate 5NF since it would be functionally dependent on a non-key, id_city.

Adanadana answered 13/6, 2014 at 1:43 Comment(2)

In real life, there may be multiple cities with the same name within the same country. – Pleomorphism 13/6, 2014 at 23:41

@BrankoDimitrijevic Aw, you spoiled the surprise. – Adanadana 14/6, 2014 at 1:12