Django Table with Million of rows
Asked Answered
D

7

13

I have a project with 2 applications ( books and reader ).

Books application has a table with 4 milions of rows with this fields:

 book_title = models.CharField(max_length=40)
 book_description = models.CharField(max_length=400)

To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).

In "reader" application, I am thinking to insert this fields:

reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()

So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").

Is a good solution to avoid to query a table with 4 milions of rows ?

Is there an alternative solution?

Departure answered 12/1, 2010 at 18:47 Comment(4)
4 million isn't a lot, you have a case of premature optimisation.Linzer
The table is queried from ajax using an autocomplete field with this query in the views.py: books.objects.filter(book_title__istartswith=request.GET['q'])[:100]Departure
If you are querying a table on a text field and performance is the problem, you may choose to implement full-text search. However your queried field's size is 40 chars only, and I am not sure if it poses a big problem for the db.Laing
It's not the question you asked, but you may find this helpful: #1567217Linzer
D
18

Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.

Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.

Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.

Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.

If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!

[edit]

I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.

Delimitate answered 12/1, 2010 at 20:20 Comment(2)
Ok if I will split the table, I will split it by first letter... indeed is more reasonable :) The table is queried from ajax using an autocomplete field with this query in the views.py: books.objects.filter(book_title__istartswith=request.GET['q'])[:100] So do you recommend me index + memcached ? ThanksDeparture
Make an index on the fist three letters of title (or whatever is the first number you start querying the database) and it will run quite fast.Defrost
B
13

ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.

Edit: And for pete's sake, keep it in one table and use indexes as appropriate.

Beene answered 12/1, 2010 at 18:50 Comment(3)
I am using index but the table has 4 milions of rows and it's often queried. So I don't know if the index is enough :-\Departure
If it isn't then the database needs more memory.Beene
4 million rows is nothing to sneeze at, but databases are built for this kind of thing, especially if you're indexing. I would only worry about it if you are getting up to at least a hundred million rows.Amboceptor
D
2

You haven't mentioned which database you're using. Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.

If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.

For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings.

Dup answered 5/3, 2012 at 11:35 Comment(0)
H
1

I'm not familiar with Django, but I have a general understanding of DB.

When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.

When it comes to associate a book with a reader, you should create another table, that links reader to books.

It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.

Hassett answered 12/1, 2010 at 19:2 Comment(2)
20 applications means 20 tables :) I am using already index but the table has 4 milions of rows and it's often queried. So I don't know if the index is enough :-\Departure
Gods no! Do not split it up into 20 tables! Make an EAR diagram of you tables, and you will see what extra table you need to handle this. E.g. extra tables for relating person to books, category to books and so on.Hassett
L
1

Are you having performance problems? If so, you might need to add a few indexes.

One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).

If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.

Lamentable answered 12/1, 2010 at 19:10 Comment(2)
Yes, its a performance problem. I am using index but the table has 4 milions of rows and it's often queried. So I don't know if the index is enough :-\Departure
A bigger index (more columns) is probably the way to go, maybe in addition to memcached as @jcm mentions. Often times single column indexes don't help because they don't get used by your queries.Lamentable
F
1

A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.

Perhaps you should look in to caching with something like memcached. Django supports this quite well.

Filomenafiloplume answered 12/1, 2010 at 19:16 Comment(0)
P
0

You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.

Pattani answered 19/9, 2022 at 6:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.