Storing File Sizes in a Database
Asked Answered
S

4

20

I am currently working on a system that involves storing multiple studies and details of their contents (A study can typically contain 1 < X < ~2000 images). My colleagues and I were discussing what might be the best method of storing files sizes (specifically the image sizes) in a database would be.

The file sizes typically range from < 1kB to > 20MB.

We are currently debating between storing the images sizes as:

# of kilobytes (as an integer value) 
# of bytes (as a large integer value)
# of megabytes (possibly as a decimal value)
Other Options...

I haven't worked with storing file sizes much and was wondering what might be the most efficient / practical method of accomplishing this?

Stanch answered 6/1, 2011 at 16:31 Comment(0)
K
8

There is no right answer. I like Matt's answer for reasons of precision. I like Abe's answer for reasons of space saving... (Yes, space in a table is much more 'impactful' than on the Filesystem)

The real answer is, for what purpose are you storing the value? Is this for a mechanism to invoice the user storing the data? Then you'd have to rely on the contract. Is this to measure space on a drive... if so, files REALLY take up some number of 'blocks' and NOT some number of bytes. If the minimum block size is 2KB then you should say that EVERY file is increments of 2kb... If you store that value or that value times 2kb is up to you.

Maybe you're storing the value because the retrieval algorithm has 2 optimization paths, one for larger files and one for smaller and that process would like to know the size WITHOUT interrogating the file-system. (in this case maybe just an "is_greater_than_x_kb" flag column is all you need.)

No one here can tell you what your requirement is. The only thing the existing answers give you is an opinion, not a right answer.

Kirsch answered 6/1, 2011 at 17:5 Comment(4)
Thanks Steph- I know there wasn't any precise answer, I just though I would get peoples thoughts on different reasons to store file sizes and the benefits of storing them a certain way. Everyone's answers have been very helpful.Stanch
They are all helpful, but they are no substitute for clear requirements. 95% of problems I see stem from a lack of clear requirements. Also 76% of all statistics are completely fabricated. ;-)Kirsch
I cannot think of a reason to store file size unless it is a requirement. As Stephanie states there is a use for the "is_greater_than_x_kb". If file size was some BA design so the app could check to see if it got the correct file ask them if they would consider a checksum, or other type of unique identifier for the file, so the app know it got the file it was trying to get.Quinton
Me neither RC... which is why I'm pushing to give us a requirement. Too much programming happens because of what we in IT THINK the business wants from us. Maybe it's like in Sharepoint... you'd like to see in your "document library" <-- sharepoint thing> a column for name, a column for uploaded time, a column for size. So before a user tries to d/l the file, he'll know what to expect. But nowadays, 20MB is squat.Kirsch
S
13

If you're going to explicitly store the size at all, store the number of bytes. There is just too much confusion/ambiguity when using other units.

Example: different people might interpret kb as:

  • kilobytes
  • kilobits
  • kibibytes
  • kibibits

...and how big is a kilobyte, anyway?

That said, if you're storing the actual data in your database, I do not see an immediately compelling reason to explicitly store the length of the data at all.

Straley answered 6/1, 2011 at 16:41 Comment(2)
I'm not sure why exactly it is being stored - as I'm not really working on that side of the system. However - I knew it was being stored and thought I would see if anyone had input on the topic or if there was a "best practice" or some other type of suggested use as to store file sizes.Stanch
Storing the length of the data can be useful in cloud environments where file size requests may be metered. In general, recording file size in a row that represents a file stored elsewhere can be more efficient.Rectory
S
8

I store filesizes in bytes as an integer in the database. The (signed) integer field of mysql has a maximum value of 2147483647 so filesizes up to 2GB can be stored without a problem.

Spanish answered 6/1, 2011 at 16:45 Comment(0)
K
8

There is no right answer. I like Matt's answer for reasons of precision. I like Abe's answer for reasons of space saving... (Yes, space in a table is much more 'impactful' than on the Filesystem)

The real answer is, for what purpose are you storing the value? Is this for a mechanism to invoice the user storing the data? Then you'd have to rely on the contract. Is this to measure space on a drive... if so, files REALLY take up some number of 'blocks' and NOT some number of bytes. If the minimum block size is 2KB then you should say that EVERY file is increments of 2kb... If you store that value or that value times 2kb is up to you.

Maybe you're storing the value because the retrieval algorithm has 2 optimization paths, one for larger files and one for smaller and that process would like to know the size WITHOUT interrogating the file-system. (in this case maybe just an "is_greater_than_x_kb" flag column is all you need.)

No one here can tell you what your requirement is. The only thing the existing answers give you is an opinion, not a right answer.

Kirsch answered 6/1, 2011 at 17:5 Comment(4)
Thanks Steph- I know there wasn't any precise answer, I just though I would get peoples thoughts on different reasons to store file sizes and the benefits of storing them a certain way. Everyone's answers have been very helpful.Stanch
They are all helpful, but they are no substitute for clear requirements. 95% of problems I see stem from a lack of clear requirements. Also 76% of all statistics are completely fabricated. ;-)Kirsch
I cannot think of a reason to store file size unless it is a requirement. As Stephanie states there is a use for the "is_greater_than_x_kb". If file size was some BA design so the app could check to see if it got the correct file ask them if they would consider a checksum, or other type of unique identifier for the file, so the app know it got the file it was trying to get.Quinton
Me neither RC... which is why I'm pushing to give us a requirement. Too much programming happens because of what we in IT THINK the business wants from us. Maybe it's like in Sharepoint... you'd like to see in your "document library" <-- sharepoint thing> a column for name, a column for uploaded time, a column for size. So before a user tries to d/l the file, he'll know what to expect. But nowadays, 20MB is squat.Kirsch
F
4

I would personally go for # of kb as an int column (as long as you will never have anything smaller than 1kb). bigint takes up twice as much space (8 bytes vs 4) and as long as it's well documented people shouldn't be too confused.

Frizzell answered 6/1, 2011 at 16:38 Comment(2)
If they are storing files up to 20MB, I would say that saving 4 bytes in describing the size isn't a top priority. ;-)Kirsch
The default unit for storage in most Linux/Unix tools is byte.Odin

© 2022 - 2024 — McMap. All rights reserved.