MySQL: Converting datatypes and collations effect on stored data

Asked 19/12, 2012 at 10:15 Answered 19/12, 2012 at 10:44

Solved mysql type-conversion database-performance collation sqldatatypes

I have a general question about this. There are many times we want to change data-types of fields or collations when lots of data is inserted before. Consider these situations :

converting varchar collation from utf8_general_ci to latin1_swedish_ci: as I know the first has multibyte chars and the second singly byte ones. Does this conversion manipulate stored records correctly? And does this conversion lead to reduction of volume of existing data (maybe 50%)?
Conversion of int(10) to smallint(5): Does the volume of data reduce to 50% correctly?
Or for example: int(10) to unsigned int(10) - text to varchar(1000) - varchar(20) to char(10) , ...

As it is obvious, these actions might be done to increase efficiency, reduce volume of data and ...

Consider I have a table with 1,000,000 records. I want to know if doing such actions have bad effects on stored data, or if it makes low performance for future inserts and selects involving this table.

UPDATE :
When I talk about changing utf8 encoding charset to Latin, of course the values of my field are English (it's obvious if there are Japanese, they will be lost). With this assumption, I'm asking about the resulting table size and performance.

Yser answered 19/12, 2012 at 10:15 Comment(2)

Premature optimization is the root of all evil. Are you actually faced with some performance or storage problem that you are trying to solve? If so, what is it? – Aconite 19/12, 2012 at 10:58

well my table size is growing fast and I wanted to know if changing for example utf8 to latin would make a change. But I understood of your answer this not make a change – Yser 19/12, 2012 at 11:2

Converting varchar collation from utf8_general_ci to latin1_swedish_ci: As I know the first has multibyte chars and the second singly byte ones. Does this conversion manipulate stored records correctly? And does this conversion lead to reduction of volume of existing data (maybe 50%)?

Collation is merely the ordering that is used for string comparisons—it has (almost) nothing to do with the character encoding that is used for data storage. I say almost because collations can only be used with certain character sets, so changing collation may force a change in the character encoding.

To the extent that the character encoding is modified, MySQL will correctly re-encode values to the new character set whether going from single to multi-byte or vice-versa. Beware that any values that become too large for the column will be truncated.

Provided that the new character type is of variable-length and that the values are encoded with fewer bytes in the new encoding than before, there will of course be a reduction in the table's size.
Conversion of int(10) to smallint(5): Does the volume of data reduce to 50% correctly?

INT and SMALLINT respectively occupy 4 and 2 bytes regardless of display width: so yes, the size of the table will reduce accordingly.
Or for example: int(10) to unsigned int(10) - text to varchar(1000) - varchar(20) to char(10), ...
- INT occupies 4 bytes irrespective of whether it is signed, so there will be no change;
- TEXT and VARCHAR(1000) both occupy L+2 bytes (where L is the value's length in bytes), so there will be no change;
- VARCHAR(20) occupies L+1 bytes (where L is the value's length in bytes) whereas CHAR(10) occupies 10×w bytes (where w is the number of bytes required for the maximum-length character in the character set), so there may well be a change but it is dependent on the actual values stored and the character encoding used.

Note that, depending on storage engine, reductions in table size may not immediately be released to the filesystem.

Aconite answered 19/12, 2012 at 10:34 Comment(4)

you provided nice information . but actually in my case , all varchar field values are English strings(no international characters). So how table size will change if I change the collation to latin1_swedish_ci ? – Yser 19/12, 2012 at 10:46

As mentioned in my answer, collation has almost nothing to do with encoding. If you change the encoding from utf8 to latin1 (and your values contain only ASCII characters), then there will be no change: in both encodings, all characters are encoded with exactly one byte. If your values contain characters that are encoded in utf8 with more than one byte then there will be a commensurate reduction (although such characters may be discarded and replaced with a ? placeholder if they do not exist in the latin1 character set). – Aconite 19/12, 2012 at 10:49

if so , why should not always use utf8 instead of latin for example ? are there any extra considerations ? – Yser 19/12, 2012 at 10:54

Well, there are some characters (e.g. accented ones) that require only 1 byte in latin1 but 2 bytes in utf8. I pretty much always use utf8. – Aconite 19/12, 2012 at 10:56

A1. collation does not change your data. it changes sort order in your queries, and possibly changes your indices (?).

A2. length of the data in the column will be reduced, however you always have some overhead per table row, and you cannot change that. moreover, if your data is not unique, you will not see much reduction in index size, because your index looks like this: 33->{row1,row2,row3...},67->{row9,row0,row7} and every row pointer is much larger than an int.

in other words, if you had a table with a hundred int rows, without many indices, and changed all these columns to tinyint, you would see a notable improvement. if it is only one column, don't bother.

http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html http://dev.mysql.com/doc/refman/5.0/en/innodb-physical-record.html

A3. please read up on text vs varchar. the earlier stores data separately from table row, the latter in the row. each has own implications.

p.s. row and index overhead depends a lot on what db engine you use. normally you should use innodb. however for read-only tasks, e.g. data mining, myisam is more efficient.

Deathly answered 19/12, 2012 at 10:36 Comment(0)

Converting varchar collation from utf8_general_ci to latin1_swedish_ci: It can reduce table(file) size, but you can lose not latin symbols, only english words will be stored correctly.
Conversion of int(10) to smallint(5) - it will reduce the volume of data. Conversion of int(10) to unsigned int(10) - it won't reduce. In these cases you should care about the values, you can get an error - out of range value.
Conversion varchar(20) to char(10): CHARs are used for strings that always have the same length (for example - 10), if the strings are different in length, then use VARCHAR data type.

Bedrock answered 19/12, 2012 at 10:44 Comment(0)

Recommended topics

Hot tags