Dynamically determine most appropriate data type for a column
Asked Answered
E

1

7

Ok, I know this is going to seem odd. but, I am trying to figure out how to dynamically determine what data type to set a column to using an SQL query in a MySql database.

I have a number of tables that get created dynamically via script. I initially create them, and size all of columns as 'text' data types with a length of 2000. I do this so that I don't have to worry about what is actually going into the table during the insert process. Once I'm done inserting the data for the table, I go through the table, examine the maximum length of the data stored and what kind of data is stored for each and every column and table. There are a lot of tables -and a lot of columns. I've come up with a process -but, I'm a little concerned that it is going to miss something one of these times. The DB has about 1250+ tables and about 300+ million rows in total. And the really unfortunate thing is that data types are never guaranteed to be consistent. Ever. Which is why I have to handle them after the import.

Basically, I run a query to get the MAX CHAR_LENGTH for the column and store that. Then, I run a query that selects N records in the database and then checks to see if they are numeric or characters. I know, I know. Not the best way to go about it.

Since there is absolutely no way that I can be certain of the incoming data (truly a nightmare) I am trying to figure if there is a way to query to column to see what kind of data resides in it? I am not looking for a query to see how the column has been defined (obviously, as I already defined it with a 'generic' type of "text").

I know I could handle this in the script during the import process but am hoping to just run an ALTER TABLE command instead of writing a bunch of extra code.

Any thoughts or ideas are greatly appreciated!

Epidiascope answered 2/10, 2012 at 23:53 Comment(5)
How do you plan on using this data? In other words, why do you need to convert from text columns to something else?Married
The data typically falls into just a few types: date,varchar,float,or int types. I really just want to make sure the columns have the proper data type defined. I suppose I could keep it as text, but I'd really rather not. Typically, smaller data types are faster and require less space on disk. Strictly speaking, I could probably get away with it though...Epidiascope
I'm assuming this is a one-time process you're after? You won't be doing this again, right? It seems to me to be a case of redefining each column of the table in respect of the first field of data you come across. I imagine you may have to perform some command line operations in MySQL. Either that or you could cop out and dump the data into a text file, find a good parser and do it in a programming language you feel more comfortable with.Sunderance
I wish it was a one time thing. Its an ongoing process. However, its not all 300 million rows -but, it can be a substantial number of them from as few as 100 records to as many as 100 million records. I've thought about just testing the first piece of data that comes down the pipe but sometimes there are characters hidden amongst the numbers. Basically, that means I have to test at least a few thousand to make sure that I have a good chance of getting the right data type.Epidiascope
Its funny that you mention a text file -that is actually what the data source is. And its a horrible mess. I actually parse every row, and every field of every row by VBScript and then insert into the DB. And the worst part is that I can cannot make any assumptions of what the data will be or even how 'clean' it is. Sometimes the fields are separated by tabs, sometimes three spaces, and sometimes the number of data fields don't even line up with with the number of headers. Oh -and sometimes there aren't even headers either!Epidiascope
G
6

Use the PROCEDURE ANALYSE() option:

SELECT * FROM `tablename` PROCEDURE ANALYSE();

Here's an example of the output:

mysql> select * from zip procedure analyse() \G
*************************** 1. row ***************************
             Field_name: zip.zip
              Min_value: 00501
              Max_value: 99950
             Min_length: 5
             Max_length: 5
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 5.0000
                    Std: NULL
      Optimal_fieldtype: MEDIUMINT(5) UNSIGNED NOT NULL
*************************** 2. row ***************************
             Field_name: zip.city
              Min_value: Aaronsburg
              Max_value: Zwolle
             Min_length: 3
             Max_length: 26
       Empties_or_zeros: 1009
                  Nulls: 0
Avg_value_or_avg_length: 8.3869
                    Std: NULL
      Optimal_fieldtype: VARCHAR(26) NOT NULL
*************************** 3. row ***************************
             Field_name: zip.state_id
              Min_value: AA
              Max_value: WY
             Min_length: 2
             Max_length: 2
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 2.0000
                    Std: NULL
      Optimal_fieldtype: ENUM('AA','AE','AK','AL','AP','AR','AS','AZ','CA','CO','CT','DC','DE','FL','FM','GA','GU','HI','IA','ID','IL','IN','KS','KY','LA','MA','MD','ME','MH','MI','MN','MO','MP','MS','MT','NC','ND','NE','NH','NJ','NM','NV','NY','OH','OK','OR','PA','PR','PW','RI','SC','SD','TN','TX','UT','VA','VI','VT','WA','WI','WV','WY') NOT NULL
*************************** 4. row ***************************
             Field_name: zip.latitude
              Min_value: -7.209975
              Max_value: 71.299525
             Min_length: 2
             Max_length: 8
       Empties_or_zeros: 1009
                  Nulls: 0
Avg_value_or_avg_length: 37.599173975674866
                    Std: 7.949323125673274
      Optimal_fieldtype: FLOAT NOT NULL
*************************** 5. row ***************************
             Field_name: zip.longitude
              Min_value: -176.63675
              Max_value: -64.734694
             Min_length: 3
             Max_length: 8
       Empties_or_zeros: 1009
                  Nulls: 0
Avg_value_or_avg_length: -88.79028976104503
                    Std: 20.6017874416888
      Optimal_fieldtype: FLOAT NOT NULL
*************************** 6. row ***************************
             Field_name: zip.timezone_id
              Min_value: 3
              Max_value: 11
             Min_length: 1
             Max_length: 2
       Empties_or_zeros: 0
                  Nulls: 1009
Avg_value_or_avg_length: 8.1563
                    Std: 1.0430
      Optimal_fieldtype: ENUM('3','5','6','7','8','9','11')
*************************** 7. row ***************************
             Field_name: zip.dst
              Min_value: 1
              Max_value: 1
             Min_length: 1
             Max_length: 1
       Empties_or_zeros: 2618
                  Nulls: 0
Avg_value_or_avg_length: 0.9376
                    Std: 0.2419
      Optimal_fieldtype: ENUM('0','1') NOT NULL
*************************** 8. row ***************************
             Field_name: zip.status
              Min_value: Active
              Max_value: Active
             Min_length: 6
             Max_length: 6
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 6.0000
                    Std: NULL
      Optimal_fieldtype: ENUM('Active') NOT NULL
*************************** 9. row ***************************
             Field_name: zip.created
              Min_value: 2010-09-25 11:43:41
              Max_value: 2011-05-13 15:56:00
             Min_length: 19
             Max_length: 19
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 19.0000
                    Std: NULL
      Optimal_fieldtype: ENUM('2010-09-25 11:43:41','2010-09-25 11:43:56','2010-11-01 09:49:32','2011-05-13 15:56:00') NOT NULL
*************************** 10. row ***************************
             Field_name: zip.updated
              Min_value: 2010-09-24 23:13:41
              Max_value: 2011-05-13 15:56:00
             Min_length: 19
             Max_length: 19
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 19.0000
                    Std: NULL
      Optimal_fieldtype: ENUM('2010-09-24 23:13:41','2010-09-24 23:13:56','2010-09-29 12:40:56','2010-11-01 09:49:32','2011-05-13 15:56:00') NOT NULL
10 rows in set (0.15 sec)
Gerta answered 3/10, 2012 at 2:52 Comment(4)
You sir, have just hit the nail right on its head! That is exactly what I was looking for. I truly owe you a beer :DEpidiascope
Ok, this is probably totally crazy to ask... When I run the PROCEDURE ANALYZE through Vbscript it basically returns gibberish. Running the query through the DB returns the results that one would expect. I've tried casting the returned data as strings... but, no dice. Any ideas on what might be causing this?Epidiascope
Are you using PROCEDURE ANALYZE instead of PROCEDURE ANALYSE?Gerta
I figured out the issue. When I ran the query directly on the DB it returned normal looking text. What I wasn't understanding is that it wasn't actually text -it was just displaying as text. I figured out that is a byte array by using the VBScript VarType() method. Here is the function I ran across:Function C8209toStr(body8209) If VarType(body8209) = 8209 Then Dim i ReDim aOut(UBound(body8209)) For i = 1 to UBound(body8209) + 1 aOut(i-1) = chr(ascb(midb(body8209,i,1))) Next C8209toStr = Join(aOut, "") End If End FunctionEpidiascope

© 2022 - 2024 — McMap. All rights reserved.