Tools for discovering de facto foreign keys in databases? [closed]
Asked Answered
M

4

11

A good way to quickly survey the information in a database is to apply a tool that automatically creates a database diagram of all tables and all relationships between them.

In my experience, such tools use foreign keys as the relationships, which most of the databases I try them do not contain. Sure, they satisfy constraints corresponding to foreign keys, but do not enforce them. And I'll end up with a 'diagram' consisting of a bunch of unrelated tables.

So what I'm looking for is software that can compute "undeclared foreign keys" and either

  • uses them as table relations in a database diagram, or
  • generates SQL code for corresponding foreign key declarations

Do you know any tools, free if possible, that can already do this?

Mamelon answered 11/8, 2011 at 18:48 Comment(8)
BTW I've fixed my Perl script that makes a stab at this for the project I'm doing, but really, something more ... deliberate ... would be nice.Mamelon
What kind of assumptions would you expect such a tool to make in order to compute these undeclared foreign keys? Explicit foreign keys serve as meta-data which describes the table relationships. Are you thinking that the tool would detect the relationships purely by naming conventions between the referencing column and the referenced table?Hopehopeful
Well the first thing is to just list inclusion dependencies; for now, my script first lists all candidate primary key columns (the columns with unique values), then tries echo column in the database against each of these to check whether the value of the first are all contained in the values of the second, but that can be optimized - see e.g. citeseerx.ist.psu.edu/viewdoc/…Mamelon
The main problem with this is false positives: many tables have autonumbered IDs, so these will often produce inclusions among unrelated keys, especially when the number of values in the candidate primary key is small. So there will need to be some amount of configurability or postprocessing to weed out such cases.Mamelon
I did this a few times on MySQL ISAM tables. I customized SQLFairy to parse MySQL data definition, guess the relations, and output a Graphviz diagram. But each time the primary keys were declared and the foreign keys followed a strict naming convention. If you don't have this, I don't think you can magically find your way out.Cubitiere
@Mytskine: did you just check naming or did you also check inclusion (whether all values of one column are a subset of all values of the other)?Mamelon
@Mytskine: would you be willing to share your modifications? My own script is also in Perl and based on DBI, perhaps we can share our efforts.Mamelon
This is a near-duplicate of https://mcmap.net/q/1016806/-decipher-database-schema-closedMamelon
R
2

The following products are all claiming to provide foreign-keys discovery abilities:

Statistical methodologies able to provide a kind of similarity rank like range distribution and creation time as suggested by Kirk, seems to be the right way. .. I'd need to implement it using SAS EG or any free tool.

Rasmussen answered 21/9, 2015 at 13:11 Comment(2)
A technical aproach might be found here vldb.org/pvldb/vldb2010/papers/R72.pdf but a practical one would still be nice ;-)Rasmussen
I still use my script. It finds candidate primary keys by testing SELECT COUNT(DISTINCT colname) = SELECT COUNT (DISTINCT *) and candidate foreign keys by testing whether their set of values is included in the set of values of one of the candidate primary keys, which can also be done with a single SELECT query. This finds a whole lot of false positives.Mamelon
R
8

Interesting question. You're looking to parse a database schema and data to determine which tables are relevant or should be related to each other, without any strict definition of the relationship. In effect, you're trying to infer a relationship.

I see two ways that you can infer such a relationship. First let me say that your approach might vary depending on the databases you're working with. A number of questions spring to mind (I don't want answers, but they are worth reflecting on)

  • are these in-house enterprise systems that follow some consistent naming convention or pattern?
  • or are they 'in-the-wild' databases that you come across anywhere, at any time?
  • what sort of assumptions are you prepared to make?
  • would you prefer to get more false positives or false negatives in your result?

Note that this type of inference will almost certainly give false results, and is built on a lot of assumptions.

So I offer two approachs that I'd use in concert.

Inferring a relationship through structure / naming (symbolic analysis)

Common database design is to name a PK column after the table name (e.g. CustomerId on table Customer), or alternatively name the PK column simply Id.

A table with a FK relationship to another often names its related column the same as the related table. In the Order table I'd expect a CustomerId column which refers to the CustomerId / Id column in the Customer table.

This type of analysis would include

  • inspecting columns across tables for similar phrases / words
  • looking for columns names that are similar to the names of other tables
  • checking for column names that contain the name of other column (e.g. FirstCustomerId & SecondCustomerId both refer to the CustomerId column in the Customer table)

Inferring a relationship through data (statistical analysis)

Looking at data, as you suggest you have done in your comments, will allow you to determine 'possible' references. If the CustomerId column in the Order table contains values which don't exist in the Id column of the Customer table then it's reasonable to question that this is a valid relationship (although you never know!)

A simple form of data analysis is using dates and times. Rows that were created with close proximity to one another are more likely to be related to one another. If, for every Order row that was created, there also exist between 2 and 5 Item rows created within a few seconds, then a relationship between the two is likely.

A more detailed analysis might look at the range and distribution of used values.

For example, if your Order table has a St_Id column - you might infer using symbolic analysis that the column is likely to relate to either a State table or a Status table. The St_Id column has 6 discrete values, and 90% of the records are covered by 2 values. The State table has 200 rows, and the Status table has 9 rows. You could quite reasonably infer that the St_Id column relates to the Status table - it gives a more greater coverage of the rows of the table (2/3 of the rows are 'used', whereas only 3% of the rows in the State table would be used).

If you perform data analysis on existing databases to gather 'real life data', I'd expect some patterns that could be used as guides to structure inference. When a table with a large number of records has a column with a small number of values repeated many times (not necessarily in order), it's more likely to this column relates to a table with a correspondingly small number of rows.

In summary

Best of luck. It's an interested problem, I've just thrown some ideas out there but this is very much a trial & error, data gathering and performance tuning situation.

Rolanderolando answered 22/8, 2011 at 3:48 Comment(2)
Thanks for your extensive reply. I really like how you are addressing a more general problem than I was trying to state, and I may sometimes need to solve such a more general problem. When looking for strict foreign keys I can just look for inclusion dependencies, which if I restrict myself to single-column dependencies is only quadratic in the size of the database, which in my use cases probably means I can just compute them instead of having to approximate the solution in one way or another.Mamelon
Nevertheless, for larger databases where exhaustive checking of inclusion dependencies is not practical, an implementation based on symbolic and statistical analysis is at github.com/janmotl/linkifierIormina
E
6

This is a non-trivial exercise in most cases. If you are lucky enough to be analysing a schema for a modern framework, such as Ruby on Rails, or CakePHP or similar, and the developers have been stringent about following column conventions, then you have a reasonable chance of finding many, but not all, of the implied relationships.

I.e. if your tables use columns like user_id to refer to entries in the users tables.

Be aware: some entity names may pluralise irregularly (entity being a good example: entities, not entitys) and these are harder to catch (but still possible). However, keys such as admin_id which the developers join with the users table on user.id can't be inferred. You would need to handle those cases manually.

You didn't specify an RDBMS, but I used MySQL a lot, and I'm currently working on this problem for myself.

The following MySQL script will infer most relationships implied by column names. It then lists any relationships that it could not find table names for, so at least you know which ones you're missing. The inferred parent and child are listed, along with singular and plural names, plus the implied relationship:

-- this DB is where MySQL keeps schema information
use information_schema;

-- change this to the DB you want to analyse
set @db_name = "example_DB";

-- infer relationships 
-- NB: this won't catch names that pluralise irregularly like category -> categories or bus_id -> buses etc.
select  LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 )              as inferred_parent_singular
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")  as inferred_parent_plural 
,       C.TABLE_NAME                                                  as child_table
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME)-3), "s has many ", C.TABLE_NAME) as inferred_relationship
from    COLUMNS C
JOIN    TABLES T on C.TABLE_NAME = T.TABLE_NAME 
        and C.TABLE_SCHEMA = T.TABLE_SCHEMA 
        and T.TABLE_TYPE != "VIEW"  -- filter out views; comment this line if you want to include them
where   COLUMN_NAME like "%_id"     -- look for columns of the form <name>_id
and     C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name 
-- and     C.TABLE_NAME not like "wwp%"  -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- finally make sure to filter out any inferred names that aren't really tables
and     CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") -- this is the inferred_parent_plural, but can't use column aliases in the where clause sadly
          in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;

This will return results like this: enter image description here

Then you can examine any naming convention exceptions detected with:

-- Now list any inferred parents that weren't real tables to see see why (irregular plurals and columns not named according to convention)
select  LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ) as inferred_parent_singular
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") as inferred_parent_plural 
,       C.TABLE_NAME as child_table
from    COLUMNS C
JOIN    TABLES T  on  C.TABLE_NAME    = T.TABLE_NAME 
                  and C.TABLE_SCHEMA  = T.TABLE_SCHEMA 
                  and T.TABLE_TYPE   != "VIEW"            -- filter out views, comment this line if you want to include them
where   COLUMN_NAME like "%_id"
and     C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name 
-- and     C.TABLE_NAME not like "wwp%"  -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- this time only include inferred names that aren't real tables
and     CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")
          not in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;

This will return results like this, which you can process manually: enter image description here

You can modify these scripts to spit out whatever is useful to you, include foreign key create statements, if you want to. Here, the final column is a simple 'has many' relationship statement. I use this in a tool I've built called called pidgin, which is rapid modelling tool that draws relationship diagrams on the fly based on relationship statements written a very simple syntax (called 'pidgin'). You can check it out at http://pidgin.gruffdavies.com

I've run the above script on a demo DB to show you the sort of results you can expect:

enter image description here

I haven't catered for the irregular plurals in my script, but I might have a go at that too, at least for the case of entities ending in -y. If you want to have a try at that yourself, I'd recommend writing a stored function that takes <name>_id column names as a parameter, strips the _id part and then applies some heuristics to attempt to pluralise correctly.

Hope that's useful!

Ehlers answered 29/8, 2014 at 12:13 Comment(5)
This is clearly useful, and complementary to my own approach, which doesn't look at column names at all, but only at the sets of column values: it checks whether one includes the other and allows the user to impose a minimum size. So you could add that to your tool.Mamelon
So you're doing set analysis of the data to infer relationships? That sounds fun (but computationally expensive?). In the end, I decided to leave the heuristics to a ruby script which I'm building into the tool, so with my new approach you just need to supply a table list and a list of columns and it does the rest, reporting exceptions which you then correct. Works really well!Ehlers
Actually, it's not hard: SQL can express such tests. I'd publish the script if it wasn't so ugly; the principles are straightforward.Mamelon
Sure, it's easy to express set memberships, that's the whole basis of SQL, but I wondered how performant they were on very large sets and hundreds of keys?Ehlers
Hard to say. Thus far I haven't used it on anything big, and of course it depends on indexes being present. (I've done a project in which I imported webserver logs into SQLite to query them for analysis, and I had to add indexes prior to doing any joins. This script doesn't add any indexes.)Mamelon
R
2

The following products are all claiming to provide foreign-keys discovery abilities:

Statistical methodologies able to provide a kind of similarity rank like range distribution and creation time as suggested by Kirk, seems to be the right way. .. I'd need to implement it using SAS EG or any free tool.

Rasmussen answered 21/9, 2015 at 13:11 Comment(2)
A technical aproach might be found here vldb.org/pvldb/vldb2010/papers/R72.pdf but a practical one would still be nice ;-)Rasmussen
I still use my script. It finds candidate primary keys by testing SELECT COUNT(DISTINCT colname) = SELECT COUNT (DISTINCT *) and candidate foreign keys by testing whether their set of values is included in the set of values of one of the candidate primary keys, which can also be done with a single SELECT query. This finds a whole lot of false positives.Mamelon
A
0

I don't know about the softwares which may help in searching what you require, but The following query will help to get you started. It lists all Foreign Key Relationships within the current database.

SELECT
    K_Table = FK.TABLE_NAME,
    FK_Column = CU.COLUMN_NAME,
    PK_Table = PK.TABLE_NAME,
    PK_Column = PT.COLUMN_NAME,
    Constraint_Name = C.CONSTRAINT_NAME
FROM
    INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK
    ON C.CONSTRAINT_NAME = FK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK
    ON C.UNIQUE_CONSTRAINT_NAME = PK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU
    ON C.CONSTRAINT_NAME = CU.CONSTRAINT_NAME
INNER JOIN (
            SELECT
                i1.TABLE_NAME,
                i2.COLUMN_NAME
            FROM
                INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1
            INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2
                ON i1.CONSTRAINT_NAME = i2.CONSTRAINT_NAME
            WHERE
                i1.CONSTRAINT_TYPE = 'PRIMARY KEY'
           ) PT
    ON PT.TABLE_NAME = PK.TABLE_NAME

Hope this helps.

Adina answered 21/8, 2011 at 10:46 Comment(3)
Thank you, but my question is about the case where no such constraints are explicitly defined.Mamelon
I think you should visit this link: ironspeed.com/Designer/8.0.2/WebHelp/Part_II/… Hope this helps you.Adina
Yes, that is the kind of functionality I'm looking for! Except that Ironspeed Designed doesn't appear to list all virtual primary and foreign keys in a database. That is the part I'm asking for.Mamelon

© 2022 - 2024 — McMap. All rights reserved.