One SQL query to access multiple data sources in Java (from oracle, excel, sql server)
Asked Answered
K

8

17

I need to develop application that can be getting data from multiple data sources ( Oracle, Excel, Microsoft Sql Server, and so on) using one SQL query. For example:

 SELECT o.employeeId, count(o.orderId) 
    FROM employees@excel e. customers@microsoftsql c, orders@oracle o 
    WHERE o.employeeId = e.employeeId and o.customerId = c.customerId 
    GROUP BY o.employeeId;

This sql and data sources must be changes dynamically by java program. My customers want to write and run sql-like query from different database and storage in same time with group by, having, count, sum and so on in web interface of my aplication. Other requirements is perfomance and light-weight.

I find this way to do it (and what drawbacks I see, please, fix me if I wrong):

  1. Apache Spark (drawbacks: heavy solution, more better for BigData, slow if you need getting up-to-date informations without cached it in Spark),

  2. Distributed queries in SQL server (Database link of Oracle, Linked server of Microsoft SQL Server, Power Query of Excel) - drawbacks: problem with change data sources dynamically by java program and problem with working with Excel,

  3. Prestodb (drawbacks: heavy solution, more better for BigData),

  4. Apache Drill (drawbacks: quite young solution, some problem with not latest odbc drivers and some bugs when working),

  5. Apache Calcite (ligth framework that be used by Apache Drill, drawbacks: quite young solution yet),

  6. Do join from data sources manually (drawbacks: a lot of work to develop correct join, "group by" in result set, find best execution plan and so on)

May be, do you know any other way (using free open-source solutions) or give me any advice from your experience about ways in above? Any help would be greatly appreciated.

Killingsworth answered 20/1, 2016 at 11:3 Comment(11)
SSIS and gather data in central place(SQL Server) and then query it.Tear
Problem, not every times we can use Microsoft SQL Server database, for some case we have only Oracle, Excel and for example mySql databaseKillingsworth
You can insert to any other DB as destination, like MySQL/Postgresql/OracleTear
Use a business intelligence tool like Qlikview to collate the data from different DBMS and then join them within the Qlikview extraction layerEscobedo
"You can insert to any other DB as destination, like MySQL/Postgresql/Oracle" - Yes, it's help for some case, but it we need to join from two very big table from Oracle and SQL server it's isn't possibleKillingsworth
Why does it have to be done within an SQL query? Why can't you query each data source separately and then merge the results?Interrogate
Because, I need do group function for example (group by) using this merged results. My customers want to write any sql from any database (with having, group by, order by, sum(), count() and so on). Of course, I can try to do my own SQL parser for merged results and build own execution plans optimizer, however it's need a lot of work.Killingsworth
Don't know much about it myself, but there is UnityJDBC.Frankiefrankincense
O, thank you. Why do you not write about UnityJDBC in answer? It may be same answer what we find...Killingsworth
@ViacheslavVedenin I answered as requested although UnityJDBC is not my real answer. Good luck with your project.Frankiefrankincense
Apache Metamodel might be a solution for your requirement. metamodel.apache.orgHighspirited
C
4

One of the appropriate solution is DataNucleus platform which has JDO, JPA and REST APIs. It has support for almost every RDBMS (PostgreSQL, MySQL, SQLServer, Oracle, DB2 etc) and NoSQL datastore like Map based, Graph based, Doc based etc, database web services, LDAP, Documents like XLS, ODF, XML etc.

Alternatively you can use EclipseLink, which also has support for RDBMS, NoSQL, database web services and XML.

By using JDOQL which is part of JDO API, the requirement of having one query to access multiple datastore will be met. Both the solutions are open-source, relatively lightweight and performant.

Why did I suggest this solution ?

  • From your requirement its understood that the datastore will be your customer choice and you are not looking for a Big Data solution.
  • You are preferring open-source solutions, which are light weight and performant.
  • Considering your use case you might require a data management platform with polyglot persistence behaviour, which has the ability to leverage multiple datastore, based on your/customer's use cases.

To read more about polyglot persistence

https://dzone.com/articles/polyglot-persistence-future

https://www.mapr.com/products/polyglot-persistence

Coucal answered 26/1, 2016 at 14:20 Comment(0)
F
7

UnityJDBC is a commercial JDBC Driver that wraps multiple datasoruces and allows you to treat them as if they were all part of the same database. It works as follows:

You define a "schema file" to describe each of your databases. The schema file resembles something like:

...
<TABLE>
    <semanticTableName>Database1.MY_TABLE</semanticTableName>
    <tableName>MY_TABLE</tableName>
    <numTuples>2000</numTuples>
 <FIELD>
    <semanticFieldName>MY_TABLE.MY_ID</semanticFieldName>
    <fieldName>MY_ID</fieldName>
    <dataType>3</dataType>
    <dataTypeName>DECIMAL</dataTypeName>
    ...

You also have a central "sources file" that references all of your schema files and gives connection information, and it looks like this:

<SOURCES>
    <DATABASE>
        <URL>jdbc:oracle:thin:@localhost:1521:xe</URL>
        <USER>scott</USER>
        <PASSWORD>tiger</PASSWORD>
        <DRIVER>oracle.jdbc.driver.OracleDriver</DRIVER>
        <SCHEMA>MyOracleSchema.xml</SCHEMA>
    </DATABASE>
    <DATABASE>
        <URL>jdbc:sqlserver://localhost:1433</URL>
        <USER>sa</USER>
        <PASSWORD>Password123</PASSWORD>
        <DRIVER>com.microsoft.sqlserver.jdbc.SQLServerDriver</DRIVER>
        <SCHEMA>MySQLServerSchema.xml</SCHEMA>
    </DATABASE> 
</SOURCES>

You can then use unity.jdbc.UnityDriver to allow your Java code to run SQL that joins across databases, like so:

String sql = "SELECT *\n" +
"FROM MyOracleDB.Whatever, MySQLServerDB.Something\n" +
"WHERE MyOracleDB.Whatever.whatever_id = MySQLServerDB.Something.whatever_id";
stmt.execute(sql);

So it looks like UnityJDBC provides the functionality that you need, however, I have to say that any solution that allows users to execute arbitrary SQL that joins tables across different databases sounds like a recipe to bring your databases to their knees. The solution that I would actually recommend to your type of requirements is to do ETL processes from all of your data sources into a single data warehouse and allow your users to query that; how to define those processes and your data warehouse is definitely too broad for a stackoverflow question.

Frankiefrankincense answered 24/1, 2016 at 8:47 Comment(2)
I agree. This calls for a data warehouse to keep all the information.Brachiate
As ETL may not be possible for everyone. Do you know performance of this solution on average?Eumenides
S
4

SQL is related to the database management system. SQL Server will require other SQL statements than an Oracle SQL server.

My suggestion is to use JPA. It is completely independent from your database management system and makes development in Java much more efficient.

The downside is, that cannot combine several database systems with JPA out of the box (like in an 1:1 relation between SQL Server and Oracle SQL server). You could, however, create several EntityManagerFactories (one for each database) and link them together in your code.

Pros for JPA in this scenario:

  • write database management system independent JPQL queries
  • reduces required java code

Cons for JPA:

  • you cannot relate entities from different databases (like in a 1:1 relationship)
  • you cannot query several databases with one query (combining tables from different databases in a group by or similar)

More information:

Superman answered 22/1, 2016 at 13:59 Comment(5)
Thank you, I know what JPA is. But do you know any JPA solution that can be use entities from different database in one time? For example, can be use one-to-one relationship from Oracle table and MySql table?Killingsworth
Yes, thank you. Problem, my customers want to write and run sql-like query from different database and storage in same time with group by, having, count, sum and so on in web interface of my aplication. It's hard to do using manual merge jpa entites (need to write sql parser, optimization plan optimizer and build jpa entites in run time). But, thank you very much for your help!Killingsworth
@ViacheslavVedenin I absolutely agree - JPA is not a good solution, if you want group by over several databases. I would like to keep that answer for other users with similar requirements - but in your case JPA is probably not the right choice.Superman
Just trying to add my thought to this solution. Might be helpful. Use JPA. Implement factory at database interacting layer(DAO) of code and for each database type write database specific native queries. This can help you if you know from which databases you are going to fetch data. In this approach you can fetch data from different factories by applying logic and then can process it.Piranesi
@Piranesi Thanks for your suggestion. I usually try to avoid DAOs and native queries and use pure JPA instead. If using native queries in this case you will lose the independency from the underlying sql dialect.Superman
C
4

One of the appropriate solution is DataNucleus platform which has JDO, JPA and REST APIs. It has support for almost every RDBMS (PostgreSQL, MySQL, SQLServer, Oracle, DB2 etc) and NoSQL datastore like Map based, Graph based, Doc based etc, database web services, LDAP, Documents like XLS, ODF, XML etc.

Alternatively you can use EclipseLink, which also has support for RDBMS, NoSQL, database web services and XML.

By using JDOQL which is part of JDO API, the requirement of having one query to access multiple datastore will be met. Both the solutions are open-source, relatively lightweight and performant.

Why did I suggest this solution ?

  • From your requirement its understood that the datastore will be your customer choice and you are not looking for a Big Data solution.
  • You are preferring open-source solutions, which are light weight and performant.
  • Considering your use case you might require a data management platform with polyglot persistence behaviour, which has the ability to leverage multiple datastore, based on your/customer's use cases.

To read more about polyglot persistence

https://dzone.com/articles/polyglot-persistence-future

https://www.mapr.com/products/polyglot-persistence

Coucal answered 26/1, 2016 at 14:20 Comment(0)
S
3

I would recommend presto and calcite. performance and lightweight doesn't always go hand in hand.

  • presto : quite a lot of proven usages, as you have said "big data". performs well scales well. I don't quite know what light weight means specifically, if requiring less machines is one of them, you could definitely scale less according to your need

  • calcite : embeded in a lot of data analytic libraries like drill kylin phoenix. does what you needed " connecting to multiple DBs" and most importantly "light weight"

Slade answered 29/1, 2016 at 9:2 Comment(0)
H
2

Having experience with some of the candidates (Apache Spark, Prestodb, Apache Drill) makes me chose Prestodb. Even though it is used in big data mostly I think it is easy to set it up and it has support for (almost) everything your are asking for. There are plenty of resources available online (including running it in Docker) and it also has excellent documentation and active community, also support from two companies (Facebook & Netflix).

Hazlett answered 29/1, 2016 at 8:50 Comment(0)
G
2

Multiple Databases on Multiple Servers from Different Vendors The most challenging case is when the databases are on different servers and some of the servers run different database software. For example, the customers database may be hosted on machine X on Oracle, and the orders database may be hosted on machine Y with Microsoft SQL Server. Even if both databases are hosted on machine X but one is on Oracle and the other on Microsoft SQL Server, the problem is the same: somehow the information in these databases must be shared across the different platforms. Many commercial databases support this feature using some form of federation, integration components, or table linking (e.g. IBM, Oracle, Microsoft), but support in the open-source databases (HSQL, MySQL, PostgreSQL) is limited.

There are various techniques to handling this problem:

  • Table Linking and Federation - link tables from one source into another for querying
  • Custom Code - write code and multiple queries to manually combine the data
  • Data Warehousing/ETL - extract, transform, and load the data into another source
  • Mediation Software - write one query that is translated by a mediator to extract the data required
Glucose answered 29/1, 2016 at 9:55 Comment(0)
T
1

May be wage idea. Try to use Apache solr. User different data sources and import the data in to Apache solr. Once data is available you can write different queries by indexing it.

It is open source search platform, that makes sure your search is faster.

Taxonomy answered 22/1, 2016 at 19:1 Comment(2)
Thank you, it's very intresting ideaKillingsworth
Did you try this option. Does it solved your problemTaxonomy
C
1

That's why Hibernate framework is for, Hibernate has its own query language HQL mostly identical to SQL. Hibernate acts as a middle ware to convert HQL query to database specific queries.

Crowson answered 27/1, 2016 at 17:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.