Write Drill query output to csv (or some other format)
Asked Answered
D

6

8

I'm using drill in embedded mode, and I can't figure out how to save query output other than copy and pasting it.

Dove answered 23/6, 2015 at 23:1 Comment(0)
A
11

If you're using sqlline, you can create a new table as CSV as follows:

use dfs.tmp; 
alter session set `store.format`='csv';
create table dfs.tmp.my_output as select * from cp.`employee.json`;

Your CSV file(s) will appear in /tmp/my_output.

Alcantar answered 29/6, 2015 at 17:35 Comment(2)
:- Its working for me, but while executing this query it will save 0_0_0.csv file name, but i want to save file with specific name. Is it possible to give proper name of filePolygamist
I tryed running this solution in the Apache-Drill web interface but got error. It seems I am unable to run 3 commands in a single "submit". Detailed problem here #57340563Kuibyshev
B
3

You can specify !record <file_path> to save all output to particular file. Drill docs

Braggart answered 28/3, 2016 at 10:46 Comment(0)
P
2

If you are using SQLLINE use !record .

If you are using a set of queries, you need to specify the exact schema to use. This can be done using th Use schema command. Unfortunately, you must also not use your root schema. Ensure that you have created the correct directory on your file system and use the proper storage configuration as well. An example configuration is below. After this, you can create a csv via java using the SQL driver, or in a tool such as Pentaho to generate a CSV. With the proper specification, it is possible to use the REST query tool at localhost:8047/query as well. The query to produce a csv at /out/data/csv is below after the configuration example.

Storage Configuration

{
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "config": null,
  "workspaces": {
    "root": {
      "location": "/out",
      "writable": false,
      "defaultInputFormat": null
    },
    "jsonOut": {
      "location": "/out/data/json",
      "writable": true,
      "defaultInputFormat": "json"
    },
    "csvOut": {
      "location": "/out/data/csv",
      "writable": true,
      "defaultInputFormat": "csv"
    }
  },
  "formats": {
    "json": {
      "type": "json",
      "extensions": [
        "json"
      ]
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    }
  }
}

Query

USE fs.csvOut;
ALTER SESSION SET `store.format`='csv';
CREATE TABLE fs.csvOut.mycsv_out
AS SELECT * FROM fs.`my_records_in.json`;

This will produce at least one CSV and possibly many with different header specifications at /out/data/csv/mycsv_out.

Each file should follow the following format:

\d+_\d+_\d+.csv

Note: While the query result can be read as a single CSV the resulting CSVs (if there are more than one) cannot as the number of headers will vary. Drop the file as a Json file and read with code or later with Drill or another tool if this is the case.

Pedlar answered 26/5, 2016 at 22:38 Comment(3)
:-Its working for me, but while executing this query it will save 0_0_0.csv file name, but i want to save file with specific name. Is it possible to give proper name of filePolygamist
That issue is likely related to something having to do with uses of distributed file systems and memory. You could try using !record /path/to/output after a query. It is highly likely you will need to change the path with python or another scripting language or bash. superuser.com/questions/486465/…Pedlar
:- Is it possible to change the name of file 0_0_0.csv to some other namePolygamist
C
2

UPDATE: REDIRECTING APACHE DRILL SHELL OUTPUT TO A CSV FILE

It's now early 2018, and for some of you (particularly Apache Drill in MAPR), the above commands DON'T work. If that's the case, try the following. As of 2018 03 02 this DOES work on MapR 5.2 and Mapr 6 :-)

NOTE: I'm using "//" to denote comments alongside actual commands...
NOTE: I'm using "=>" to denote the RESPONSE of the shell to the command...

//FROM INSIDE A DRILL SHELL (ie "SQLLINE")...
//first set the "outputformat" session (shell) variables...

!set outputformat 'csv'

=> you see some output from the shell echoing back the new value...

//next begin "recording" any output to a file...

!record '/user/user01/query_output.csv'

=> again you see some output from the shell echoing that "recording" is ON...

//next actually submit (say) a SELECT query, whose output will now be CSV (even to the screen), as opposed to "TABLE" format...

SELECT * FROM hive.orders;

=> output (formatted as CSV) will begin streaming to both the screen and the file you specified...

//finally you turn OFF the "recording", so the csv file closes...

!record

THAT'S IT - you're DONE ! :-) Now you can either process that CSV where it lies in the CLUSTER storage, or - if you have a need - TRANSFER that file OUT of the cluster and into (say) some other server that has Tableau, Kabana, PowerBI Desktop or some other visualization tools for further analysis.

Chopstick answered 2/3, 2018 at 23:56 Comment(0)
M
0

To set the output of a query in drill-embededd you need to create a table in the tmp schema first. Let's say you want to extract the first 5 rows of a parquet file input_file.parquet in your home folder and set the output to output_file.parquet

CREATE TABLE dfs.tmp.`output_file.parquet`
AS 
(
 SELECT *
 FROM dfs.`/Users/your_user_name/input_file.parquet`
 LIMIT 5
);

File will be saved as /tmp/output_file.parquet.

You can check the result inside drill with

SELECT * 
FROM dfs.tmp.`output_file.parquet`;
Mcguire answered 3/9, 2020 at 9:51 Comment(0)
A
0

You can set the output format with an ALTER SESSION query as shown below, then run a CREATE TABLE AS query and your results will be saved as a CSV. You can also save output as JSON or Parquet format.

use dfs.tmp; 
alter session set `store.format`='csv';
create table dfs.tmp.my_output as select * from cp.`employee.json`;
``` [1]


[1]: https://mcmap.net/q/1263548/-write-drill-query-output-to-csv-or-some-other-format



Adz answered 9/8, 2022 at 12:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.