Run SQL query on CSV file contents from command-line?
Asked Answered
N

5

5

I have a (syntactically-valid, not-huge) CSV file. I want to run an arbitrary SQL SELECT query on it; let's assume that it can be a complex query, with some inner queries, with self-joins, maybe with window functions etc.

Now, obviously, I could install a DBMS, run the server process, create a new DB, maybe create an appropriate table, load that CSV data into a table, then finally use a client for the DB to send the query and get the result.

But - that seems like quite a hassle, and I don't need the loaded DB table again for another query.

So, is there an easier way to do this, from a command-line/shell session, hopefully without so much overhead and waste of space?

Notes:

  • I'm not averse to writing a short script (bash, Python, Perl).
  • You may assume any specific SQL language variant you like; I'll adapt as necessary.
  • You may assume the operating system is GNU/Linux if that makes it easier.
Nagle answered 22/8, 2023 at 13:48 Comment(16)
"...let's assume that in can be a complex query, with some inner queries, with self-joins, maybe with window functions etc..." -- Well... you need a database engine. This smells like something Amazon Athena could do, nevertheless.Kishke
What language are we talking about here? With most of the languages you don't need to install anything but a simple ODBC\OLEDB connection. ie: You could have an SQLite ODBC driver, you could load the data in an in-memory database and work on it. IOW would you give more information on the environment? OS, language(s) ... Or with a language like Go, you could directly read the CSV and work on it.Panhellenism
Or... you can load the CSV file in an in memory H2 database; you said it's a small data set; no need to install anything in the server. You can use create table my_data as select * from csvread('my_file.csv');. Then, you can run any complex query in the table my_data.Kishke
"From the command line", what you are saying sounds to be a good fit for interactive F# or Python as well. Maybe a Jupyter notebook is what you are looking for.Panhellenism
I run scripts from command line (cron jobs) using H2 database (for complex queries), by combining Java an Ant. Ant can run SQL easily. What I mention is easy -- of course -- if you know these tools.Kishke
@CetinBasoz: The language is SQL (ISO SQL 2003 if I want to use window function). As for the language for a script - bash, maybe python or perl if it's really necessary. See edit. Using SQLite via ODBC is fine by me; could I do it with a simple short script and not much hassle?Nagle
@TheImpaler: Can you make an answer of that? Is that doable without writing a long program, and for an arbitrary CSV file and query?Nagle
@JNevill: I don't think a solution based on assuming someone has paid Microsoft for a Windows and an Office license is reasonable. Regardless - I want to run something simple from the command-line.Nagle
If windows and office is actually a constraint for a solution, I suggest you mention it in your question. A lot of information workers use these tools.Magnificent
@Nick.McDermaid: No Windows, no Office. I said I want to do something on the command-line.Nagle
@Nagle If this site was limited to only open-source technology, it would be much less valuable. If you have constraints in how this is solutioned, those should be noted in the question.Firewater
@Firewater : Can you link to one such question which asks for a way to do this using the command-line?Nagle
You failed to even mention the database you are usingVoss
@JGFMK: I'm not using a DBMS, that's the whole point. I just have a CSV.Nagle
@Nagle thought the whole point of SO was to demonstrate what you had done - not have people do your work for you!Voss
@JGFMK: The point, for question asking, is to ask questions which SO users find useful (i.e. believe it is useful for answers to them to be available on the site). That's it. The answer does not have to be "doing the work for you"; it could just be a sketch of a relevant approach.Nagle
B
8

You can use sqlite for this:

sqlite3 :memory: -cmd '.import -csv taxi.csv taxi' \
    'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count'

There are many wrappers that try to simplify this (e.g. csvsql --query, dsq or q), but you may find they have limitations of their own (e.g. q doesn't support CTEs, though sqlite obiously does).

duckdb, as an analytic-focused alternative to sqlite, makes this use-case very simple and fast out of the box, while providing some flexibility too:

duckdb -c "SELECT * FROM 'data_source.csv.gz'"
Brenan answered 13/11, 2023 at 21:54 Comment(2)
If you prefer a GUI the DB Browser for SQLLite works a treat importing the csv's! DB Browser for SQLiteProfiteer
Thanks, csvkit is great! csvsql --query was just what I needed.Imelda
K
2

You can run complex SQL queries from data in a CSV file, if you have Java installed in your OS (that's pretty common) and by combining Ant (scripting) and H2 (in-memory database).

For example, if you have the file my_file.csv as:

"name",     "sex", "age", "height", "weight"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       66,      166
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       70,      124
"Luke",       "M",   34,       72,      163
"Omar",       "M",   38,       69,      145
"Page",       "F",   31,       67,      135
"Ruth",       "F",   28,       65,      131

Then you can create the script1.xml Ant script:

<project default="sql">
 
  <target name="sql">
  
    <sql driver="org.h2.Driver" url="jdbc:h2:mem:" userid="sa" password=""
         classpath="./h2-2.1.214.jar" onerror="abort" print="true"
         showheaders="true" showtrailers="true" showWarnings="true" >

      create table my_data as select * from csvread('my_file.csv');

      select * 
      from my_data a
      join (
        select * from my_data where sex = 'M'
      ) b on b.height &lt; a.height 
      where a.sex = 'F'

    </sql>
      
  </target>
 
</project>

Download and unzip Apache Ant 1.10.14 (or other version) under the folder apache-ant-1.10.14/. Then download the H2 database jar file in the folder. Your dir should look like:

apache-ant-1.10.14/
h2-2.1.214.jar
my_file.csv
script1.xml

Then, just run it:

apache-ant-1.10.14/bin/ant -f script1.xml

The first SQL statement loads the CSV data in memory. The second one is your "complex" query:

[sql] Executing commands

[sql] NAME,SEX,AGE,HEIGHT,WEIGHT,NAME,SEX,AGE,HEIGHT,WEIGHT
[sql] Elly,F,30,70,124,Bert,M,42,66,166
[sql] Elly,F,30,70,124,Omar,M,38,69,145
[sql] Page,F,31,67,135,Bert,M,42,66,166
[sql] 
[sql] 2 of 2 SQL statements executed successfully

BUILD SUCCESSFUL
Total time: 0 seconds

That's it!

Notes:

  • For a variable query string, you could externalize it in a file and then use <transaction src="command.sql" /> instead of the embedded SQL I included. You can use multiple tags as needed.

  • For a variable CSV file name (or any parameter value) you can add -Dcsvfile=my_file.csv in the command line and then use it in the script as ${csvfile}. You can define many parameters, as needed.

  • H2 by definition is not a system-level installed database, but and embeded one. Nevertheless, you can place the JAR file anywhere you want, and then use <classpath> tags to refer to the system classpath, instead of the simpler classpath attrubute in this example. See Ant -> Ant Tasks -> List of Tasks -> Sql and <classpath> for details.

  • As a database, H2 is quite powerful. I would qualify its SQL support to medium-high, so it implements most of the common SQL syntax and features you would find in expensive databases.

  • Note that the SQL query is embedded in the script using XML format. That's why the < symbol is escaped as &lt;. If you don't like this, you can place the query in an external file (no escaping necessary) using the <transaction> tag.

  • You can also output the result as CSV or other format for further processing or to be consumed by other tools.

Kishke answered 23/8, 2023 at 13:36 Comment(7)
So, +1 already; I would just need to generalize this script so that it accepts the CSV filename from the command-line, as well as the query string. Can I do that in Ant itself? If not, perhaps I would have an ant script template, and my script will plug the two strings into it, then run ant. Also, if I want to rely on a system-level-installed version of H2, could I do that with a similarly-simple ant script? e.g. perhaps drop the classpath and have it search the system classpaths?Nagle
Let's go one by one. For a variable query string, you could externalize it in a file and then use <transaction src="command.sql" /> instead of the embedded SQL I included. You can use multiple <transaction> tags as needed.Kishke
For a variable CSV file name (or any parameter value) you can add -Dcsvfile=my_file.csv in the command line and then use it in the script as ${csvfile}. You can define many parameters, as needed.Kishke
One last question: What language standard does this support?Nagle
1. Thanks for the clarifications; please consider integrating them into the answer. 2. One last question: What language standard does this support? 3. Alas, the Python script solution is much shorter and simpler...Nagle
That's actually a very good point, because it seems like the Python sqldf solution is limited to sqlite's subset of SQL, which is more limited. I un-accepted the other answer for now :-)Nagle
@Nagle curious as to which parts of SQL you find missing from sqlite?Brenan
P
2

Since you are not averse to using Python you can use third party modules pandas and pandasql. The code would look something like

from pandasql import sqldf
import pandas as pd

df = pd.read_csv('your_file.csv')
query = 'select * from df'
print(sqldf(query).to_string(index=False))

If you prefer using this from the command line, it can be rewritten as

from pandasql import sqldf
import pandas as pd
import sys

_, _, csv_path, query = sys.argv

df = pd.read_csv(csv_path)
print(sqldf(query).to_string(index=False))

If your script is named script.py, invoke it like

python script.py  /path/to/whatever.csv "select * from df"

For better SQL support you can use duckdb as an alternative to pandasql:

import duckdb
import pandas as pd
import sys

_, _, csv_path, query = sys.argv

df = pd.read_csv(csv_path)
print(duckdb.query(query).to_df().to_string(index=False))
Pronounced answered 23/8, 2023 at 13:55 Comment(6)
I have some potential improvements/notes to this useful answer (e.g. making this a proper executable script taking an arbitrary and query string). Would you prefer I edited them in or write my own, based on yours?Nagle
@einpoklum: I threw in a basic, untested command line solution. Feel free to edit in something better.Pronounced
Wait a minute... is the SQL language supported here only what SQLite3 supports?Nagle
Correct, but I'll add a variation using duckdb, which has better sql support.Pronounced
Funny, I end up with my researcher floor-colleagues' (Hannes, Mark, Pedro) work as the answer to my practical question :-)Nagle
I just found out you can also omit pandas by altering the query to read from csv: duckdb.query("select * from read_csv('test.csv', AUTO_DETECT=TRUE)"). See CSV Loading for adding csv reads to duckdb queries. If you want to avoid Python altogether it looks like you can use duckdb's CLI API.Pronounced
A
1

Came wondering what the state of the art might be in 2024. I'd looked into this before, the SQLite and Duckdb usage are useful to know.

Going to brain dump, my apologies:

Turns out I had come across Duckdb a while back (found but not used).

Enjoy!

Athematic answered 6/10 at 6:33 Comment(0)
D
-1

You can use free and open-source Factotum Data Manipulation Tool. It loads the file into a temporary database and allows you to query its content with SQL. You can also switch between Polars and DuckDB SQL engines. Many other formats are supported, not only the CSV. Works on Windows, Mac, Linux.

Steps.

  1. Launch Factotum
  2. Import a CSV file using Ctrl+O or Cmd+O shortcut
  3. Compose your SQL query
  4. Run your SQL query using Ctrl+Enter or Cmd+Return

There is also a video illustrating these steps in the article.

Disclaimer: I am a developer of Factotum.

Dacoity answered 13/11, 2023 at 8:9 Comment(5)
Suppose I've installed factorum on my system. What would I do then, exactly? Otherwise this is just a stub more than an answer.Nagle
@Nagle I updated the answer with exact steps. Thanks!Dacoity
This question is about running the query from the command-line. It seems your answer does not offer a way of doing that. If it indeed does not - please remove it (and no offense to your tool.)Nagle
Please add an emphasis that you need only a command-line solution in your question. I see that you mention a command-line, but there is no proper highlight that you are going to reject answers that provide a slightly different but still a working solution. You can add it as one more bullet point in Notes. After that I'll remove my answer.Dacoity
The title explicitly says "command-line".Nagle

© 2022 - 2024 — McMap. All rights reserved.