Howto clean comments from raw sql file
Asked Answered
D

5

13

I have problem with cleaning comments and empty lines from already existing sql file. The file has over 10k lines so cleaning it manually is not an option.

I have a little python script, but I have no idea how to handle comments inside multi line inserts.

Code:

f = file( 'file.sql', 'r' )
t = filter( lambda x: not x.startswith('--') \
            and not x.isspace() 
  , f.readlines() )
f.close()
t #<- here the cleaned data should be

How it should work:

This should be cleaned:

-- normal sql comment

This should stay as it is:

CREATE FUNCTION func1(a integer) RETURNS void
    LANGUAGE plpgsql
    AS $$
BEGIN
        -- comment
       [...]
END;
$$;

INSERT INTO public.texts (multilinetext) VALUES ('
and more lines here \'
-- part of text 
\'
[...]

');
Dufrene answered 3/5, 2011 at 15:22 Comment(8)
Can you keep track of how many non-escaped quotes you've passed? An odd number means the comment is part of a string and thus should not be removed.Exemplum
but it will still remove comments from stored proceduresDufrene
I'm slightly off topic, but maybe your real problem is that your SQL file is 10K lines. Or that you're not using version control. Or both. Something like our process might help you. See my answer for #5330565Flagwaving
@Catcall I have both, the file is produced from several smaller files, but for production environment deployment the comments are not needed (they take around 60% of file)Dufrene
It might be easier to change the makefile. It's certainly easier to skip removing comments from functions and stored procedures--just don't add the "remove comments" code to the part of the make that builds the functions (or the file full of functions).Flagwaving
@Catcall but I still need the "remove comments" function, or I`m missing the meaning ?Dufrene
If each function is in a separate file under version control, it's easy to leave the comments in. If you have a 10K line file of SQL, it's harder to strip comments from everything except functions. That's what I was talking about.Flagwaving
sqlparse's strip_comments is worth looking at: sqlparse.format(sql, strip_comments=True)Amphithecium
C
10

Try the sqlparse module.

Updated example: leaving comments inside insert values, and comments within CREATE FUNCTION blocks. You can tweak further to tune the behavior:

import sqlparse
from sqlparse import tokens

queries = '''
CREATE FUNCTION func1(a integer) RETURNS void
    LANGUAGE plpgsql
        AS $$
        BEGIN
                -- comment
       END;
       $$;
SELECT -- comment
* FROM -- comment
TABLE foo;
-- comment
INSERT INTO foo VALUES ('a -- foo bar');
INSERT INTO foo
VALUES ('
a 
-- foo bar'
);

'''

IGNORE = set(['CREATE FUNCTION',])  # extend this

def _filter(stmt, allow=0):
    ddl = [t for t in stmt.tokens if t.ttype in (tokens.DDL, tokens.Keyword)]
    start = ' '.join(d.value for d in ddl[:2])
    if ddl and start in IGNORE:
        allow = 1
    for tok in stmt.tokens:
        if allow or not isinstance(tok, sqlparse.sql.Comment):
            yield tok

for stmt in sqlparse.split(queries):
    sql = sqlparse.parse(stmt)[0]
    print sqlparse.sql.TokenList([t for t in _filter(sql)])

Output:

CREATE FUNCTION func1(a integer) RETURNS void
    LANGUAGE plpgsql
        AS $$
        BEGIN
                -- comment
       END;
       $$;

SELECT * FROM TABLE foo;

INSERT INTO foo VALUES ('a -- foo bar');

INSERT INTO foo
VALUES ('
a
-- foo bar'
);
Chamomile answered 3/5, 2011 at 15:40 Comment(5)
thanks for a tip, but this is also stripping comments from inside functions, is there a way to prevent it ?Dufrene
@Szymon Lukaszczyk : i was just living the same comment for @Chamomile :) , look at my answer if you want something that work with your example.Terza
@Szymon updated to show the raw tokens, which you can filter by iterating over them.Chamomile
almost, still fails on the INSERT from my exampleDufrene
@Szymon I updated it to handle that case. This should provide enough of a baseline which you can extend and modify.Chamomile
S
15

Adding an updated answer :)

import sqlparse

sql_example = """--comment
SELECT * from test;
INSERT INTO test VALUES ('
-- test
a
');
 """
print sqlparse.format(sql_example, strip_comments=True).strip()

Output:

SELECT * from test;
INSERT INTO test VALUES ('
-- test
a
');

It achieves the same result but also covers all other corner cases and more concise

Synaesthesia answered 31/7, 2018 at 13:32 Comment(0)
C
10

Try the sqlparse module.

Updated example: leaving comments inside insert values, and comments within CREATE FUNCTION blocks. You can tweak further to tune the behavior:

import sqlparse
from sqlparse import tokens

queries = '''
CREATE FUNCTION func1(a integer) RETURNS void
    LANGUAGE plpgsql
        AS $$
        BEGIN
                -- comment
       END;
       $$;
SELECT -- comment
* FROM -- comment
TABLE foo;
-- comment
INSERT INTO foo VALUES ('a -- foo bar');
INSERT INTO foo
VALUES ('
a 
-- foo bar'
);

'''

IGNORE = set(['CREATE FUNCTION',])  # extend this

def _filter(stmt, allow=0):
    ddl = [t for t in stmt.tokens if t.ttype in (tokens.DDL, tokens.Keyword)]
    start = ' '.join(d.value for d in ddl[:2])
    if ddl and start in IGNORE:
        allow = 1
    for tok in stmt.tokens:
        if allow or not isinstance(tok, sqlparse.sql.Comment):
            yield tok

for stmt in sqlparse.split(queries):
    sql = sqlparse.parse(stmt)[0]
    print sqlparse.sql.TokenList([t for t in _filter(sql)])

Output:

CREATE FUNCTION func1(a integer) RETURNS void
    LANGUAGE plpgsql
        AS $$
        BEGIN
                -- comment
       END;
       $$;

SELECT * FROM TABLE foo;

INSERT INTO foo VALUES ('a -- foo bar');

INSERT INTO foo
VALUES ('
a
-- foo bar'
);
Chamomile answered 3/5, 2011 at 15:40 Comment(5)
thanks for a tip, but this is also stripping comments from inside functions, is there a way to prevent it ?Dufrene
@Szymon Lukaszczyk : i was just living the same comment for @Chamomile :) , look at my answer if you want something that work with your example.Terza
@Szymon updated to show the raw tokens, which you can filter by iterating over them.Chamomile
almost, still fails on the INSERT from my exampleDufrene
@Szymon I updated it to handle that case. This should provide enough of a baseline which you can extend and modify.Chamomile
T
1

This is an extend of samplebias answer that work with your example :

import sqlparse

sql_example = """--comment
SELECT * from test;
INSERT INTO test VALUES ('
-- test
a
');
"""

new_sql = []

for statement in sqlparse.parse(sql_example):
    new_tockens = [stm for stm in statement.tokens 
                   if not isinstance(stm, sqlparse.sql.Comment)]

    new_statement = sqlparse.sql.TokenList(new_tockens)
    new_sql.append(new_statement.to_unicode())

print sqlparse.format("\n".join(new_sql))

Output:

SELECT * from test;

INSERT INTO test VALUES ('
-- test
a
');
Terza answered 3/5, 2011 at 15:52 Comment(5)
print(sqlparse.format("\n".join(new_sql))) returns u'\n' after coming the example to pythonDufrene
@Szymon Lukaszczyk: i pasted the code above directly from a script that produced the same output as in my answer weird ??!! :)Terza
got (<Statement '--comm...' at 0x8d6009c>, <Statement ' INSER...' at 0x8d6002c>, <Statement ' ' at 0x8d6017c>) after parse instead of proper objects. Don`t know why.Dufrene
I have installed sqlparse from debian sid repo, (ver 0.1.2-2)Dufrene
@Szymon Lukaszczyk: i have the same version in my ubuntu machine, and the output that you said that you are getting is the return value of sqlparse.parse(sql_example) weird !!??Terza
M
1
# Remove comments i.e. lines beginning with whitespace and '--' (using multi-line flag)
re.sub('^\s*--.*\n?', '', query, flags=re.MULTILINE)

Regex string explained:

  • ^ start of line
  • \s whitespace
  • \s* zero or more whitespace characters
  • -- two hypens (static string pattern)
  • .* zero or more of any characters (i.e. the rest of the line)
  • \n newline character
  • ? end of string
  • flags = re.M is the multiline modifier

"When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline)"

See the Python regular expressions documentation for more details:

https://docs.python.org/3/library/re.html

Moulin answered 15/2, 2018 at 10:15 Comment(2)
add some explanation pleaseRociorock
It'd be nice to have this include /* */ comments too, though I'm guessing this would be tricky to accomodate with the -- ones potentially overlapping or lines starting with -- and ending with */ and you missing the closing part.Verse
Y
0

It is possible to do it with regular expressions. First you have to split the file by strings and after this you can split the file by comments. The following Perl program does it:

#! /usr/bin/perl -w

# Read hole file.
my $file = join ('', <>);

# Split by strings including the strings.
my @major_parts = split (/('(?:[^'\\]++|\\.)*+')/, $file);

foreach my $part (@major_parts) {
    if ($part =~ /^'/) {
        # Print the part if it is a string.
        print $part; 
    }
    else {
        # Split by comments removing the comments
        my @minor_parts = split (/^--.*$/m, $part);
        # Print the remaining parts.
        print join ('', @minor_parts);
    }
}
Yarrow answered 3/5, 2011 at 16:4 Comment(1)
he's looking to leave comments inside functions/procedures intact, filtering out the rest.Chamomile

© 2022 - 2024 — McMap. All rights reserved.