How to get lineno of "end-of-statement" in Python ast
Asked Answered
D

6

42

I am trying to work on a script that manipulates another script in Python, the script to be modified has structure like:

class SomethingRecord(Record):
    description = 'This records something'
    author = 'john smith'

I use ast to locate the description line number, and I use some code to change the original file with new description string base on the line number. So far so good.

Now the only issue is description occasionally is a multi-line string, e.g.

    description = ('line 1'
                   'line 2'
                   'line 3')

or

    description = 'line 1' \
        'line 2' \
        'line 3'

and I only have the line number of the first line, not the following lines. So my one-line replacer would do

    description = 'new value'
        'line 2' \
        'line 3'

and the code is broken. I figured that if I know both the lineno of start and end/number of lines of description assignment I could repair my code to handle such situation. How do I get such information with Python standard library?

Delve answered 29/9, 2016 at 20:35 Comment(5)
Do you need to preserve the count of blank lines between the last line of the description assignment and the next statement or is it sufficient to normalize the output?Exsiccate
@BrianCain Yes. I wish to make sure the update of the line does not make the code style change.Delve
The astor library can turn ast back to code.Violinist
When you have are two representations of a program, it is really easy for them to "get out synch" which is what is happening in your case. This is why one makes modifications to the "one representation" of the program, e.g,. the AST itself instead of the text file, and then prettyprints the AST. Done right, this also handles retaining commens and whitespace. People keep trying to hack solutions to modify code. It is better to use tools that are designed to do this job. en.wikipedia.org/wiki/Program_transformation for tools that do this "right".Er
This sounds like an XY problem. It sounds like what you really want is a method or tool to replace the descripton assignment statement reliably. [Maybe you can do that if you know the line numbers, and hack at the source text, but that doesn't seem like it is necessary to solve the actual problem]. Please clarify.Er
E
7

I looked at the other answers; it appears people are doing backflips to get around the problems of computing line numbers, when your real problem is one of modifying the code. That suggests the baseline machinery is not helping you the way you really need.

If you use a program transformation system (PTS), you could avoid a lot of this nonsense.

A good PTS will parse your source code to an AST, and then let you apply source-level rewrite rules to modify the AST, and will finally convert the modified AST back into source text. Generically PTSes accept transformation rules of essentially this form:

   if you see *this*, replace it by *that*

[A parser that builds an AST is NOT a PTS. They don't allow rules like this; you can write ad hoc code to hack at the tree, but that's usually pretty awkward. Not do they do the AST to source text regeneration.]

(My PTS, see bio, called) DMS is a PTS that could accomplish this. OP's specific example would be accomplished easily by using the following rewrite rule:

 source domain Python; -- tell DMS the syntax of pattern left hand sides
 target domain Python; -- tell DMS the syntax of pattern right hand sides

 rule replace_description(e: expression): statement -> statement =
     " description = \e "
  ->
     " description = ('line 1'
                      'line 2'
                      'line 3')";

The one transformation rule is given an name replace_description to distinguish it from all the other rule we might define. The rule parameters (e: expression) indicate the pattern will allow an arbitrary expression as defined by the source language. statement->statement means the rule maps a statement in the source language, to a statement in the target language; we could use any other syntax category from the Python grammar provided to DMS. The " used here is a metaquote, used to distinguish the syntax of the rule language form the syntax of the subject language. The second -> separates the source pattern this from the target pattern that.

You'll notice that there is no need to mention line numbers. The PTS converts the rule surface syntax into corresponding ASTs by actually parsing the patterns with the same parser used to parse the source file. The ASTs produced for the patterns are used to effect the pattern match/replacement. Because this is driven from ASTs, the actual layout of the orginal code (spacing, linebreaks, comments) don't affect DMS's ability to match or replace. Comments aren't a problem for matching because they are attached to tree nodes rather than being tree nodes; they are preserved in the transformed program. DMS does capture line and precise column information for all tree elements; just not needed to implement transformations. Code layout is also preserved in the output by DMS, using that line/column information.

Other PTSes offer generally similar capabilities.

Er answered 9/10, 2016 at 16:17 Comment(3)
No. It answers op's motivation, a script that manipulates another script. When one gets distracted by "trying solve X using Y", and Y doesn't work very well, then you need to find another way to solve X. Isn't that obvious?Er
There was a constraint "How do I get such information with Python standard library?" There are indeed lots of tools that could do the transformation. As I understand the question, it was about solving the problem using only stdlib (i.e. no extra dependencies).Jordonjorey
If OPs problem is exactly what he shows, and he doesn't have big files, then a line number solution so he can hack the source file could work. If the change he wants to make starts in the middle of one line and ends in the middle of that line (or another), then "line numbers" won't help him because Python (indeed most source) doesn't come in quantums of "lines". Worse, if he want to make many transforms to the code, a source-patching scheme will fail him because the source-patches compose badly. So a line number based scheme is really the wrong answer, even if that's what he asked for.Er
E
6

As a workaround you can change:

    description = 'line 1' \
              'line 2' \
              'line 3'

to:

    description = 'new value'; tmp = 'line 1' \
              'line 2' \
              'line 3'

etc.

It is a simple change but indeed ugly code produced.

Electrokinetics answered 5/10, 2016 at 16:22 Comment(5)
what if there is a "tmp" variable around?Delve
So you can call it some other garbage name like thisisgarbagevariablemadebyxisscript777Electrokinetics
If you can't assume that some garbage variable name exists the you should check which names exists and generate a different one.Electrokinetics
Or you could use _, which is the traditional python throwaway variable name?Earhart
@Earhart ... but it can still be used by developers, especially those using gettext (_ often represents the i18n function in this scenario)Addition
K
5

This is now available as end_lineno since Python 3.8.

Kick answered 6/8, 2020 at 19:59 Comment(1)
In 2020, this is the canonical answer. Note, however, that the ast.get_source_segment() function first introduced with Python 3.8 should be typically called instead. Thanks to the rarely used optional line terminator ;, computing character intervals on the basis of line numbers alone is guaranteed to fail for edge-case code. Line and column numbers must both be considered.Harsh
J
3

Indeed, the information you need is not stored in the ast. I don't know the details of what you need, but it looks like you could use the tokenize module from the standard library. The idea is that every logical Python statement is ended by a NEWLINE token (also it could be a semicolon, but as I understand it is not your case). I tested this approach with such file:

# first comment
class SomethingRecord:
    description = ('line 1'
                   'line 2'
                   'line 3')

class SomethingRecord2:
    description = ('line 1',
                   'line 2',
                   # comment in the middle

                   'line 3')

class SomethingRecord3:
    description = 'line 1' \
                  'line 2' \
                  'line 3'
    whatever = 'line'

class SomethingRecord3:
    description = 'line 1', \
                  'line 2', \
                  'line 3'
                  # last comment

And here is what I propose to do:

import tokenize
from io import BytesIO
from collections import defaultdict

with tokenize.open('testmod.py') as f:
    code = f.read()
    enc = f.encoding

rl = BytesIO(code.encode(enc)).readline
tokens = list(tokenize.tokenize(rl))

token_table = defaultdict(list)  # mapping line numbers to token numbers
for i, tok in enumerate(tokens):
    token_table[tok.start[0]].append(i)

def find_end(start):
    i = token_table[start][-1]  # last token number on the start line
    while tokens[i].exact_type != tokenize.NEWLINE:
        i += 1
    return tokens[i].start[0]

print(find_end(3))
print(find_end(8))
print(find_end(15))
print(find_end(21))

This prints out:

5
12
17
23

This seems to be correct, you could tune this approach depending on what exactly you need. tokenize is more verbose than ast but also more flexible. Of course the best approach is to use them both for different parts of your task.


EDIT: I tried this in Python 3.4, but I think it should also work in other versions.

Jordonjorey answered 5/10, 2016 at 16:12 Comment(3)
Does it work when the statement doesn't have a trailing newline i.e. it's the last line in a file?Cathrinecathryn
I just scrolled through Grammar/Grammar and tested this with my file, and it looks like it works. In case you find a case when it doesn't, just add or tokens[i].exact_type != tokenize.ENDMARKERJordonjorey
I work on transforming code systems (in many languages) in which source files vary hugely in size, including some files that are millions of lines long. Parsing is bad enough; why would I want to scan a million line file a second time just to get this?Er
A
1

My solution takes a different path: When I had to change code in another file I opened the file, found the line and got all the next lines which had a deeper indent than the first and return the line number for the first line which isn't deeper. I return None, None if I couldn't find the text I was looking for. This is of course incomplete, but I think it's enough to get you through :)

def get_all_indented(text_lines, text_in_first_line):
    first_line = None
    indent = None
    for line_num in range(len(text_lines)):
        if indent is not None and first_line is not None:
            if not text_lines[line_num].startswith(indent):
                return first_line, line_num     # First and last lines
        if text_in_first_line in text_lines[line_num]:
            first_line = line_num
            indent = text_lines[line_num][:text_lines[line_num].index(text_in_first_line)] + ' '  # At least 1 more space.
    return None, None
Astrogation answered 6/10, 2016 at 13:40 Comment(0)
C
1

There is a new asttokens library that addresses this well: https://github.com/gristlabs/asttokens

import ast, asttokens

code = '''
class SomethingRecord(object):
    desc1 = 'This records something'
    desc2 = ('line 1'
             'line 2'
             'line 3')
    desc3 = 'line 1' \
            'line 2' \
            'line 3'
    author = 'john smith'
'''

atok = asttokens.ASTTokens(code, parse=True)
assign_values = [n.value for n in ast.walk(atok.tree) if isinstance(n, ast.Assign)]

replacements = [atok.get_text_range(n) + ("'new value'",) for n in assign_values]
print(asttokens.util.replace(atok.text, replacements))

produces

class SomethingRecord(object):
    desc1 = 'new value'
    desc2 = ('new value')
    desc3 = 'new value'
    author = 'new value'
Capsicum answered 22/12, 2016 at 21:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.