Markdown Text Highlighting Performance Issues - Tkinter
Asked Answered
S

2

4

Overview

I’m trying to add markdown syntax highlighting in a text editor for my project, but I am having some issues with making it user proof so to speak, while being performance friendly

Basically, I'm after this–from Visual Studio Code's markdown:

enter image description here

I’m talking about simple highlighting of bold, italic, lists, etc. to indicate the style that will be applied when the user previews their markdown file.

My Solution

I originally set up this method for my project (simplified for the question and using colours to make the styles clearer for debugging)

import re
import tkinter

root = tkinter.Tk()
root.title("Markdown Text Editor")
editor = tkinter.Text(root)
editor.pack()

# bind each key Release to the markdown checker function
editor.bind("<KeyRelease>", lambda event : check_markdown(editor.index('insert').split(".")[0]))


# configure markdown styles
editor.tag_config("bold",           foreground = "#FF0000") # red for debugging clarity
editor.tag_config("italic",         foreground = "#00FF00") # green for debugging clarity
editor.tag_config("bold-italic",    foreground = "#0000FF") # blue for debugging clarity


# regex expressions and empty tag legnth
search_expressions = {
#   <tag name>    <regex expression>   <empty tag size>
    "italic" :      ["\*(.*?)\*",           2],
    "bold" :        ["\*\*(.*?)\*\*",       4], 
    "bold-italic" : ["\*\*\*(.*?)\*\*\*",   6],
}


def check_markdown(current_line):
    # loop through each tag with the matching regex expression
    for tag, expression in search_expressions.items():
        # start and end indices for the seach area
        start_index, end_index = f"{current_line}.0", f"{current_line}.end"

        # remove all tag instances
        editor.tag_remove(tag, start_index, end_index)
        
        # while there is still text to search
        while 1:
            length = tkinter.IntVar()
            # get the index of 'tag' that matches 'expression' on the 'current_line'
            index = editor.search(expression[0], start_index, count = length, stopindex = end_index, regexp = True)
            
            # break if the expression was not met on the current line
            if not index: 
                break
            
            # else is this tag empty ('**' <- empty italic)
            elif length.get() != expression[1]: 
                # apply the tag to the markdown syntax
                editor.tag_add(tag, index, f"{index}+{length.get()}c")

            # continue searching after the markdown
            start_index = index + f"+{length.get()}c"

            # update the display - stops program freezing
            root.update_idletasks()

            continue

        continue

    return

root.mainloop()

I reasoned that by removing all formatting each KeyRelease and then rescanning the current line, it reduces the amount of syntax being misinterpreted like bold-italic as bold or italic, and tags stacking on top of each other. This works well for a few sentences on a single line, but if the user types lots of text on one line, the performance drops fast, with long waits for the styles to be applied - especially when lots of different markdown syntax is involved.

I used Visual Studio Code's markdown language highlighting as a comparison, and it could handle far more syntax on a single line before it removed the highlighting for "performance reasons".

I understand this is an extremely large amount of looping to be doing every keyReleaee, but I found the alternatives to be vastly more complicated, while not really improving the performance.

Alternative Solutions

I thought, let’s decrease the load. I’ve tested checking every time the user types markdown syntax like asterisks and m-dashes, and doing validation on any tag that has been edited (key release within a tags range). but there are so many variables to consider with the users input– like when text is pasted into the editor, as it is difficult to determine what the effects of certain syntax combinations could have on the surrounding documents markdown–these would need to be checked and validated.

Is there some better and more intuitive method to highlight markdown that I haven’t thought of yet? is there a way to drastically speed up my original idea? Or is python and Tkinter simply not able to do what I’m trying to do fast enough.

Thanks in advance.

Salvo answered 8/12, 2020 at 2:43 Comment(0)
J
2

If you don't want to use an external library and keep the code simple, using re.finditer() seems faster than Text.search().

You can use a single regular expression to match all cases:

regexp = re.compile(r"((?P<delimiter>\*{1,3})[^*]+?(?P=delimiter)|(?P<delimiter2>\_{1,3})[^_]+?(?P=delimiter2))")

The length of the "delimiter" group gives you the tag and the span of the match gives you where to apply the tag.

Here is the code:

import re
import tkinter

root = tkinter.Tk()
root.title("Markdown Text Editor")
editor = tkinter.Text(root)
editor.pack()

# bind each key Release to the markdown checker function
editor.bind("<KeyRelease>", lambda event: check_markdown())

# configure markdown styles
editor.tag_config("bold", foreground="#FF0000") # red for debugging clarity
editor.tag_config("italic", foreground="#00FF00") # green for debugging clarity
editor.tag_config("bold-italic", foreground="#0000FF") # blue for debugging clarity

regexp = re.compile(r"((?P<delimiter>\*{1,3})[^*]+?(?P=delimiter)|(?P<delimiter2>\_{1,3})[^_]+?(?P=delimiter2))")
tags = {1: "italic", 2: "bold", 3: "bold-italic"}  # the length of the delimiter gives the tag


def check_markdown(start_index="insert linestart", end_index="insert lineend"):
    text = editor.get(start_index, end_index)
    # remove all tag instances
    for tag in tags.values():
        editor.tag_remove(tag, start_index, end_index)
    # loop through each match and add the corresponding tag
    for match in regexp.finditer(text):
        groupdict = match.groupdict()
        delim = groupdict["delimiter"] # * delimiter
        if delim is None:
            delim = groupdict["delimiter2"]  # _ delimiter
        start, end = match.span()
        editor.tag_add(tags[len(delim)], f"{start_index}+{start}c", f"{start_index}+{end}c")
    return

root.mainloop()

Note that check_markdown() only works if start_index and end_index are on the same line, otherwise you need to split the text and do the search line by line.

Jeffers answered 8/12, 2020 at 16:6 Comment(1)
Works very well, Thanks so much! I had originally thought of taking the text out of the editor with Text.get() and running regular expressions on the string with the re module but wasn't sure how to translate the found indexes back into Tkinters line.character format, I see now I could have used the +*characters*c on the start, silly me!Salvo
J
2

I don't know if this solution improves performances but at least it improves the syntax highlighting.

The idea is to make pygments (official documentation here) do the job for us, using pygments.lex(text, lexer) to parse the text, where lexer is pygments' lexer for Markdown syntax. This function returns a list of (token, text) couples and so I use str(token) as a tag name, e.g. the tag "Token.Generic.Strong" corresponds to bold text. To avoid configuring the tags one by one, I use one of the predefined pygments style that I load with the load_style() function.

Unfortunately, pygments' markdown lexer does not recognize bold-italic so I define a custom Lexer class that extends pygments' one.

import tkinter
from pygments import lex
from pygments.lexers.markup import MarkdownLexer
from pygments.token import Generic
from pygments.lexer import bygroups
from pygments.styles import get_style_by_name


# add markup for bold-italic
class Lexer(MarkdownLexer):
    tokens = {key: val.copy() for key, val in MarkdownLexer.tokens.items()}
    # # bold-italic fenced by '***'
    tokens['inline'].insert(2, (r'(\*\*\*[^* \n][^*\n]*\*\*\*)',
                                bygroups(Generic.StrongEmph)))
    # # bold-italic fenced by '___'
    tokens['inline'].insert(2, (r'(\_\_\_[^_ \n][^_\n]*\_\_\_)',
                                bygroups(Generic.StrongEmph)))
    
def load_style(stylename):
    style = get_style_by_name(stylename)
    syntax_highlighting_tags = []
    for token, opts in style.list_styles():
        kwargs = {}
        fg = opts['color']
        bg = opts['bgcolor']
        if fg:
            kwargs['foreground'] = '#' + fg
        if bg:
            kwargs['background'] = '#' + bg
        font = ('Monospace', 10) + tuple(key for key in ('bold', 'italic') if opts[key])
        kwargs['font'] = font
        kwargs['underline'] = opts['underline']
        editor.tag_configure(str(token), **kwargs)
        syntax_highlighting_tags.append(str(token))
    editor.configure(bg=style.background_color,
                     fg=editor.tag_cget("Token.Text", "foreground"),
                     selectbackground=style.highlight_color)
    editor.tag_configure(str(Generic.StrongEmph), font=('Monospace', 10, 'bold', 'italic'))
    syntax_highlighting_tags.append(str(Generic.StrongEmph))
    return syntax_highlighting_tags    

def check_markdown(start='insert linestart', end='insert lineend'):
    data = editor.get(start, end)
    while data and data[0] == '\n':
        start = editor.index('%s+1c' % start)
        data = data[1:]
    editor.mark_set('range_start', start)
    # clear tags
    for t in syntax_highlighting_tags:
        editor.tag_remove(t, start, "range_start +%ic" % len(data))
    # parse text
    for token, content in lex(data, lexer):
        editor.mark_set("range_end", "range_start + %ic" % len(content))
        for t in token.split():
            editor.tag_add(str(t), "range_start", "range_end")
        editor.mark_set("range_start", "range_end")

root = tkinter.Tk()
root.title("Markdown Text Editor")
editor = tkinter.Text(root, font="Monospace 10")
editor.pack()

lexer = Lexer()
syntax_highlighting_tags = load_style("monokai")

# bind each key Release to the markdown checker function
editor.bind("<KeyRelease>", lambda event: check_markdown())

root.mainloop()

To improve performance, you can bind check_markdown() to only some keys or choose to apply the syntax highlighting only when the user changes line.

Jeffers answered 8/12, 2020 at 13:37 Comment(1)
It would have been so better if there was a nice documentation for pygments.Narah
J
2

If you don't want to use an external library and keep the code simple, using re.finditer() seems faster than Text.search().

You can use a single regular expression to match all cases:

regexp = re.compile(r"((?P<delimiter>\*{1,3})[^*]+?(?P=delimiter)|(?P<delimiter2>\_{1,3})[^_]+?(?P=delimiter2))")

The length of the "delimiter" group gives you the tag and the span of the match gives you where to apply the tag.

Here is the code:

import re
import tkinter

root = tkinter.Tk()
root.title("Markdown Text Editor")
editor = tkinter.Text(root)
editor.pack()

# bind each key Release to the markdown checker function
editor.bind("<KeyRelease>", lambda event: check_markdown())

# configure markdown styles
editor.tag_config("bold", foreground="#FF0000") # red for debugging clarity
editor.tag_config("italic", foreground="#00FF00") # green for debugging clarity
editor.tag_config("bold-italic", foreground="#0000FF") # blue for debugging clarity

regexp = re.compile(r"((?P<delimiter>\*{1,3})[^*]+?(?P=delimiter)|(?P<delimiter2>\_{1,3})[^_]+?(?P=delimiter2))")
tags = {1: "italic", 2: "bold", 3: "bold-italic"}  # the length of the delimiter gives the tag


def check_markdown(start_index="insert linestart", end_index="insert lineend"):
    text = editor.get(start_index, end_index)
    # remove all tag instances
    for tag in tags.values():
        editor.tag_remove(tag, start_index, end_index)
    # loop through each match and add the corresponding tag
    for match in regexp.finditer(text):
        groupdict = match.groupdict()
        delim = groupdict["delimiter"] # * delimiter
        if delim is None:
            delim = groupdict["delimiter2"]  # _ delimiter
        start, end = match.span()
        editor.tag_add(tags[len(delim)], f"{start_index}+{start}c", f"{start_index}+{end}c")
    return

root.mainloop()

Note that check_markdown() only works if start_index and end_index are on the same line, otherwise you need to split the text and do the search line by line.

Jeffers answered 8/12, 2020 at 16:6 Comment(1)
Works very well, Thanks so much! I had originally thought of taking the text out of the editor with Text.get() and running regular expressions on the string with the re module but wasn't sure how to translate the found indexes back into Tkinters line.character format, I see now I could have used the +*characters*c on the start, silly me!Salvo

© 2022 - 2024 — McMap. All rights reserved.