Best way to save lots of position pairs in Emacs Lisp

Asked 18/9, 2014 at 14:47 Answered 19/9, 2014 at 1:26

I am in the middle of transforming a usual full-restart tokenizer (ported from the original language parser, the language is irrelevant here) into a more advanced incremental one. This implies the following:

a) it has to be fast, very fast;

b) upon every text update (be it insertion or deletion) it has to find damaged tokens and repair the token list accordingly.

The original tokenizer version just builds a list of tokens while going through the buffer text using regexps; every token in the list is a vector of 4 elements (['TOKENTYPE "token-lexeme" linum charnum]). Linum/charnum are plain numbers specifying token position in the buffer at the moment lexing was done. Easy pie.

Now, to the point. Whenever (well.. not every time, but often enough) a user adds or deletes a character (or a string) the new tokenizer has to find a token built using the text in a change position, and, probably, dependant tokens for later deletions/updates.

There are two problems here:

a) token positions should be dynamic (i.e. if a user adds some text in the beginning of a buffer -> we shouldn't bother fixing tokens in the buffer end);

b) a way to get to a damaged token(or lots of tokens in general).

Right now I am trying to use overlays for the task as overlays have a nice interface suiting my needs: overlays-at/overlays-in functions help the search; and overlay start/end move in an apropriate way.

I can happily do that for a smaller file. But it turns out (and I have to admit that I was warned by the docs) that the solution does not scale: even an average 1K LOC file can have CONST * LOC overlays, which is just too much for Emacs.

That was my first try, and it wasn't a successful one. I am considering alternatives such as:

1) managing a handwritten token search tree using plain numbers;

2) same tree, but using markers;

3) some kind of a mixed approach including both plain numbers and markers.

Any alternatives to mentioned approaches? Or maybe there's a better way to handle lots of overlays?

Nonlinearity answered 18/9, 2014 at 14:47 Comment(2)

Related: There are Wisent and semantic in emacs. Probably, you already have checked those ones. There is also some highlighting support e.g., `gnu.org/software/emacs/manual/html_mono/…. – Gunpoint 18/9, 2014 at 15:39

@Gunpoint Yes, of course, I did check them out. Lots of times, actually, even signed the FSF papers and sent a micropatch or two for Java-related code. This particular tokenizer is meant to work in a different way (hopefully, superior :-) ) and is more of an experiment to replace the font-lock-mode. – Nonlinearity 18/9, 2014 at 16:16

Like Lindydancer, I'd suggest you use text-properties rather than overlays. Overlays scale like O(N^2) whereas text-properties scale like O(N log N), so it works much better. I wouldn't use font-lock for any of it, tho.

Of course, an alternative solution is to fix overlays: the C code could be changed to make it O(N log N). I've known how to do that for a while now, but haven't found the time (and am unlikely to find the time in any foreseeable future), so if someone's interested I'd be very happy to help him do it.

Houseline answered 19/9, 2014 at 1:26 Comment(4)

In the end of the day I do not need the text itself, I just need positions and position-related overlay functions. It turns out that using text properties here would be an unnecessary complication... I'll go with some sort of a complex approach with overlays, such as using a single overlay for a group of tokens. You mentioned overlay perfomance improving... I am interested in doing this and getting into Emacs C code hacking in general. That is, if noone is in a hurry (I'm certainly not). How can we cooperate on this one? – Nonlinearity 19/9, 2014 at 6:28

by the way, do you happen to know anything about marker implementation? Are they as potentially problematic as overlays are? – Nonlinearity 19/9, 2014 at 10:15

Markers have performance issues as well, yes (it's part of the reason for overlays being slow, since every overlay internally uses 2 markers). If you're interested in hacking on overlays, then come over to emacs-devel stating your interest in working on this issue and we'll take it from there. – Houseline 19/9, 2014 at 12:52

There was some recent discussion about this on emacs-devel; it would be really nice to have the performance issues fixed :) – Baptiste 21/11, 2016 at 17:30

An alternative to overlays are text properties, they are attached to the text in a way that overlays aren't, so they are much more efficient.

A package that use text properties heavily is font-lock. Normally, it's used for highlighting buffers, but there is nothing that prevents you from piggy-backing on it for your own purposes. That way you will get the entire system for detecting that the user has modified the content of the buffer for free.

In your case, you could replace the regexp in a font-lock keyword with a function that will be called with a search limit. All you would need to do is to scan the relative short section, set your own text properties and you are done. (Also, you must inform font-lock which property you are setting using font-lock-extra-managed-props.)

Zedoary answered 18/9, 2014 at 15:5 Comment(6)

The problem here is that I want to avoid font-lock-mode and replace it with the incremental tokenizing system which would do the syntax highlighting in a more granular (compared to regexps font-lock uses) way. Actually, I already checked font-locking code and it's not too smart anyway. As much as I understood from a quick study it just rescans the buffer starting with the change point (avoiding couple of corner cases). – Nonlinearity 18/9, 2014 at 15:28

Did not think about text properties in such a way... Okay, I can attach a token reference to a piece of text. Another problem arises here. How to find deleted tokens? I.e. tokens that are not pointed to by any text? With overlays that was rather easy. – Nonlinearity 18/9, 2014 at 15:32

Besides performance, the question of whether to use overlay properties or text properties involves this: Do you want to associate the properties and their values with characters in the buffer or buffer positions. If the former, then text properties are a natural choice; if the latter, overlays are. (Your question title suggests that it is buffer positions that this is about, but maybe that's only because your first implementation is based on overlays.) – Palaeobotany 18/9, 2014 at 16:21

FWIW, I'm not sure why font-lock was suggested in this context. You can certainly use text properties (and overlays) without `font-lock-mode'. – Palaeobotany 18/9, 2014 at 16:30

@Palaeobotany I do not need the characters, I get the lexeme (characters) after finding the damages. The overlays (or whatever mechanics I will choose) are just a way to identify the damaged tokens. After the edits all I have is a point (or a region) where the changes happened. This point has to be used to get related tokens. In overlays I can just do "overlays-at". – Nonlinearity 18/9, 2014 at 16:50

In the general, and typical, Emacs case, text in a buffer can change (move, be replaced, etc.). It is especially in this context that one is sometimes interested in the characters (the text) and not buffer positions. If that does not apply in your case, and you care only about buffer postions, not the text currently at those positions, then overlays are a natural choice. But as you have noted, they can be costly. – Palaeobotany 18/9, 2014 at 17:2

Recommended topics

Hot tags