How can I find the position of the list of substrings from the string?
Asked Answered
A

4

5

How can I find the position of the list of substrings from the string?

Given a string:

"The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."

And a list of substring:

['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

Desired output:

>>> s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
>>> find_offsets(tokens, s)
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34),
        (34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67),
        (68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109),
        (110, 119), (120, 122), (123, 131), (131, 132)]

Explanation of the output, the first substring "The" can be found using the (start, end) index by using the string s. So from the desired output.

So if we loop through all the tuples of integers from the desired output we'll get back the list of substrings, i.e.

>>> [s[start:end] for start, end in out]
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

I've tried:

def find_offset(tokens, s):
    index = 0
    offsets = []
    for token in tokens:
        start = s[index:].index(token) + index
        index = start + len(token)
        offsets.append((start, index))
    return offsets

Is there another way to find the position of the list of substrings from the string?

Antheridium answered 4/5, 2017 at 4:34 Comment(0)
D
1

If we have no idea about the substrings, there's no way except to scan the whole text anew for each of them.

If, as it seems from data, we know that these are sequential fragments of the text, given in the textual order, it's easy to only scan the rest of the text after each match. There's no point to cut the text every time, though.

def spans(text, fragments):
    result = []
    point = 0  # Where we're in the text.
    for fragment in fragments:
        found_start = text.index(fragment, point)
        found_end = found_start + len(fragment)
        result.append((found_start, found_end))
        point = found_end
    return result

Test:

>>> spans('foo in bar', ['foo', 'in', 'bar'])
[(0, 3), (4, 6), (7, 10)]

This assumes that every fragment is present in the text at the right place. Your output format does not provide an example of mismatch reporting. Using .find instead of .index could help that, though only partly.

Deanndeanna answered 4/5, 2017 at 4:52 Comment(0)
G
5

First solution:

#use list comprehension and list.index function.
[tuple((s.index(e),s.index(e)+len(e))) for e in t]

Second solution to correct the issues in the first solution:

def find_offsets(tokens, s):
    tid = [list(e) for e in tokens]
    i = 0
    for id_token,token in enumerate(tid):
        while (token[0]!=s[i]):            
            i+=1
        tid[id_token] = tuple((i,i+len(token)))
        i+=len(token)

    return tid


find_offsets(tokens, s)
Out[201]: 
[(0, 3),
 (4, 9),
 (9, 10),
 (11, 16),
 (17, 20),
 (21, 23),
 (24, 34),
 (34, 35),
 (36, 43),
 (44, 46),
 (47, 52),
 (52, 54),
 (55, 60),
 (61, 67),
 (68, 72),
 (73, 75),
 (76, 83),
 (84, 89),
 (90, 98),
 (99, 103),
 (104, 109),
 (110, 119),
 (120, 122),
 (123, 131),
 (131, 132)]   

#another test
s = 'The plane, plane'
t = ['The', 'plane', ',', 'plane']
find_offsets(t,s)
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)]
Godfree answered 4/5, 2017 at 4:46 Comment(3)
Nicely short but also delightfully inefficient, calling .index() twice.Deanndeanna
Also, if there are repeated words, this won't work. .index() would always be fetching only the first instance =(Antheridium
Try s = 'The plane, plane'; t = ['The', 'plane', ',', 'plane']Antheridium
H
3
import re

s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']


for token in tokens:
  pattern = re.compile(re.escape(token))
  print(pattern.search(s).span())

RESULT

(0, 3)
(4, 9)
(9, 10)
(11, 16)
(17, 20)
(21, 23)
(24, 34)
(9, 10)
(36, 43)
(44, 46)
(47, 52)
(52, 54)
(55, 60)
(61, 67)
(68, 72)
(73, 75)
(76, 83)
(84, 89)
(90, 98)
(99, 103)
(104, 109)
(110, 119)
(120, 122)
(123, 131)
(131, 132)
Hobard answered 4/5, 2017 at 17:56 Comment(0)
D
1

If we have no idea about the substrings, there's no way except to scan the whole text anew for each of them.

If, as it seems from data, we know that these are sequential fragments of the text, given in the textual order, it's easy to only scan the rest of the text after each match. There's no point to cut the text every time, though.

def spans(text, fragments):
    result = []
    point = 0  # Where we're in the text.
    for fragment in fragments:
        found_start = text.index(fragment, point)
        found_end = found_start + len(fragment)
        result.append((found_start, found_end))
        point = found_end
    return result

Test:

>>> spans('foo in bar', ['foo', 'in', 'bar'])
[(0, 3), (4, 6), (7, 10)]

This assumes that every fragment is present in the text at the right place. Your output format does not provide an example of mismatch reporting. Using .find instead of .index could help that, though only partly.

Deanndeanna answered 4/5, 2017 at 4:52 Comment(0)
B
0
def spans2(text, fragments):
    result = []
    for fragment in fragments:
        found_start = text.index(fragment)
        found_end = found_start + len(fragment)
        result.append((found_start, found_end))
    return(result)

Test:

txt = "foo man bit"
frag = ["bit","man"]

spans2(txt,frag)

[(8, 11), (4, 7)]
Balk answered 10/3, 2023 at 18:35 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Sakhuja

© 2022 - 2025 — McMap. All rights reserved.