How to find all occurrences of a substring?
Asked Answered
M

33

586

Python has string.find() and string.rfind() to get the index of a substring in a string.

I'm wondering whether there is something like string.find_all() which can return all found indexes (not only the first from the beginning or the first from the end).

For example:

string = "test test test test"

print string.find('test') # 0
print string.rfind('test') # 15

#this is the goal
print string.find_all('test') # [0,5,10,15]

For counting the occurrences, see Count number of occurrences of a substring in a string.

Menial answered 12/1, 2011 at 2:35 Comment(2)
what should 'ttt'.find_all('tt') return?Patriarchate
it should return '0'. Of course, in perfect world there also has to be 'ttt'.rfind_all('tt'), which should return '1'Menial
G
787

There is no simple built-in string function that does what you're looking for, but you could use the more powerful regular expressions:

import re
[m.start() for m in re.finditer('test', 'test test test test')]
#[0, 5, 10, 15]

If you want to find overlapping matches, lookahead will do that:

[m.start() for m in re.finditer('(?=tt)', 'ttt')]
#[0, 1]

If you want a reverse find-all without overlaps, you can combine positive and negative lookahead into an expression like this:

search = 'tt'
[m.start() for m in re.finditer('(?=%s)(?!.{1,%d}%s)' % (search, len(search)-1, search), 'ttt')]
#[1]

re.finditer returns a generator, so you could change the [] in the above to () to get a generator instead of a list which will be more efficient if you're only iterating through the results once.

Gaskins answered 12/1, 2011 at 2:43 Comment(10)
hi, concerning this [m.start() for m in re.finditer('test', 'test test test test')], how can we look for test or text? Does it become much more complicated?Entasis
You want to look into regular expression in general : docs.python.org/2/howto/regex.html. The solution to your question will be : [m.start() for m in re.finditer('te[sx]t', 'text test text test')]Mchale
What will be the time complexity of using this method ?Brandy
@PranjalMittal. Upper or lower bound? Best, worst or average case?Ciccia
@marcog what if the substring contains parentheses or other special characters?Ephraimite
This method doesn't work with overlapping strings, e.g. when searching for "ACA" in string "ACACA", it will return only index 0. In case, someone wants a solution, here it is: https://mcmap.net/q/65734/-finding-multiple-occurrences-of-a-string-within-a-string-in-python-duplicate. Use the find method with index += 1.Cholecystotomy
I would recommend escaping the search strings as well, like this: [m.start() for m in re.finditer(re.escape(search_str), input_str)]Approval
Applied method to search for substrings in a text file, got: "error: nothing to repeat at position 0"Terni
I want overlapping matches. If the substring or string contain leading and lagging spaces, i got and error: expected string or bytes-like object. I need the spaces, because I don't want to match "really mean" to "really meaningful"Abbotsen
This doesn't work with multi-word subwords. For example, THIS SUB-WORD in this sentence Find THIS SUB-WORD in this sentence with THIS SUB-WORD .Imperfection
M
178
>>> help(str.find)
Help on method_descriptor:

find(...)
    S.find(sub [,start [,end]]) -> int

Thus, we can build it ourselves:

def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1: return
        yield start
        start += len(sub) # use start += 1 to find overlapping matches

list(find_all('spam spam spam spam', 'spam')) # [0, 5, 10, 15]

No temporary strings or regexes required.

Mcmahan answered 12/1, 2011 at 3:13 Comment(6)
To get overlapping matches, it should suffice to replace start += len(sub) with start += 1.Mcmahan
I believe your previous comment should be a postscript in your answer.Bachman
Your code does not work for finding substr: "ATAT" in "GATATATGCATATACTT"Curt
See the comment I made in addition. That is an example of an overlapping match.Mcmahan
To match the behaviour of re.findall, I'd recommend adding len(sub) or 1 instead of len(sub), otherwise this generator will never terminate on empty substring.Oneupmanship
Personally I think that a_str.find should be replaced with a_str.index so that return isn't needed.Spiry
S
87

Here's a (very inefficient) way to get all (i.e. even overlapping) matches:

>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]

This solution also works for multi-word subwords.

s = "Find THIS SUB-WORD in this sentence with THIS SUB-WORD"
sub = "THIS SUB-WORD"
[i for i in range(len(s)) if s.startswith(sub, I)]
# [5, 41]
Slick answered 12/1, 2011 at 2:48 Comment(5)
If we want to check many characters by using 1 for loop how can it be done? with this code, I'll have many for loop and the order of time is too high.Incubation
@Slick Very smart way of performing the operation without use of re module. Thanks for the answer!Dorcas
I think I like this answer more since it doesn't need the re module.Carat
Thanks, it worked for me. The accepted answer doesn't work with multi-word subwords. For example, THIS SUB-WORD in this sentence Find THIS SUB-WORD in this sentence with THIS SUB-WORD.Imperfection
This is not very efficient. The time complexity is the "haystack" length multiplied by the "needle" length. The efficient algorithm does it in "haystack" plus "needle" length time.Amphidiploid
W
76

Use re.finditer:

import re
sentence = input("Give me a sentence ")
word = input("What word would you like to find ")
for match in re.finditer(word, sentence):
    print (match.start(), match.end())

For word = "this" and sentence = "this is a sentence this this" this will yield the output:

(0, 4)
(19, 23)
(24, 28)
Wavemeter answered 3/2, 2016 at 19:1 Comment(2)
I think it's worth pointing out, that it works only for "non-overlapping matches", therefore won't work for: sentence="ababa" and word="aba"Osteology
This will fail if the word contains any characters that have a meaning in regexSubmissive
R
68

Again, old thread, but here's my solution using a generator and plain str.find.

def findall(p, s):
    '''Yields all the positions of
    the pattern p in the string s.'''
    i = s.find(p)
    while i != -1:
        yield i
        i = s.find(p, i+1)

Example

x = 'banananassantana'
[(i, x[i:i+2]) for i in findall('na', x)]

returns

[(2, 'na'), (4, 'na'), (6, 'na'), (14, 'na')]
Rigging answered 23/12, 2015 at 23:9 Comment(3)
this looks beautiful!Advice
tested and it is twice faster than the re.finditer solution: 310 ns ± 5.35 ns per loop for solution with str.find vs 799 ns ± 5.72 ns per loop for solution with re.finditer (on my machine). Confirms what I've noticed in the past: built-in string methods are generally faster than regex (same for nested str.replace vs re.sub)Reglet
Prettiest solution. Note that one can easily generalize by introducing optional parameter overlapping=True and replacing i+1 by i + (1 if overlapping else len(p)).Aquarelle
M
25

You can use re.finditer() for non-overlapping matches.

>>> import re
>>> aString = 'this is a string where the substring "is" is repeated several times'
>>> print [(a.start(), a.end()) for a in list(re.finditer('is', aString))]
[(2, 4), (5, 7), (38, 40), (42, 44)]

but won't work for:

In [1]: aString="ababa"

In [2]: print [(a.start(), a.end()) for a in list(re.finditer('aba', aString))]
Output: [(0, 3)]
Mandrake answered 12/1, 2011 at 2:55 Comment(2)
Why make a list out of an iterator, it just slows the process.Irradiate
aString VS astring ;)Daberath
H
22

Come, let us recurse together.

def locations_of_substring(string, substring):
    """Return a list of locations of a substring."""

    substring_length = len(substring)    
    def recurse(locations_found, start):
        location = string.find(substring, start)
        if location != -1:
            return recurse(locations_found + [location], location+substring_length)
        else:
            return locations_found

    return recurse([], 0)

print(locations_of_substring('this is a test for finding this and this', 'this'))
# prints [0, 27, 36]

No need for regular expressions this way.

Hitherward answered 1/11, 2013 at 3:16 Comment(1)
This code has several problems. Since it's working on open-ended data sooner or later you'll bump into RecursionError if there are many enough occurrences. Another one are two throw-away lists it creates on each iteration just for the sake of appending one element, which is very suboptimal for a string finding function, which possibly could be called a lot of times. Although sometimes recursive functions seem elegant and clear, they should be taken with caution.Apus
B
13

If you're just looking for a single character, this would work:

string = "dooobiedoobiedoobie"
match = 'o'
reduce(lambda count, char: count + 1 if char == match else count, string, 0)
# produces 7

Also,

string = "test test test test"
match = "test"
len(string.split(match)) - 1
# produces 4

My hunch is that neither of these (especially #2) is terribly performant.

Brochu answered 24/9, 2014 at 21:12 Comment(1)
gr8 solution .. i am impressed with use of .. split()Anesthesia
N
12

this is an old thread but i got interested and wanted to share my solution.

def find_all(a_string, sub):
    result = []
    k = 0
    while k < len(a_string):
        k = a_string.find(sub, k)
        if k == -1:
            return result
        else:
            result.append(k)
            k += 1 #change to k += len(sub) to not search overlapping results
    return result

It should return a list of positions where the substring was found. Please comment if you see an error or room for improvment.

Names answered 1/4, 2015 at 9:23 Comment(0)
A
9

This does the trick for me using re.finditer

import re

text = 'This is sample text to test if this pythonic '\
       'program can serve as an indexing platform for '\
       'finding words in a paragraph. It can give '\
       'values as to where the word is located with the '\
       'different examples as stated'

#  find all occurances of the word 'as' in the above text

find_the_word = re.finditer('as', text)

for match in find_the_word:
    print('start {}, end {}, search string \'{}\''.
          format(match.start(), match.end(), match.group()))
Ashelman answered 6/7, 2018 at 9:34 Comment(0)
L
7

You can try :

import re
str1 = "This dress looks good; you have good taste in clothes."
substr = "good"
result = [_.start() for _ in re.finditer(substr, str1)]
# result = [17, 32]
Lxx answered 25/10, 2021 at 10:13 Comment(3)
This is no different from the accepted answer.Shaffer
@NatRiddle The presentation Mohammad wrote the answer in is a lot cleaner. This should be the accepted answer.Beitch
Regex is much heavier on CPU than the accepted answerNaquin
W
6

This thread is a little old but this worked for me:

numberString = "onetwothreefourfivesixseveneightninefiveten"
testString = "five"

marker = 0
while marker < len(numberString):
    try:
        print(numberString.index("five",marker))
        marker = numberString.index("five", marker) + 1
    except ValueError:
        print("String not found")
        marker = len(numberString)
Whin answered 1/9, 2014 at 12:48 Comment(0)
G
6

You can try :

>>> string = "test test test test"
>>> for index,value in enumerate(string):
    if string[index:index+(len("test"))] == "test":
        print index

0
5
10
15
Grooved answered 27/2, 2018 at 6:44 Comment(0)
O
4
src = input() # we will find substring in this string
sub = input() # substring

res = []
pos = src.find(sub)
while pos != -1:
    res.append(pos)
    pos = src.find(sub, pos + 1)
Ojeda answered 16/5, 2020 at 17:5 Comment(1)
While this code may resolve the OP's issue, it is best to include an explanation as to how your code addresses the OP's issue. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform, that differentiates it from forums. You can edit to add additional info &/or to supplement your explanations with source documentationCorrey
E
3

When looking for a large amount of key words in a document, use flashtext

from flashtext import KeywordProcessor
words = ['test', 'exam', 'quiz']
txt = 'this is a test'
kwp = KeywordProcessor()
kwp.add_keywords_from_list(words)
result = kwp.extract_keywords(txt, span_info=True)

Flashtext runs faster than regex on large list of search words.

Erikerika answered 28/9, 2018 at 17:29 Comment(0)
T
3

This function does not look at all positions inside the string, it does not waste compute resources. My try:

def findAll(string,word):
    all_positions=[]
    next_pos=-1
    while True:
        next_pos=string.find(word,next_pos+1)
        if(next_pos<0):
            break
        all_positions.append(next_pos)
    return all_positions

to use it call it like this:

result=findAll('this word is a big word man how many words are there?','word')
Tinner answered 13/1, 2020 at 12:39 Comment(0)
S
3

I think the most clean way of solution is without libraries and yields:

def find_all_occurrences(string, sub):
    index_of_occurrences = []
    current_index = 0
    while True:
        current_index = string.find(sub, current_index)
        if current_index == -1:
            return index_of_occurrences
        else:
            index_of_occurrences.append(current_index)
            current_index += len(sub)

find_all_occurrences(string, substr)

Note: find() method returns -1 when it can't find anything

Sung answered 13/10, 2022 at 20:6 Comment(0)
A
2

Whatever the solutions provided by others are completely based on the available method find() or any available methods.

What is the core basic algorithm to find all the occurrences of a substring in a string?

def find_all(string,substring):
    """
    Function: Returning all the index of substring in a string
    Arguments: String and the search string
    Return:Returning a list
    """
    length = len(substring)
    c=0
    indexes = []
    while c < len(string):
        if string[c:c+length] == substring:
            indexes.append(c)
        c=c+1
    return indexes

You can also inherit str class to new class and can use this function below.

class newstr(str):
def find_all(string,substring):
    """
    Function: Returning all the index of substring in a string
    Arguments: String and the search string
    Return:Returning a list
    """
    length = len(substring)
    c=0
    indexes = []
    while c < len(string):
        if string[c:c+length] == substring:
            indexes.append(c)
        c=c+1
    return indexes

Calling the method

newstr.find_all('Do you find this answer helpful? then upvote this!','this')

Antidepressant answered 15/2, 2018 at 20:2 Comment(0)
H
2

This is solution of a similar question from hackerrank. I hope this could help you.

import re
a = input()
b = input()
if b not in a:
    print((-1,-1))
else:
    #create two list as
    start_indc = [m.start() for m in re.finditer('(?=' + b + ')', a)]
    for i in range(len(start_indc)):
        print((start_indc[i], start_indc[i]+len(b)-1))

Output:

aaadaa
aa
(0, 1)
(1, 2)
(4, 5)
Halfdan answered 20/1, 2020 at 22:47 Comment(0)
S
2

if you want to use without re(regex) then:

find_all = lambda _str,_w : [ i for i in range(len(_str)) if _str.startswith(_w,i) ]

string = "test test test test"
print( find_all(string, 'test') ) # >>> [0, 5, 10, 15]
Supererogation answered 5/11, 2021 at 8:38 Comment(0)
J
2

Here's a solution that I came up with, using assignment expression (new feature since Python 3.8):

string = "test test test test"
phrase = "test"
start = -1
result = [(start := string.find(phrase, start + 1)) for _ in range(string.count(phrase))]

Output:

[0, 5, 10, 15]
Janellajanelle answered 8/4, 2022 at 10:6 Comment(0)
H
1

The pythonic way would be:

mystring = 'Hello World, this should work!'
find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]

# s represents the search string
# c represents the character string

find_all(mystring,'o')    # will return all positions of 'o'

[4, 7, 20, 26] 
>>> 
Hankins answered 10/4, 2018 at 19:40 Comment(2)
1) How does this help a question that was answered 7 years ago? 2) Using lambda this way is not Pythonic and goes against PEP8. 3) This doesn't provide the correct output for the OPs situationHyperbole
Pythonic does not mean "Use as much features of python as you can think of"Torr
R
1

if you only want to use numpy here is a solution

import numpy as np

S= "test test test test"
S2 = 'test'
inds = np.cumsum([len(k)+len(S2) for k in S.split(S2)[:-1]])- len(S2)
print(inds)

Revamp answered 10/6, 2021 at 16:46 Comment(0)
M
0

please look at below code

#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''


def get_substring_indices(text, s):
    result = [i for i in range(len(text)) if text.startswith(s, i)]
    return result


if __name__ == '__main__':
    text = "How much wood would a wood chuck chuck if a wood chuck could chuck wood?"
    s = 'wood'
    print get_substring_indices(text, s)
Mardis answered 16/3, 2017 at 1:14 Comment(1)
simple and best answer.Hendecasyllable
K
0
def find_index(string, let):
    enumerated = [place  for place, letter in enumerate(string) if letter == let]
    return enumerated

for example :

find_index("hey doode find d", "d") 

returns:

[4, 7, 13, 15]
Klaus answered 8/11, 2020 at 13:49 Comment(1)
Have you actually read the question? Try print(find_index('test test test test', 'test')) which is the example the op gave.Clothing
S
0

Not exactly what OP asked but you could also use the split function to get a list of where all the substrings don't occur. OP didn't specify the end goal of the code but if your goal is to remove the substrings anyways then this could be a simple one-liner. There are probably more efficient ways to do this with larger strings; regular expressions would be preferable in that case

# Extract all non-substrings
s = "an-example-string"
s_no_dash = s.split('-')
# >>> s_no_dash
# ['an', 'example', 'string']

# Or extract and join them into a sentence
s_no_dash2 = ' '.join(s.split('-'))
# >>> s_no_dash2
# 'an example string'

Did a brief skim of other answers so apologies if this is already up there.

Slifka answered 19/5, 2021 at 13:43 Comment(0)
A
0
def count_substring(string, sub_string):
    c=0
    for i in range(0,len(string)-2):
        if string[i:i+len(sub_string)] == sub_string:
            c+=1
    return c

if __name__ == '__main__':
    string = input().strip()
    sub_string = input().strip()
    
    count = count_substring(string, sub_string)
    print(count)
Ammonate answered 2/6, 2021 at 3:24 Comment(2)
Replace line 3, with: for i in range(0,len(string)+1-len(sub_string)):Rhabdomancy
Yeah. Thanks for the correction.Ammonate
H
0

I runned in the same problem and did this:

hw = 'Hello oh World!'
list_hw = list(hw)
o_in_hw = []

while True:
    o = hw.find('o')
    if o != -1:
        o_in_hw.append(o)
        list_hw[o] = ' '
        hw = ''.join(list_hw)
    else:
        print(o_in_hw)
        break

Im pretty new at coding so you can probably simplify it (and if planned to used continuously of course make it a function).

All and all it works as intended for what i was doing.

Edit: Please consider this is for single characters only, and it will change your variable, so you have to create a copy of the string in a new variable to save it, i didnt put it in the code cause its easy and its only to show how i made it work.

Highchair answered 25/6, 2021 at 20:18 Comment(0)
A
0

All the answers so far imply inefficient solutions that take O(n*m) time, where n is the "haystack" length and m is the "needle" length. Though I'm not sure whether it's true for the regular expression solution.

The problem can be solved in an O(n+m) time using a Knuth–Morris–Pratt algorithm version that doesn't stop after an occurrence is found:

# Not necessary to comprehend, just copy to your code
def findAll(haystack: str, needle: str):
    n = len(haystack)
    m = len(needle)
    # Key - needle prefix length,
    # Value - the length of the longest other needle prefix that is also a suffix of this prefix.
    longestPrefixSuffix = [0] * m
    length = 0
    suffixEnd = 1 # Last index

    while suffixEnd < m - 1:
        if needle[length] == needle[suffixEnd]:
            length += 1
            suffixEnd += 1
            longestPrefixSuffix[suffixEnd] = length
        elif length > 0:
            # Since needle[0:length] == needle[suffixEnd-length:suffixEnd],
            # needle[0:longestPrefixSuffix[length]] == needle[suffixEnd-longestPrefixSuffix[length]:suffixEnd]
            length = longestPrefixSuffix[length]
            # Try to continue the equal substrings with the shorter prefix
        else:
            suffixEnd += 1
    
    i = 0 # haystack index
    j = 0 # needle index

    while i <= n - m:
        if haystack[i + j] == needle[j]:
            if j + 1 < m:
                j += 1
                continue
            yield i
        if j > 0:
            # Move i to the end of the compared region,
            # unless a part of needle is a prefix of needle
            i = i + j - longestPrefixSuffix[j]
            j = longestPrefixSuffix[j]
        else:
            i += 1
            j = 0

print(list(findAll("test test test test", "test"))) # [0, 5, 10, 15]

This algorithm is used inside the built-in find method. I wish the findAll function is also built-in.

Amphidiploid answered 24/12, 2023 at 4:1 Comment(0)
Y
-1

By slicing we find all the combinations possible and append them in a list and find the number of times it occurs using count function

s=input()
n=len(s)
l=[]
f=input()
print(s[0])
for i in range(0,n):
    for j in range(1,n+1):
        l.append(s[i:j])
if f in l:
    print(l.count(f))
Yulan answered 30/7, 2019 at 11:44 Comment(2)
When s="test test test test" and f="test" your code prints 4, but OP expected [0,5,10,15]Osteen
Have written for a single word will update the codeYulan
L
-1

To find all the occurence of a character in a give string and return as a dictionary eg: hello result : {'h':1, 'e':1, 'l':2, 'o':1}

def count(string):
   result = {}
   if(string):
     for i in string:
       result[i] = string.count(i)
     return result
   return {}

or else you do like this

from collections import Counter

   def count(string):
      return Counter(string)
Litterbug answered 30/4, 2022 at 8:0 Comment(0)
N
-1

Try this it worked for me !

x=input('enter the string')
y=input('enter the substring')
z,r=x.find(y),x.rfind(y)
while z!=r:
        print(z,r,end=' ')
        z=z+len(y)
        r=r-len(y)
        z,r=x.find(y,z,r),x.rfind(y,z,r)
Navarino answered 9/6, 2022 at 13:17 Comment(0)
D
-3

You can easily use:

string.count('test')!

https://www.programiz.com/python-programming/methods/string/count

Cheers!

Disgorge answered 1/12, 2018 at 19:9 Comment(3)
this should be the answerSuisse
The string count() method returns the number of occurrences of a substring in the given string. Not their location.Felske
this doesnt satisfty all cases, s = 'banana' , sub = 'ana'. Sub occurs in this situation twice but doing s.sub('ana') would return 1Ilka

© 2022 - 2024 — McMap. All rights reserved.