So I'm trying to make my program print out the indexes of each word and punctuation, when it occurs, from a text file. I have done that part. - But the problem is when I'm trying to recreate the original text with punctuation using those index positions. Here is my code:
with open('newfiles.txt') as f:
s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {}
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
if match not in d.keys():
d[match] = i
i+=1
list_with_positions.append(d[match])
print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
word_base = [None] + [z.strip() for z in f_base.read().split()]
sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
print(' '.join(sentence_seq))
As i said the first part works fine but then i get the error:-
Traceback (most recent call last):
File "E:\Python\Indexes.py", line 33, in <module>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
File "E:\Python\Indexes.py", line 33, in <listcomp>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer
This error occurs when the program runs through 'sentence_seq' towards the bottom of the code
newfiles is the original text file - a random article with more than one sentence with punctuation
list_with_positions is the list with the actual positions of where each word occurs within the original text
matches is the separated DIFFERENT words - if words repeat in the file (which they do) matches should have only the different words.
Does anyone know why I get the error?
int
must be too big for array indexing: probable duplicate of #4752225 (not closing the question yet) – Kirkhamfile.write (''.join(str(e) for e in list_with_positions))
writes the data with no spaces, such that when you read it back in, yoursplit()
does nothing and actually you're trying to index by an 80-digit number. – Ingmarsentence_seq = [word_base[int(i)].strip() for i in f_select.read().split()]
. I won't write as an answer yet because I can't test any of this – Ingmarfile.write(''.join(matches))
where you join words with no separation. What happens if you change that tofile.write(' '.join(matches))
? Really, I might be reaching my limit to what I can suggest without a test case here. – Ingmarwith open('newfiles.txt') as f:
. I don't havenewfiles.txt
, that's on your computer. There is the idea of an MCVE here, so that people can replicate the issue easily. I don't know what your file contains, so I don't know if any test case I create is accurate to what you're using and if I can't be assured I can recreate the issue, it's wasted effort on my part to end up giving false advice. It always helps to try pinpoint the issue you have and make it easily reproducible :) – IngmarWelcome to Stack Overflow. It's fine that you didn't quite create an MCVE on your first question as otherwise it's quite interesting.
then I'm set? :) – Ingmarjoin()
. Do you need to stick with your current format? – Ingmar