IndexError: cannot fit 'int' into an index-sized integer
Asked Answered
F

1

6

So I'm trying to make my program print out the indexes of each word and punctuation, when it occurs, from a text file. I have done that part. - But the problem is when I'm trying to recreate the original text with punctuation using those index positions. Here is my code:

with open('newfiles.txt') as f:
    s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {} 
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
    if match not in d.keys():
        d[match] = i
        i+=1
    list_with_positions.append(d[match])

print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
    word_base = [None] + [z.strip() for z in f_base.read().split()]

sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
    sentence_seq = [word_base[int(i)] for i in f_select.read().split()]

print(' '.join(sentence_seq))

As i said the first part works fine but then i get the error:-

Traceback (most recent call last):
    File "E:\Python\Indexes.py", line 33, in <module>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
    File "E:\Python\Indexes.py", line 33, in <listcomp>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer

This error occurs when the program runs through 'sentence_seq' towards the bottom of the code

newfiles is the original text file - a random article with more than one sentence with punctuation

list_with_positions is the list with the actual positions of where each word occurs within the original text

matches is the separated DIFFERENT words - if words repeat in the file (which they do) matches should have only the different words.

Does anyone know why I get the error?

Firebrick answered 19/1, 2017 at 17:13 Comment(15)
your int must be too big for array indexing: probable duplicate of #4752225 (not closing the question yet)Kirkham
@Jean-FrançoisFabre Indeed because we are replacing each word in the text file for integers (it's indexes) - probably around 60-80 words. So, does that mean the only way to overcome this is to use a shorter text file?Firebrick
Stab in the dark here. file.write (''.join(str(e) for e in list_with_positions)) writes the data with no spaces, such that when you read it back in, your split() does nothing and actually you're trying to index by an 80-digit number.Ingmar
@Ingmar Wow that did solve a lot of the problem but the final output comes as - " They say it ' s a dog ' s life " instead of "They say it's a dog's life" - Is it a whitespace error between the punctuation? This happens for full stops too - i guess all the punctuation gets treated like the words because of the way i split the original file. Do you know any way to let there not be any unnecessary space between the punctuation (as you do need whitespace after a fullstop but not before. etc)Firebrick
In that case, try sentence_seq = [word_base[int(i)].strip() for i in f_select.read().split()]. I won't write as an answer yet because I can't test any of thisIngmar
@Ingmar Unfortunately there is no difference BTW i just noticed, there are random letter 's' in the final output. This just makes me totally confused. Here is what it outputs:- "They say it ' s a dog ' s ' s life , but s for Estrella" - not the unnecessary letter 's's in the outputFirebrick
This is getting tough for me to visualise; without data it's difficult for anyone to keep track in the debugging. But again you do have file.write(''.join(matches)) where you join words with no separation. What happens if you change that to file.write(' '.join(matches))? Really, I might be reaching my limit to what I can suggest without a test case here.Ingmar
@Ingmar Genuine question just curious not rude : Are you not on a computer or anything? Why cant you test it - not being rude genuinely asking - is there anything wrong with the code? And i did separate the matches filewrite when you suggested to do it for the other one so no luck so farFirebrick
The first line of your code: with open('newfiles.txt') as f:. I don't have newfiles.txt, that's on your computer. There is the idea of an MCVE here, so that people can replicate the issue easily. I don't know what your file contains, so I don't know if any test case I create is accurate to what you're using and if I can't be assured I can recreate the issue, it's wasted effort on my part to end up giving false advice. It always helps to try pinpoint the issue you have and make it easily reproducible :)Ingmar
Oh, sorry for being stupid - newfiles is just a random article with more than one sentence with punctuation. That's all that matters within the context of the question - just saying in case you are bothered enough :-) I've changed the question - thx for letting me knowFirebrick
So if I create a file containing Welcome to Stack Overflow. It's fine that you didn't quite create an MCVE on your first question as otherwise it's quite interesting. then I'm set? :)Ingmar
Hopefully the last question. I'm really trying to stick with your current code but I'm finding it tough. The issue here is that punctuation cannot be included in the join(). Do you need to stick with your current format?Ingmar
No, not necessarily as long as it's along the same lines and does the requirementsFirebrick
@Jean-FrançoisFabre first dibs here; I've identified the problem in my answer but I don't like my solution. Is there a cleaner way? OP may/may not accept as answer but I will upvote if you find a better way.Ingmar
@Ingmar see my improvement suggestions. I don't want to post a slightly better solution plagarizing yours while you did all the legwork with the OP. Lower part can be improved, upper part cannot with listcomps because you generate 2 lists.Kirkham
I
1

The issue with your approach is using ''.join() as this joins everything with no spaces. So, the immediate issue is that you attempt to then split() what is effectively a long series of digits with no spaces; what you get back is a single value with 100+ digits. So, the int overflows with a gigantic number when trying to use it as an index. Even more of an issue is that indices might go into double digits etc.; how did you expect split() to deal with that when numbers are joined without spaces?

Beyond that, you fail to treat punctuation properly. ' '.join() is equally invalid when trying to reconstruct a sentence because you have commas, full stops etc. getting whitespace on either side.

I tried my best to stick with your current code/approach (I don't think there's huge value in changing the entire approach when trying to understand where an issue comes from) but it still feels shakey for me. I dropped the regex, perhaps that was needed. I'm not immediately aware of a library for doing this kind of thing but almost certainly there must be a better way

import string

punctuation_list = set(string.punctuation) # Has to be treated differently

word_base = []
index_dict = {}
with open('newfiles.txt', 'r') as infile:
    raw_data = infile.read().split()
    for index, item in enumerate(raw_data):
        index_dict[item] = index
        word_base.append(item)

with open('newfiletwo.txt', 'w') as outfile1, open('newfilethree.txt', 'w') as outfile2:
    for item in word_base:
        outfile1.write(str(item) + ' ')
        outfile2.write(str(index_dict[item]) + ' ')

reconstructed = ''
with open('newfiletwo.txt', 'r') as infile1, open('newfilethree.txt', 'r') as infile2:
    indices = infile1.read().split()
    words = infile2.read().split()
    reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])
Ingmar answered 19/1, 2017 at 19:11 Comment(12)
Thx a lot man, thx for your time & skills it really really helps. :-)Firebrick
@TheWorldIn5 you're very welcome, this felt like it became much more difficult than it should have been; in your broader goals I would think there is a library for this. Dealing with punctuation is always going to be an issue otherwise. I wonder if NLTK can help.Ingmar
@TheWorldIn5 again, you're welcome :) I think I've missed the mark on this one, feel free to un-accept my answer, which leaves it open... this means that more people might answer. I am interested in an answer myself for more general things. My answer pinpoints the main issue but I don't like how it deals with it; we can both learn from thisIngmar
How come when we print 'words', instead of the positions of each word in the file, a random output of numbers come, i checked manually and then the integers outputted do not represent the occurance of each word in the text. Just in case you wanted to know what my text actually says here it is: "They say it's a dog's life, but for Estrella, born without her front legs, she's adapted to more of a kangaroo way of living. The Peruvian mutt hasn't let her disability hold her back, gaining celebrity status in the small town of Tinga Maria. "Firebrick
Instead of - [1, 2, 3, 4, 5, 6, 7, 4, 5, 8, 9, 10, 11, 12, 9, 13, 14, 15, 16, 17, 9, 18, 4, 5, 19, 20, 21, 22, 6, 23, 24, 22, 25, 26, 27, 28, 29, 30, 4, 31, 32, 15, 33, 34, 15, 35, 9, 36, 37, 38, 39, 40, 41, 42, 22, 43, 44, 26] - the actual positions of each word in the text, which i gained by printing 'result' from the question above- i get ['0', '1', '2', '19', '4', '5', '6', '7', '8', '9', '10', '32', '12', '13', '14', '15', '16', '17', '41', '19', '20', '21', '41', '23', '24', '25', '26', '27', '28', '32', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43']Firebrick
Which is not the positions of the word occurances? The other parts of the code work perfectly just not this part or is it that im printing the wrong variable?Firebrick
Cannot replicate with words however there are two potential things here: dictionaries are not ordered and also dictionaries need unique keys so the index position will be overwritten if a word occurs more than once.Ingmar
your solution looks all right, a few comments: indices = [item for item in infile1.read().split()] => indices = infile1.read().split() (same for line below). Also for item in word_base: if item in punctuation_list: reconstructed += item + ' ' else: reconstructed += ' ' + item + ' ' is ugly & underperformant. I'd write reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base]). But this isn't codereview-on-answers.stackexchange.com :)Kirkham
@Jean-FrançoisFabre (s)he did, I asked for it to be revoked as I'm not sure I liked my approach. You have given valid feedback, working on it now. I don't know why I used list comp. for thoseIngmar
@Ingmar In the dictionary link you gave me, it says that we can use the first occuring key - if the same word occurs more than once. That is what i want to do here to. How do i do that?Firebrick
@TheWorldIn5 What difference would it make?Ingmar
Lol, that was half of the purpose of my question which i had already done but, i don't think i wrote it in the question because i thought that since i already done that, that would be printed with what you have wrote. But, since it isn't i'm trying to find a way to conjoin the twoFirebrick

© 2022 - 2024 — McMap. All rights reserved.