IndexError: cannot fit 'int' into an index-sized integer

Asked 19/1, 2017 at 17:13 Answered 19/1, 2017 at 19:11

python list append runtime-error indexof

So I'm trying to make my program print out the indexes of each word and punctuation, when it occurs, from a text file. I have done that part. - But the problem is when I'm trying to recreate the original text with punctuation using those index positions. Here is my code:

with open('newfiles.txt') as f:
    s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {} 
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
    if match not in d.keys():
        d[match] = i
        i+=1
    list_with_positions.append(d[match])

print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
    word_base = [None] + [z.strip() for z in f_base.read().split()]

sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
    sentence_seq = [word_base[int(i)] for i in f_select.read().split()]

print(' '.join(sentence_seq))

As i said the first part works fine but then i get the error:-

Traceback (most recent call last):
    File "E:\Python\Indexes.py", line 33, in <module>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
    File "E:\Python\Indexes.py", line 33, in <listcomp>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer

This error occurs when the program runs through 'sentence_seq' towards the bottom of the code

newfiles is the original text file - a random article with more than one sentence with punctuation

list_with_positions is the list with the actual positions of where each word occurs within the original text

matches is the separated DIFFERENT words - if words repeat in the file (which they do) matches should have only the different words.

Does anyone know why I get the error?

Firebrick answered 19/1, 2017 at 17:13 Comment(15)

your int must be too big for array indexing: probable duplicate of #4752225 (not closing the question yet) – Kirkham 19/1, 2017 at 17:22

@Jean-FrançoisFabre Indeed because we are replacing each word in the text file for integers (it's indexes) - probably around 60-80 words. So, does that mean the only way to overcome this is to use a shorter text file? – Firebrick 19/1, 2017 at 17:27

Stab in the dark here. file.write (''.join(str(e) for e in list_with_positions)) writes the data with no spaces, such that when you read it back in, your split() does nothing and actually you're trying to index by an 80-digit number. – Ingmar 19/1, 2017 at 17:28

@Ingmar Wow that did solve a lot of the problem but the final output comes as - " They say it ' s a dog ' s life " instead of "They say it's a dog's life" - Is it a whitespace error between the punctuation? This happens for full stops too - i guess all the punctuation gets treated like the words because of the way i split the original file. Do you know any way to let there not be any unnecessary space between the punctuation (as you do need whitespace after a fullstop but not before. etc) – Firebrick 19/1, 2017 at 17:33

In that case, try sentence_seq = [word_base[int(i)].strip() for i in f_select.read().split()]. I won't write as an answer yet because I can't test any of this – Ingmar 19/1, 2017 at 17:36

@Ingmar Unfortunately there is no difference BTW i just noticed, there are random letter 's' in the final output. This just makes me totally confused. Here is what it outputs:- "They say it ' s a dog ' s ' s life , but s for Estrella" - not the unnecessary letter 's's in the output – Firebrick 19/1, 2017 at 17:40

This is getting tough for me to visualise; without data it's difficult for anyone to keep track in the debugging. But again you do have file.write(''.join(matches)) where you join words with no separation. What happens if you change that to file.write(' '.join(matches))? Really, I might be reaching my limit to what I can suggest without a test case here. – Ingmar 19/1, 2017 at 17:45

@Ingmar Genuine question just curious not rude : Are you not on a computer or anything? Why cant you test it - not being rude genuinely asking - is there anything wrong with the code? And i did separate the matches filewrite when you suggested to do it for the other one so no luck so far – Firebrick 19/1, 2017 at 17:49

The first line of your code: with open('newfiles.txt') as f:. I don't have newfiles.txt, that's on your computer. There is the idea of an MCVE here, so that people can replicate the issue easily. I don't know what your file contains, so I don't know if any test case I create is accurate to what you're using and if I can't be assured I can recreate the issue, it's wasted effort on my part to end up giving false advice. It always helps to try pinpoint the issue you have and make it easily reproducible :) – Ingmar 19/1, 2017 at 17:54

Oh, sorry for being stupid - newfiles is just a random article with more than one sentence with punctuation. That's all that matters within the context of the question - just saying in case you are bothered enough :-) I've changed the question - thx for letting me know – Firebrick 19/1, 2017 at 17:56

So if I create a file containing

Welcome to Stack Overflow. It's fine that you didn't quite create an MCVE on your first question as otherwise it's quite interesting.

then I'm set? :) – Ingmar 19/1, 2017 at 18:1

Hopefully the last question. I'm really trying to stick with your current code but I'm finding it tough. The issue here is that punctuation cannot be included in the join(). Do you need to stick with your current format? – Ingmar 19/1, 2017 at 18:40

No, not necessarily as long as it's along the same lines and does the requirements – Firebrick 19/1, 2017 at 18:43

@Jean-FrançoisFabre first dibs here; I've identified the problem in my answer but I don't like my solution. Is there a cleaner way? OP may/may not accept as answer but I will upvote if you find a better way. – Ingmar 19/1, 2017 at 19:52

@Ingmar see my improvement suggestions. I don't want to post a slightly better solution plagarizing yours while you did all the legwork with the OP. Lower part can be improved, upper part cannot with listcomps because you generate 2 lists. – Kirkham 19/1, 2017 at 20:38

The issue with your approach is using ''.join() as this joins everything with no spaces. So, the immediate issue is that you attempt to then split() what is effectively a long series of digits with no spaces; what you get back is a single value with 100+ digits. So, the int overflows with a gigantic number when trying to use it as an index. Even more of an issue is that indices might go into double digits etc.; how did you expect split() to deal with that when numbers are joined without spaces?

Beyond that, you fail to treat punctuation properly. ' '.join() is equally invalid when trying to reconstruct a sentence because you have commas, full stops etc. getting whitespace on either side.

I tried my best to stick with your current code/approach (I don't think there's huge value in changing the entire approach when trying to understand where an issue comes from) but it still feels shakey for me. I dropped the regex, perhaps that was needed. I'm not immediately aware of a library for doing this kind of thing but almost certainly there must be a better way

import string

punctuation_list = set(string.punctuation) # Has to be treated differently

word_base = []
index_dict = {}
with open('newfiles.txt', 'r') as infile:
    raw_data = infile.read().split()
    for index, item in enumerate(raw_data):
        index_dict[item] = index
        word_base.append(item)

with open('newfiletwo.txt', 'w') as outfile1, open('newfilethree.txt', 'w') as outfile2:
    for item in word_base:
        outfile1.write(str(item) + ' ')
        outfile2.write(str(index_dict[item]) + ' ')

reconstructed = ''
with open('newfiletwo.txt', 'r') as infile1, open('newfilethree.txt', 'r') as infile2:
    indices = infile1.read().split()
    words = infile2.read().split()
    reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])

Ingmar answered 19/1, 2017 at 19:11 Comment(12)

Thx a lot man, thx for your time & skills it really really helps. :-) – Firebrick 19/1, 2017 at 19:13

@TheWorldIn5 you're very welcome, this felt like it became much more difficult than it should have been; in your broader goals I would think there is a library for this. Dealing with punctuation is always going to be an issue otherwise. I wonder if NLTK can help. – Ingmar 19/1, 2017 at 19:15

@TheWorldIn5 again, you're welcome :) I think I've missed the mark on this one, feel free to un-accept my answer, which leaves it open... this means that more people might answer. I am interested in an answer myself for more general things. My answer pinpoints the main issue but I don't like how it deals with it; we can both learn from this – Ingmar 19/1, 2017 at 19:46

How come when we print 'words', instead of the positions of each word in the file, a random output of numbers come, i checked manually and then the integers outputted do not represent the occurance of each word in the text. Just in case you wanted to know what my text actually says here it is: "They say it's a dog's life, but for Estrella, born without her front legs, she's adapted to more of a kangaroo way of living. The Peruvian mutt hasn't let her disability hold her back, gaining celebrity status in the small town of Tinga Maria. " – Firebrick 19/1, 2017 at 20:2

Instead of - [1, 2, 3, 4, 5, 6, 7, 4, 5, 8, 9, 10, 11, 12, 9, 13, 14, 15, 16, 17, 9, 18, 4, 5, 19, 20, 21, 22, 6, 23, 24, 22, 25, 26, 27, 28, 29, 30, 4, 31, 32, 15, 33, 34, 15, 35, 9, 36, 37, 38, 39, 40, 41, 42, 22, 43, 44, 26] - the actual positions of each word in the text, which i gained by printing 'result' from the question above- i get ['0', '1', '2', '19', '4', '5', '6', '7', '8', '9', '10', '32', '12', '13', '14', '15', '16', '17', '41', '19', '20', '21', '41', '23', '24', '25', '26', '27', '28', '32', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] – Firebrick 19/1, 2017 at 20:5

Which is not the positions of the word occurances? The other parts of the code work perfectly just not this part or is it that im printing the wrong variable? – Firebrick 19/1, 2017 at 20:6

Cannot replicate with words however there are two potential things here: dictionaries are not ordered and also dictionaries need unique keys so the index position will be overwritten if a word occurs more than once. – Ingmar 19/1, 2017 at 20:7

your solution looks all right, a few comments: indices = [item for item in infile1.read().split()] => indices = infile1.read().split() (same for line below). Also

for item in word_base:         if item in punctuation_list:             reconstructed += item + ' '         else:             reconstructed += ' ' + item + ' '

is ugly & underperformant. I'd write reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base]). But this isn't codereview-on-answers.stackexchange.com :) – Kirkham 19/1, 2017 at 20:36

@Jean-FrançoisFabre (s)he did, I asked for it to be revoked as I'm not sure I liked my approach. You have given valid feedback, working on it now. I don't know why I used list comp. for those – Ingmar 19/1, 2017 at 20:40

@Ingmar In the dictionary link you gave me, it says that we can use the first occuring key - if the same word occurs more than once. That is what i want to do here to. How do i do that? – Firebrick 19/1, 2017 at 21:34

@TheWorldIn5 What difference would it make? – Ingmar 19/1, 2017 at 21:35

Lol, that was half of the purpose of my question which i had already done but, i don't think i wrote it in the question because i thought that since i already done that, that would be printed with what you have wrote. But, since it isn't i'm trying to find a way to conjoin the two – Firebrick 19/1, 2017 at 21:38

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags