Keras: Making a neural network to find a number's modulus

Asked 30/5, 2019 at 14:32 Answered 29/5, 2020 at 22:50

Solved python tensorflow machine-learning keras neural-network

I'm an experienced Python developer, but a complete newbie in machine learning. This is my first attempt to use Keras. Can you tell what I'm doing wrong?

I'm trying to make a neural network that takes a number in binary form, and outputs its modulo when dividing by 7. (My goal was to take a very simple task just to see that everything works.)

In the code below I define the network and I train it on 10,000 random numbers. Then I test it on 500 random numbers.

For some reason the accuracy that I get is around 1/7, which is the accuracy you'd expect from a completely random algorithm, i.e. my neural network isn't doing anything.

Can anyone help me figure out what's wrong?

import keras.models
import numpy as np
from python_toolbox import random_tools

RADIX = 7

def _get_number(vector):
    return sum(x * 2 ** i for i, x in enumerate(vector))

def _get_mod_result(vector):
    return _get_number(vector) % RADIX

def _number_to_vector(number):
    binary_string = bin(number)[2:]
    if len(binary_string) > 20:
        raise NotImplementedError
    bits = (((0,) * (20 - len(binary_string))) +
            tuple(map(int, binary_string)))[::-1]
    assert len(bits) == 20
    return np.c_[bits]


def get_mod_result_vector(vector):
    return _number_to_vector(_get_mod_result(vector))


def main():
    model = keras.models.Sequential(
        (
            keras.layers.Dense(
                units=20, activation='relu', input_dim=20
            ),
            keras.layers.Dense(
                units=20, activation='relu'
            ),
            keras.layers.Dense(
                units=20, activation='softmax'
            )
        )
    )
    model.compile(optimizer='sgd',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    data = np.random.randint(2, size=(10000, 20))
    labels = np.vstack(map(get_mod_result_vector, data))

    model.fit(data, labels, epochs=10, batch_size=50)
    def predict(number):
        foo = model.predict(_number_to_vector(number))
        return _get_number(tuple(map(round, foo[0])))
    def is_correct_for_number(x):
        return bool(predict(x) == x % RADIX)
    predict(7)
    sample = random_tools.shuffled(range(2 ** 20))[:500]
    print('Total accuracy:')
    print(sum(map(is_correct_for_number, sample)) / len(sample))
    print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')


if __name__ == '__main__':
    main()

Cockhorse answered 30/5, 2019 at 14:32 Comment(8)

Perhaps you could check stats.stackexchange.com/questions/157985/… – Pinkie 30/5, 2019 at 14:48

I'm sure there is proper math to say this, but in short the reason is that the relation between bits and the modulo value is pretty much random (other than for powers of two). As in, if you take a big set of numbers and compute mod 7 for values with the i-th bit set to 0, or set to 1 (or if you pick a particular subset of fixed bit values), the frequency of each class will generally be the same. The network cannot pick up patterns on the input because there aren't. It may learn the output for some training examples, but that doesn't inform any new examples. – Dishonest 30/5, 2019 at 17:8

@jdehesa I don't understand the logic in what you're saying. If I were to take the basic exercise of analyzing a picture of a hand-drawn digit, which is far more complicated than this task, that's something a neural network can do. And yet in that case too, the relation between each specific bit in the input and the answer will be random. So why would a neural network fail at the former and succeed at the latter? – Cockhorse 28/5, 2020 at 16:55

@RamRachum the task of classifying digits may sound complicated relative to finding a mod, but the loss surface for a mod is extremely bumpy, it's hard to even converge a simple NN (memorizing) for such tasks let alone generalize. It's not about the capacity of the network, maybe a proper data transformation is needed before applying an NN so that the loss surface is manageable. – Prase 28/5, 2020 at 20:7

@RamRachum Well, it depends on what you understand by "complicated". Recognizing hand-drawn digits is certainly a "bigger" problem in terms of dimensionality of the data, but neural networks can handle big problems. But from the point of view of the relationship between input and output, it's not so complicated. Each pixel value in a hand-drawn digit image has a strong correlation with the class of digit that is drawn. – Dishonest 29/5, 2020 at 9:16

@RamRachum In probability terms, given a pixel value v, the distribution of P(v | image has class c) is generally different than P(v) (more or less, some pixels may always be black or white). In your problem, for a bit value b and output n, P(b | output is n) does not differ much from P(b) (or the other way around). There is also no smoothness or continuity here (e.g. for images a pixel v may be "usually white" for class c, but here there's only 1 or 0, and each value changes the output radically), so the network needs to "learn" every single example almost independently. – Dishonest 29/5, 2020 at 9:25

While @jdehesa and others are generally correct, there are ways to treat your task as something NN can crack. One such way is to add additional structure to the input - e.g. treat it as a sequence and use RNN approach. With some tuning you can make it work (see updated version of my answer below). In some sense it is similar to using CNN layers for images - it allows network to generalize better by trying to learn different shapes no matter on which part of image they appear. – Kendo 29/5, 2020 at 15:30

There is a massive difference between this data set and say the MNIST data set. Namely the following question, "how many random bits of change does it take to flip the label?". In the MNIST dataset the labels are so far apart in the space, that such question is in the hundreds (and probably smoothly, so you don't get random garbage). But for mod7, the answer is one. Change any one bit and the label changes. This makes the data for mod7 (in binary) have extreme entropy. The MNIST dataset is far less complicated than this dataset. – Gruver 29/5, 2020 at 23:50

UPD

After some tinkering I was able to get to a reasonably good solution using RNNs. It trains on less than 5% of all possible unique inputs and gives >90% accuracy on the random test sample. You can increase number of batches to 100 from 40 to make it a bit more accurate (though in some runs there is a chance the model won't converge to the right answer - here it is higher than usually). I have switched to using Adam optimizer here and had to increase number of samples to 50K (10K led to overfitting for me).

Please understand that this solution is a bit of a tongue-in-cheek thing, because it is based on the task-domain knowledge that our target function can be defined by a simple recurring formula on the sequence of input bits (even simpler formula if you reverse your input bit sequence, but using go_backwards=True in LSTM didn't help here).

If you inverse the input bits order (so that we always start with the most significant bit) than the recurring formula for the target function is just F_n = G(F_{n-1}, x_n), where F_n = MOD([x_1,...,x_n], 7), and G(x, y) = MOD(2*x+y, 7) - only has 49 different inputs and 7 possible outputs. So the model kind of have to learn initial state + this G update function. For the sequence starting with the least significant bit the recurring formula is slightly more complicated cause it will also need to keep track on what is current MOD(2**n, 7) on each step, but it seems that this difficulty doesn't matter for training.

Please note - these formulas are only to explain why RNN works here. The net below is just a plain LSTM layer + softmax with original input of bits treated as a sequence.

Full code for the answer using RNN layer:

import keras.models
import numpy as np
from python_toolbox import random_tools

RADIX = 7
FEATURE_BITS = 20

def _get_number(vector):
    return sum(x * 2 ** i for i, x in enumerate(vector))

def _get_mod_result(vector):
    return _get_number(vector) % RADIX

def _number_to_vector(number):
    binary_string = bin(number)[2:]
    if len(binary_string) > FEATURE_BITS:
        raise NotImplementedError
    bits = (((0,) * (FEATURE_BITS - len(binary_string))) +
            tuple(map(int, binary_string)))[::-1]
    assert len(bits) == FEATURE_BITS
    return np.c_[bits]


def get_mod_result_vector(vector):
    v = np.repeat(0, 7)
    v[_get_mod_result(vector)] = 1
    return v


def main():
    model = keras.models.Sequential(
        (
            keras.layers.Reshape(
                (1, -1)
            ),
            keras.layers.LSTM(
                units=100,
            ),
            keras.layers.Dense(
                units=7, activation='softmax'
            )
        )
    )
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01),
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    data = np.random.randint(2, size=(50000, FEATURE_BITS))
    labels = np.vstack(map(get_mod_result_vector, data))

    model.fit(data, labels, epochs=40, batch_size=50)
    def predict(number):
        foo = model.predict(_number_to_vector(number))
        return np.argmax(foo)
    def is_correct_for_number(x):
        return bool(predict(x) == x % RADIX)
    sample = random_tools.shuffled(range(2 ** FEATURE_BITS))[:500]
    print('Total accuracy:')
    print(sum(map(is_correct_for_number, sample)) / len(sample))
    print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')


if __name__ == '__main__':
    main()

ORIGINAL ANSWER

I'm not sure how it happened, but the particular task you chose to check your code is extremely difficult for a NN. I think the best explanation would be that NNs are not really good when features are interconnected in such way that changing one feature always change value of your target output completely. One way to look at it would be to see the sets of features when you expect a certain answer - in your case they will look like unions of very large number of parallel hyper planes in 20 dimensional space - and for each of 7 categories these sets of planes are "nicely" interleaved and left for NN to distinguish.

That said - if your number of examples is large, say 10K and number of possible inputs is smaller, say your input bit numbers are only 8 bits large (so 256 unique inputs possible only) - networks should "learn" the right function quite ok (by "remembering" correct answers for every input, without generalization). In your case that doesn't happen because the code has the following bug.

Your labels were 20-dimensional vectors with bits of 0-6 integer (your actual desired label) - so I guess you were pretty much trying to teach NN to learn bits of the answer as separate classifiers (with only 3 bits ever possible to be non-zero). I changed that to what I assume you actually wanted - vectors of length 7 with only one value being 1 and others 0 (so-called one hot encoding which keras actually expects for categorical_crossentropy according to this). If you wanted to try to learn each bit separately you definitely shouldn't have used softmax 20 in the last layer, cause such output generates probabilities on 20 classes which sum up to 1 (in that case you should have trained 20-or-rather-3 binary classifiers instead). Since your code didn't give keras correct input the model you got in the end was kind of random and with rounding you applied was intented to output the same value for 95%-100% of inputs.

Slightly changed code below trains a model which can more or less correctly guess the mod 7 answer for every number 0 to 255 (again, pretty much remembers the correct answer for every input). If you try to increase FEATURE_BITS you will see large degradation of the results. If you actually want to train NN to learn this task as is with 20 or more bits of input (and without supplying NN with all possible inputs and infinite time to train) you will need to apply some task-specific feature transformations and/or some layers carefully designed to exactly be good at task you want to achieve as others already mentioned in comments to your question.

import keras.models
import numpy as np
from python_toolbox import random_tools

RADIX = 7
FEATURE_BITS = 8

def _get_number(vector):
    return sum(x * 2 ** i for i, x in enumerate(vector))

def _get_mod_result(vector):
    return _get_number(vector) % RADIX

def _number_to_vector(number):
    binary_string = bin(number)[2:]
    if len(binary_string) > FEATURE_BITS:
        raise NotImplementedError
    bits = (((0,) * (FEATURE_BITS - len(binary_string))) +
            tuple(map(int, binary_string)))[::-1]
    assert len(bits) == FEATURE_BITS
    return np.c_[bits]


def get_mod_result_vector(vector):
    v = np.repeat(0, 7)
    v[_get_mod_result(vector)] = 1
    return v


def main():
    model = keras.models.Sequential(
        (
            keras.layers.Dense(
                units=20, activation='relu', input_dim=FEATURE_BITS
            ),
            keras.layers.Dense(
                units=20, activation='relu'
            ),
            keras.layers.Dense(
                units=7, activation='softmax'
            )
        )
    )
    model.compile(optimizer='sgd',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    data = np.random.randint(2, size=(10000, FEATURE_BITS))
    labels = np.vstack(map(get_mod_result_vector, data))

    model.fit(data, labels, epochs=100, batch_size=50)
    def predict(number):
        foo = model.predict(_number_to_vector(number))
        return np.argmax(foo)
    def is_correct_for_number(x):
        return bool(predict(x) == x % RADIX)
    sample = random_tools.shuffled(range(2 ** FEATURE_BITS))[:500]
    print('Total accuracy:')
    print(sum(map(is_correct_for_number, sample)) / len(sample))
    print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')


if __name__ == '__main__':
    main()

Kendo answered 29/5, 2020 at 3:3 Comment(12)

This works because the network learns the mapping from a vector to a class, the vector has only 255 configs and it's trained 10000 samples (this task has much smoother loss surface), it converged but did you manage to evaluate the model on an unseen set to understand if it only memorized the inputs or learned a useful mapping? – Prase 29/5, 2020 at 7:32

I'm very positive here that network just memorized the inputs, though talking about unseen inputs in this case is a bit strange. I didn't try it, but I would expect that if I intentionally removed say 1 or 10 out of 256 unique inputs from training, the net won't show better than normal results on average compared to random (averaged across multiple random initialization and decision on which inputs to set aside for test set). – Kendo 29/5, 2020 at 7:49

I also think there is a way to solve this particular task in a generalizable way - e.g. for 20 bit inputs with 7 classes outputs and relatively small number of unique training samples. I think some sort of 1d convolution on bits repeated multiple times with same kernel might work. Actually, an RNN (e.g. LSTM) is very likely to crack this easily and generalize on arbitrary length bit inputs. It will only need to learn how to get from 7 possible states with 2 inputs to one of 7 next states. Can be RNN much simpler than LSTM. – Kendo 29/5, 2020 at 8:7

@ZabirAlNazi - RNN worked! Not as easy as I thought - mostly because it has to converge to correct RNN update starting from random and only getting input after 20 steps. But it is definitely generalizing (more than 98% correct answers on random test with less than 5% of unique inputs used for training). – Kendo 29/5, 2020 at 15:34

That's interesting, but as MLP can memorize the mapping, RNN also should be able to do the same, technically. That's definitely progress, but maybe you can try changing your sampling strategy for training, for example, sample integers from range 1-1000, and see if the model can somewhat predict for the range 1001-1500, so far I think memorization on the data is possible with the vector format, the two challenging aspects for me seems: generalization (somewhat), can we learn without vectorization? – Prase 29/5, 2020 at 18:26

Not sure I fully understand the comment - are you suspecting the model memorizes the inputs here? I don't think this is happening here - I only use ~5% of different inputs and the model gives reasonably good answers everywhere. Also when I switch to using 10K samples instead of 50K (and 5x more epochs) I get good training data loss with close-to-random predictions outside of training sample. Thus I conclude that 50K samples training is actually learning correct RNN update function, while 10K training memorizes training samples (with current net layout). – Kendo 29/5, 2020 at 18:38

Learning without vectorization or other similar trick like 1d convolutions etc. is very likely impossible here as you and others suggested earlier. On the other hand you can say similar thing about learning to classify images based on pixels. – Kendo 29/5, 2020 at 18:42

Hmm wondering about this. A RNN sounds like it would basically learn the equations to carry out long division. – Lollygag 30/5, 2020 at 7:46

That was my intention - to make it learn long division rule in a nutshell. Though here the target is only provided at the end of 20 operations which makes training more difficult. I could add shorter samples of different size - most likely that would help learning such rule. – Kendo 30/5, 2020 at 16:21

@AlexanderPivovarov Thanks for your answer. I chose the mod7 task as a task that would be as simple as possible, and it looks like I was comically wrong. Can you recommend a couple of tasks that actually are very simple for neural networks, and wouldn't require CNN or RNN? Thanks! – Cockhorse 2/6, 2020 at 17:40

Well, turned out the difficulty of this mod 7 task you chose is a bit overestimated by many people :). But it is indeed quite more difficult for a NN than one might expect. Regarding the tasks which are actually simple for NNs consisting of a few Dense layers you can try e.g. the following. Inputs - points in N dimensional space, target - distance from an input point to a fixed point (e.g. 0) with square of error as the loss. – Kendo 2/6, 2020 at 18:49

If you prefer classification tasks with cross entropy loss like here - you can do the same thing, but use distance from a point to construct your classes (e.g. distance < 1 - class 0, distance between 1 and 5 - class 1, otherwise class 3). I think that sounds like a fairly simple task for a NN (unless your dimensionality is high). – Kendo 2/6, 2020 at 18:51

This achieves an accuracy of 99.74% and a validation accuracy of 99.69%.

import tensorflow as tf, numpy as np

def int2bits(i,fill=20): 
    return list(map(int,bin(i)[2:].zfill(fill)))

def bits2int(b):
    return sum(i*2**n for n,i in enumerate(reversed(b)))

# Data. 
I = np.random.randint(0,2**20,size=(250_000,))
X = np.array(list(map(int2bits,I)))
Y = np.array([int2bits(2**i,7) for i in I % 7])

# Test Data. 
It = np.random.randint(0,2**20,size=(10_000,))
Xt = np.array(list(map(int2bits,It)))
Yt = np.array([int2bits(2**i,7) for i in It % 7])

# Model.
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1000,'relu'),
    tf.keras.layers.Dense(7,'softmax'), 
])
model.compile('adam','categorical_crossentropy',['accuracy'])

# Train.
model.fit(X,Y,10_000,100,validation_data=(Xt,Yt))

Some take-aways:

1) You had way too little data. You were uniformly sampling points from 0 to 2**20, but only sampled 10,000, which is only about 1% of the possible vectors that the model is suppose to learn about. The point is that a lot of components (in the binary representation) would be mostly fixed at zero or one without any opportunity to learn how they function in the overall data or how they interact with other components.

2) You needed an embedding layer, namely extend the space into some massive higher dimension, so the neurons can move around more easily. This allows the learning to shuffle things better hopefully finding the algorithm your looking for. A single Dense(1000) seems to work.

3) Ran batches of 10_000 (just so I maximize my CPU usage). Ran 100 epochs. Included my validation_data in the training so I could see how the validation set performs at each epoch (including this doesn't effect the training, just makes it easier to see if the model is doing well, while training).

Thanks. :-)

Gruver answered 29/5, 2020 at 22:50 Comment(6)

Very nice. And I know this is just hyperparameter tuning, but I'm always wondering if a hypothesis class needs to be so complex, especially for a problem that should be solvable with a simple equation. So I did some testing, and with a smaller batch size of 1000 and a hidden layer with just 64 nodes, you can achieve similar performance. :) – Lollygag 30/5, 2020 at 10:13

Ya, I noticed that when I used, Dense(100), I was able to get 99% Acc and around 20% Val Acc; so it was learning something, but had some serious overfitting issues. Turning down the batch size could fix that, but also makes training take longer. My GPU did 100 epochs in about <10 secs with the 10,000 batch mark. But ya, I didn't play around with it even to fine tune the parameters. – Gruver 30/5, 2020 at 16:5

Wow! When I first read it I thought no way this can be correct. While I was coming up with excuses on why this problem is so difficult for NN you actually solved it :). I've also tried to adjust the parameters - not only 1000 neurons are not necessary here as @DesmondCheong pointed, you also can decrease number of samples too. So with same 64 neurons and mere 50K samples validation accuracy was reaching 99.7% at roughly 10-15K epochs (took a few minutes to run). Using same 10K as batche size. – Kendo 30/5, 2020 at 16:10

Although, I don't really agree with the idea of trying to save 20KB of memory, but take significantly longer. Ran it with batch_size=1000 and Dense(64), took about 100+ secs to converge to about 95.33% Acc; which is equivalently achieved at epoch 50ish with 10000,Dense(1000) in <5 secs. Unless you are using an FPGA or RasPi or something, there is just no reason to be so conservative with your memory. Faster training is typically better. – Gruver 30/5, 2020 at 16:12

It's not really about saving the memory. I was more interested to see how generalizing the model is. Using 25% of all possible unique examples sounds like quite a lot. Not arguing with you here though (and clearly with 250K samples the model was also generalizing otherwise we would probably see accuracy around 25%+(75%)*1/7 for validation). – Kendo 30/5, 2020 at 16:19

Agreed, the 250K is quite high. The fact that this can be lowered is very interesting. The reason I went with such a high number is that this data set has an extremely high entropy. Changing any 1 bit does change the label; with such a property you really need a lot of cases to learn all the subtle behaviors between the parameters. I am surprised you got it to converge at all with just 50K samples, NICE. – Gruver 30/5, 2020 at 16:23