is any method to approximate the softmax probability under special conditions?

Asked 4/6, 2020 at 8:25 Answered 21/5 at 5:36

machine-learning math deep-learning softmax

I'm trying to find approach to compute the softmax probability without using exp().

assume that:

target: to compute f(x1, x2, x3) = exp(x1)/[exp(x1)+exp(x2)+exp(x3)]

conditions:

    1. -64 < x1,x2,x3 < 64

    2. result is just kept 3 desimal places.

is there any way to find a polynomial to approximately represent the result under such conditions?

Settlement answered 4/6, 2020 at 8:25 Comment(1)

You could approximate exp in softmax with piece-wise linear functions (see here), another possibilities are in this answer. It depends what you're after and why you want to do that. – Distinct 4/6, 2020 at 8:47

My understanding of Softmax probability

The output of neural networks (NN) is not very discriminating. For example if I have 3 classes, for the correct class say NN output may be some value a and for others b,c such that a>b, a>c. But if we do the softmax trick, after transformation firstly a+b+c = 1 which makes it interpretable as probability. Secondly, a>>>b, a>>>c and so we are now much more confident.

So how to go further

To get the first advantage, it is sufficient to use

f(x1)/[f(x1)+f(x2)+f(x3)]
(equation 1)

for any function f(x)

Softmax chooses f(x)=exp(x). But as you are not comfortable with exp(x), you can choose say f(x)=x^2.

I give some plots below which have profile similar to exponential and you may choose from them or use some similar function. To tackle the negative range, you may add a bias of 64 to the output.

Please note that the denominator is just a constant and need not be computed. For simplicity you can just use following instead of equation 1,

[f(x)] / [3*f(xmax)]

In your case xmax = 64 + bias(if you choose to use one)

Regards.

Uam answered 4/6, 2020 at 12:38 Comment(0)

Since the activation range is often vastly larger than the domain of exp(x), one will mostly need to find the largest activation value m = max(a,b,c) then subtract that from all the values.

This is identical to 1 / (1 + exp(b-m) + exp(c-m)), with a selected/sorted to be the largest of the values.

The number of exp functions is thus reduced a bit, however it's possible that the sorting is actually more costly than the fastest exp approximations:

For the exp function there is also a well known 1st order approximation available -- which is of the form (int)(x * 12102203.2f) + 127 * (1 << 23) - 486411 reinterpreted as float -- see Fastest Implementation of the Natural Exponential Function Using SSE

I just recently found another method, which lacks a little bit of accuracy, but parallelises better on selected SIMD implementation (Arm64) without using float <-> int conversions:

   template <typename T>
   T fastExp2(T x) {
        if constexpr(sizeof(x) == 2) {
            // 0 10101 0 01111 xxxx // just 4 bits of fractionals
            x += (T)79.0f;
            return std::bit_cast<T>(std::bit_cast<uint16_t>(x) << 6);
        } else if constexpr(sizeof(x) == 4) {
            // 0 10001000 001111111 xxxxx xxxxx xxxx // 14 fractional bits
            x += (T)639.0f;
            return std::bit_cast<T>(std::bit_cast<uint32_t>(x) << 9);   
        }
        // 0 10000001011 001111111111 xxxx... // 40 fractional bits
        x += (T)5119.0f;
        return std::bit_cast<T>(std::bit_cast<uint64_t>(x) << 12);
   }

If it's not obvious, what's happening here, it's that the argument x is shifted or offset by a large (carefully selected) integer. Some or most of the fractional bits stay intact, where as the integer part will be added to an exponent bias. At this point there's the correct (but truncated) result embedded in the floating point number, which just needs to be shifted to the correct position.

One can premultiply the weights of the last convolutional layer by log2(e) == 1.44269504088896 to avoid the scaling in the exponential function.

Henke answered 21/5 at 5:36 Comment(0)

Recommended topics

Hot tags