How to create an aggregate UDF in Snowflake
R

2

7

I'm trying to create an aggregate UDF, for example something like sum or median.

The documentation and examples at https://docs.snowflake.net/manuals/sql-reference/udf-sql.html and https://docs.snowflake.net/manuals/sql-reference/sql/create-function.html don't explain how to do so.

Can someone please explain how and/or provide a MWE?

Reorganize answered 16/4, 2019 at 13:40 Comment(0)
C
3

you want to use a javascript user defined table function of which that have a sum example here

The main gotcha's you want to look out for, is inside the UDF code the SQL parameters names are uppercase, you can see this in the examples, but if you missed noting it, it can lead to lots of head banging. Also Javascript has no int types so, all value have to go in/out via a double, but an int32 can safely be stored in a double, so that is not a major concern.. and if you need more precision, you might want to make you function output not the final "sum" but return multiple values one of which is the an aggregate key, and then sum in SQL space.

Courtesy answered 16/4, 2019 at 22:25 Comment(1)
This answer is now out-dated, as we have Python UDAFs docs.snowflake.com/en/developer-guide/udf/python/…Industrious
I
1

New in 2024: You can write Python user-defined aggregate functions (UDAF) in Snowflake.

Example from docs:

CREATE OR REPLACE AGGREGATE FUNCTION PYTHON_SUM(a INT)
RETURNS INT
LANGUAGE PYTHON
RUNTIME_VERSION=3.8
handler = 'PythonSum'
AS $$
class PythonSum:
  def __init__(self):
    # This aggregate state is a primitive Python data type.
    self._partial_sum = 0

  @property
  def aggregate_state(self):
    return self._partial_sum

  def accumulate(self, input_value):
    self._partial_sum += input_value

  def merge(self, other_partial_sum):
    self._partial_sum += other_partial_sum

  def finish(self):
    return self._partial_sum
$$;
Industrious answered 22/5 at 5:7 Comment(3)
very nice to see a working example for a function that gives 64bit Int safe aggregation.Courtesy
Huh! Never thought about that problem...Industrious
short of my answer summing int32's in double space is 100% bit correct, so the only reason to use python over javascript is the larger native value support. Techincally it would be interesting to performance check 1billion rows of Python verse Javascript. Rightly though, given the partial/merge behavour this answer will scale much better over large sets of the same partition group. Even that tradeoff curve could be interesting to plot.Courtesy

© 2022 - 2024 — McMap. All rights reserved.