Questions
How much more expensive is it to load small files (eg. 4K) using Snowpipe than say 16K, 500K or 1-10Mb (the recommended file size). Note: This question implies it is more expensive to load small files rather than the recommended 1-10Mb.
Understand best practice is to load files sized 1-10Mb, but I need Near Real-Time delivery (a few minutes). I could concatenate files to make them larger, but can't wait more than 60 seconds before sending the micro-batch to S3 and therefore Snowpipe. I currently write whatever I have every 30 seconds, but I see Snowpipe reports every 60 seconds. Does this mean there is no point writing files to S3 more frequently than 60 seconds? ie. If I send the file every 30 seconds will it actually reduce average latency or is 60 seconds the minimum Snowpipe Cycle.
Loading 4K files (around 200Mb a day at 4K per file), it's costing around 20 credits per gigabyte which is very expensive. What kind of cost should I expect per gigabyte using Snowpipe if I load (for example), CSV files in the 1-10Mb range? Will my cost per Gigabyte drop if I keep within the 1-10Mb range?
Is there any faster/cheaper alternative to get data into Snowflake? Note: Currently using Snowpipe in Parquet format to VARIANT then using STREAMS and TASKS to restructure the data for near real-time analysis. Understand it's cheaper to use Snowpipe rather than a Virtual Warehouse. Is this true? I suspect the real answer is "it depends". But "depends upon what".
In addition to my Near Real-time requirement, I have a number of systems delivering batch feeds (CSV format, approx once every 4 hours, latency expected within 30 minutes to process and present for analysis. File sizes vary here, but most are 1Mb to 1Gb range. Should I use the same Snowpipe solution or am I better off orchestrating the work from Airflow and using a COPY command followed by SQL Statements on a dedicated virtual warehouse? Or indeed, what alternative would you recommend?
I can see Snowpipe loading 4K files is expensive and probably cheaper than larger files. If I load files over 10Mb in size, will these start to become more expensive again? IE. Is the cost a "bell curve" or does it flatten out.
Background
- I'm using Snowpipe to deliver a near real-time (NRT) data load solution.
- I have data being replicated from Kafka to S3 around every 30 seconds from approx 30 tables, and it's being automatically loaded to Snowflake using Snowpipe.
- Data passed to me in Parqet format, loaded to Variant and then a view to extract out the attributes to a table before using Tasks and SQL to restructure for analysis.
- In a single day, I found 50,000 files loaded, file size varies but average file size is 4K per file.
- I can see around 30 files per minute being loaded (ie. around 100K per minute loaded).
- I'm trying to balance several non-functional requirements. a) Efficient use of credits. Aware small files are expensive. b) Reduce latency (I'm trying to get a pipeline of around 2-5 minutes maximum from Kafka to dashboard). c) Simplicity - IE. It needs to be easy to understand and maintain, as I expect the solution to grow MASSIVELY - IE. From around 20 tables to many hundreds of tables - all needing Near real-time
- I will (in the next 3 months) have a number of CSV batch loads every 4 hours. They are entirely independent data sources (from the NRT), and have much more intensive processing and ELT. I'm wondering whether I should use Snowpipe or COPY for these.