Biggest misconceptions about Snowflake

Francesco Quaratino
7 min readMay 14, 2021

I strongly believe that “failure is the best way to learn”. However, failing because of misconceptions is a completely different story... I wish I had not given those for granted!

This post aims to address the biggest misconceptions about Snowflake and it is the result of about one year of hands-on experience of a small team of data engineers, in a data warehouse migration from SQL Server to Snowflake.

Here you will learn about some of the key concepts in Snowflake that data engineers often misunderstand. It is meant to be ‘food for thought’, and not a full dissertation by any means.

Virtual Warehouses

If you are new to Snowflake, a warehouse is probably not what you think it is. Its real name is ‘Virtual Warehouse’ but — as software developers love abbreviations — the ‘Virtual’ bit soon gets lost, hence the confusion with the (far more) famous concept of ‘Data Warehouse’.

A warehouse in Snowflake is nothing to do with data per se, but rather compute — that is the ability to fetch and store data from/into storage and process it.

Compute translates into a terrific concept that makes this data platform unique, and is characterized by:

  • Separation of storage and compute: it is possible to run a range of warehouses of different sizes that can perform compute operations on the same data with zero resource contention as each node of a warehouse uses dedicated hardware.
  • Auto Scaling: when a warehouse hasAuto-Scale’ enabled, it will start with a single node and add more nodes to handle the workload as required. Once the newly added nodes are no longer required, the warehouse will scale down to the original single node.
  • Automatic Suspend: when a warehouse has ‘Auto-Suspend’ enabled, it will automatically suspend after X minutes of inactivity.
  • Automatic Resumption: when a warehouse has ‘Auto-Resume’ enabled, it will automatically resume within milliseconds when a new query is executed.
  • Per Second Pay-as-you-go: Snowflake charges per second after the first minute and no charge is applied to a suspended Warehouse.

In conclusion, the Virtual Warehouse concept is a “must know” as it is hugely important for running a fast and cost-effective data platform with Snowflake.

Snowflake is Not Fully Relational

Does this come as a surprise to you? If that is the case, I would recommend continuing reading this paragraph, as this misconception could lead to big disappointment. In fact, just providing the SQL language for querying a database does not make it relational. But what is actually a relational database? Here is a concise definition provided by the AWS documentation:

A relational database is a collection of data items with pre-defined relationships between them. These items are organized as a set of tables with columns and rows. Tables are used to hold information about the objects to be represented in the database. Each column in a table holds a certain kind of data and a field stores the actual value of an attribute. The rows in the table represent a collection of related values of one object or entity. Each row in a table could be marked with a unique identifier called a primary key, and rows among multiple tables can be made related using foreign keys. This data can be accessed in many different ways without reorganizing the database tables themselves.[1]

According to the above definition, a Snowflake database would be relational but, what is missing in Snowflake is the enforcement of the constraints such as Primary Key, Unique Key, and Foreign Key. Confusingly, they can all be defined but there is no mechanism to enforced them in Snowflake.

The flip side of the coin is that, by not having to ensure Entity and Referential Integrity — enforced by implementing, respectively, Primary/Unique Key and Foreign Key constraints — Snowflake is way faster than a traditional, ACID compliant, RDBMS, when loading data. Now, as Snowflake is a Data Platform designed for analytical processing (and not for transactional processing), it is usually ok to favor performance rather than referential integrity as this can be ensured programmatically while loading data. Nevertheless, it is something to bear in mind, especially by those data professionals with a strong relational database background, who could feel quite uncomfortable without entity and referential integrity enforced by the database engine.

Costs

The costs associated with using Snowflake are based on the usage of three distinct functions:

  • Compute resources (implemented as virtual warehouses)
  • Data storage
  • Cloud services

As a matter of fact, it is usually the Virtual Warehouses (VWs) usage that takes the biggest slice of the Snowflake credits. The way it works is quite simple, however, users often get confused.

After a Virtual Warehouse gets started, and until it is stopped, it consumes credits according to its size and the number of its nodes running. The problem with that is the assumption that a VW only consumes credits when queries are running! That is not the case. Therefore, make sure your VW Auto-Suspend setting is set to not consume credits unnecessarily.

As an extreme example, say you have a query that takes 10 seconds to complete and it is executed once a day over a VW#1 that is set to start automatically and to suspend after 10 hours from the last executed query: if this is the only query running over VW#1, the consumed costs for VW#1 would be based on 10 hours (and not 10 seconds) a day.

Warehouse Auto-Scaling

Virtual Warehouses (VWs) come in different sizes, from X-Small to 4X-Large (using the quite effective t-shirt paradigm), which means they can scale vertically (up and down) at any time, as required.

But there is an even more intriguing scaling that a VW can be subjected to, and that is automatic horizontally scaling or Auto-Scaling Out: multiple worker nodes (effectively EC2 instances of the same size) can be running for a single VW which can be set to automatically scale out by adding more nodes to handle the increasing workload and scale back in once workload decreases.

In other words, a VW can stretch and shrink automatically if required, but the question is: could a query be processed across nodes? Say, a big query was executed using a Small VW that runs over a minimum of 1 worker node, and it is set to auto-scaling out, up to a maximum of 4 nodes as an example, will Snowflake be able to take advantage of more than one worker node to execute that query?

The misconception here comes from associating Snowflake with the Shared-nothing architecture, where tables are horizontally partitioned across nodes whereby each node has its own local storage system and is responsible for the rows hosted on it.

Whereas, in Snowflake, storage and compute are separate, therefore a query runs only within a worker node and cannot scale horizontally even if multiple nodes are running and available to accept newly coming workload to process. This also means that the VW size needs to be set to support the heaviest expected query.

No more database backups to run

There is no longer a need for running database backup jobs because we can take advantage of the Time Travel feature [2], together with Clone [3], to restore a table, a schema, or an entire database to a certain point-in-time within a defined period in the past based on the parameter called DATA_RETENTION_TIME_IN_DAYS.

It is worth noting that the Standard Edition of Snowflake allows a very short Time Travel of 1 day, whereas for the Enterprise Edition it can be set to a maximum of 90 days. Also, the higher the data retention is, the more storage Snowflake needs to use, hence more credits and thus a higher charge.

No more indexes to maintain

There is no longer a need to run index maintenance jobs of any kind since indexes are not a thing in Snowflake!

Now, you may have heard about Clustering in Snowflake and were thinking it does refer to Clustered (B+ Tree) Indexes? Although there are some similarities between the two, they are completely different concepts, and while a full dissertation on the subject is beyond the scope of this blog post, it is worth noting that it is unlikely that you will need to think about how your data is clustered for tables under 1TB.

SQL Stored Procedure

As of today, Javascript is the only language available for writing Stored Procedures in Snowflake, which might be quite disappointing to hear from someone who has spent his life developing stored procedures with SQL language, and certainly is not good news when you plan to lift-and-shift a Data Warehouse from a conventional RDBMS which likely uses quite a few of them.

However, the good news is that SQL appears to be the first of a long series of languages (including Python!) that are going to be made available for writing Stored Procedures and the estimate for its preview is June 2021.

We have looked at various concepts that are often misunderstood by novices in Snowflake, and we have shed some light on them. Data and software engineers can benefit from using Snowflake because of its elasticity and simplicity. However, when approaching new technologies, any misconception can lead to big mistakes.

References

[1] What is a Relational Database? https://aws.amazon.com/relational-database/

[2] Understanding & Using Time Travel. https://docs.snowflake.com/en/user-guide/data-time-travel.html

[3] Cloning Considerations. https://docs.snowflake.com/en/user-guide/object-clone.html

--

--