In the last few weeks, I had to munge data on Snowflake and it was almost a year ago, that I had briefly played around with Snowflake. I had exactly one week to crunch stats on close to a million unstructured XML documents and create a demo. A cursory evaluation made me realize that Snowflake the best platform to ingest and munge XML data. For a quick refresher on the platform, I went through a course on Udemy. This blogpost is a brief summary of few of my learnings from the course.

Overview

  • Sharing database in the world of snowflake means that there is no copy made. You are sharing the same database as the one that is used by the owner of the database
  • Snowflake called itself cloud dataplatform
  • It was designed ground up to be cloud based
  • It was designed to be cloud based software as service.
  • It was built to run on AWS, Azure or GCP
  • Allow access all their data in one place to make actionable decisions anytime, anywhere
  • All the services are delivered as a service
  • At a high level, the snowflake platform has the following components
    • Complete SQL database
    • Self-tuning
    • All your data
      • Designed to hold structured, semi-structured and unstructured data
    • All your users
      • Handle any number of concurrent users
      • Scale up or Scale out to accommodate greater compute needs
    • Pay only for what you use
      • You are not purchasing hardware or software. You pay for the storage - it is inexpensive. Primarily you pay for the compute time. You pay only what you use
    • Live Data Sharing
      • Supports complete sharing without creating a copy of it.
  • Snowflake supports one or more data warehouses, data lakes, data engineering, data exchange, data applications, data science work loads
    • It does all the above under a single consistent platform
  • Snowflake is the single source of truth
  • It is not a siloed system where you have to worry about keeping your datamarts in sync
  • One platform - One copy of data - many workloads
  • Unlimited Performance and Scale
  • Secured and Governed Access to All Data
  • Near-Zero Maintenance as a Service

High-level Architecture

  • Hybrid of Shared Disk and Shared Nothing architectures
  • Uses a central data repository for persistent data
  • Similar to shared nothing, snowflake processes data using massively parallel queries
  • Database storage - When data is loaded in to snowflake, it optimizes the data and stores the data in a columnar data in Azure/GCP/AWS
  • Snowflake does not open up its object storage. One can access the data based on SQL queries
  • Each virtual warehouse is a massively parallel compute cluster.
  • Each compute cluster is an independent cluster and do not share resources with other clusters
  • Cloud Services layer
    • Collection of layers that coordinates across several setups
    • Infrastructure manager
    • Optimizer
    • Metadata manager
    • Security
  • Scale out Services
    • Optimization
    • Management
    • Transactions
    • Security and Governance
    • Metadata
    • Sharing and Collaboration
  • Centralized Storage layer
  • Cloud Agnostic layer - Runs on AWS or Azure or GCP
  • Cloud - Compute and Storage
  • Any client who can JDBC/ODBC can connect to an instant of snowflake
  • Snowflake is deployed in a VPC
  • Virtual warehouses - In snowflake, warehouse is a compute instance and not a database. So the word “warehouse” is slightly different in meaning
  • Compute layer is completely decouples from the storage layer
  • There are many types of compute instances based on data loads
  • One can create a virtual warehouse for different types of workloads
  • If DS team is complaining that queries are taking too long a time, one can easily scale up to extra large

Plans

  • There are four editions - Standard, Enterprise, Business Critical and Virtual Private Snowflake
  • Compute and Storage cost is based on what cloud provider you choose
  • Compute credits are how you are billed for compute usage. You might have a set number of credits and you are charged based on computational loads
    • Compute usage
    • Warehouse usage - charged based on per second basis
  • Compute instances come in 8 different sizes. X-small has one server , 4X-Large has 128 servers
  • Warehouses are billed for credit usage when they are running
  • The credit numbers shown here are for a fill hour of usage; however, credits are billed per second with a 60 second minimum

Virtual Warehouses

  • The term “virtual warehouse” is a bit of misnomer. It has nothing to do with a traditional warehouse. it is simply the compute engine in SF that runs queries
  • When you create a virtual warehouse, you are creating a wrapper around a cluster of servers with CPU and memory
  • The larger the warehouse, the more servers in the cluster.
  • Any job that requires power runs on a virtual warehouse
  • Examples
    • Virtual Warehouse 2X - Large
    • Virtual Warehouse Auto Scale - X Large by 5 : You would do this if you have concurrency problems. X large contains a certain number of machines and there are 5 of such allocated
    • Virtual warehouse Medium
    • Virtual warehouse Large
  • None of the workloads are not competing the compute resources.
  • It is hitting the same centralized warehouse
  • Types of virtual warehouses
    • Standard: Single compute cluster - Cannot scale out
    • Multi-Cluster: Possible to create additional compute clusters to respond to complex queries or heavy concurrency issues
  • Warehouses are sized in ‘tshirt’ sizing - XS, X, M etc
  • The size determines the number of servers that comprise each cluster in a warehouse
  • Each larger size is double the preceding size and double the cost
  • Users do not have access to the clusters. Snowflake is managed service. We have no idea where our queries are running. All we care is whether the query is getting executed or not

Storage Cost

  • On-Demand : easiest and most flexible way to purchase snowflake service is on demand
  • Customers are charged a fixed rate for the services that are consumed and are billed in arrears every month
  • Common price across regions - 40 dollars per month
  • Pre purchased capacity
    • Snowflake provides customers the option to pre-purchase capacity
    • A capacity purchase is a specific dollar commitment to snowflake
    • Common price across regions
  • For AWS service, snowflake charges 23 USD per TB under pre-purchase plan and 40 USD per TB under on-demand purchase plan
  • Virtual warehouse cost is the bigger chunk of cost that is paid by the customer
  • Snowflake supports a wide range of virtual warehouse sizes
  • The size of virtual warehouse determines how fast queries will run
  • Keen the auto suspend mode on for various virtual warehouses
  • Virtual warehouses are billed by the second with a one-minute minimum
  • Snowflake credits are used to pay for the consumption of resources on Snowflake. A Snowflake credit is a unit of measure, and it is consumed only when a customer is using resources, such as when a virtual warehouse is running, the cloud services layer is performing work, or serverless features are used.
  • Snowflake supports a wide range of virtual warehouse sizes: X-Small, Small, Medium, Large, X-Large, 2X-Large, 3X-Large, and 4X-Large. The size of the virtual warehouse determines how fast queries will run. When a virtual warehouse is not running (that is, when it is set to sleep mode), it does not consume any Snowflake credits.
  • Customers who wish to move or copy their data between regions or clouds will incur data transfer charges
  • Make sure to have the data compressed before storage as much as possible. There are instances, such as storing database tables where snowflake automatically does a data compression
  • Snowflake works better with date or timestamp columns stored as such rather than them being stored as type varchar
  • Try to make more use of transient tables as they are not maintained in the history tables which in turn reduces the data storage costs for history tables

Resource Monitors

  • It is important to track the consumption of the virtual warehouses
  • To control costs and avoid unexpected credit usage caused by running warehouses
  • Can be used to impose limits on the n umber of credits that are consumed by virtual warehouses
  • When limits are reached, the resource monitor can trigger various actions.
  • Can be created by only account admin role

Fail safe

  • We can perform various DDL, DML operations on the current data storage
  • As per the type of table, we have time travel retention period
    • Transient - Time travel is 1 day
    • Temporary - Time travel is 1 day
    • Permanent table - Time travel is 90 days
  • Once the data has completed the time travel retention period, the data is moved to fail safe zone
    • Permanent table - fail safe zone is 7 days
    • Transient - fail safe zone is 0 days
    • No body can recover the data from fail safe zone except Snowflake
    • Users do not have access to the data in fail safe zone
  • Fail safe ensures historical data is protected in the event of a system failure or any disaster
  • Provides a non-configurable 7 day period during which the historical data is recoverable by Snowflake
  • Period starts immediately after the time travel retention period ends
  • Fail safe is not provided as a means for accessing historical data after the time travel retention period has ended
  • It is for use only by snowflake to recover data that may have been lost or damaged due to extreme operational failures
  • Any objects in fail safe zone has an associated cost with them
  • One should use transient tables during testing and dev phase
    • Transient tables do not have fail safe feature
  • Why fail safe over backup ?
    • Data corruption or loss can happen with any database management
    • To mitigate the risk, DBA used to perform full and incremental backups, but this can double or even triple overall data storage
    • Data recovery can be painful and costly due to various factors
      • Time required to reload lost data
      • Business downtime during recovery
      • Loss of data since the last backup
    • Fail safe provides an efficient and cost-effective alternative to backup that eliminates the risk and scales with your data
  • What is the storage that fail safe zone has taken ?

Intro to Snowflake

  • Cloud datawarehouse runs only on cloud and infrastructure can come in any source
  • Snowflake is in high demand because
    • time travel
    • fail safe
    • cloning
    • sharing
  • Charges the customer based on storage and compute
  • “Virtual Warehouse” - computing engine to run queries
  • Data storing on Snowflake - separate cost
  • Pay upfront for compute and storage is not needed
  • No need to set up any infrastructure
  • Elastic and highly scalable

Secure Data Sharing

  • This is one of the most appealing features of Snowflake
  • You share the data via emails, FTP or google link
  • Secured data sharing optimizes data sharing
  • One can share data with any person with ease
  • Provider and Consumer accounts
  • Data sharing - ability to share the same data resource with multiple applications or users
  • Data sharing is the way to optimize higher relevant data, generating more robust data and analytics to solve business challenges and meet enterprise goals
  • Types of accounts when sharing data on snowflake
    • Provider accounts: Any snowflake account which produce or create the data and shares with other snowflake accounts
      • No hard limits on the number of shares you can create or number of accounts you can add
    • Consumer accounts: Any snowflake account which consumes the data
      • No hard limits on the number of shares you can consume from data providers
  • List of objects to share
    • Tables
    • External tables
    • Secure views
    • Secure materialized views
    • Secure UDFs
  • How does sharing work ?
    • No actual data is transferred between accounts
    • Consumer accounts do not pay any storage
      • Compute charges are paid by Consumer for query
    • Sharing is done using cloud services layer and metadata store
    • No DML is possible on shared database
  • Share is enable for account admin role only
  • You have to grant privileges before sharing data with any consumer account
  • Grants should be provided from top to granular level objects
    • Database - Schemas - Tables
  • All sharing with any consumer accounts - gives read access only
  • You cannot share data across cloud providers and across cloud regions
    • You need to replicate the data and then share it with a different cloud or a different region
  • Once we start consuming data, the shares will list the dbs
  • Creating a clone of shared databases is not allowed
  • If you are working on shared database, you cannot clone any database object using shared object
    • It takes reference from cloud services layer - metadata
    • Since the metadata is available with provider account, the consumer account cannot clone it
  • If you want to share across regions, you need to replicate

Different roles in Snowflake

  • Default role is sysadmin
  • ACCOUNTADMIN is the highest role available
  • SECURITYADMIN
    • Role that can monitor, and manage users and roles.
  • USERADMIN
    • create users and role
  • SYSADMIN
    • create warehouses and databases
  • PUBLIC
    • automatically granted to every use

Databases, schemas and Tables

  • Tables fall under schemas and schemas fall under database
  • You can create multiple tables for a schemas

Snowflake architecture

  • Saves all the storage in hybrid columnar storage
  • Query processing is done by Vitual warehouse
    • Can be called as Muscle of the system
    • Performs Massive Parallel processing
    • Can be scaled up or down automatically
  • Cloud Services - BRAIN
    • Bunch of various independent scalable services
    • Monitors, optimizer and handles other crucial data management as well
  • Virtual warehouse creation
    • Size
      • Credits used up
    • Clusters - Min max
    • Scaling policy
    • Auto suspension
    • Auto resume
    • Once queries are run, the warehouse is activated and the query is run
  • Standard vs Economy scaling policy
    • Standard : Cluster starts immediately when either a query is queued or the system detects that there’s one more query than the currently-running clusters can execute
      • Prevents queuing by favoring starting additional clusters over conserving credits
    • Economy: Only if the system estimates there’s enough query load to keep the cluster busy for at least 6 minutes
      • Conserves credits by favoring keeping running clusters fully loaded rather than starting additional clusters, which may result in queries being queued and taking longer to complete

Snowflake Pricing

  • Very important aspect while working on snowflake
  • Separates compute and storage cost
  • Charges based on consumed snowflake credits
  • Value of snowflake credits is based on snowflake edition

Snowflake credit

  • Snowflake credit is a unit of measure
  • Snowflake credits are used to pay for the consumption of resources on Snowflake
  • It is consumed only when a customer is using resources, such as when a virtual warehouse is running, the cloud services layer is performing work
  • User receives 400 dollars worth of free usage upon creation of Snowflake trial account
    • About 100 credits

Snowflake editions

  • Standard
    • It is a data warehouse that you can run all your queries
    • Share data across regions through the various cloud providers
    • Sharing from one account to another account
    • 1 day of time travel
    • Data inside the snowflake is encrypted
    • Federated Authentication - Map Azure directory structure to enable single sign on
    • Database replication
    • Customer dedicated virtual warehouses
    • Database replication
    • External functions
    • Snowsight analytics UI
    • Data marketplace access
  • Enterprise
    • Standard +
    • Up to 90 days of time travel
    • Annual rekey of all encrypted data
    • Materialized views
    • SEO
    • Dynamic data masking
    • External data tokenization
  • Business Critical
  • Virtual Private Snowflake

Serverless features

  • Snowpipe
  • Database replication
  • Materialized views Maintenance
  • Automatic clustering
  • Search Optimization Service

Storage cost

  • On demand
    • Easiest and most flexible way to purchase the snowflake service is on demand
    • Customers are charged a fixed rate for the services that are consumed and are billed in arrears
    • Common price across regions is 40 USD per month per TB
  • Pre purchased
    • Common price across regions
    • Pre-purchase capacity
    • Capacity purchase is a specific dollar commitment to snowflake

Virtual Warehouse cost

  • Wide range of warehouses in tshirt sizes
  • Size determines how fast the queries will run
  • Virtual warehouse is in sleep mode, it does not consume any snowflake credits
  • Credits consumed per hour based on the type of warehouse chosen

Cloud Services cost

  • Cloud services are automatically assigned by snowflake based on requirements of the workload
  • Up to 10% of daily compute credits
  • Customers will not see incremental charges for cloud compute cost

Data Transfer cost

  • From snowflake to external data
  • From one snowflake account to another account in different region

Usage

  • Storage
  • Compute
  • Data Transfer
  • Transient table does not incur fail safe storage
    • Active bytes
    • Time travel bytes
    • Fail safe bytes
  • One can programatically obtain the credits consumed

Optimization methods to reduce Snowflake costs

  • Depending on your location, it is important to choose the cloud region wisely, to minimize latency to have access to the required set of features
  • Auto suspension and Auto resume should be made use
  • Workload/data usage monitoring at an account level, warehouse level, database or table level is necessary
  • Make sure to have the data compressed before storage as much as possible
  • Snowflake works better with date or timestamp columns stored as such rather
  • Try to make more use of transient tables as they are not maintained in the history tables which in turn reduces the data storage costs for history tables

Takeaways

Some of the courses on Udemy on Snowflake are worth one’s time especially if you have never experimented with the platform. One can always read documentation and play around with the platform, which of course, one eventually will have to. But to have a sense of various components that fit the Snowflake platform, Udemy could be a useful starting point.