Road to Snowflake SnowPro Core Certification: Data Loading
Eighth Chapter: Data Loading
Hello everybody. In this chapter, we are going to study the different ways to COPY data into Snowflake tables, either using bulk load or continuous load with Snowpipe:
Remember that all the chapters from the course can be found in the following link.
DATA LOADING
When we load the data into Snowflake, we usually perform the following steps:
- Source system → Snowflake Stage → Snowflake Table.
A source system (it can be, for example, your application, or not even that, it can be whatever machine that generates data, like sensors) will generate data that it will send to a stage, for example, AWS S3. Once the data is at any stage, we’ll copy it into Snowflake tables. There are two different ways to do it, Bulk Load or Continuous Load; let’s study the differences.
BULK LOAD
Bulk load is the process of loading batches of data from files already available at any stage into Snowflake tables. We use the COPY command to do that (it consumes credits from the virtual warehouses). It supports data transformation while loading, using column reordering, column omission, casting… Let’s see how the COPY INTO command works.
COPY INTO
Using this command, you can load data from staged files to an existing table. It copies the data into tables. It works for any stage (internal, external, table, and user stages) or even from an external location like S3, Azure… Some factors affect the loading time, like the physical location of the stage, GZIP compression efficiency (files are automatically compressed using gzip unless compression is explicitly disabled), or the number and types of columns.
You can both load and unload data into tables with this command:
- LOAD → COPY data from the stages into a table. You load data into Snowflake tables.
- UNLOAD → COPY data from the table to stages. You unload data from the tables to a different location (internal stage, external stage, or external location).
Some extra considerations:
- You need to specify the table name where you want to copy the data, the stage where the files are, the file/patterns you want to copy, and the file format.
- 64 days of metadata. The information about the loaded files is stored in Snowflake metadata. You cannot COPY the same file again in the next 64 days unless you specify it (“FORCE=True” command).
- You cannot Load/Unload files from your Local Drive
- Some transformations like Flatten, Join, Group by, Filters or Aggregations are not supported.
- Using the Snowflake UI, you can only Load 250MB files. You can copy bigger files using SnowSQL.
- Organizing input data by granular path can improve load performance.
On error:
Loading some of the files might produce errors. For example, you are copying .csv files, and the data is incorrect inside the file. There are several options that you can specify in this case:
- ABORT_STATEMENT → Abort the load operation if there are errors in a data file. If you don’t specify any parameter in the ON_ERROR option, this will be the VALUE BY DEFAULT.
- CONTINUE → Continue loading the file.
- SKIP_FILE → Skip file if there are errors in the files.
- SKIP_FILE_num → Skip the file when the number of error rows in the file equals or exceeds the specified number.
- SKIP_FILE_num% → Skip the file when the percentage of error rows in the file exceeds the specified percentage.
Some other options from the COPY INTO command that normally appears in the exam:
- pattern = <pattern> → Load files from a stage into the table, using pattern matching.
- FORCE = TRUE → Once the files are copied into a table, they cannot be copied again in the next 64 days because of the files’ metadata. If this option is true, it loads all files, regardless of whether they’ve been loaded previously and have not changed since they were loaded.
- PURGE = TRUE → It specifies whether to automatically remove the data files from the stage after loading the data successfully. If the purge operation fails for any reason, no error is returned. An excellent way to check whether there was an error or not would be to list the files from the stage with the “LIST stage” command.
- MAX_FILE_SIZE → You can specify the maximum size for each file when unloading the data with this option.
CONTINUOUS LOAD
Load small volumes of data (micro-batches) and incrementally make them available for analysis. There are different ways to do that:
- Snowpipe → The easiest and most popular way to do it. We’ll see it in this chapter.
- Snowflake Connector for Kafka → Reads data from Apache Kafka topics and loads the data into a Snowflake table.
- Third-Party Data Integration Tools → You can do it with other supported integration tools. You can see the list at the following link.
SNOWPIPE
Snowpipe enables loading data when the files are available in any (internal/external) stage. You use it when you have a small volume of frequent data, and you load it continuously (micro-batches). Snowpipe is serverless, which means that it doesn’t use Virtual Warehouses. It is used for Streaming / Near Real-Time data. An important thing to know is that Snowpipe does not guarantee that files will be loaded in the same order as staged. It usually processes the oldest files first, but there is no guarantee.
The real question is, how can Snowpipe detect new files in the stage? There are two different ways to do it:
- Automating Snowpipe using cloud messaging → You can do a trigger when there is an event in the stage (for example, when there is a new document) using Amazon SQS (Simple Queue Service) notifications for an S3 bucket. You can follow this tutorial to learn how to do it.
- Calling Snowpipe REST endpoints → Your client application calls a public REST endpoint with the name of a pipe object and a list of data filenames. It requires key pair authentication with JSON Web Token (JWT), and you have the “insertFiles”, “insertReport”, and “loadHistoryScan” APIs to do it. You have more information at the following link.
Some extra considerations:
- 14 days of metadata (COPY INTO was 64 days). You cannot copy the same files again in these 14 days.
- By default, it does a “SKIP_FILE” when there is an error loading files. COPY INTO default ON_ERROR option was ABORT_STATEMENT.
TYPICAL SNOWPRO CORE EXAM QUESTIONS
What are the usual data-loading steps in Snowflake?
- Source → Snowflake Stage → Snowflake table
- Source → Snowflake Table → Snowflake stage
- Snowflake table → Source → Snowflake Stage
Solution: 1
What key concepts will need to be considered while loading data into Snowflake?
- Stage
- File Format
- Transformation
- Region of your Snowflake account
- Error validation
Solution: 1, 2, 3, 5.
Using COPY INTO <location> command, you can unload data from a table into which locations?
- Named internal stage (or table/user stage).
- Named external stage that references an external location (Amazon S3, Google Cloud Storage, or Microsoft Azure).
- An external location like Amazon S3 or Azure.
- Local Drive
Solution: 1, 2, 3. Once the data is in the internal stage, you can download them into your local drive using the GET command. You can also unload data into an external location.
After how many days does the COPY INTO load metadata expire?
- 1 day
- 14 days
- 64 days
- 180 days
Solution: 3
While loading data through the COPY command, you can transform the data. Which of the below transformations are allowed?
- Truncate columns
- Omit columns
- Filters
- Reorder columns
- Cast
- Aggregate
Solution: 1, 2, 4, 5
After a successful load into a Snowflake table with the COPY INTO command, what option will you specify to delete the stage files?
- DELETE = TRUE
- REMOVE = TRUE
- PURGE = TRUE
- TRUNCATE = TRUE
Solution: 3.
In which of the below scenarios is Snowpipe recommended to load data?
- We have a small volume of frequent data
- We have a huge volume of data generated as part of a batch schedule
- In both of the previous scenarios
Solution: 1
Is Snowpipe Serverless?
- True
- False
Solution: 1
After how many days does the load history of Snowpipe expire?
- 1 day
- 14 days
- 90 days
- 180 days
Solution: 2. The load history is stored in the metadata of the pipe for 14 days. Must be requested from Snowflake via a REST endpoint, SQL table function, or ACCOUNT_USAGE view.
Can Snowpipe load a file with the same name if it has been modified later?
- True
- False
Solution: 2. This is because of the Snowpipe metadata. Changing the name doesn’t modify this metadata, so it won’t be copied.
Does Snowpipe guarantee that files are loaded in the same order they are staged?
- True
- False
Solution: 2. It usually processes the oldest files first, but there is no guarantee.
Which of the below APIs are Snowpipe REST APIs?
- insertFiles
- insertReport
- insertHistoryScan
- loadFiles
- loadHistoryScan
Solution: 1, 2, 5
Which data-loading method requires a user-specified warehouse to execute COPY statements?
- Bulk Data Load
- Snowpipe
- Both
Solution: 1
Thanks for Reading!
If you like my work and want to support me…
- The BEST way is to follow me on Medium here.
- Feel free to clap if this post is helpful for you! :)
- More Snowflake SnowPro Core practice exam questions? Find them at FullCertified.com!