Article | How to optimize Dataiku with Snowflake SQL pipelines

Learn how to use SQL Pipelines in Dataiku with Snowflake to cut execution time, reduce storage, and boost performance for complex data flows.

More and more organisations are discovering that the combination of Snowflake and Dataiku provides a foundation for one of the most flexible and dynamic platforms in the data science and analytics market today. Together these tools create an exceptionally scalable engine for processing, analyzing, and sharing data.

We wanted to walk you through how to use SQL Pipelines: an additional method for optimizing Snowflake execution from your Dataiku platform in order to maximize compute and team efficiency and prepare your projects for successful productionalization.

Using Dataiku's SQL Pipeline Feature

We've written previously about how push-down compute can be used to translate Dataiku recipe operations into SQL queries, allowing us to take advantage of the tremendous computational power of Snowflake and reduce data movement. You can read that here.

Best practices for leveraging Dataiku and Snowflake content image

Regardless of which engine we choose, the output data generated by any recipe will always be written to the destination table - which in a big flow may result in many individual “intermediate” datasets (at least one per recipe) in order to produce a final, consumable result.

By utilizing the Dataiku SQL Pipeline feature, it’s possible to combine multiple recipes together into a single, larger query execution which will avoid the writing of the intermediate datasets and potentially provide a much more efficient transformation flow.

To get a clear understanding of what we’re aiming to accomplish with a SQL Pipeline, let’s take a look at the Dataiku flow in the figure below. This flow has three recipes, each of which are reading and writing to Snowflake datasets and are configured to use the “In-database (SQL)” recipe engine. 

Best practices for leveraging Dataiku and Snowflake loan data loan requests

The in-database engine will effectively convert the visual recipes’ logic to SQL queries and send these queries to the Snowflake database for execution. If we look at the job logs generated as a result of the execution of these recipes, we’ll see a single statement per recipe that reads from the input(s), applies the recipe logic, and then writes to the output dataset(s) as illustrated below in the join recipe:

INSERT INTO "LOAN_REQUESTS_SF_JOINED" 

SELECT 

 "loan_requests_sf"."id" AS "id",

 "loan_requests_sf"."Amount_Requested" AS "Amount_Requested",

 …

 "cust_info_sf"."Status" AS "Status"

 FROM "LOANDATASOURCE_LOAN_REQUESTS_SF" "loan_requests_sf"

 LEFT JOIN "LOANDATASOURCE_CUST_INFO_SF" "cust_info_sf" <---- JOIN converted to SQL

 ON "loan_requests_sf"."id" = "cust_info_sf"."id"

As discussed, this recipe execution is very efficient since it minimizes any data movement (it never leaves Snowflake) and it takes advantage of the scalability and efficiency of the Snowflake query engine, but it does still require data to be written (possibly unnecessarily) to each intermediate dataset in our flow. What we’ll aim to accomplish with SQL pipelines is illustrated in the screenshot below. By combining these recipe steps, we’ll avoid writing data to the intermediate datasets altogether, which should save execution time and potentially storage costs.

Best practices for leveraging Dataiku and Snowflake Skip write and combine single sql query

Pros and cons of SQL pipelines

Before we look at how to set up data pipelines, let’s list some of the pros and cons of their use.

Pros of SQL pipelines

Great for reducing processing time in large/long flows where all steps are being performed in-database. As an example, Amplifi has seen project scenario execution times reduced from 90 minutes to 30 minutes by simply using SQL pipelines. When put in terms of Snowflake warehouse utilization, this could be a significant cost savings.
If the intermediate datasets are never written (or written and subsequently cleared) the use of SQL pipelines will reduce the amount of data stored in the intermediate sets and therefore reduce overall project storage costs. Depending on the size of the data in a project, this could translate to a significant amount of storage reduction.

Cons of SQL pipelines

As we’ll see in the next section, setting up SQL pipelines does take some project configuration which must be maintained as your project evolves.
If you have recipes that are not using the in-database processing engine (Python steps for example), a SQL pipeline will not be able to include that recipe.
Failed jobs that occur within a SQL pipeline are much more difficult to troubleshoot than those that occur in a standard recipe job activity.

Pro Tip: Due to the potential debugging challenges related to SQL Pipelines, Amplifi typically recommends implementation when a project has heavy engineering requirements (long flows) and is in the stage of “production process optimization” and not typically when in the early stages of development.

Setting up the SQL pipeline

Now let’s look at the steps necessary to configure a project and flow. As you’re working in Dataiku, it’s necessary to remember that enabling SQL pipelines is not done at a “global” level but instead must be configured for each dataset on every project. Thankfully, it’s relatively easy to switch this feature on and off as needed.

The first step necessary to enable SQL pipelines is an option that must be enabled at the project level. This option can be found under "Project Settings" in the "Pipelines" section. Enabling pipelines at the project level however does not immediately cause the entire project to utilize pipelines, the feature still needs to be enabled at a per-dataset level.

Best practices for leveraging Dataiku and Snowflake SQL pipelines

Although this feature is described as “experimental”, the team at Amplifi has been part of multiple implementations making use of SQL pipelines and found it to be a stable feature.

Once we have the project set up to allow for SQL pipelines, the next step is to configure each dataset that we’d like to participate in pipelines as “virtualizable”. This setting tells Dataiku that the dataset is available to be included in a pipeline activity when the execution of jobs is being planned. The "Virtualizable" setting can be found per dataset by scrolling to the bottom of the "Settings" in the "Advanced" section.

Best practices for leveraging Dataiku and Snowflake virtualisable in build

For a large flow, having to check the box for each dataset might seem like a laborious task. Thankfully, there is a much faster method to set this property on multiple datasets using the flow actions menu. By selecting more than one dataset in your flow, you’ll find an option called Allow build virtualization which will toggle the virtualizable property on each of the selected datasets - a much quicker method than the "Advanced" steps we previously covered.

Best practices for leveraging Dataiku and Snowflake allow build virtualisation for pipelines

Running the SQL pipeline

Once we have our desired datasets configured to allow for virtualization, we are ready to perform our first pipeline test. In order to take advantage of the SQL pipeline in our flow, it’s necessary to run using a method that will execute multiple recipes in a single job (building a single dataset won’t take advantage of this feature). This can be accomplished in multiple ways, one of which is by selecting the furthest downstream dataset and selecting Build, Recursive, then Forced recursive rebuild. The following screenshot shows the preview of this job which contains a single SQL pipeline step encapsulating the three recipes in this section of the flow.

Best practices for leveraging Dataiku and Snowflake SQL Pipeline 3 Activities

As you can see, the Dataiku job planner has combined our three recipes (activities) into a single SQL pipeline operation. Had we executed this same job without pipelines enabled, we would have seen 3 individual activities listed instead of the combined activity in this example.

By running this job, we can see on the logs that Dataiku, rather than writing out each recipe’s SQL individually, is instead exposing the steps’ output as a new, temporary view created in the Snowflake database. For the “final” datasets that are written in the flow, Dataiku then pulls the transformed data from those intermediate views.

See the examples below of SQL statements taken from the Dataiku job logs which illustrate the creation of temporary views by Dataiku to combine the recipes, and then select from that temporary view to insert into the output datasets.

CREATE VIEW "DSSVIEW_LOAN_REQUESTS_SF_JOINED_PREPARED_thVGD" AS <— Creation of a new, temporary view

SELECT 

 "id",

 "Amount_Requested",

 "Loan_Purpose",

 … [REST OF THE RECIPE LOGIC HERE]



INSERT INTO "APPLICATIONS" <— Inserting data from the temporary view into the output dataset

SELECT 

 "id" AS "id",

 "Amount_Requested" AS "Amount_Requested",

 …

 "Status" AS "Status"

 FROM "DSSVIEW_LOAN_REQUESTS_SF_JOINED_PREPARED_thVGD" "__dku_enriched_input"

Measuring the results

We’ve made the assertion in this article that the use of SQL pipelines along with "Virtualizable" datasets can not only help you reduce your intermediate storage in flows, but also reduce the amount of processing time required to execute complex recipe chains. Although the example we’ve walked through in this article is not a complex chain of recipes, nor are we working with a large amount of data, we can see that the execution time of the same three recipes was reduced from 11s to 6s by enabling pipelines - that’s an 83% decrease!

While this may be a simple example and your results may vary depending on the size and complexity of flow and tasks, generally this result should hold true. There is significant time and cost savings to be gained by using SQL pipelines in Snowflake.

Best practices for leveraging Dataiku and Snowflake efficiency with SQL pipelines

Debugging a flow

We just saw an example of a successful job run using SQL pipelines. However, in the real world, we know that there are often errors encountered in our day-to-day flow execution. In the pros and cons section of this article, we referred to the potential difficulty in debugging a flow that is running using SQL pipelines. As an example of this difficulty, let’s look at what a job execution looks like with a “bug” introduced to our preparation step

Best practices for leveraging Dataiku and Snowflake fixing invalid argument

As you can see in the preceding screenshot, our job execution failed and did output a meaningful error statement, but it was unclear which recipe caused the error. Although it is possible to dig through the logs to try to determine the root cause, this is often a difficult task.

An easier way to perform the debugging of this error is to jump back to the "Project Settings", disable SQL pipelines and then retry the job. By simply making this toggle and re-running, it’s possible to get a much clearer view of the problematic recipe and debug/fix the issue.

Best practices for leveraging Dataiku and Snowflake clearer error point

The combination of Dataiku and Snowflake offers a powerful, scalable, and secure solution for managing and analyzing large datasets, enabling organizations to make data-driven decisions faster and more efficiently. In combination, it’s possible to achieve a platform that takes advantage of the strengths of each system.

Curious how SQL pipelines can supercharge your organization’s Dataiku and Snowflake performance?

Request a free 30-minute consultation with our experts

If you’re looking to optimize workflows, reduce compute time, and boost efficiency across your data projects, we’re here to help. Just fill out the form to schedule a consultation with one of our specialists. Let us know what you’d like to explore and share your availability - we’ll set up a time to walk through practical strategies tailored to your needs.

First Name

Last Name

Your Message

Data & Analytics Strategy

Data & Analytics Strategy

Management & Governance

Management & Governance

Management & Governance

Data Integration & Engineering

Data Integration & Engineering

Modern Data Ecosystems

Modern Data Ecosystems

Modern Data Ecosystems

Analytics & Data Science

Analytics & Data Science

Analytics & Data Science

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

Data & Analytics Strategy

Modern Data Ecosystems

Management & Governance

Data Integration & Engineering

Analytics & Data Science

Artificial Intelligence

Analytics & Data Science

Data Integration & Engineering

Management & Governance

Product Information Management

Data Strategy

Technology Partners

Find out more

About Amplifi

Article | How to optimize Dataiku with Snowflake SQL pipelines

Using Dataiku's SQL Pipeline Feature

Pros and cons of SQL pipelines

Pros of SQL pipelines

Cons of SQL pipelines

Setting up the SQL pipeline

Running the SQL pipeline

Measuring the results

Debugging a flow

Curious how SQL pipelines can supercharge your organization’s Dataiku and Snowflake performance?