As you can see, the Dataiku job planner has combined our three recipes (activities) into a single SQL pipeline operation. Had we executed this same job without pipelines enabled, we would have seen 3 individual activities listed instead of the combined activity in this example.
By running this job, we can see on the logs that Dataiku, rather than writing out each recipe’s SQL individually, is instead exposing the steps’ output as a new, temporary view created in the Snowflake database. For the “final” datasets that are written in the flow, Dataiku then pulls the transformed data from those intermediate views.
See the examples below of SQL statements taken from the Dataiku job logs which illustrate the creation of temporary views by Dataiku to combine the recipes, and then select from that temporary view to insert into the output datasets.
CREATE VIEW "DSSVIEW_LOAN_REQUESTS_SF_JOINED_PREPARED_thVGD" AS <— Creation of a new, temporary view
SELECT
"id",
"Amount_Requested",
"Loan_Purpose",
… [REST OF THE RECIPE LOGIC HERE]
INSERT INTO "APPLICATIONS" <— Inserting data from the temporary view into the output dataset
SELECT
"id" AS "id",
"Amount_Requested" AS "Amount_Requested",
…
"Status" AS "Status"
FROM "DSSVIEW_LOAN_REQUESTS_SF_JOINED_PREPARED_thVGD" "__dku_enriched_input"
Measuring the results
We’ve made the assertion in this article that the use of SQL pipelines along with "Virtualizable" datasets can not only help you reduce your intermediate storage in flows, but also reduce the amount of processing time required to execute complex recipe chains. Although the example we’ve walked through in this article is not a complex chain of recipes, nor are we working with a large amount of data, we can see that the execution time of the same three recipes was reduced from 11s to 6s by enabling pipelines - that’s an 83% decrease!
While this may be a simple example and your results may vary depending on the size and complexity of flow and tasks, generally this result should hold true. There is significant time and cost savings to be gained by using SQL pipelines in Snowflake.