info@lasmart.biz

july 14, 2023

Using MS SQL functionality as a data source to increase the speed of SuperSet operation

When working with big data, it is necessary to pay special attention to query optimization. In this respect, the combination of MS SQL and Apache Superset solutions is very flexible in terms of distributing computation workloads.

In this article we will:

— Consider several examples for building virtual queries: direct query, using functions and stored procedures; 

— We will compare the speed of work for three variants of queries. 

As an example, we will use a fact table with the size of 37,896,435 rows.

Task: it is necessary to build the top 10 by the sum of sales for product categories, by the selected year and month in the filter.

1. Direct query to the database from SuperSet

The query for such a chart will have a structure, using jinja templates (you can read in the article Provisional filtering of the data in the subquery by filter values in Apache Superset.

Let’s create a dashboard for the specified functionality:

Request for chart:

 

select top 10 [ItemBrandID] 

              ,sum(CostNet) as Cost

from [dbo].[t_fact_Sales_by_month] 

where [YearID] in ({{ filter_values(‘year’, )[0] }}) 

and [MonthName] IN (‘{{ filter_values(‘month’, )[0] }}’) 

group by [ItemBrandID] 

order by sum(CostNet) desc

 

2. Function query

Let’s wrap the query from step 1 into a function, we will use the filter parameters as input:

 

create function [dbo].[f_top_10]

(

@YearID int, 

@MonthName nvarchar(4000) = »

)

return table 

as 

return (

select top 10 [ItemBrandID] ,

                   sum(CostNet) as Cost

from [dbo].[t_fact_Sales_by_month]

 where [YearID] in (@YearID)

     and [MonthName] IN (@MonthName)

group by [ItemBrandID]

order by sum(CostNet) desc

)

 

Query for a dataset:

select [ItemBrandID]

      ,[Cost] 

from [dbo].[f_top_10]

(

{{ filter_values(‘year’, )[0] }}

,'{{ filter_values(‘month’, )[0] }}’

)

We get the second table with similar data

 

3. Procedure Query
For a virtual query in SuperSet it is necessary to know what set of columns will be at its output, so it is not possible to use a stored procedure directly. In this case, we need to resort to the openquery construct.
Let’s wrap our query in a stored procedure:

 

create procedure [dbo].[p_top_10]

@YearID INT, 

@MonthName nvarchar(4000) = »

as  

begin

select top 10 [ItemBrandID] ,

           sum(CostNet) as Cost

from [dbo].[t_fact_Sales_by_month]

where [YearID] in (@YearID)

  and [MonthName] IN (@MonthName)

group by [ItemBrandID]

order by sum(CostNet) desc

end

 

 

Query for a dataset:

select 

  [ItemBrandID]

,[Cost]

from openquery([<LinkServerName>], 

‘exec [<DataBaseName>].[dbo].[p_top_10]

@YearID = {{ filter_values(‘year’, )[0] }}

,@MonthName = »{{ filter_values(‘month’, )[0] }}»

with result sets

((

[ItemBrandID] nvarchar(128)

,[Cost] money

))

)

We get a third table with similar data:

 

  1. Let’s compare the performance speed for the three query variants.

To compare the performance speed, we will use the same input data for the query, and measure it in SQLlab:

The input parameters will be: year — 2022 and month — August 

Data volume for testing — 37,896,435 rows

Simple query Function Procedure
Operation time, sec
1,2
0,7
0,67

It can be seen that the speed of execution of an ordinary query is 1.5 times less than the speed using a function or a stored procedure.

The stored procedure wins in terms of execution time, its advantage over the function in this case is small, but the gap will increase significantly as the complexity of calculations increases.

Click on the «Save» button. Log in as the user «user1» to verify the result. The test table should now only display one row.

For more information, please get in touch here:

Contact Form main