Data Enginerr: Azure Synapse serverless SQL pool

What is Synapse?

It is a service available in Azure that is useful to Integrate, and analyze large volumes of data.

Azure Synapse SQL is a distributed query system in Azure Synapse analytics that offer two kind of environment:

1. Serverless SQL Pool: It is used to work with data that is in the data lake
2. Dedicated SQL Pool: Enterprises-Scale relational database instance use to host Dataware house in which data is store in a relational table.

In this module, we'll focus on a serverless SQL pool, which provides a pay-per-query endpoint to query the data in your data lake.

The benefits of using a serverless SQL pool include:

A familiar Transact-SQL syntax to query data in place without the need to copy or load data into a specialized store.
Integrated connectivity from a wide range of business intelligence and ad-hoc querying tools, including the most popular drivers.
Distributed query processing that is built for large-scale data, and computational functions - resulting in fast query performance.
Built-in query execution fault-tolerance, resulting in high reliability and success rates even for long-running queries involving large data sets.
No infrastructure to setup or clusters to maintain. A built-in endpoint for this service is provided within every Azure Synapse workspace, so you can start querying data as soon as the workspace is created.
No charge for resources reserved, you're only charged for the data processed by queries you run.

Serverless SQL pool is tailored for querying the data residing in the data lake, so in addition to eliminating management burden, it eliminates a need to worry about ingesting the data into the system. You just point the query to the data that is already in the lake and run it.

Common use cases for serverless SQL pools include:

Data exploration: Data exploration involves browsing the data lake to get initial insights about the data, and is easily achievable with Azure Synapse Studio. You can browse through the files in your linked data lake storage, and use the built-in serverless SQL pool to automatically generate a SQL script to select TOP 100 rows from a file or folder just as you would do with a table in SQL Server. From there, you can apply projections, filtering, grouping, and most of the operation over the data as if the data were in a regular SQL Server table.
Data transformation: While Azure Synapse Analytics provides great data transformations capabilities with Synapse Spark, some data engineers might find data transformation easier to achieve using SQL. Serverless SQL pool enables you to perform SQL-based data transformations; either interactively or as part of an automated data pipeline.
Logical data warehouse: After your initial exploration of the data in the data lake, you can define external objects such as tables and views in a serverless SQL database. The data remains stored in the data lake files, but are abstracted by a relational schema that can be used by client applications and analytical tools to query the data as they would in a relational database hosted in SQL Server.

Query files using a serverless SQL pool

SELECT TOP 100 *

FROM OPENROWSET(

BULK 'https://mydatalake.blob.core.windows.net/data/files/*.csv',

FORMAT = 'csv')

with([column] data_type)

AS rows

SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://storagedata1993.dfs.core.windows.net/retail/stage/moviesDB.csv',
    FORMAT = 'csv',
    PARSER_VERSION = '2.0',
    HEADER_ROW=TRUE) AS rows

Open Row set function:

It includes parameters such as:

Schema of a resulting row set

additional formatting option for delimited text file

Open row set must need an alias

Bulk Parameter: It includes the location of data lake containing the files. We can use wildcards in parameters to include or exclude file.

https://mydatalake.blob.core.windows.net/data/files/file1.csv: Only include file1.csv in the files folder.
https://mydatalake.blob.core.windows.net/data/files/file*.csv: All .csv files in the files folder with names that start with "file".
https://mydatalake.blob.core.windows.net/data/files/*: All files in the files folder.
https://mydatalake.blob.core.windows.net/data/files/**: All files in the files folder, and recursively its subfolders.

You can also specify multiple file paths in the BULK parameter, separating each path with a comma.

--OPENROWSET syntax for reading Parquet or Delta Lake files

OPENROWSET

( { BULK 'unstructured_data_path' , [DATA_SOURCE = <data source name>, ]

FORMAT= ['PARQUET' | 'DELTA'] }

)

[WITH ( {'column_name' 'column_type' }) ]

[AS] table_alias(column_alias,...n)

--OPENROWSET syntax for reading delimited text files

OPENROWSET

( { BULK 'unstructured_data_path' , [DATA_SOURCE = <data source name>, ]

FORMAT = 'CSV'

[ <bulk_options> ]

[ , <reject_options> ] }

)

WITH ( {'column_name' 'column_type' [ 'column_ordinal' | 'json_path'] })

[AS] table_alias(column_alias,...n)

<bulk_options> ::=

[ , FIELDTERMINATOR = 'char' ]   

[ , ROWTERMINATOR = 'char' ]

[ , ESCAPECHAR = 'char' ]

[ , FIRSTROW = 'first_row' ]

[ , FIELDQUOTE = 'quote_characters' ]

[ , DATA_COMPRESSION = 'data_compression_method' ]

[ , PARSER_VERSION = 'parser_version' ]

[ , HEADER_ROW = { TRUE | FALSE } ]

[ , DATAFILETYPE = { 'char' | 'widechar' } ]

[ , CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]

[ , ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}' ]

<reject_options> ::=

{

| MAXERRORS = reject_value,

| ERRORFILE_DATA_SOURCE = <data source name>,

| ERRORFILE_LOCATION = '/REJECT_Directory'

}

The PARSER_VERSION is used to determine how the query interprets the text encoding used in the files. Version 1.0 is the default and supports a wide range of file encodings, while version 2.0 supports fewer encodings but offers better performance.

Parameters are required to work with delimited text files:

FIELDTERMINATOR - the character used to separate field values in each row. For example, a tab-delimited file separates fields with a TAB (\t) character. The default field terminator is a comma (,).
ROWTERMINATOR - the character used to signify the end of a row of data. For example, a standard Windows text file uses a combination of a carriage return (CR) and line feed (LF), which is indicated by the code \n; while UNIX-style text files use a single line feed character, which can be indicated using the code 0x0a.
FIELDQUOTE - the character used to enclose quoted string values. For example, to ensure that the comma in the address field value 126 Main St, apt 2 isn't interpreted as a field delimiter, you might enclose the entire field value in quotation marks like this: "126 Main St, apt 2". The double-quote (") is the default field quote character.

Querying Json File: Json is a popular format for web applications that exchanges data through the rest interfaces or use No SQL Data Store such as Azure Cosmos db.

SELECT doc

FROM

OPENROWSET(

BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',

FORMAT = 'csv',

FIELDTERMINATOR ='0x0b',

FIELDQUOTE = '0x0b',

ROWTERMINATOR = '0x0b'

) WITH (doc NVARCHAR(MAX)) as rows

Extract Individual value from json:

SELECT JSON_VALUE(doc, '$.product_name') AS product,

JSON_VALUE(doc, '$.list_price') AS price

FROM

OPENROWSET(

BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',

FORMAT = 'csv',

FIELDTERMINATOR ='0x0b',

FIELDQUOTE = '0x0b',

ROWTERMINATOR = '0x0b'

) WITH (doc NVARCHAR(MAX)) as rows

Querying Parquet File:

In most cases, the schema of the data is embedded within the Parquet file, so you only need to specify the BULK parameter with a path to the file(s) you want to read, and a FORMAT parameter of parquet; like this:

SELECT TOP 100 *

FROM OPENROWSET(

BULK 'https://mydatalake.blob.core.windows.net/data/files/*.*',

FORMAT = 'parquet') AS rows

We can partition the data using folders, like this:

/orders

/year=2020
- /month=1
  - /01012020.parquet
  - /02012020.parquet
  - ...
- /month=2
  - /01022020.parquet
  - /02022020.parquet
  - ...
- ...
/year=2021

/month=1
- /01012021.parquet
- /02012021.parquet
- ...
...

SELECT *

FROM OPENROWSET(

BULK 'https://mydatalake.blob.core.windows.net/data/orders/year=*/month=*/*.*',

FORMAT = 'parquet') AS orders

WHERE orders.filepath(1) = '2020'

AND orders.filepath(2) IN ('1','2');

Create External Object :

Create a database in a serverless pool: We can create a database in two ways:

a) Graphical Interface b) SQL Command

One consideration is to set the collation of your database so that it supports the conversion of text data in files to appropriate Transact-SQL data types.

CREATE DATABASE SalesDB

COLLATE Latin1_General_100_BIN2_UTF8

collation "Latin1_General_100_BIN2_UTF8" makes it easier to import UTF-8 encoded text data into VARCHAR columns.

Create an External Data Source: An external data source encapsulates a connection to a file system location in a data lake. We can use this connection to specify the relative path in which the data files for the external table are saved

CREATE EXTERNAL DATA SOURCE files

WITH (

LOCATION = 'https://mydatalake.blob.core.windows.net/data/files/'

)

SELECT *

FROM

OPENROWSET(

BULK 'orders/*.csv',

DATA_SOURCE = 'files',

FORMAT = 'csv',

        PARSER_VERSION = '2.0'

) AS orders

In this example, the BULK parameter is used to specify the relative path for all .csv files in the orders folder, which is a subfolder of the files folder referenced by the data source.

Another benefit of using a data source is that you can assign a credential for the data source to use when accessing the underlying storage, enabling you to provide access to data through SQL without permitting users to access the data directly in the storage account.

CREATE DATABASE SCOPED CREDENTIAL sqlcred

WITH

IDENTITY='SHARED ACCESS SIGNATURE',

SECRET = 'sv=xxx...';

CREATE EXTERNAL DATA SOURCE secureFiles

WITH (

LOCATION = 'https://mydatalake.blob.core.windows.net/data/secureFiles/'

CREDENTIAL = sqlcred

);

Create External File Format

CREATE EXTERNAL FILE FORMAT CsvFormat

WITH (

FORMAT_TYPE = DELIMITEDTEXT,

FORMAT_OPTIONS(

FIELD_TERMINATOR = ',',

STRING_DELIMITER = '"'

)

);

Create External Table:

Qns: If we can fetch data directly from the file then what is the need to create an external table:

Ans: When you need to perform a lot of analysis or reporting from files in the data lake, using the OPENROWSET function can result in complex code that includes data sources and file paths.

To Simplify the query, we can encapsulate the files in an external table; which users and reporting applications can query using a standard SQL SELECT statement just like any other database table

CREATE EXTERNAL TABLE dbo.products

(

product_id INT,

product_name VARCHAR(20),

list_price DECIMAL(5,2)

)

WITH

(

DATA_SOURCE = files,

LOCATION = 'products/*.csv',

FILE_FORMAT = CsvFormat

);

-- query the table

SELECT * FROM dbo.products;

Store Procedure: We can use store procedure to encapsulate the transformation operation. We can add multiple logics in store procedures.

Store Procedure provides the following benefit:

Reduce client-to-server network traffic

Provide a security boundary:

Multiple users and client programs can perform operations on underlying database objects through a procedure, even if the users and programs don't have direct permissions on those underlying objects. The procedure controls what processes and activities are performed and protects the underlying database objects; eliminating the requirement to grant permissions at the individual object level and simplifying the security layers.

Scenario: We have to restrict access to the underlying table with data while still allowing the application to query and modify data.

In SQL Server, when one object refers to another object and both objects have the same owner, SQL Server will only look at the security of the first object. This is called ownership chaining

For example: if the store procedure references a table, the SQL server will only check security on the procedure not the table, as long as both objects have the same owner.

This allows us to give control through procedure or view and never give users direct access to the table. This allows us to hide columns, control how data is queried and modified, and perform business rule checks or complex data integrity rules

Easy Maintenance: Any change in logic or file system location involved in the data transformation can be applied to only store procedure, without requiring an update to a client application.

Improved Performance:

Store procedures are compiled the first time they are executed, and the resulting execution plan is held in the cache and reused in subsequent runs of the store procedure. As a result, it takes less time to process the procedure.

Find the data type:

sp_describe_first_result_Set N 'sql query select * from table'

Find the collation of database:

select * from sys.database

apply collation to database:

alter database database_name collate collation_type

Collation Type	Serverless SQL Pool	Dedicated SQL Pool - Database & Column Level	Dedicated SQL Pool - External Table (Native Support)	Dedicated SQL Pool - External Table (Hadoop/Polybase)
Non-UTF-8 Collations	Yes	Yes	Yes	Yes
UTF-8	Yes	Yes	No	No
Japanese_Bushu_Kakusu_140_*	Yes	Yes	No	No
Japanese_XJIS_140_*	Yes	Yes	No	No
SQL_EBCDIC1141_CP1_CS_AS	No	No	No	No
SQL_EBCDIC277_2_CP1_CS_AS	No	No	No	No

Quoted and escape column in csv file.

problem: aman, llp: aman llp

Fix 1:

aman\, llp:

escapechar='\\'

Fix 2:

"aman,llp"