What is Synapse?
It is a service available in Azure that is useful to
Integrate, and analyze large volumes of data.
Azure Synapse SQL is a distributed query system in Azure
Synapse analytics that offer two kind of environment:
- 1. Serverless SQL Pool: It is used to work with data that is in the data lake
- 2. Dedicated SQL Pool: Enterprises-Scale relational database instance use to host Dataware house in which data is store in a relational table.
In this module, we'll focus on a serverless SQL pool, which provides a pay-per-query endpoint to query the data in your data lake.
The benefits of using a serverless SQL pool include:
- A familiar Transact-SQL syntax to query data in place without the need to copy or load data into a specialized store.
- Integrated connectivity from a wide range of business intelligence and ad-hoc querying tools, including the most popular drivers.
- Distributed query processing that is built for large-scale data, and computational functions - resulting in fast query performance.
- Built-in query execution fault-tolerance, resulting in high reliability and success rates even for long-running queries involving large data sets.
- No infrastructure to setup or clusters to maintain. A built-in endpoint for this service is provided within every Azure Synapse workspace, so you can start querying data as soon as the workspace is created.
- No charge for resources reserved, you're only charged for the data processed by queries you run.
Serverless SQL pool is tailored for querying the data residing in the data lake, so in addition to eliminating management burden, it eliminates a need to worry about ingesting the data into the system. You just point the query to the data that is already in the lake and run it.
Common use cases for serverless SQL pools include:
- Data exploration: Data exploration involves browsing the data lake to get initial insights about the data, and is easily achievable with Azure Synapse Studio. You can browse through the files in your linked data lake storage, and use the built-in serverless SQL pool to automatically generate a SQL script to select TOP 100 rows from a file or folder just as you would do with a table in SQL Server. From there, you can apply projections, filtering, grouping, and most of the operation over the data as if the data were in a regular SQL Server table.
- Data transformation: While Azure Synapse Analytics provides great data transformations capabilities with Synapse Spark, some data engineers might find data transformation easier to achieve using SQL. Serverless SQL pool enables you to perform SQL-based data transformations; either interactively or as part of an automated data pipeline.
- Logical data warehouse: After your initial exploration of the data in the data lake, you can define external objects such as tables and views in a serverless SQL database. The data remains stored in the data lake files, but are abstracted by a relational schema that can be used by client applications and analytical tools to query the data as they would in a relational database hosted in SQL Server.
Query files using a serverless SQL pool
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.csv',
FORMAT = 'csv')
with([column] data_type)
AS rows
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://storagedata1993.dfs.core.windows.net/retail/stage/moviesDB.csv',
FORMAT = 'csv',
PARSER_VERSION = '2.0',
HEADER_ROW=TRUE) AS rows
Open Row set function:
It includes parameters such as:
Schema of a resulting row set
additional formatting option for delimited text file
Open row set must need an alias
Bulk Parameter: It includes the location of data lake containing the files. We can use wildcards in parameters to include or exclude file.
https://mydatalake.blob.core.windows.net/data/files/file1.csv
: Only include file1.csv in the files folder.https://mydatalake.blob.core.windows.net/data/files/file*.csv
: All .csv files in the files folder with names that start with "file".https://mydatalake.blob.core.windows.net/data/files/*
: All files in the files folder.https://mydatalake.blob.core.windows.net/data/files/**
: All files in the files folder, and recursively its subfolders.
You can also specify multiple file paths in the BULK parameter, separating each path with a comma.
--OPENROWSET syntax for reading Parquet or Delta Lake files
OPENROWSET
( { BULK 'unstructured_data_path' , [DATA_SOURCE = <data source name>, ]
FORMAT= ['PARQUET' | 'DELTA'] }
)
[WITH ( {'column_name' 'column_type' }) ]
[AS] table_alias(column_alias,...n)
--OPENROWSET syntax for reading delimited text files
OPENROWSET
( { BULK 'unstructured_data_path' , [DATA_SOURCE = <data source name>, ]
FORMAT = 'CSV'
[ <bulk_options> ]
[ , <reject_options> ] }
)
WITH ( {'column_name' 'column_type' [ 'column_ordinal' | 'json_path'] })
[AS] table_alias(column_alias,...n)
<bulk_options> ::=
[ , FIELDTERMINATOR = 'char' ]
[ , ROWTERMINATOR = 'char' ]
[ , ESCAPECHAR = 'char' ]
[ , FIRSTROW = 'first_row' ]
[ , FIELDQUOTE = 'quote_characters' ]
[ , DATA_COMPRESSION = 'data_compression_method' ]
[ , PARSER_VERSION = 'parser_version' ]
[ , HEADER_ROW = { TRUE | FALSE } ]
[ , DATAFILETYPE = { 'char' | 'widechar' } ]
[ , CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
[ , ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}' ]
<reject_options> ::=
{
| MAXERRORS = reject_value,
| ERRORFILE_DATA_SOURCE = <data source name>,
| ERRORFILE_LOCATION = '/REJECT_Directory'
}
The PARSER_VERSION is used to determine how the query interprets the text encoding used in the files. Version 1.0 is the default and supports a wide range of file encodings, while version 2.0 supports fewer encodings but offers better performance.
Parameters are required to work with delimited text files:
- FIELDTERMINATOR - the character used to separate field values in each row. For example, a tab-delimited file separates fields with a TAB (\t) character. The default field terminator is a comma (,).
- ROWTERMINATOR - the character used to signify the end of a row of data. For example, a standard Windows text file uses a combination of a carriage return (CR) and line feed (LF), which is indicated by the code \n; while UNIX-style text files use a single line feed character, which can be indicated using the code 0x0a.
- FIELDQUOTE - the character used to enclose quoted string values. For example, to ensure that the comma in the address field value 126 Main St, apt 2 isn't interpreted as a field delimiter, you might enclose the entire field value in quotation marks like this: "126 Main St, apt 2". The double-quote (") is the default field quote character.
Querying Json File: Json is a popular format for web applications that exchanges data through the rest interfaces or use No SQL Data Store such as Azure Cosmos db.
SELECT doc
FROM
OPENROWSET(
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',
FORMAT = 'csv',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
) WITH (doc NVARCHAR(MAX)) as rows
Extract Individual value from json:
SELECT JSON_VALUE(doc, '$.product_name') AS product,
JSON_VALUE(doc, '$.list_price') AS price
FROM
OPENROWSET(
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',
FORMAT = 'csv',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
) WITH (doc NVARCHAR(MAX)) as rows
Querying Parquet File:
In most cases, the schema of the data is embedded within the Parquet file, so you only need to specify the BULK parameter with a path to the file(s) you want to read, and a FORMAT parameter of parquet; like this:
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.*',
FORMAT = 'parquet') AS rows
We can partition the data using folders, like this:
- /orders
- /year=2020
- /month=1
- /01012020.parquet
- /02012020.parquet
- ...
- /month=2
- /01022020.parquet
- /02022020.parquet
- ...
- ...
- /month=1
- /year=2021
- /month=1
- /01012021.parquet
- /02012021.parquet
- ...
- ...
SELECT *
FROM OPENROWSET(
BULK 'https://mydatalake.blob.core.windows.net/data/orders/year=*/month=*/*.*',
FORMAT = 'parquet') AS orders
WHERE orders.filepath(1) = '2020'
AND orders.filepath(2) IN ('1','2');
Create External Object :
Create a database in a serverless pool: We can create a database in two ways:
a) Graphical Interface b) SQL Command
One consideration is to set the collation of your database so that it supports the conversion of text data in files to appropriate Transact-SQL data types.
CREATE DATABASE SalesDB
COLLATE Latin1_General_100_BIN2_UTF8
collation "Latin1_General_100_BIN2_UTF8" makes it easier to import UTF-8 encoded text data into VARCHAR columns.
Create an External Data Source: An external data source encapsulates a connection to a file system location in a data lake. We can use this connection to specify the relative path in which the data files for the external table are saved
CREATE EXTERNAL DATA SOURCE files
WITH (
LOCATION = 'https://mydatalake.blob.core.windows.net/data/files/'
)
SELECT *
FROM
OPENROWSET(
BULK 'orders/*.csv',
DATA_SOURCE = 'files',
FORMAT = 'csv',
PARSER_VERSION = '2.0'
) AS orders
In this example, the BULK parameter is used to specify the relative path for all .csv files in the orders folder, which is a subfolder of the files folder referenced by the data source.
Another benefit of using a data source is that you can assign a credential for the data source to use when accessing the underlying storage, enabling you to provide access to data through SQL without permitting users to access the data directly in the storage account.
CREATE DATABASE SCOPED CREDENTIAL sqlcred
WITH
IDENTITY='SHARED ACCESS SIGNATURE',
SECRET = 'sv=xxx...';
GO
CREATE EXTERNAL DATA SOURCE secureFiles
WITH (
LOCATION = 'https://mydatalake.blob.core.windows.net/data/secureFiles/'
CREDENTIAL = sqlcred
);
GO
Create External File Format
CREATE EXTERNAL FILE FORMAT CsvFormat
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"'
)
);
GO
Create External Table:
Qns: If we can fetch data directly from the file then what is the need to create an external table:
Ans: When you need to perform a lot of analysis or reporting from files in the data lake, using the OPENROWSET function can result in complex code that includes data sources and file paths.
To Simplify the query, we can encapsulate the files in an external table; which users and reporting applications can query using a standard SQL SELECT statement just like any other database table
CREATE EXTERNAL TABLE dbo.products
(
product_id INT,
product_name VARCHAR(20),
list_price DECIMAL(5,2)
)
WITH
(
DATA_SOURCE = files,
LOCATION = 'products/*.csv',
FILE_FORMAT = CsvFormat
);
GO
-- query the table
SELECT * FROM dbo.products;
Store Procedure: We can use store procedure to encapsulate the transformation operation. We can add multiple logics in store procedures.
Store Procedure provides the following benefit:
Reduce client-to-server network traffic
Provide a security boundary:
Multiple users and client programs can perform operations on underlying database objects through a procedure, even if the users and programs don't have direct permissions on those underlying objects. The procedure controls what processes and activities are performed and protects the underlying database objects; eliminating the requirement to grant permissions at the individual object level and simplifying the security layers.
Scenario: We have to restrict access to the underlying table with data while still allowing the application to query and modify data.
In SQL Server, when one object refers to another object and both objects have the same owner, SQL Server will only look at the security of the first object. This is called ownership chaining
For example: if the store procedure references a table, the SQL server will only check security on the procedure not the table, as long as both objects have the same owner.
This allows us to give control through procedure or view and never give users direct access to the table. This allows us to hide columns, control how data is queried and modified, and perform business rule checks or complex data integrity rules
Easy Maintenance: Any change in logic or file system location involved in the data transformation can be applied to only store procedure, without requiring an update to a client application.
Improved Performance:
Store procedures are compiled the first time they are executed, and the resulting execution plan is held in the cache and reused in subsequent runs of the store procedure. As a result, it takes less time to process the procedure.
Find the data type:
sp_describe_first_result_Set N 'sql query select * from table'
Find the collation of database:
select * from sys.database
apply collation to database:
alter database database_name collate collation_type
Collation Type | Serverless SQL Pool | Dedicated SQL Pool - Database & Column Level | Dedicated SQL Pool - External Table (Native Support) | Dedicated SQL Pool - External Table (Hadoop/Polybase) |
---|---|---|---|---|
Non-UTF-8 Collations | Yes | Yes | Yes | Yes |
UTF-8 | Yes | Yes | No | No |
Japanese_Bushu_Kakusu_140_* | Yes | Yes | No | No |
Japanese_XJIS_140_* | Yes | Yes | No | No |
SQL_EBCDIC1141_CP1_CS_AS | No | No | No | No |
SQL_EBCDIC277_2_CP1_CS_AS | No | No | No | No |
Quoted and escape column in csv file.
problem: aman, llp: aman llp
Fix 1:
aman\, llp:
escapechar='\\'
Fix 2:
"aman,llp"