Cheat Sheet for Microsoft Azure Data Fundamental certification exam (DP-900)

Kumar Rahul
20 min readJan 22, 2022

--

Written by Kumar Rahul & Rishita Shaw

This a cheat sheet for Microsoft Certified Exam DP-900: Microsoft Azure Data Fundamentals.

This is not enough for preparation but it’s definitely enough for revision.

Module 1(Explore core data concepts)

Data

Structured: relational databases (stored in SQL Server or Azure SQL Database)

Unstructured: audio and video files, and binary data files (BLOB)

Semi structured: key-value stores and graph databases(key-value database stores data as a single collection without structure or relation) (Azure Cosmo db)

Access:

Read-only (management team of large org)

read/write (customers)

Ownership (db admin, Data analysts and data managers)

Data processing:

Transactional (OLTP): Splitting tables out into separate groups of columns like this is called normalization it can make querying more complex

Analytical (OATP): big picture view of the information held in a database

Relational Db

Non relational Db

  • A transaction is a sequence of operations that are atomic. This means that either all operations in the sequence must be completed successfully, or if something goes wrong, all operations run so far in the sequence must be undone.

ACID (Atomicity, Consistency, Isolation, Durability)

  • Many systems implement relational consistency and isolation by applying locks to data when it is updated. The lock prevents another process from reading the data until the lock is released. The lock is only released when the transaction commits or rolls back.
  • Distributed Db
  • eventual consistency.
  • Analytical workload
  • Batch processing and streaming
  • Batch processing adv and disadvantage : advantage(large vol, scheduled run) disadvantage (time delay, minor issues can stop the process)
  • Diff b/w streaming and batch: data scope, size, performance(latency), analysis

Module 2(Explore roles and responsibilities in the world of data)

Data job role

  • Database Administrators (operational aspects of on-premises and cloud-based database) manage databases, assign permissions to users, implement policies, manage security, storing backup copies of data and restoring data in case of any failures.
  • Data Engineers work with data, applying data cleaning routines and ingestion, identifying business rules, and turning data into useful information.
  • Data Analysts explore and analyze data to create visualizations and charts to enable organizations to make informed decisions.

TASK AND RESPONSIBILITY

Database admin:

  • Install, upgrade, Db server & tools
  • Allocating, controlling, monitoring and modifying storage
  • Enrolling users and security
  • Backup, restore and archive
  • Generate reports
  • Tools: SQL Server Management Studio, pgadmin, mysql workbench, azure data studio

Data engineer:

  • Developing, constructing, testing, acquisition, examining, automating and maintaining Db and data
  • Deploying sophisticated analytics programs, machine learning, and statistical methods.
  • improve data reliability, efficiency, and quality and research
  • Tools: Microsoft SQL Server , Azure SQL Database, Azure Databricks, and Azure HDInsight , Cosmo Db

Data Analyst:

  • Creating charts and graphs, histograms, geographical maps
  • Transforming, improving, and integrating data
  • Finding hidden patterns using data and delivering info by creating rich graphical dashboards and reports
  • Tools: excel, power bi

Module 3(Describe concepts of relational data)

Relational db:

  • You design a relational db by creating a data model.
  • The primary key indicates the column (or combination of columns) that uniquely identify each row
  • Foreign Key are reference, or link to, the primary key of another table, and are used to maintain the relationships between tables
  • All data is tabular. Entities are modeled as tables, each instance of an entity is a row in the table, and each property is defined as a column.
  • All rows in the same table have the same set of columns. A table can contain any number of rows.
  • Supports SQL
  • suited for OLTP applications

Index

  • contains a copy of this data in a sorted order, with pointers to the corresponding rows in the table and can create many indexes
  • consume additional storage space and additional work can slow down operations and incur charges
  • A clustered index physically reorganizes a table by the index key

View

  • virtual table based on the result set of a query

on-premises hosting vs cloud

IaaS:

  • installing and configuring the software, patching, taking backups, and restoring data
  • virtual machine in the cloud
  • best for migrations and applications requiring operating system-level access

PaaS

  • Does not expose the underlying operating system and hardware to your applications
  • Azure automatically creates the necessary virtual machines, networks, and other devices for your requirement
  • Quickly handles this scaling for you,

Module 4 (Explore concepts of non-relational data)

Non relational db:

  • enable you to store data in a very flexible manner
  • don’t impose a schema on data rather focus on the data itself
  • store the information for entities in collections or containers
  • Each entity should have a unique key value and are usually stored in key-value order
  • advanced non-relational systems support indexing (Azure Cosmos DB )

Semi structured:

  • data that contains fields defined on a per-entity basis
  • Json, Avro, ORC, and Parquet
  • Avro is a row-based format. Each record contains a header(json) that describes the structure of the data(binary) in the record.
  • ORC (Optimized Row Columnar format) organizes data into columns, Hive supports SQL-like queries over unstructured data
  • Parquet is another columnar data format (row group)

Unstructured data:

  • store video and audio data as block blobs in an Azure Storage account
  • A block blob only supports basic read and write operations.

No SQL

  • key-value stores:
  • key uniquely identifies the item, and the value(opaque) holds the data for the item.
  • read and write data very quickly
  • excellent choice for data ingestion
  • Azure Table storage, Cosmo db
  • document databases:
  • each document has a unique ID, but the fields in the documents are transparent to the dbms
  • XML, YAML, JSON, BSON format or plain text
  • enables you to query and filter data by using the values in these fields.
  • Some create the document key automatically and support indexing to facilitate fast lookup
  • Azure Cosmos DB implements a document database approach in its Core (SQL) API.
  • column family databases:
  • ORC and Parquet files
  • denormalized approach to structuring sparse data
  • column family database as holding tabular data comprising rows and columns, but you can divide the columns into groups known as column-families
  • Apache Cassandra. Azure Cosmos DB supports the column-family approach through the Cassandra API.
  • graph databases:
  • store entities, but focuses on the relationships between entities
  • nodes (entities), and edge (relationships between nodes)
  • efficiently perform queries
  • Azure Cosmos DB supports graph databases using the Gremlin API

Module 5: Explore concepts of data analytics :

  • Data ingestion
  • Data Processing
  • ELT and ETL

SQL Server Integration Services and Azure Data Factory: Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale.

ETL processes that transform data visually with data flows, or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

  • Reporting
  • Business Intelligence (BI)
  • Benchmarking: Comparison with other companies in the same industry.
  • Data Visualization

Most famous tool in Azure is Power BI for data visualization: you can connect to multiple different sources of data, and combine them into a data model

  1. Bar and column chart
  2. Line chart
  3. Matrix
  4. Key influencer
  5. Tree map
  6. Scatter
  7. Filed map

Data analytics:

  1. Descriptive : what happened . By developing KPIs (Key Performance Indicators), these strategies can help track the success or failure of key objectives
  2. Diagnostic :why happened
  3. Predictive : what will happen in the future using neural networks, decision trees, and regression.
  4. Prescriptive: what action should we take to achieve a goal
  5. Cognitive: Cognitive analytics helps you to learn what might happen if circumstances change, and how you might handle these situations. It uses several NLP (Natural Language Processing) concepts to make sense of previously untapped data sources, such as call center conversation logs and product reviews.

Module 6: Explore relational data services in Azure:

  • Data base
  • Stored procedure: A stored procedure is a block of code that runs inside your database.

A linked server is a connection from one database server to another. SQL Server can use linked servers to run queries on one server that can include data retrieved from other servers; these are known as distributed queries.

  • IaaS: Infrastructure as a service e.g.: azure virtual network
  • PaaS: Platform-as-a-service e.g.: Azure SQL Databases
  • SaaS: Software-as-a-Service, e.g. office 365
  • Azure Data Services: Azure Data Services fall into the PaaS category. These services are a series of DBMSs managed by Microsoft in the cloud. Each data service takes care of the configuration, day-to-day management, software updates, and security of the databases that it hosts. All you do is create your databases under the control of the data service.
  • Most famous database service is Azure SQL database

Azure Database for SQL server

MariaDB server

Postgre SQL server

Microsoft also provide services for non relational dbms such as cosmos DB

Azure Data Services ensure that your databases are available for at least 99.99% of the time.

There are cost associated with the running database in azure data services

Can’t shutdown the database and restart it later. This services are always on

  • SQL Server on Azure Virtual Machines: SQL Server on Virtual Machines enables you to use full versions of SQL Server in the Cloud without having to manage any on-premises hardware. SQL Server running on an Azure virtual machine effectively replicates the database running on real on-premises hardware. Migrating from the system running on-premises to an Azure virtual machine

lift-and-shift refers to the way in which you can move a database directly from an on-premises

server to an Azure virtual machine without requiring that you make any changes to it.

A hybrid deployment is a system where part of the operation runs on-premises, and part in the cloud

  • Azure SQL Database :Azure SQL Database is a PaaS offering from Microsoft. You create a managed database server in the cloud, and then deploy your databases on this server.
  1. Single Database
  2. Elastic Pool database :This option is similar to Single Database, except that by default multiple databases can share the same resources, such as memory, data storage space, and processing power through multiple-tenancy.
  3. Managed instance

Azure SQL Database is often used for:

  • Modern cloud applications that need to use the latest stable SQL Server features.
  • Applications that require high availability.
  • Systems with a variable load, that need the database server to scale up and down quickly.

SQL Database helps secure your data by providing encryption. For data in motion, it uses Transport Layer Security. For data at rest, it uses Transparent Data Encryption. For data in use, it uses Always Encrypted.

Azure SQL Database Managed Instance:

You have complete control over this instance, much as you would for an on-premises server. The Managed instance service automates backups, software patching, database monitoring, and other general tasks, but you have full control over security and resource allocation for your databases. Managed instances depend on other Azure services such as Azure Storage for backups, Azure Event Hubs for telemetry, Azure Active Directory for authentication, Azure Key Vault for Transparent Data Encryption (TDE) and a couple of Azure platform services that provide security and supportability features. The managed instances make connections to these services.

  • MySQL:
  • MariaDB: compatibility with oracle database, One notable feature of MariaDB is its built-in support for temporal data. A table can hold several versions of data, enabling an application to query the data as it appeared at some point in the past.
  • PostgreSQL

Azure Database for MySQL:

  • High availability features built-in.
  • Predictable performance.
  • Easy scaling that responds quickly to demand.
  • Secure data, both at rest and in motion.
  • Automatic backups and point-in-time restore for the last 35 days.
  • Enterprise-level security and compliance with legislation.

Azure Database for MariaDB:

  • Built-in high availability with no additional cost.
  • Predictable performance, using inclusive pay-as-you-go pricing.
  • Scaling as needed within seconds.
  • Secured protection of sensitive data at rest and in motion.
  • Automatic backups and point-in-time-restore for up to 35 days.
  • Enterprise-grade security and compliance.

Azure Database for PostgreSQL:

  • same property as azure database for my SQL.
  • Azure Database for PostgreSQL single-server: Each tier supports different numbers of CPUs, memory, and storage sizes
  • Azure Database for PostgreSQL Hyperscale (Citrus): Data is split across nodes

Use Azure database migration service to migrate on premise Mysql, MariaDB or Postgre SQL to a database running the corresponding data services in Azure

Module 7: Explore provisioning and deployment in database service in azure:

  • What is Provisioning?

Provisioning is the act of running a series of tasks that a service provider, such as Azure SQL Database, performs to create and configure a service.

The act of increasing (or decreasing) the resources used by a service is called scaling.

Tools to provision services:

  • Azure portal: Display list of service specific pages before actual provisioning
  • Azure command line interface(CLI): basic command prompt and PowerShell command we run on windows to automate service creation.
  • Azure PowerShell
  • Azure Resource manager templates: describes the service (or services) that you want to deploy in a text file, in a format known as JSON
  • Provisioning in Azure SQL database
  • Provisioning PostgreSQL and MySQL:

Hyperscale option for Postgre SQL supports:

  1. Horizontal Scaling
  2. Query parallelization
  3. Excellent support for multi-tenant applications,
  4. real time operational analytics, and
  5. high throughput transactional workloads
  • Configuring relational data services
  • Configure connectivity and firewall
  • Configure connectivity to virtual networks and on-premises computers
  • Azure SQL Database communicates over port 1433
  • A firewall rule of 0.0.0.0 enables all Azure services to pass through the server-level firewall rule and attempt to connect to a single or pooled database through the server.
  • Configure connectivity from private endpoints.
  • Configure Authentication
  • Configure access control: who or what can access your resources

Role assignment consist of three elements:

  • Security principle
  • Role definition: Collection of permission
  • Owner
  • Contributor
  • Reader
  • User access administrator
  • Scope : lists set of resources that the access applies to
  • You add role assignments to a resource in the Azure portal using the Access control (IAM) page
  • Configure advance data security

Configure Azure SQL Database

  • An ACL(access control list) contains a list of resources, and the objects (users, computers, and applications) that are allowed to access those resources
  • Connectivity from within Azure
  • Connectivity from outside azure
  • Configure DoS(Denial of Service) guard: DoSGuard actively tracks failed logins from IP addresses

Configure Azure Database for PostgreSQL:

  • Connections to your Azure Database for PostgreSQL server communicate over port 5432
  • Configure read replicas : replicate data from an Azure Database for PostgreSQL server to a read-only server
  • Configure Azure Database for MySQL
  • Configure server parameter
  • Configure read replicas: same as Postgre SQL

Module 8: Query relational data in azure:

  • SQL: structured query language
  • Transact-SQL (T-SQL). This version of SQL is used by Microsoft SQL Server and Azure SQL Database.
  • pgSQL. This is the dialect, with extensions implemented in PostgreSQL.
  • PL/SQL. This is the dialect used by Oracle. PL/SQL stands for Procedural Language/SQL.
  • Data Manipulation Language (DML): SELECT, INSERT, UPDATE, DELETE
  • Data Definition Language (DDL): CREATE, ALTER, DROP, RENAME

Query relational data in Azure SQL Database

  • Retrieve connection information for azure sql DB
  • Use the azure portal to query a DB
  • Use SQLCMD to query a database
  • Use Azure data studio: Azure Data Studio is a graphical utility for creating and running SQL queries from your desktop
  • SQL server management studio
  • Use SQL Server Data Tools in Visual Studio

Query relational data in Azure Database for PostgreSQL

  • Retrieve connection information for Azure Database for PostgreSQL
  • Use psql to query a database
  • psql commands include:
  • \l to list databases.
  • \dt to list the tables in the current database.
  • Connect to PostgreSQL database using Azure Data Studio

Query relational data in Azure Database for MySQL

  • port : 3306

Module 9 : Explore non-relational data offerings in Azure

Azure Table Storage:

  • items are referred to as rows(must have a key), and fields are known as columns
  • store semi-structured data ( schema less)
  • Denormalized data
  • splits a table into partitions. Partitioning is a mechanism for grouping related rows, based on a common property or partition key
  • Partitions are independent from each other
  • you can include the partition key in the search criteria
  • Items in the same partition are stored in row key order.
  • quickly perform Point queries that identify a single row, and Range queries that fetch a contiguous block of rows in a partition
  • The columns in a table can hold numeric, string, or binary data up to 64 KB in size. A table can have up to 252 columns, apart from the partition and row keys. The maximum row size is 1 MB

Adv

  • no need to map and maintain the complex relationships
  • Fast operations like insertion deletion query
  • Simple to scale
  • Storing TBs of structured data capable of serving web scale applications
  • Storing datasets that don’t require complex joins, foreign keys, or stored procedures
  • Capturing event logging and performance monitoring data

Disadv

  • Consistency isn’t guaranteed
  • no referential integrity
  • difficult to filter and sort on non-key data
  1. The data for each table is replicated three times within an Azure region. For increased availability
  2. transparently switch to a working replica while the failed replica is recovered
  3. configure security and role-based access control

Azure Blob storage

  • unstructured data, or blobs
  • you create blobs inside containers
  • Block blobs. handled as a set of blocks. used to store discrete, large, binary objects that change infrequently.
  • Page blobs. organized as a collection of fixed size 512-byte pages. support random read and write operations, used to implement virtual disk storage for virtual machines
  • Append blobs. a block blob optimized to support append operations updating or deleting existing blocks isn’t supported
  • Access tiers
  • The Hot tier is accessed frequently. data is stored on high-performance media.
  • The Cool tier. has lower performance, accessed infrequently.
  • The Archive tier. provides the lowest storage cost, but with increased latency for historical data that mustn’t be lost, but is required only rarely, stored in an offline state
  • A lifecycle management policy can automatically move a blob from Hot to Cool, and then to the Archive tier, as it ages and is used less frequently
  • Serving images or documents directly to a browser
  • Storing data for backup and restore, disaster recovery, and archiving, analysis and distributed access
  • Versioning
  • Soft delete
  • Snapshots
  • Change Feed

Azure File storage

  • Create and access files shares in the cloud
  • exposes file shares using the Server Message Block 3.0 (SMB) protocol
  • tools such as the AzCopy utility or azure portal
  • Services: standard and premium (greater throughput)
  • Migrate existing applications to the cloud (access data using file-based APIs)
  • Share server data across on-premises and cloud(leverage the availability, durability, scalability, and geo redundancy built into the Azure storage platform)
  • Integrate modern applications with Azure File Storage (rest api)
  • Simplify hosting High Availability (HA) workload data

Azure Cosmos DB

  • multi-model NoSQL dbms
  • manages data as a partitioned set of documents(collection of fields, identified by a key)
  • use JSON
  • provides APIs that enable you to access these documents
  • SQL API. provides a SQL-like query language over documents
  • Table API. use the Azure Table Storage API to store and retrieve documents
  • MongoDB API enable a MongoDB application to run unchanged against a Cosmos DB database
  • Cassandra API column family database management system provides a Cassandra-like programmatic interface for Cosmos DB
  • Gremlin API. implements a graph database interface to Cosmos DB
  • Unlike Azure Table storage, documents in a Cosmos DB partition aren’t sorted by ID Instead, Cosmos DB maintains a separate index. This index contains not only the document IDs, but also tracks the value of every other field in each document
  • all databases are replicated within a single region
  • replication is transparent, and failover from a failed replica is automatic
  • guarantees less than 10-ms latencies for both reads (indexed) and writes at the 99th percentile, all around the world
  • all data in Cosmos DB is encrypted at rest and in motion

Module 10 : Explore provisioning and deploying non-relational data services in Azure

provisioning non-relational data services

  • Provisioning is the act of running a series of tasks that a service provider, it is opaque. Parameters initially set can be later modified.
  • Tools
  • The Azure portal
  • The Azure command-line interface (CLI): run command using cloud shell
  • Azure PowerShell
  • Azure Resource Manager templates (deploy in json)

Provision Azure Cosmos DB

  • Azure CLI, Azure PowerShell, or an Azure Resource Manager template
  • Databases and containers are the primary resource(storage space) consumers
  • uses the concept of Request Units per second (RU/s) to manage the performance and cost of databases
  • If you under provision (by specifying too few RU/s), Cosmos DB will start throttling performance

configuring non-relational data services

  • You can connect to these services from an on-premises network, the internet, or from within an Azure virtual network
  • Azure Private Endpoint is a network interface that connects you privately and securely to a service powered by Azure Private Link
  • Private endpoint connections page for a service allows you to specify which private endpoints, if any, are permitted access to your servic
  • Azure Active Directory (Azure AD) provides superior security and ease of use over access key authorization

Configure access control

  • Azure role-based access control
  • Allow one user to manage virtual machines in a subscription and another user to manage virtual networks.
  • Allow a database administrator group to manage SQL databases in a subscription.
  • Allow a user to manage all resources in a resource group, such as virtual machines, websites, and subnets.
  • Allow an application to access all resources in a resource group.
  • security principal is an object that represents a user, group, service, or managed identity that is requesting access to Azure resources.
  • role definition, often abbreviated to role, is a collection of permissions
  • Owner
  • Contributor
  • Reader
  • User Access Administrator
  • scope lists the set of resources that the access applies to
  • Tools: Access control (IAM)

Configure consistency in Cosmo db

  • Eventual. least consistent Changes won’t be lost, they’ll appear eventually changes could appear out of order. lowest latency and least consistency
  • Consistent Prefix. ensures that changes will appear in order, although there may be a delay before they become visible
  • Session. If an application makes a number of changes, they’ll all be visible to that application, and in order
  • Bounded Staleness. There’s a lag between writing and then reading the updated data
  • Strong: In this case, all writes are only visible to clients after the changes are confirmed as written successfully to all replicas

Configure shared access signatures

  • use shared access signatures (SAS) to grant limited rights to resources in an Azure storage account for a specified time period
  • access resources such as blobs and files, without requiring that they’re authenticated first

Module 11: Manage non-relational data stores in Azure

Data operations in Cosmos DB

  • Data Explorer in the Azure portal to run ad-hoc queries
  • Cosmos DB Data Migration tool to perform a bulk-load or transfer of data
  • Azure Data Factory to import data from another source
  • custom application that imports data using the Cosmos DB Bulk Executor library (make use of multiple concurrent threads to batch your data into chunks and load the chunks in parallel)
  • Create your own application that uses the functions available through the Cosmos DB SQL API client library
  • insufficient throughput capacity configured results in https error 429

Query Azure Cosmos DB

  • The Cosmos DB SQL API supports a dialect of SQL for querying documents using SELECT statements

Manage Azure File storage

Module 12: Examine components of a Modern data warehouse :

  • The process of combining all of the local data or gathering data from many different sources within an organization is known as data warehousing.
  • The process of analyzing streaming data and data from the Internet is known as Big Data analytics.
  • Azure Synapse Analytics combines data warehousing with Big Data analytics.
  • online analytical processing (OLAP)
  • Modern data Warehousing : It contains a mixture of relational and non relational data like media files, social media streams or internet of things.
  • Tools we can use: Azure Data Factory, Azure Data Lake Storage, Azure Databricks, Azure Synapse Analytics, and Azure Analysis Services, power BI is used to analyze and visualize the data, generating reports and charts.
  • gathers data from many different sources within an organization
  • provide answers to complex queries, unlike a traditional relational database
  • handle big data
  • contain a mixture of relational and non-relational data, including files, social media streams, and Internet of Things (IoT) sensor data
  • Combine batch and stream process: upto the second data and historical data are important for a company. Upto the second data is generated by steam process and historical data is generated by batch process.
  • Azure Data factory: its a data integration service. Purpose of the azure data factory is to retrieve data from one or more data sources and to convert it into the format that you process.

Work performed by azure data factory is defined as pipeline of operations

  • Azure Data Lake storage:
  • It’s a repository for Large Quantities of Raw data.
  • Data is fast load and processed because its raw & unprocessed.
  • A data warehouse holds structured information but a data lake stores raw data.
  • It’s an extension of Azure Blob storage. It organizes files into directories and sub directories but blob storage only mimics a directory structure.
  • It supports POSIX( Portable Opening system interface) and compatible with HDFS (Hadoop distributed file system)
  • Azure DataBricks:
  • Azure Databricks is an Apache Spark environment running on Azure to provide big data processing, streaming, and machine learning.
  • Azure Databricks provides a graphical user interface where you can define and test your processing step by step, before submitting it as a set of batch tasks.
  • Azure Synapse Analytics:
  • It’s an analytical engine. Design to process large data very quickly.
  • Ingest data from external sources and then transform and aggregate this data into a format suitable for analytics processing. Its used to process data which we have read in and processed locally.
  • Control node is the brain of architecture, like a front end which interacts with all applications. When we submit a process request it transforms it into smaller requests that run against distinct subsets of the data in parallel.
  • We have to submit queries in Transact-SQL statements
  • PolyBase enables you to retrieve data from relational and non-relational sources, such as delimited text files, Azure Blob Storage, and Azure Data Lake Storage.
  • Azure Analysis Service

Azure Analysis Services enables you to build tabular models to support online analytical processing (OLAP) queries. You can combine data from various sources (e.g. lake , analytics , SQL db, cosmos db).

  • Difference between Analysis service and synapse analytics:

Use Azure Synapse Analytics for:

  • Very high volumes of data (multi-terabyte to petabyte sized datasets).
  • Very complex queries and aggregations.
  • Data mining, and data exploration.
  • Complex ETL operations. ETL stands for Extract, Transform, and Load, and refers to the way in which you can retrieve raw data from multiple sources, convert this data into a standard format, and store it
  • Low to mid concurrency (128 users or fewer).

Use Azure Analysis Services for:

  • Smaller volumes of data (a few terabytes).
  • Multiple sources that can be correlated.
  • High read concurrency (thousands of users).
  • Detailed analysis, and drilling into data, using functions in Power BI.
  • Rapid dashboard development from tabular data.
  • Azure HDInsight: Azure HDInsight is a big data processing service that provides the platform for technologies such as Spark in an Azure environment.
  • Hadoop is an open source framework that breaks large data processing problems down into smaller chunks.
  • Hive is a SQL-like query facility that you can use with an HDInsight cluster to examine data held in a variety of formats.

Module 13: Explore large scale data analytics :

Ingest data using Azure Data Factory

  • HTAP(Hybrid transactional analytical processing): analyze operational data into original location.
  • Orchestration is the process of directing and controlling other services, and connecting them together, to allow data to flow between them.
  • A linked service provides the information needed for Data Factory to connect to a source or destination
  • A dataset in Azure Data Factory represents the data that you want to ingest (input) or store (output). If the data is structured, the data set specifies how data is structured.
  • A pipeline is a logical grouping of activities that together perform a task. The activities in a pipeline define actions to perform on your data. It includes a lot of activity like looping data in for each loop , can use if condition to execute the activity conditionally, can map input into different formats as per output is required. A trigger enables you to schedule a pipeline to occur according to a planned schedule

Ingest data using PolyBase

  • Polybase is a feature of SQL server & azure synapse analytics that enables you to run Transact-SQL queries that read data from external data sources. It makes external data appear like tables in a SQL db.
  • Azure SQL Database does not support PolyBase.

Ingest data using SQL Server Integration Services

SQL Server Integration Services (SSIS) is a platform for building enterprise-level data integration and data transformations solutions. It is used to solve complex business problems.

SSIS can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations.

Module 14: Get started building with Power BI

  • collection of software services, apps, and connectors that work together to turn your unrelated sources of data into coherent, visually immersive, and interactive insights.
  • Power BI Desktop, an online SaaS (Software as a Service)

Use Power BI

  1. Bring data into Power BI Desktop, and create a report.
  2. Publish to the Power BI service, where you can create new visualizations or build dashboards.
  3. Share dashboards with others, especially people who are on the go.
  4. View and interact with shared dashboards and reports in Power BI Mobile apps.

Building blocks of Power BI

  • Visualizations
  • Datasets
  • Reports
  • Dashboard
  • Tiles
  • app is a collection of preset, ready-made visuals and reports that are shared with an entire organization.
  • Power BI can just as easily connect to a whole assortment of software services (also called SaaS providers or cloud services): Salesforce, Facebook, Google Analytics, and more.

--

--

Kumar Rahul

A passionate Web Developer and a Student of Mechanical Engineering from NIT Durgapur