Azure Data Catalog

The Azure Data Catalog lets you register data repositories and quickly discover data across your organization.

Access the Data Catalog

Log onto the Data Catalog from either the Azure Portal or redirect to the Data Catalog through the Data Catalog link:

Purpose of the Data Catalog

Create metadata surrounding your data sources.

Discover Data Source

Search syntax

Although the default free text search is simple and intuitive, users can also use Azure Data Catalog’s search syntax to have greater control over the search results. Search supports the following techniques:

Technique Use Example
Basic Search Basic search using one or more search terms. Results are any assets that match on any property with one or more of the terms specified. sales data
Property Scoping Only return data sources where the search term is matched with the specified property name:finance
Boolean Operators Broaden or narrow a search using Boolean operations finance NOT corporate
Grouping with Parenthesis Use parentheses to group parts of the query to achieve logical isolation, especially in conjunction with Boolean operators name:finance AND (tags:Q1 OR tags:Q2)
Comparison Operators Use comparisons other than equality for properties that have numeric and date data types modifiedTime > “06/07/2016”

Publish Data Source

You can either enter a Data Source manually through the web portal, or use the application for a more seamless approach. I wasn’t able to successfully register an SQL Server manually, and I had better luck using the application.

When you use the application to publish the data source your location determines what the application connects to. For example, if you are connected over the VPN, your connection is through the VPN (see firewall restrictions as to why this matters). What this means is you can publish ANY data source that you are able to connect to. Pretty cool! After publishing a new data source you will need to refresh the web portal.

To publish from the application, install the Azure Data Catalog application, browse to the Data Catalog Portal, click Install Application, and follow the on screen prompts (accept the agreement).

  1. The Azure Data Catalog Application will install.
  2. Sign on to the application using your email account, select Work or School Account.
  3. Select the Data Source, click Next.
  4. Enter the connection information, click Connect.
  5. Select the database and table, fill out the details, click Next.

Registered files and folders within the Data Lake Store will be related to each other based on the name of the Data Lake Store, this makes categorizing data simple.

Annotate Data Source

Azure Data Catalog allows users annotate data sources using various methods to provide their own descriptive metadata – such as descriptions and tags – to supplement the metadata extracted from the data source, and to make the data source more understandable to more people.

When selecting multiple data assets in the Azure Data Catalog portal, users can annotate all selected assets in a single operation. Annotations will apply to all selected assets, making it easy to select and provide a consistent description and sets of tags and experts for related data assets.

There are many different ways users can annotate a data source:

Friendly name Friendly names can be supplied at the data asset level, to make the data assets more easily understood. Friendly names are most useful when the underlying object name is cryptic, abbreviated or otherwise not meaningful to users.
Description Descriptions can be supplied at the data asset and attribute / column levels. Descriptions are free-form short text annotations that describe the user’s perspective on the data asset or its use.
Tags (user tags) Tags can be supplied at the data asset and attribute / column levels. User tags are user-defined labels that can be used to categorize data assets or attributes.
Tags (glossary tags) Tags can be supplied at the data asset and attribute / column levels. Glossary tags are centrally-defined glossary terms that can be used to categorize data assets or attributes using a common business taxonomy.
Experts Experts can be supplied at the data asset level. Experts identify users or groups with expert perspectives on the data and can serve as points of contact for users who discover the registered data sources and have questions that are not answered by the existing annotations.
Request access Request access information can be supplied at the data asset level. This information is for users who discover a data source that they do not yet have permissions to access. Users can enter the email address of the user or group who grants access, the URL of the process or tool that users need to gain access, or can enter the process itself as text.

Data Source Friendly Name

We need to standardize our data source names. The default is the database or filename, which can become confusing as we usually name databases the same throughout our servers. Only the tables change.

Naming scheme we’ll adopt:

ServerName.DatabaseName

Document Data Source

Documenting data sources with Azure Data Catalog can create a narrative about your data assets in as much detail as you need. By using links, you can link to content stored in an existing content repository which brings your existing docs and data assets together. Once your users discover appropriate data assets, they can have a complete set of documentation.

Security

Add Account to the Data Catalog Application

Adding accounts to the Data Catalog is a two-step process:

Add Account to the Azure Portal

  1. 1. Browse to http://portal.azure.com
  2. 2. Search for and open the Azure Data Catalog
  3. 3. Click Access
  4. 4. Add the user account to the appropriate group (Owner, Contributor, Reader)

Add Account to the Data Catalog Web UI

The second account you will need to add is within the Data Catalog web UI:

  1. 1. Open and sign onto the Data Catalog.
  2. 2. Within the Data Catalog web UI, click Settings
  3. 3. Add the account to one or more of the sections (catalog users are allowed to publish and update data, while catalog administrators are allowed to update the Business Glossary):

Troubleshooting

New Data Sources Not Displayed

Within the web browser, after adding a new data source, the data source is not displayed. I discovered that the Data Catalog web UI is not dynamic and does not refresh periodically. After adding new data sources using the application, you will have to refresh or reload the web UI to display the new data source.

References

Organize and Discover Big Data in Azure Data Lake with Azure Data Catalog