Azure Data Lake Store

Microsoft Azure Data Lake Store is a Hadoop file system that’s compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Data Lake Store is integrated with Azure Data Lake Analytics and Azure HDInsight and will be integrated with Microsoft offerings like Revolution-R Enterprise; industry-standard distributions like Hortonworks, Cloudera, and MapR; and individual Hadoop projects like Spark, Storm, Flume, Sqoop, and Kafka.

Administer ADL

Add Non-Interactive Account

To access the Data Lake Store from a script or through code you will need a non-interactive account.

Add the account using portal.azure.com:

1. Open the Active Directory Extension and open the Active Directory in use for your subscription. A new window will open to the classic management portal.

2. Go to Applications within the Active Directory management page

3. Add new application by clicking Add at the bottom of the Active Directory management page

4. Define a new web based application

5. Fill out urls

6. The new Application will be created.

7. Under Configure, expand Access Web APIs in Other Applications, and click configure key:

8. Set USER ASSIGNMENT REQUIRED TO ACCESS APP to Yes

9. Copy the ClientID (CLIENT ID) to be used later.

10. Under Keys, select a 2-year duration

11. Click save, and copy the new Secret Key to be used later.

Note: you will not be able to retrieve the secret key after you leave the page – you will have to create a new secret key. 

12. Go back to the portal, browse to the Data Lake Store, click on the Resource Group…

Next, click Users, and add the newly defined application to the resource group setup when the Data Lake Store was created

Apply ACL to the Data Lake Store

After you create or modify account access to the Data Lake Store you will have to add the account group to the folder – basically modifying folder permissions.

1. Click on the Data Lake Store

2. Click on Data Explorer located at the top of the screen.

3. Within the Data Lake, click Access and modify the folder permissions.

Retrieve the TenantId

You will need to pass the tenant id with your authentication request. For Web Apps and Web API Apps, you can retrieve the tenant id by selecting View endpoints at the bottom of the screen and retrieving the id as shown below.

1. Click on the Data Lake Store

2. Click on All Settings, Users

3. Click on the External user added using the instructions listed above.

4. You’ll find the tenant id under the account.

Reference for Client ID and Tenant ID: https://azure.microsoft.com/en-us/documentation/articles/resource-group-create-service-principal-portal/

Retrieve the SubscriptionId

To retrieve the subscription id, browse to the Data Lake Store, click on the Subscription where the Data Lake Store is housed, and copy the subscription id:

Retrieve the Token Endpoint

Our authentication endpoint, also known as the token endpoing (authTokenEndpoin), is as follows:

https://login.microsoftonline.com/<TenantId>/oauth2/token

https://login.microsoftonline.com/14ea3e2d-a67c-4c86-821b-51e6745fd11d/oauth2/token

How to Access Azure Data Lake Store

Java SDK

See Get started with Azure Data Lake Store using Java

Example codes @ TFS, …Shared/Tools/AzureDatalakeStorewithJavaSDK

.NET SDK

See Get started with Azure Data Lake Store using .NET SDK

Example codes @ TFS, …Shared/Tools/AzureDataLakeWithNetSDK

Learn how to use the Azure Data Lake Store .NET SDK to create an Azure Data Lake account and perform basic operations such as create folders, upload and download data files, delete your account, etc, see Get started with Azure Data Lake Store using .NET SDK. For more information about Data Lake, see Azure Data Lake Store.

See the code example @ TFS, …/Shared/Tools/AzureDataLakeWithNetSDK.

Non-Interactive Authentication

The following snippet shows an AuthenticateApplication method that you can use for a non-interactive log in experience.

// Authenticate the application with AAD through the application's secret key.
// You need to have an application registered with AAD in order to authenticate.
//   For more information and instructions on how to register your application with AAD, see:
//   https://azure.microsoft.com/en-us/documentation/articles/resource-group-create-service-principal-portal/
public static TokenCredentials AuthenticateApplication(string tenantId, string resource, string appClientId, Uri appRedirectUri, SecureString clientSecret)
{
    var authContext = new AuthenticationContext("https://login.microsoftonline.com/" + tenantId);
    var credential = new ClientCredential(appClientId, clientSecret);

    var tokenAuthResult = authContext.AcquireToken(resource, credential);

    return new TokenCredentials(tokenAuthResult.AccessToken);
}

TenantId

1. Start the Azure PowerShell, then Run Login-AzureRmAccount to login.

2. If you receive an error that the Login-AzureRmAccount module cannot be found:

# Install the Azure Resource Manager modules from the PowerShell Gallery 
Install-Module AzureRM
 # Install the Azure Service Management module from the PowerShell Gallery
Install-Module Azure
  • Run Login-AzureRmAccount:
Resource

ClientId

appClientId and appRedirectUri
  • Using portal.azure.com, go to App Registrations within the Active Directory management page.  Select the application and Configure.  Sign-On URL is appRedirectUri, and Client Id is appClientId.

Secret Key

clientSecret
  • Using portal.azure.com, go to Applications within the Active Directory management page.  Select the application and Configure. Go to Keys section and click 1 or 2 years, then Save it.

Azure Data Lake Store cmdlets

Below are commands to move files within, into, or out of a Data Lake store using powershell: (italicized values must be modified by user)

first login to azure, Azure Resource Manager must be installed:

login-azureRmAccount

view data lake subscriptions:
Get-AzureRmSubscription

select subscription of data lake:
Select-AzureRmSubscription -SubscriptionId “SubscriptionId

(Add -Recurse to the end of the cmdlets below to make them act recursively):

move file/folder within a data lake store
Move-AzureRmDataLakeStoreItem -AccountName “data lake store name” -Path “/Original/Path/File.txt” -Destination “/New/Path/RenamedFile.txt

Downloads a file from Data Lake Store into local directory
Export-AzureRmDataLakeStoreItem -AccountName “data lake store name” -Path /myFiles/TestSource.csv -Destination “C:\Test.csv

Uploads a local file or directory to a Data Lake Store
Import-AzureRmDataLakeStoreItem -AccountName “data lake store name” -Path “C:\SrcFile.csv” -Destination “/MyFiles/File.csv

Deletes a file or folder in Data Lake Store
Remove-AzureRmDataLakeStoreItem -AccountName “data lake store name” -Paths “/File01.txt“,”/MyFiles/File.csv

Script to move multiple user-specified folders from one data lake to another (uses local machine as intermidiary): I recommend running this in Windows Powershell ISE

$datalakefolder = ‘azure data lake store folder

$subfolders = ‘X‘,’Y‘,’Z

$datalake1 = ‘datalake1’s azure data lake store name

$datalake2 = ‘datalake2’s azure data lake store name

$localpath = ‘local machine folder

Foreach ($subfolder in $subfolders)

{

Export-AzureRmDataLakeStoreItem -AccountName “$datalake1” -Path /$datalakefolder/$subfolder/ -Destination “$localpath\$datalakefolder\$subfolder\” -PerFileThreadCount “40” -ConcurrentFileCount “20”

Import-AzureRmDataLakeStoreItem -AccountName “$datalake2” -Path “$localpath\$datalakefolder\$subfolder\” -Destination /$datalakefolder/$subfolder/ -PerFileThreadCount “40” -ConcurrentFileCount “20”

}

Get file count and folder size in bytes:

Example:

# login
Login-AzureRmAccount
Get-AzureRmDataLakeStoreChildItem -AccountName "resource-name" -Path "/dir/2017/01/02/" measure-object -sum 'Length'

Count : 96 ← (file count)
Average :
Sum : 41780702 ← (folder size in bytes)
Maximum :
Minimum :
Property : Length

Get file count and folder size recursively in bytes:

function Get-DataLakeStoreChildItemRecursive ([hashtable] $Params) {
    $AllFiles New-Object Collections.Generic.List[Microsoft.Azure.Commands.DataLakeStore.Models.DataLakeStoreItem];
    recurseDataLakeStoreChildItem -AllFiles $AllFiles -Params $Params
    $AllFiles
}
function recurseDataLakeStoreChildItem ([System.Collections.ICollection] $AllFiles, [hashtable] $Params) {
    $ChildItems = Get-AzureRmDataLakeStoreChildItem @Params;
    $Path $Params["Path"];
    foreach ($ChildItem in $ChildItems) {
        switch ($ChildItem.Type) {
            "FILE" {
                $AllFiles.Add($ChildItem);
            }
            "DIRECTORY" {
                $Params.Remove("Path");
                $Params.Add("Path"$Path "/" $ChildItem.Name);
                recurseDataLakeStoreChildItem -AllFiles $AllFiles -Params $Params;
            }
        }
    }
}
Get-DataLakeStoreChildItemRecursive -Params @{ 'Path' '/dir''Account' 'resource-name' }

Troubleshooting

Authentication Failed

If authentication failed it might be possible that that the config is the issue.  To fix it, start the Azure PowerShell, then  Run Login-AzureRmAccount to login.

If the directory create, upload is failed, it might be a permission issue.