In large Data Lake Stores, operations on ACL entries are slow
I recently wanted to simplify permissions in a Data Lake Storage file system. For folders and files in Data Lake Storage, we define permissions using Access Control Lists.
Microsoft generally recommends that we should assign permissions to security groups, and not for individual users or apps:
32 ACLs can be set per file and per directory. Access and default ACLs each have their own 32 ACL entry limit. Use security groups for ACL assignments if possible. By using groups, you’re less likely to exceed the maximum number of ACL entries per file or directory.Access control in Azure Data Lake Storage Gen1
There is another reason why it’s convenient to manage permissions with security groups in Active Directory. Changing ACL definitions on many objects is painfully slow. This is because files and folders don’t inherit permissions from their parent folders. We need to update many individual objects.
The experience of doing it from Azure Portal was terrible. It could be done, but a single operation took literally hours. The browser started consuming large amounts of RAM (over 10 GB). CPU usage was near 100%, and network usage very high, like the browser was constantly sending multiple individual requests.
How to remove or assign ACLs faster than from the Azure Portal’s UI?
Azure Portal UI suggests a better option than Azure Portal: a PowerShell command.
I found it easiest to run this command directly from the Cloud Shell in Azure Portal:
The only downside is that a session in Cloud Shell times out after 20 minutes of inactivity (by design). If the operation time exceeds 20 minutes, it might be better to run the command on your local machine.
Now to the point. How do we remove multiple ACLs in a single PowerShell operation? The example from documentation has some errors, so I publish my own in the case any of us needs to do that in the future:
Code language: PowerShell (powershell)
# List all ACLs to remove, separated by a comma. Ids are just examples, insert your own ;) $aclsToRemove ="user:eba73f1b-fa61-45df-ac8c-cb068836680d:rwx,default:user:eba73f1b-fa61-45df-ac8c-cb068836680d:rwx" # Split the string into array of individual ACLs $aclsToRemoveAsArray = $aclsToRemove.Split(",") # Remove all in a single operation (recursively) Remove-AzureRmDataLakeStoreItemAclEntry -AccountName "mydatalakename" -Path / -Acl $aclsToRemoveAsArray -Recurse -ShowProgress
This is much faster than a similar operation from Azure Portal’s UI. On a data set of ~1 TB and more than 1 000 000 files, it took only few minutes.