This post explains how to calculate the total storage size of an Azure Data Lake Store(ADLS) Gen1 or Gen2 folder in Pyspark using Azure Databricks or Azure Synapse Analytics.
Assumptions
- ADLS Gen1 or Gen2 is already set and is being mounted in Azure Databricks or Azure Synapse Analytics.
- The below code can be used to calculate the folder size. It cant be used as a user defined function (UDF). Because it contains databricks utility function (dubtils) or Synpase utility functions (mssparkutils). These functions are not allowed inside a UDF. The following error will be thrown by databricks or synapse. databricks could not serialize object: Exception: You cannot use dbutils within a spark job
Code
def recursiveDirSize(path): total = 0 dir_files = dbutils.fs.ls(path) for file in dir_files: if file.isDir(): total += recursiveDirSize(file.path) else: total = file.size return total print(recursiveDirSize("/mnt/folder/"))
Unix command
You can use the disk usage Unix command in the Databricks or Synapse notebook in order to get the size. Any dbfs directory has a mount on the Unix system and one can access it using /dbfs.
%sh du -h /dbfs/mnt/folder/
The above command takes a lot of time to run. Please run cautiously
You can also browse other categories in our blog for some amazing 8051, Python, ARM, Verilog, Machine Learning codes.