Understanding HDFS quotas and Hadoop fs and fsck tools
In my experience Hadoop users often confuse the file size numbers reported by commands such as hadoop fsck
,
hadoop fs -dus
and hadoop fs -count -q
when it comes to reasoning about HDFS space quotas. Here is a short
summary how the various filesystem tools in Hadoop work in unison.
In this blog post we will look at three commands:
hadoop fsck and hadoop fs -dus
First, let’s start with hadoop fsck
and hadoop fs -dus
because they will report identical numbers.
$ hadoop fsck /path/to/directory
Total size: 16565944775310 B <=== see here
Total dirs: 3922
Total files: 418464
Total blocks (validated): 502705 (avg. block size 32953610 B)
Minimally replicated blocks: 502705 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 18
Number of racks: 1
FSCK ended at Thu Oct 20 20:49:59 CET 2011 in 7516 milliseconds
The filesystem under path '/path/to/directory' is HEALTHY
$ hadoop fs -dus /path/to/directory
hdfs://master:54310/path/to/directory 16565944775310 <=== see here
As you can see, hadoop fsck
and hadoop fs -dus
report the effective HDFS storage space used, i.e. they show the
“normal” file size (as you would see on a local filesystem) and do not account for replication in HDFS. In this case,
the directory path/to/directory
has stored data with a size of 16565944775310
bytes (15.1 TB). Now fsck tells
us that the average replication factor for all files in path/to/directory
is exactly 3.0
This means that the
total raw HDFS storage space used by these files – i.e. factoring in replication – is actually:
3.0 x 16565944775310 (15.1 TB) = 49697834325930 Bytes (45.2 TB)
This is how much HDFS storage is consumed by files in path/to/directory
.
hadoop fs -count -q
Now, let us inspect the HDFS quota set for path/to/directory
. If we run hadoop fs -count -q
we get this result:
$ hadoop fs -count -q /path/to/directory
QUOTA REMAINING_QUOTA SPACE_QUOTA REMAINING_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
none inf 54975581388800 5277747062870 3922 418464 16565944775310 hdfs://master:54310/path/to/directory
(I manually added the column headers like QUOTA
to the output for making it easier to read.)
The seventh column CONTENT_SIZE
is, again, the effective HDFS storage space used: 16565944775310
Bytes (15.1
TB).
The third column SPACE_QUOTA
however, 54975581388800
is the raw HDFS space quota in bytes. The fourth
column REMAINING_SPACE_QUOTA
with 5277747062870
is the remaining raw HDFS space quota in bytes. Note
that whereas hadoop fsck
and hadoop fs -dus
report the effective data size (= the same numbers you see on a
local filesystem), the third and fourth columns of hadoop fs -count -q
indirectly return how many bytes this data
actually consumes across the hard disks of the distributed cluster nodes – and for this it is counting each of the
3.0
replicas of an HDFS block individually (here, the value 3.0
has been taken from the hadoop fsck
output
above and actually matches the default value of the replication count). So if we make the subtraction of these two
quota-related numbers we get back the number from above:
54975581388800 (50 TB) - 5277747062870 (4.8 TB) = 49697834325930 (45.2 TB)
Now keep in mind that the Hadoop space quota always counts against the raw HDFS disk space consumed. So if you have a
quota of 10MB, you can store only a single 1MB file if you set its replication to 10. Or you can store up to three 1MB
files if their replication is set to 3. The reason why Hadoop’s quotas work like that is because the replication count
of an HDFS file is a user-configurable setting. Though Hadoop ships with a default value of 3
it is up to
the users to decide whether they want to keep this value or change it. And because Hadoop can’t anticipate how users
might be playing around with the replication setting for their files, it was decided that the Hadoop quotas always
operate on the raw HDFS disk space consumed.
TL;DR and Summary
If you never change the default value of 3 for the HDFS replication count of any files you store in your
Hadoop cluster, this means in a nutshell that you should always multiply the numbers reported by hadoop fsck
or
hadoop fs -dus
times 3 when you want to reason about HDFS space quotas.
Reported file size | Local filesystem | hadoop fsck and hadoop fs -dus |
hadoop fs -count -q (if replication factor == 3) |
---|---|---|---|
If a file was of size… | 1GB | 1GB | 3GB |
I hope this clears things up a bit!
Related Articles
- Hadoop space quotas, HDFS block size, replication and small files (earlier blog post)