The EOS balancing system provides a fully automated mechanism to balance the volume usage across a scheduling group. Hence currently the balancing system does not balance between scheduling groups! See Group Balancer!
The balancing system is made up by the cooperation of several components:
Each filesystem advertises the used volume and the central view allows to see the deviation from the average filesystem usage in each group.
EOS Console [root://localhost] |/> group ls
#---------------------------------------------------------------------------------------------------------------------
# type # name # status #nofs #dev(filled) #avg(filled) #sig(filled) #balancing # bal-run #drain-run
#---------------------------------------------------------------------------------------------------------------------
groupview default.0 on 8 0.27 0.10 0.12 idle 0 0
groupview default.1 on 8 0.28 0.10 0.12 idle 0 0
groupview default.10 on 8 0.29 0.10 0.13 idle 0 0
groupview default.11 on 8 0.29 0.10 0.13 idle 0 0
groupview default.12 on 7 0.28 0.11 0.14 idle 0 0
groupview default.13 on 8 0.28 0.12 0.14 idle 0 0
groupview default.14 on 8 0.29 0.10 0.13 idle 0 0
groupview default.15 on 8 0.30 0.10 0.13 idle 0 0
groupview default.16 on 7 0.26 0.12 0.13 idle 0 0
groupview default.17 on 8 0.28 0.12 0.14 idle 0 0
groupview default.18 on 8 0.30 0.10 0.14 idle 0 0
groupview default.19 on 8 12.42 4.76 6.80 idle 0 0
groupview default.2 on 8 0.48 0.16 0.23 idle 0 0
groupview default.20 on 8 14.03 5.43 7.62 idle 0 0
groupview default.21 on 8 0.48 0.16 0.23 idle 0 0
groupview default.3 on 8 0.28 0.10 0.12 idle 0 0
groupview default.4 on 8 0.26 0.11 0.13 idle 0 0
groupview default.5 on 8 0.27 0.10 0.12 idle 0 0
groupview default.6 on 8 0.27 0.10 0.12 idle 0 0
groupview default.7 on 8 0.27 0.09 0.12 idle 0 0
groupview default.8 on 8 0.27 0.10 0.12 idle 0 0
groupview default.9 on 8 0.30 0.11 0.14 idle 0 0
The decision parameters to enable balancing in a group is the maximum deviation of the filling state (given in %). In this example two groups are unbalanced (12 + 14 %).
The balancing is configured on the space level and the current configuration is displayed using the ‘space status’ command:
EOS Console [root://localhost] |/> space status default
# ------------------------------------------------------------------------------------
# Space Variables
# ....................................................................................
balancer := off
balancer.node.ntx := 10
balancer.node.rate := 10
balancer.threshold := 1
...
The configuration variables are:
variable definition balancer can be off or on to disable or enable the balancing balancer.node.ntx number of parallel balancer transfers running on each FST balancer.node.rate rate limitation for each running balancer transfer in MB/s balancer.threshold percentage at which balancing get’s enabled within a scheduling group
If balancing is enabled ....
EOS Console [root://localhost] |/> space config default space.balancer=on
success: balancer is enabled!
Groups which are balancing are shown via the eos group ls command:
EOS Console [root://localhost] |/> group ls
#---------------------------------------------------------------------------------------------------------------------
# type # name # status #nofs #dev(filled) #avg(filled) #sig(filled) #balancing # bal-run #drain-run
#---------------------------------------------------------------------------------------------------------------------
groupview default.0 on 8 0.27 0.10 0.12 idle 0 0
groupview default.1 on 8 0.28 0.10 0.12 idle 0 0
groupview default.10 on 8 0.29 0.10 0.13 idle 0 0
groupview default.11 on 8 0.29 0.10 0.13 idle 0 0
groupview default.12 on 7 0.28 0.11 0.14 idle 0 0
groupview default.13 on 8 0.28 0.12 0.14 idle 0 0
groupview default.14 on 8 0.29 0.10 0.13 idle 0 0
groupview default.15 on 8 0.30 0.10 0.13 idle 0 0
groupview default.16 on 7 0.26 0.12 0.13 idle 0 0
groupview default.17 on 8 0.28 0.12 0.14 idle 0 0
groupview default.18 on 8 0.30 0.10 0.14 idle 0 0
groupview default.19 on 8 12.42 4.76 6.80 balancing 10 0
groupview default.2 on 8 0.48 0.16 0.23 idle 0 0
groupview default.20 on 8 14.03 5.43 7.62 balancing 12 0
groupview default.21 on 8 0.48 0.16 0.23 idle 0 0
groupview default.3 on 8 0.28 0.10 0.12 idle 0 0
groupview default.4 on 8 0.26 0.11 0.13 idle 0 0
groupview default.5 on 8 0.27 0.10 0.12 idle 0 0
groupview default.6 on 8 0.27 0.10 0.12 idle 0 0
groupview default.7 on 8 0.27 0.09 0.12 idle 0 0
groupview default.8 on 8 0.27 0.10 0.12 idle 0 0
groupview default.9 on 8 0.30 0.11 0.14 idle 0 0
The current balancing can also be viewed by space or node:
EOS Console [root://localhost] |/> space ls --io
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# name # diskload # diskr-MB/s # diskw-MB/s #eth-MiB/s # ethi-MiB # etho-MiB #ropen #wopen # used-bytes # max-bytes # used-files # max-files # bal-run #drain-run
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
default 0.02 66.00 66.00 862 57 60 31 22 1.99 TB 347.33 TB 805.26 k 16.97 G 51 0
EOS Console [root://localhost] |/> node ls --io
#------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# hostport # diskload # diskr-MB/s # diskw-MB/s #eth-MiB/s # ethi-MiB # etho-MiB #ropen #wopen # used-bytes # max-bytes # used-files # max-files # bal-run #drain-run
#------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lxfsra02a02.cern.ch:1095 0.08 41.00 0.00 119 0 41 23 0 825.47 GB 41.92 TB 298.80 k 2.05 G 0 0
lxfsra02a05.cern.ch:1095 0.03 19.00 0.00 119 0 19 2 0 832.01 GB 43.92 TB 152.14 k 2.15 G 0 0
lxfsra02a06.cern.ch:1095 0.01 0.00 11.00 119 12 0 0 6 70.05 GB 43.92 TB 54.77 k 2.15 G 10 0
lxfsra02a07.cern.ch:1095 0.01 0.00 11.00 119 9 0 0 3 79.95 GB 43.92 TB 75.91 k 2.15 G 10 0
lxfsra02a08.cern.ch:1095 0.01 0.00 11.00 119 9 0 0 2 52.01 GB 43.92 TB 61.25 k 2.15 G 8 0
lxfsra04a01.cern.ch:1095 0.01 0.00 10.00 119 9 0 0 1 72.12 GB 41.92 TB 60.92 k 2.05 G 8 0
lxfsra04a02.cern.ch:1095 0.01 0.00 10.00 119 9 0 0 7 52.32 GB 43.92 TB 86.72 k 2.15 G 10 0
lxfsra04a03.cern.ch:1095 0.01 0.00 10.00 119 9 0 0 5 10.53 GB 43.92 TB 14.80 k 2.15 G 5 0
To see the usage difference within the group, one can inspect all the group filesystems via eos group ls –IO e.g.
EOS Console [root://localhost] |/> group ls --IO default.20
#---------------------------------------------------------------------------------------------------------------------
# type # name # status #nofs #dev(filled) #avg(filled) #sig(filled) #balancing # bal-run #drain-run
#---------------------------------------------------------------------------------------------------------------------
groupview default.20 on 8 13.71 5.48 7.47 balancing 37 0
#.................................................................................................................................................................................................................
# hostport # id # schedgroup # diskload # diskr-MB/s # diskw-MB/s #eth-MiB/s # ethi-MiB # etho-MiB #ropen #wopen # used-bytes # max-bytes # used-files # max-files # bal-run #drain-run
#.................................................................................................................................................................................................................
lxfsra02a05.cern.ch:1095 17 default.20 0.47 12.00 0.00 119 0 21 1 0 383.17 GB 2.00 TB 59.33 k 97.52 M 0 0
lxfsra02a06.cern.ch:1095 35 default.20 0.08 0.00 6.00 119 10 0 0 6 26.56 GB 2.00 TB 6.23 k 97.52 M 7 0
lxfsra04a01.cern.ch:1095 57 default.20 0.13 0.00 6.00 119 9 0 0 4 25.01 GB 2.00 TB 6.11 k 97.52 M 4 0
lxfsra02a08.cern.ch:1095 77 default.20 0.08 0.00 6.00 119 11 0 0 5 27.36 GB 2.00 TB 6.64 k 97.52 M 8 0
lxfsra04a02.cern.ch:1095 99 default.20 0.07 0.00 4.00 119 10 0 0 3 26.57 GB 2.00 TB 7.75 k 97.52 M 6 0
lxfsra02a02.cern.ch:1095 121 default.20 1.00 22.00 0.00 119 0 41 21 0 351.07 GB 2.00 TB 59.80 k 97.52 M 0 0
lxfsra02a07.cern.ch:1095 143 default.20 0.10 0.00 7.00 119 9 0 0 2 28.57 GB 2.00 TB 7.46 k 97.52 M 7 0
lxfsra04a03.cern.ch:1095 165 default.20 0.12 0.00 6.00 119 10 0 0 5 7.56 GB 2.00 TB 2.96 k 97.52 M 5 0
The scheduling activity for balancing can be monitored with the eos ns ls command:
EOS Console [root://localhost] |/> ns stat
# ------------------------------------------------------------------------------------
# Namespace Statistic
# ------------------------------------------------------------------------------------
ALL Files 682781 [booted] (12s)
ALL Directories 1316
# ....................................................................................
ALL File Changelog Size 804.27 MB
ALL Dir Changelog Size 515.98 kB
# ....................................................................................
ALL avg. File Entry Size 1.18 kB
ALL avg. Dir Entry Size 392.00 B
# ------------------------------------------------------------------------------------
ALL Execution Time 0.40 +- 1.12
# -----------------------------------------------------------------------------------------------------------
who command sum 5s 1min 5min 1h exec(ms) +- sigma(ms)
# -----------------------------------------------------------------------------------------------------------
ALL Access 0 0.00 0.00 0.00 0.00 -NA- +- -NA-
....
ALL Schedule2Balance 6423 11.75 10.81 10.71 1.78 -NA- +- -NA-
ALL Schedule2Drain 0 0.00 0.00 0.00 0.00 -NA- +- -NA-
ALL Scheduled2Balance 6423 11.75 10.81 10.71 1.78 4.20 +- 0.57
ALL SchedulingFailedBalance 0 0.00 0.00 0.00 0.00 -NA- +- -NA-
The relevant counters are:
state definition Schedule2Balance counter/rate at which all FSTs ask for a file to balance ScheduledBalance counter/rate of balancing transfers which have been scheduled to FSTs SchedulingFailedBalance counter/rate of scheduling requests which could not get any workload (e.g. no file matches the target machine)