The EOS file scheduler is a core component of EOS which decides on which filesystems to place or access files. This decision is based on:
This information is structured under the form of so-called scheduling trees -the shape of the trees being given by the geotags of the filesystems-. There is one scheduling tree by scheduling group.
The file scheduler is a stateful component of which state is continuously updated to reflect the state of all the filesystems involved in the instance.
The file scheduler is involved in ALL the file access/placement operations including file access/placement from clients, space balancing and filesystem draining.
The interaction with the file geoscheduling is three-folded:
The GeoTreeEngine is a software component inside EOS in charge of keeping a consistent up-to-date view of each scheduling group. For each scheduling group, this view is summarized into a scheduling tree and multiple snapshots of this scheduling tree, one for each type of access/placement operation. These snapshots are then copied and used to serve all the file access/placement requests. To achieve its tasks, the GeoTreeEngine has several features including:
The commands
geosched show param
and
geosched set
allow to view and set internal parameters of the GeoTreeEngine.
EOS Console [root://localhost] |/eos/demo/> geosched show param
### GeoTreeEngine parameters :
skipSaturatedAccess = 1
skipSaturatedDrnAccess = 1
skipSaturatedBlcAccess = 1
penaltyUpdateRate = 1
plctDlScorePenalty = 10(default) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
plctUlScorePenalty = 10(defaUlt) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
accessDlScorePenalty = 10(default) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
accessUlScorePenalty = 10(defaUlt) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
fillRatioLimit = 80
fillRatioCompTol = 100
saturationThres = 10
timeFrameDurationMs = 1000
### GeoTreeEngine list of groups :
default.0 , default.1 , default.10 , default.11 , default.12 , default.13
default.14 , default.15 , default.16 , default.17 , default.18 , default.19
default.2 , default.20 , default.21 , default.22 , default.3 , default.4
default.5 , default.6 , default.7 , default.8 , default.9 ,
Here follows the list of these parameters.
parameter definition skipSaturatedAccess as skipSaturatedPlct but for access skipSaturatedDrnAccess as skipSaturatedPlct but for draining access skipSaturatedBlcAccess as skipSaturatedPlct but for balancing access penaltyUpdateRate weight of the penalty update at each time Frame. 0 means penalties are fixed, 100 means that new values are estimated for each time frame regardless of the past. This parameter is used to ensure some stability for the penalties when they are self-estimated. plctDlScorePenalty atomic penalty applied to a fs download score on any type of placement operation. It is a vector indexed by the networking speed class of the file system. plctUlScorePenalty as plctDlScorePenalty but for the upload score accessDlScorePenalty as plctDlScorePenalty but for access operations. accessUlScorePenalty as accessDlScorePenalty but for the upload score fillRatioLimit fill ratio above which a filesystem should not be used for a placement or a RW access operation. fillRatioCompTol quantity by which fill ratio of two fs should differ to be considered as different. 100 means that whatever the fill ratios of two compared fs are, they will not be considered as different. The file scheduler, among other criterions, tries to balance fs fill ratios using this tolerance. As a consequence, if it is set to 10 it will try to get all the fill ratios equal in a 10% tol. If this value is set to 100, there is no such inline space balancing. saturationThres threshold under which a fs upload or download score makes a fs considered as saturated. timeFrameDurationMs periodicity of the internal state update (especially snapshots and possibly trees).
The internal state of the GeoTreeEngine is essentially composed of scheduling trees and snapshots. They can be displayed with commands
geosched show tree
geosched show snapshot
Warning
By design, information attached to the trees might not be up-to-date. Contrary to the snapshots that should be keep up-to-date.
The internal state also includes the penalty accounting table and the fs age/latency report. They can be displayed with the command
geosched show state
Some examples follow.
EOS Console [root://localhost] |/eos/demo/> geosched show tree default.0
### scheduling tree for scheduling group default.0 :
--------default.0 [3,9]
|----------site1 [1,3]
| `----------rack1 [1,2]
| `----------1@lxfsrd47a04.cern.ch [1,1,UnvRW]
|
|
`----------site2 [2,5]
|----------rack1 [1,2]
| `----------24@lxfsre13a01.cern.ch [1,1,UnvRW]
|
`----------rack2 [1,2]
`----------46@lxfsrg15a01.cern.ch [1,1,UnvRW]
EOS Console [root://localhost] |/eos/demo/> geosched show snapshot default.0
### scheduling snapshot for scheduling group default.0 and operation 'Placement' :
--------default.0/( free:2|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
|----------site1/( free:1|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------rack1/( free:1|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------1/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
|
|
`----------site2/( free:1|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
|----------rack1/( free:1|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
| `----------24/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
|
`----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
`----------46/( free:1|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch
### scheduling snapshot for scheduling group default.0 and operation 'Access RO' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
|----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------1/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
|
|
`----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
|----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
| `----------24/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
|
`----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
`----------46/( free:0|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch
### scheduling snapshot for scheduling group default.0 and operation 'Access RW' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
|----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------1/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
|
|
`----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
|----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
| `----------24/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
|
`----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
`----------46/( free:0|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch
### scheduling snapshot for scheduling group default.0 and operation 'Draining Access' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
|----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------1/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
|
|
`----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
|----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
| `----------24/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
|
`----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
`----------46/( free:0|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch
### scheduling snapshot for scheduling group default.0 and operation 'Draining Placement' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
|----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------1/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
|
|
`----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
|----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
| `----------24/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
|
`----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
`----------46/( free:1|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch
### scheduling snapshot for scheduling group default.0 and operation 'Balancing Access' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
|----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------1/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
|
|
`----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
|----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
| `----------24/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
|
`----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
`----------46/( free:0|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch
### scheduling snapshot for scheduling group default.0 and operation 'Draining Placement' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
|----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
| `----------1/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
|
|
`----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
|----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
| `----------24/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
|
`----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
`----------46/( free:1|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch
The internal state of the GeoTreeEngine is kept up-to-date by the background updater. It can be paused and resumed with the commands.
geosched updater pause
geosched updater resume
A refresh of all the scheduling trees and snapshots can be obtained with the command
geosched forcerefresh
The GeoTreeEngine implements a mechanism to inhibit branches of the snapshots for selected types of operation. It can be done for all the scheduling groups or only for specific ones. The list of inhibited branches for each operation can be managed with the commands
geosched disabled add
geosched disabled rm
geosched disabled show
Warning
By default, placing data to ungeotagged fs is disabled. That means that for very basic instances (like dev ones), this disabling should be removed by the command
geosched disabled rm nogeotag * *
One can foresee multiple applications for this. An example can be found in the default value that forbids any placement operation to a non-geotagged filesystem.
The commands
group ls
space ls
both feature a switch -g <depth> that allows to summarize the displayed information along the scheduling trees down to depth <depth>.
EOS Console [root://localhost] |/eos/demo/> space ls -g 2
#-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# type # name # groupsize # groupmod #N(fs) #N(fs-rw) #sum(usedbytes) #sum(capacity) #capacity(rw) #nom.capacity #quota #balancing # threshold # converter # ntx # active #intergroup
#-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
spaceview default 0 0 67 66 272.69 G 133.62 T 131.62 T 0 off off 20 on 2 0 off
#-------------------------------------------------------------------------------------------------------
# geotag #N(fs) #N(fs-rw) #sum(usedbytes) #sum(capacity) #capacity(rw)
#-------------------------------------------------------------------------------------------------------
<ROOT> 67 66 272.69 G 133.62 T 131.62 T
<ROOT>::site1 23 23 105.72 G 45.79 T 45.79 T
<ROOT>::site2 44 43 166.97 G 87.83 T 85.84 T
<ROOT>::site1::rack1 23 23 105.72 G 45.79 T 45.79 T
<ROOT>::site2::rack1 22 22 74.36 G 43.92 T 43.92 T
<ROOT>::site2::rack2 22 21 92.61 G 43.92 T 41.92 T