Table Of Contents

Previous topic

GEO Balancer

Next topic

GeoTags

This Page

File Geoscheduling

Overview

The EOS file scheduler is a core component of EOS which decides on which filesystems to place or access files. This decision is based on:

  • the geotag of each filesystem
  • the state of each filesystem and of the machine hosting it
  • the geotag of the requesting client
  • the layout of the requested file
  • several admin-defined internal parameters
  • several admin-defined or user-defined directory attributes

This information is structured under the form of so-called scheduling trees -the shape of the trees being given by the geotags of the filesystems-. There is one scheduling tree by scheduling group.

The file scheduler is a stateful component of which state is continuously updated to reflect the state of all the filesystems involved in the instance.

The file scheduler is involved in ALL the file access/placement operations including file access/placement from clients, space balancing and filesystem draining.

The interaction with the file geoscheduling is three-folded:

  • the geosched command that allows to view/set internal state/parameters of the GeoTreeEngine
  • geoscheduling related directory attributes that allow to alter the file scheduling in a directory-specific way.
  • geotag aware eos commands that can display useful information summarized along the scheduling trees

Interacting with the GeoTreeEngine using the geosched command

The GeoTreeEngine is a software component inside EOS in charge of keeping a consistent up-to-date view of each scheduling group. For each scheduling group, this view is summarized into a scheduling tree and multiple snapshots of this scheduling tree, one for each type of access/placement operation. These snapshots are then copied and used to serve all the file access/placement requests. To achieve its tasks, the GeoTreeEngine has several features including:

  • a background updater which keeps snapshots and trees up-to-date. It updates snapshots and, only when needed, trees. Only when it is ultimately necessary, the updates on the snapshots are backported to the trees. It happens when a filesystem is added or removed from a scheduling group. So in general, snapshots have fresher information than trees. This is perfectly normal.
  • a penalty system which makes sure that some filesystems cannot be over-scheduled in bursts of requests. Atomic penalties can be self-estimated or fixed. These penalties are subtracted from the dlscore and the ulscore of the scheduled fs.
  • a latency estimation system which estimates how fresh is the information the state of the GeoTreeEngine is based on.

Internal parameters

The commands

geosched show param

and

geosched set

allow to view and set internal parameters of the GeoTreeEngine.

EOS Console [root://localhost] |/eos/demo/> geosched show param
### GeoTreeEngine parameters :
skipSaturatedAccess = 1
skipSaturatedDrnAccess = 1
skipSaturatedBlcAccess = 1
penaltyUpdateRate = 1
plctDlScorePenalty = 10(default) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
plctUlScorePenalty = 10(defaUlt) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
accessDlScorePenalty = 10(default) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
accessUlScorePenalty = 10(defaUlt) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
fillRatioLimit = 80
fillRatioCompTol = 100
saturationThres = 10
timeFrameDurationMs = 1000
### GeoTreeEngine list of groups :
default.0 , default.1 , default.10 , default.11 , default.12 , default.13
default.14 , default.15 , default.16 , default.17 , default.18 , default.19
default.2 , default.20 , default.21 , default.22 , default.3 , default.4
default.5 , default.6 , default.7 , default.8 , default.9 ,

Here follows the list of these parameters.

parameter definition
skipSaturatedAccess as skipSaturatedPlct but for access
skipSaturatedDrnAccess as skipSaturatedPlct but for draining access
skipSaturatedBlcAccess as skipSaturatedPlct but for balancing access
penaltyUpdateRate weight of the penalty update at each time Frame. 0 means penalties are fixed, 100 means that new values are estimated for each time frame regardless of the past. This parameter is used to ensure some stability for the penalties when they are self-estimated.
plctDlScorePenalty atomic penalty applied to a fs download score on any type of placement operation. It is a vector indexed by the networking speed class of the file system.
plctUlScorePenalty as plctDlScorePenalty but for the upload score
accessDlScorePenalty as plctDlScorePenalty but for access operations.
accessUlScorePenalty as accessDlScorePenalty but for the upload score
fillRatioLimit fill ratio above which a filesystem should not be used for a placement or a RW access operation.
fillRatioCompTol quantity by which fill ratio of two fs should differ to be considered as different. 100 means that whatever the fill ratios of two compared fs are, they will not be considered as different. The file scheduler, among other criterions, tries to balance fs fill ratios using this tolerance. As a consequence, if it is set to 10 it will try to get all the fill ratios equal in a 10% tol. If this value is set to 100, there is no such inline space balancing.
saturationThres threshold under which a fs upload or download score makes a fs considered as saturated.
timeFrameDurationMs periodicity of the internal state update (especially snapshots and possibly trees).

Internal state

The internal state of the GeoTreeEngine is essentially composed of scheduling trees and snapshots. They can be displayed with commands

geosched show tree
geosched show snapshot

Warning

By design, information attached to the trees might not be up-to-date. Contrary to the snapshots that should be keep up-to-date.

The internal state also includes the penalty accounting table and the fs age/latency report. They can be displayed with the command

geosched show state

Some examples follow.

EOS Console [root://localhost] |/eos/demo/> geosched show tree default.0
### scheduling tree for scheduling group default.0 :
--------default.0 [3,9]
       |----------site1 [1,3]
       |         `----------rack1 [1,2]
       |                   `----------1@lxfsrd47a04.cern.ch [1,1,UnvRW]
       |
       |
       `----------site2 [2,5]
                 |----------rack1 [1,2]
                 |         `----------24@lxfsre13a01.cern.ch [1,1,UnvRW]
                 |
                 `----------rack2 [1,2]
                           `----------46@lxfsrg15a01.cern.ch [1,1,UnvRW]
EOS Console [root://localhost] |/eos/demo/> geosched show snapshot default.0
### scheduling snapshot for scheduling group default.0 and operation 'Placement' :
--------default.0/( free:2|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
       |----------site1/( free:1|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |         `----------rack1/( free:1|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |                   `----------1/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
       |
       |
       `----------site2/( free:1|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |----------rack1/( free:1|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |         `----------24/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
                 |
                 `----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
                           `----------46/( free:1|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch

### scheduling snapshot for scheduling group default.0 and operation 'Access RO' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
       |----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |         `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |                   `----------1/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
       |
       |
       `----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |         `----------24/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
                 |
                 `----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
                           `----------46/( free:0|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch

### scheduling snapshot for scheduling group default.0 and operation 'Access RW' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
       |----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |         `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |                   `----------1/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
       |
       |
       `----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |         `----------24/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
                 |
                 `----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
                           `----------46/( free:0|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch

### scheduling snapshot for scheduling group default.0 and operation 'Draining Access' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
       |----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |         `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |                   `----------1/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
       |
       |
       `----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |         `----------24/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
                 |
                 `----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
                           `----------46/( free:0|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch

### scheduling snapshot for scheduling group default.0 and operation 'Draining Placement' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
       |----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |         `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |                   `----------1/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
       |
       |
       `----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |         `----------24/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
                 |
                 `----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
                           `----------46/( free:1|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch

### scheduling snapshot for scheduling group default.0 and operation 'Balancing Access' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
       |----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |         `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |                   `----------1/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
       |
       |
       `----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |         `----------24/( free:0|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
                 |
                 `----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
                           `----------46/( free:0|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch

### scheduling snapshot for scheduling group default.0 and operation 'Draining Placement' :
--------default.0/( free:0|repl:0|pidx:1|status:OK|ulSc:99|dlSc:99|filR:0|totS:3.85797e+12)
       |----------site1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |         `----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)
       |                   `----------1/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.86507e+12)@lxfsrd47a04.cern.ch
       |
       |
       `----------site2/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |----------rack1/( free:0|repl:0|pidx:0|status:OK|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)
                 |         `----------24/( free:1|repl:0|pidx:0|status:RW|ulSc:99|dlSc:99|filR:0|totS:1.99291e+12)@lxfsre13a01.cern.ch
                 |
                 `----------rack2/( free:0|repl:0|pidx:0|status:Dis|ulSc:0|dlSc:0|filR:0|totS:0)
                           `----------46/( free:1|repl:0|pidx:0|status:DISRW|ulSc:99|dlSc:99|filR:0|totS:1.99091e+12)@lxfsrg15a01.cern.ch

The internal state of the GeoTreeEngine is kept up-to-date by the background updater. It can be paused and resumed with the commands.

geosched updater pause
geosched updater resume

A refresh of all the scheduling trees and snapshots can be obtained with the command

geosched forcerefresh

Branch disabling

The GeoTreeEngine implements a mechanism to inhibit branches of the snapshots for selected types of operation. It can be done for all the scheduling groups or only for specific ones. The list of inhibited branches for each operation can be managed with the commands

geosched disabled add
geosched disabled rm
geosched disabled show

Warning

By default, placing data to ungeotagged fs is disabled. That means that for very basic instances (like dev ones), this disabling should be removed by the command

geosched disabled rm nogeotag * *

One can foresee multiple applications for this. An example can be found in the default value that forbids any placement operation to a non-geotagged filesystem.

Geotag aware commands

The commands

group ls
space ls

both feature a switch -g <depth> that allows to summarize the displayed information along the scheduling trees down to depth <depth>.

EOS Console [root://localhost] |/eos/demo/> space ls -g 2
#-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#     type #           name  #  groupsize #   groupmod #N(fs) #N(fs-rw) #sum(usedbytes) #sum(capacity) #capacity(rw) #nom.capacity #quota #balancing # threshold # converter #  ntx # active #intergroup
#-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
spaceview           default             0            0     67        66        272.69 G       133.62 T      131.62 T             0    off        off          20          on      2        0         off
#-------------------------------------------------------------------------------------------------------
#                         geotag   #N(fs) #N(fs-rw) #sum(usedbytes) #sum(capacity) #capacity(rw)
#-------------------------------------------------------------------------------------------------------
                          <ROOT>       67        66        272.69 G       133.62 T      131.62 T
                   <ROOT>::site1       23        23        105.72 G        45.79 T       45.79 T
                   <ROOT>::site2       44        43        166.97 G        87.83 T       85.84 T
            <ROOT>::site1::rack1       23        23        105.72 G        45.79 T       45.79 T
            <ROOT>::site2::rack1       22        22         74.36 G        43.92 T       43.92 T
            <ROOT>::site2::rack2       22        21         92.61 G        43.92 T       41.92 T