Topic
5 replies Latest Post - ‏2008-09-18T12:55:29Z by astdenis
astdenis
astdenis
11 Posts
ACCEPTED ANSWER

Pinned topic LoadLeveler fair share metric.

‏2008-03-31T18:25:07Z |
We are considering using fair share scheduling to give some weight to our site's resource allocation policies. On AIX, shares are based on cpu consumption. There apparently is no other metric. We have come to the conclusion that this metric is useless for us because cpu consumption is dependent on job cpu efficiency. Using LoadLeveler's accounting information, we calculate a job cpu efficiency as:

(Step Total Time) / (Elapsed Wall Clock Time * Step Cpus)

Using the current fair share metric to prioritize jobs based on the availability of shares would favor inefficient jobs.

I would like to know if there are plans to give more flexibility to the fair share metric definition, i.e. let us pick our own metric.

Have other sites worked around this limitation and, if yes, would you be willing to share with us?

Alain St-Denis
Environment Canada
Updated on 2008-09-18T12:55:29Z at 2008-09-18T12:55:29Z by astdenis
  • ezhong
    ezhong
    11 Posts
    ACCEPTED ANSWER

    Re: LoadLeveler fair share metric.

    ‏2008-04-04T21:21:56Z  in response to astdenis
    Fair share scheduling is all about job priority and it is coarse-grained. It serves exactly the purpose of "give some weight to our site's resource allocation policies". That's well said.

    On AIX, shares are based on cpu consumption. That's the most basic and well-defined metric. you can bring your input or requirement to IBM and we can probably work together to define a metric which can serve your needs better. I'd think any metric defined would have its own strong and weak points.

    After calculating job cpu efficiency do you observe some trend? If the average job cpu efficiency is about the same for evey user, then it's fair in the sense that eveyone is treated by the same policy. Each one gain some at one point and loss some at another point. If some users consistently run jobs with low efficiency, you could allocate less number of shares to those users, to get even that way. :) There are also other variables like user priority, etc. that you can use in the SYSPRIO expression if you like more flexibility.

    Enci Zhong
    LoadLeveler Development
    • corbeill
      corbeill
      1 Post
      ACCEPTED ANSWER

      Re: LoadLeveler fair share metric.

      ‏2008-04-07T13:22:00Z  in response to ezhong
      Right, any metric has its pros and cons.

      What's odd is that since the chosen metric to allow resources by LoadLeveler is wall clock, that there are no ways to "fair share" wall clock, only cpu time. I dealt in the past with a scheduler using cpu time, not wall clock. At first, I thought the choice for LoadLeveler was wrong but the more I use it, the more I like it. User is basically requesting a chunk of real time on a given surface, that he can use efficiently or not, that's his problem (and mine as well). Otoh, we would like to bill users the same way they request resources.

      We have about 50 regular users, with probably the same number of job profiles, with job efficiency varying from 20% to 95%. We'd like users to be billed by what Alain described, ie wall clock * number of cpus requested (and if they use "not_shared", we would like to bill them for the full node, but that would be a nice to have). Managing shares on a per-user/per-job-profile basis would bring a tremendous administrative overhead that we would like to avoid. As fair share is implemented now, you ask for three hours, Y nodes not_shared, use them at 80%, and get X shares. You ask for the same time and nodes, use the resource at 20% and get invoiced for only X/4 shares. To me, that's not fair...

      Luc.
      • ezhong
        ezhong
        11 Posts
        ACCEPTED ANSWER

        Re: LoadLeveler fair share metric.

        ‏2008-04-09T19:15:33Z  in response to corbeill
        Right, any metric has its pros and cons.
        Indeed.

        What's odd is that since the chosen metric to allow resources by
        LoadLeveler is wall clock, that there are no ways to "fair share" wall > clock, only cpu time.
        wall_clock_limit must be used in the BACKFILL scheduler. job_cpu_limit is also there and I wonder whether any customer uses it. There are many resources in LoadLeveler. If our customers prefer to deal with wall clock time, then fair share scheduling should have a wall clock option as well.

        User is basically requesting a chunk of real time on a given surface, > that he can use efficiently or not, that's his problem (and mine as
        well). Otoh, we would like to bill users the same way they request
        resources.
        When a node is shared by multiple jobs, how long it takes to run a job varies, by a lot in the extreme cases. So in a sense, it's not entirely "his problem". :)

        We have about 50 regular users, with probably the same number of job
        profiles, with job efficiency varying from 20% to 95%. We'd like users > to be billed by what Alain described, ie wall clock * number of cpus
        requested (and if they use "not_shared", we would like to bill them
        for the full node, but that would be a nice to have).
        If a node is used not_shared, the entire node should be counted as used by the job. Often, a node can run multiple tasks, determined by MAX_STARTERS and CLASS keyword values in LoadL_config.local. Say, if a node can run "n tasks" at the maximum and a job runs two tasks on the node shared, do you want the amount of resources consumed by the job on the node to be calculated as
        wall clock * 2 tasks * number of cpus / "n tasks" ?

        For simplicity, assuming WLM is not being used. If WLM is used, then not only the number of tasks run on the node but the number of CPUs requested by the job would need to be considered as well.

        Managing shares on a per-user/per-job-profile basis would bring a
        tremendous administrative overhead that we would like to avoid.
        That's a good point.

        As fair share is implemented now, you ask for three hours, Y nodes
        not_shared, use them at 80%, and get X shares. You ask for the same
        time and nodes, use the resource at 20% and get invoiced for only X/4
        shares. To me, that's not fair...
        If we count wall clock in fair share, you ask for three hours, Y nodes shared, use them at 90% by running a CPU intensive job, and get X shares. Another ask for the same time and nodes, use the ressources at 10% because the CPU intensive job leave little resources available and get invoiced for the same X shares. Is that fairer now? :)

        LoadLeveler development has no intention to tell our customers what's fair and what's not. It's entirely up to our customers to make the decision for their own clusters. Our job is to provide what our customers need. Though it's always good to consider different sides before deciding what option is the best.

        Enci
        • astdenis
          astdenis
          11 Posts
          ACCEPTED ANSWER

          Re: LoadLeveler fair share metric.

          ‏2008-09-18T12:52:59Z  in response to ezhong
          It seems the ball has been rolling for over a year now and after the last post I'm not sure who was supposed to catch it...

          Say, if a node can run "n tasks" at the maximum and a job runs two tasks on the node shared, do you want the amount of resources consumed by the job on the node to be calculated as
          wall clock * 2 tasks * number of cpus / "n tasks" ?

          No, we simply want wall clock * total_number_of_cpus_for_the_whole_job. I realize such a metric would penalize jobs sharing nodes and competing for resources, but the penalty would be "fair". For our site, jobs sharing nodes are all serial jobs so the share utilization would be small. Put another way, we want users to be charged for the amount of time they reserve cpus, not the amount of time they use them.

          If we count wall clock in fair share, you ask for three hours, Y nodes shared, use them at 90% by running a CPU intensive job, and get X shares. Another ask for the same time and nodes, use the ressources at 10% because the CPU intensive job leave little resources available and get invoiced for the same X shares. Is that fairer now? :)

          This makes some sense, although I don't quite understand how a job could get 90% of the resources available if another job is competing for the same resources. Are you saying wlm doesn't work? ;-)

          Still, we are convinced a metric of wall clock * cpus would better suit our job mix. Most our users run multi node jobs and don't share. The only jobs sharing nodes are serial jobs which are not consuming much resources.

          LoadLeveler development has no intention to tell our customers what's fair and what's not. It's entirely up to our customers to make the decision for their own clusters. Our job is to provide what our customers need. Though it's always good to consider different sides before deciding what option is the best.

          Does this mean you'll implement our request? When? :-)

          Alain
          • astdenis
            astdenis
            11 Posts
            ACCEPTED ANSWER

            Re: LoadLeveler fair share metric.

            ‏2008-09-18T12:55:29Z  in response to astdenis
            It seems the ball has been rolling for over a year now and after the last post I'm not sure who was supposed to catch it...

            Oops. I was looking at the registered dates. So it's been only 6 months. ;-)

            Alain