Overview
The Purpose of the Platform Usage Report is to give users visibility into resource usage across their organization. The report should help users answer questions like:
- How many resources are available to my organization?
- How many resources are currently available?
- How many resources are currently in use?
- What does resource usage look like over time?
To access the report:
- Click “Admin” on the Top Navigation Bar
- Select "Platform Usage Overview"
Key Terms
- EC2- (Elastic Cloud Compute), cloud compute used to run everything on Platform, including Scripts, Notebooks, and Services.
- Compute Instance- The type of EC2 used to run a Job/Notebook/Service. Each EC2 instance type has CPU, memory, and storage limits which constrain the number of Jobs/Notebooks/Services that can run on one instance. If users are running enough Jobs/Notebooks/Services in Platform, they will need multiple instances to run everything in parallel. Note - a single Jobs/Notebook/Service cannot run on multiple instances, thus the maximum CPU and memory of an instance dictates the maximum CPU and memory of a Jobs/Notebook/Service.
- Partition- A collection of EC2 instances dedicated to running a particular workload type. See Platform Compute Partitions.
- Compute Hours- The number of hours of EC2 hours used by an organization across all partitions and instance types. Instances that run in parallel will count as separate hours. For example, if an organization is using 10 instances in parallel for 100 hours, they will use a total of 1,000 compute hours.
- Compute Instance Maximum- the maximum number of EC2 instances an organization can use in parallel. This is specified for each partition and instance type and can vary across each partition and instance type.
- Jobs- Python, R, and Container scripts in Platform
- Notebooks- Interactive coding environments in Python or R
- Services- User defined applications, a R Shiny application for example.
Usage Overview Page
The Usage Overview Page provides the user with a summary of the different types of resources available to their organization. Currently this page only shows Compute Resources. Compute Resources refer to the EC2 instances that run Jobs, Notebooks, and Services in Platform.
Monthly Compute Hours - The hours total on this page display the cumulative number of compute hours used by the organization for the month across all partition and instance types. See “Key Terms” above for more details. This total will reset on the first of each month. This cumulative count may have up to a 24 hour lag.
Partitions - Resources for each of your organization’s partitions are shown beneath a partition header, ex: “Jobs Compute Instances”. Note that partitions are mutually exclusive, meaning if one partition is maxed out, work queued to run on that partition will queue until there is space available.
Instance Types - For each partition, an organization can have multiple instance types available. Users select the instance type they want to use on the settings of an individual Job, Notebook, or Service. Each box displayed on the landing page represents a different partition and instance type. Some organizations may have only one instance type available.
Max Instances - For each instance type, users can see the maximum number of instances that their organization can run in parallel.
Status Bar - The status bar in each box displays the current utilization of the specific partition and instance type. For example, if the status bar shows 70% usage for a specific instance type in the Job partition, then 70% of that instance type is currently being used to run Jobs and there is only 30% available to run additional jobs. When the status shows “MAX”, no additional Jobs can be run until other Jobs that are currently running complete or are canceled. This status bar is dynamic and shows what is currently running on the cluster (it is not cumulative). This data can be manually refreshed by clicking the “Refresh” button on the top of the page.
To see more details about a specific instance and partition type, click on the instance type.
Instance Detail Page: Current Activity
The Instance Detail Page provides you compute usage activity within your organization for a specific instance type.
Partition and Instance Type - The title of the pages shows the user which partition and instance type they are currently viewing.
Memory and CPU Status Bars - Each instance type has a specific memory and CPU limit. These status bars show the current utilization of memory and CPU across all the available instances for the specific instance type. The utilization of memory and CPU are not necessarily correlated. The memory across all instances of a specific instance type may be maxed out but there might be available CPU (and vice versa). If one of these two resources is maxed out, the entire instance type is maxed out. These status bars are updated every time the page is refreshed. Users can manually refresh the status bars by clicking “Refresh” in the top right corner.
Running - The total number of jobs currently running across all instances for the specified partition and instance type. This information is updated every time the page is refreshed or can be manually refreshed by clicking “Refresh” located in the top right corner.
Pending - The total number of queued jobs for the specified partition and instance type. Jobs/Notebooks/Services that are queued will be stuck with a “dedicating resource” log message in Platform and will not be able to run until resources become available. This information is updated every time the page is refreshed. Users can manually refresh the status bars by clicking “Refresh” in the top right corner.
Top User Activity - Table that gives insight into the users who are consuming the most resources across the organization. This information can be manually refreshed by clicking “Refresh” located in the top right corner.
- Running Jobs- Total number of jobs currently running by a specific user.
- Queued Jobs- Total number of queued jobs by a specific user.
- Memory Used- Total memory used by a specific user compared to the maximum memory of the partition and instance typeacross maximum instances.
- CPU Used- Total CPU used by a specific user compared to the maximum CPU of the partition and instance type across maximum instances.
Active Workloads - Table that identifies all active workloads which are utilizing resources across the organization. Certain fields (e.g. workload name, ID) may be hidden to users if they don’t have permissions to access the workload. Users have the ability to cancel any active workloads which they either own or have been shared on as an editor/manager.
- Name - Platform workload's name.
- ID - Platform workload's ID.
- Type - Platform workload’s type (i.e. python/r/container script, notebook, service).
- User - User who is running the workload.
- Requested CPU (M) - CPU requested by the workload.
- Requested Memory (MB) - Memory requested by the workload.
- State - Current status of the workload (i.e. running, canceling).
Instance Detail Page: Over Time
These graphs show both CPU and memory usage over time. Users can toggle between past day and past week to look at the data with different levels of granularity. Past day is the last 24 hours (rolling) and Past Week is the past 7 days (rolling).
Reading the Graph
- Resource Requested- the total amount of resources allocated by all the jobs or notebooks/services on that partition and instance type at a certain period in time. Resource requested is the memory or CPU set on each job under Settings.
- Resource Used- the actual amount of resources used by the jobs or notebooks/services running on that partition and instance type at a certain period in time.
- Resource Capacity - the total resource amount available for a particular partition and instance type at that period in time. This capacity can scale up based on Platform autoscaling logic if an organization does not have their maximum number of instances set to always on.
How to Interpret the Graph
- If the Resources Used line is lower than the Resources Requested line, users are over-provisioning their Jobs, Notebooks, or Services. To be more efficient, users should reduce the amount of memory or CPU Jobs, Notebooks, and Services are requesting to leave more resources for other Jobs, Notebooks and Services to run.
- If the Resource Requested line is above the Resource Capacity line, the organization is at its maximum capacity. Jobs, Notebooks, or Services will be queued (depending on which partition is being viewed) until running Jobs have finished or other Notebooks and Services are turned off.
Comments
0 comments
Please sign in to leave a comment.