DeviceStatsMonitor¶

class lightning.pytorch.callbacks.DeviceStatsMonitor(cpu_stats=None)[source]¶

Bases: Callback

Automatically monitors and logs device stats during training, validation and testing stage. DeviceStatsMonitor is a special callback as it requires a logger to passed as argument to the Trainer.

Logged Metrics

Logs device statistics with keys prefixed as DeviceStatsMonitor.{hook_name}/{base_metric_name}. The actual metrics depend on the active accelerator and the cpu_stats flag. Below are an overview of the possible available metrics and their meaning.

CPU (via psutil)
- cpu_percent — System-wide CPU utilization (%)
- cpu_vm_percent — System-wide virtual memory (RAM) utilization (%)
- cpu_swap_percent — System-wide swap memory utilization (%)
CUDA GPU (via torch.cuda.memory_stats)
Logs memory statistics from PyTorch caching allocator (all in bytes). GPU compute utilization is not logged by default.
- General Memory Usage:
  allocated_bytes.all.current — Current allocated GPU memory
  
  allocated_bytes.all.peak — Peak allocated GPU memory
  
  reserved_bytes.all.current — Current reserved GPU memory (allocated + cached)
  
  reserved_bytes.all.peak — Peak reserved GPU memory
  
  active_bytes.all.current — Current GPU memory in active use
  
  active_bytes.all.peak — Peak GPU memory in active use
  
  inactive_split_bytes.all.current — Memory in inactive, splittable blocks
- Allocator Pool Statistics* (for small_pool and large_pool):
  allocated_bytes.{pool_type}.current / allocated_bytes.{pool_type}.peak
  
  reserved_bytes.{pool_type}.current / reserved_bytes.{pool_type}.peak
  
  active_bytes.{pool_type}.current / active_bytes.{pool_type}.peak
- Allocator Events:
  num_ooms — Cumulative out-of-memory errors
  
  num_alloc_retries — Number of allocation retries
  
  num_device_alloc — Number of device allocations
  
  num_device_free — Number of device deallocations
For a full list of CUDA memory stats, see the PyTorch documentation.
TPU (via torch_xla)
- Memory Metrics (per device, e.g., xla:0):
  memory.free.xla:0 — Free HBM memory (MB)
  
  memory.used.xla:0 — Used HBM memory (MB)
  
  memory.percent.xla:0 — Percentage of HBM memory used (%)
- XLA Operation Counters:
  CachedCompile.xla
  
  CreateXlaTensor.xla
  
  DeviceDataCacheMiss.xla
  
  UncachedCompile.xla
  
  xla::add.xla, xla::addmm.xla, etc.
These counters can be retrieved using: torch_xla.debug.metrics.counter_names()

Parameters:

cpu_stats¶ (Optional[bool]) – if None, it will log CPU stats only if the accelerator is CPU. If True, it will log CPU stats regardless of the accelerator. If False, it will not log CPU stats regardless of the accelerator.

Raises:

MisconfigurationException – If Trainer has no logger.
ModuleNotFoundError – If psutil is not installed and CPU stats are monitored.

Example:

from lightning import Trainer
from lightning.pytorch.callbacks import DeviceStatsMonitor
device_stats = DeviceStatsMonitor()
trainer = Trainer(callbacks=[device_stats])

on_test_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0)[source]¶

Called when the test batch ends.

Return type:: None

on_test_batch_start(trainer, pl_module, batch, batch_idx, dataloader_idx=0)[source]¶

Called when the test batch begins.

Return type:: None

on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)[source]¶: Called when the train batch ends. :rtype: None

Note

The value outputs["loss"] here will be the normalized value w.r.t accumulate_grad_batches of the loss returned from training_step.

on_train_batch_start(trainer, pl_module, batch, batch_idx)[source]¶

Called when the train batch begins.

Return type:: None

on_validation_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0)[source]¶

Called when the validation batch ends.

Return type:: None

on_validation_batch_start(trainer, pl_module, batch, batch_idx, dataloader_idx=0)[source]¶

Called when the validation batch begins.

Return type:: None

setup(trainer, pl_module, stage)[source]¶

Called when fit, validate, test, predict, or tune begins.

Return type:: None