DeviceStatsMonitor

class lightning.pytorch.callbacks.DeviceStatsMonitor(cpu_stats=None)[source]

Bases: Callback

Automatically monitors and logs device stats during training, validation and testing stage. DeviceStatsMonitor is a special callback as it requires a logger to passed as argument to the Trainer.

Logged Metrics

Logs device statistics with keys prefixed as DeviceStatsMonitor.{hook_name}/{base_metric_name}. The actual metrics depend on the active accelerator and the cpu_stats flag. Below are an overview of the possible available metrics and their meaning.

  • CPU (via psutil)

    • cpu_percent — System-wide CPU utilization (%)

    • cpu_vm_percent — System-wide virtual memory (RAM) utilization (%)

    • cpu_swap_percent — System-wide swap memory utilization (%)

  • CUDA GPU (via torch.cuda.memory_stats)

    Logs memory statistics from PyTorch caching allocator (all in bytes). GPU compute utilization is not logged by default.

    • General Memory Usage:

      • allocated_bytes.all.current — Current allocated GPU memory

      • allocated_bytes.all.peak — Peak allocated GPU memory

      • reserved_bytes.all.current — Current reserved GPU memory (allocated + cached)

      • reserved_bytes.all.peak — Peak reserved GPU memory

      • active_bytes.all.current — Current GPU memory in active use

      • active_bytes.all.peak — Peak GPU memory in active use

      • inactive_split_bytes.all.current — Memory in inactive, splittable blocks

    • Allocator Pool Statistics* (for small_pool and large_pool):

      • allocated_bytes.{pool_type}.current / allocated_bytes.{pool_type}.peak

      • reserved_bytes.{pool_type}.current / reserved_bytes.{pool_type}.peak

      • active_bytes.{pool_type}.current / active_bytes.{pool_type}.peak

    • Allocator Events:

      • num_ooms — Cumulative out-of-memory errors

      • num_alloc_retries — Number of allocation retries

      • num_device_alloc — Number of device allocations

      • num_device_free — Number of device deallocations

    For a full list of CUDA memory stats, see the PyTorch documentation.

  • TPU (via torch_xla)

    • Memory Metrics (per device, e.g., xla:0):

      • memory.free.xla:0 — Free HBM memory (MB)

      • memory.used.xla:0 — Used HBM memory (MB)

      • memory.percent.xla:0 — Percentage of HBM memory used (%)

    • XLA Operation Counters:

      • CachedCompile.xla

      • CreateXlaTensor.xla

      • DeviceDataCacheMiss.xla

      • UncachedCompile.xla

      • xla::add.xla, xla::addmm.xla, etc.

    These counters can be retrieved using: torch_xla.debug.metrics.counter_names()

Parameters:

cpu_stats (Optional[bool]) – if None, it will log CPU stats only if the accelerator is CPU. If True, it will log CPU stats regardless of the accelerator. If False, it will not log CPU stats regardless of the accelerator.

Raises:
  • MisconfigurationException – If Trainer has no logger.

  • ModuleNotFoundError – If psutil is not installed and CPU stats are monitored.

Example:

from lightning import Trainer
from lightning.pytorch.callbacks import DeviceStatsMonitor
device_stats = DeviceStatsMonitor()
trainer = Trainer(callbacks=[device_stats])
on_test_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0)[source]

Called when the test batch ends.

Return type:

None

on_test_batch_start(trainer, pl_module, batch, batch_idx, dataloader_idx=0)[source]

Called when the test batch begins.

Return type:

None

on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)[source]

Called when the train batch ends. :rtype: None

Note

The value outputs["loss"] here will be the normalized value w.r.t accumulate_grad_batches of the loss returned from training_step.

on_train_batch_start(trainer, pl_module, batch, batch_idx)[source]

Called when the train batch begins.

Return type:

None

on_validation_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0)[source]

Called when the validation batch ends.

Return type:

None

on_validation_batch_start(trainer, pl_module, batch, batch_idx, dataloader_idx=0)[source]

Called when the validation batch begins.

Return type:

None

setup(trainer, pl_module, stage)[source]

Called when fit, validate, test, predict, or tune begins.

Return type:

None