DeviceStatsMonitor¶
- class lightning.pytorch.callbacks.DeviceStatsMonitor(cpu_stats=None)[source]¶
Bases:
Callback
Automatically monitors and logs device stats during training, validation and testing stage.
DeviceStatsMonitor
is a special callback as it requires alogger
to passed as argument to theTrainer
.Logged Metrics
Logs device statistics with keys prefixed as
DeviceStatsMonitor.{hook_name}/{base_metric_name}
. The actual metrics depend on the active accelerator and thecpu_stats
flag. Below are an overview of the possible available metrics and their meaning.CPU (via
psutil
)cpu_percent
— System-wide CPU utilization (%)cpu_vm_percent
— System-wide virtual memory (RAM) utilization (%)cpu_swap_percent
— System-wide swap memory utilization (%)
CUDA GPU (via
torch.cuda.memory_stats
)Logs memory statistics from PyTorch caching allocator (all in bytes). GPU compute utilization is not logged by default.
General Memory Usage:
allocated_bytes.all.current
— Current allocated GPU memoryallocated_bytes.all.peak
— Peak allocated GPU memoryreserved_bytes.all.current
— Current reserved GPU memory (allocated + cached)reserved_bytes.all.peak
— Peak reserved GPU memoryactive_bytes.all.current
— Current GPU memory in active useactive_bytes.all.peak
— Peak GPU memory in active useinactive_split_bytes.all.current
— Memory in inactive, splittable blocks
Allocator Pool Statistics* (for
small_pool
andlarge_pool
):allocated_bytes.{pool_type}.current
/allocated_bytes.{pool_type}.peak
reserved_bytes.{pool_type}.current
/reserved_bytes.{pool_type}.peak
active_bytes.{pool_type}.current
/active_bytes.{pool_type}.peak
Allocator Events:
num_ooms
— Cumulative out-of-memory errorsnum_alloc_retries
— Number of allocation retriesnum_device_alloc
— Number of device allocationsnum_device_free
— Number of device deallocations
For a full list of CUDA memory stats, see the PyTorch documentation.
TPU (via
torch_xla
)Memory Metrics (per device, e.g.,
xla:0
):memory.free.xla:0
— Free HBM memory (MB)memory.used.xla:0
— Used HBM memory (MB)memory.percent.xla:0
— Percentage of HBM memory used (%)
XLA Operation Counters:
CachedCompile.xla
CreateXlaTensor.xla
DeviceDataCacheMiss.xla
UncachedCompile.xla
xla::add.xla
,xla::addmm.xla
, etc.
These counters can be retrieved using:
torch_xla.debug.metrics.counter_names()
- Parameters:
cpu_stats¶ (
Optional
[bool
]) – ifNone
, it will log CPU stats only if the accelerator is CPU. IfTrue
, it will log CPU stats regardless of the accelerator. IfFalse
, it will not log CPU stats regardless of the accelerator.- Raises:
MisconfigurationException – If
Trainer
has no logger.ModuleNotFoundError – If
psutil
is not installed and CPU stats are monitored.
Example:
from lightning import Trainer from lightning.pytorch.callbacks import DeviceStatsMonitor device_stats = DeviceStatsMonitor() trainer = Trainer(callbacks=[device_stats])
- on_test_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0)[source]¶
Called when the test batch ends.
- Return type:
- on_test_batch_start(trainer, pl_module, batch, batch_idx, dataloader_idx=0)[source]¶
Called when the test batch begins.
- Return type:
- on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)[source]¶
Called when the train batch ends. :rtype:
None
Note
The value
outputs["loss"]
here will be the normalized value w.r.taccumulate_grad_batches
of the loss returned fromtraining_step
.
- on_train_batch_start(trainer, pl_module, batch, batch_idx)[source]¶
Called when the train batch begins.
- Return type:
- on_validation_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0)[source]¶
Called when the validation batch ends.
- Return type: