Explaining Heatmap Estimation with CNN, Metaphorically

Think of heatmaps as disposable coffee trays.
Think of heatmaps as disposable coffee trays -- they are roughly planar, but at several parts they are pressed downwards.
Imagine that the hourglass model is a black box that manufactures arbitrarily-shaped heatmaps/trays.
From real-life experience, you know what happens when you stack two differently-shaped trays -- they don't fit. When stacked trays don't fit, they appear taller than if they had fitted.
Well, the optimizer (in PyTorch) evaluates just that. Think of the optimizer as a ruler that repeatedly measures the height of the manufactured tray with a standard tray (i.e. ground truth), telling the tray-manufacturing blackbox: yo, this tray was bad. Try again.
The black box will learn from the feedback, and adjust its products.
The cycle continues till the optimizer is satisfied, i.e., when the trays fit each other good enough.