损失函数

损失函数是衡量模型输出与真实标签之间的差异。我们还经常听到代价函数和目标函数，它们之间差异如下：

损失函数(Loss Function)是计算一个样本的模型输出与真实标签的差异 $L o s s = f (y^{\land}, y)$
代价函数(Cost Function)是计算整个样本集的模型输出与真实标签的差异，是所有样本损失函数的平均值 $\cos t = \frac{1}{N} \sum_{i}^{N} f (y i^{\land}, y_{i})$
目标函数(Objective Function)就是代价函数加上正则项

在 PyTorch 中的损失函数也是继承于nn.Module，所以损失函数也可以看作网络层。

loss.backward()

loss.backward() 是 PyTorch 中一个非常重要的方法，用于自动微分（autograd）系统来反向传播误差（或损失）。

PyTorch 的 autograd 系统

PyTorch 的 autograd 系统能够自动计算张量（tensor）上的所有操作的梯度。它采用计算图（computation graph）的方式来跟踪所有涉及的操作，从而可以轻松地计算梯度。

当你定义一个计算图时，例如通过定义一系列的张量操作和函数，autograd 会跟踪这些操作。然后，当你调用 loss.backward() 时，autograd 会从损失（loss）开始，反向遍历计算图，计算每个参数的梯度。

使用 `loss.backward()`

在训练神经网络时，我们通常有一个损失函数，它衡量模型预测与真实值之间的差异。我们的目标是优化模型的参数，以最小化这个损失。为了做到这一点，我们需要计算损失关于模型参数的梯度，并使用这些梯度来更新参数。

loss.backward() 就是用来计算这些梯度的。它会在计算图中反向传播误差，计算每个参数的梯度，并将这些梯度存储在参数的 .grad 属性中。

示例

下面是一个简单的示例，演示了如何在 PyTorch 中使用 loss.backward()：

import torch
import torch.nn as nn

# 定义一个简单的线性模型
model = nn.Linear(10, 1)

# 定义损失函数
criterion = nn.MSELoss()

# 创建一些模拟数据
inputs = torch.randn(16, 10)
targets = torch.randn(16, 1)

# 前向传播
outputs = model(inputs)
loss = criterion(outputs, targets)

# 清除之前的梯度（如果存在的话）
model.zero_grad()

# 反向传播误差，计算梯度
loss.backward()

# 现在，model.parameters() 中的每个参数都有一个 .grad 属性，存储了它的梯度
for param in model.parameters():
    print(param.grad)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

注意事项

梯度累积：在某些情况下，你可能想要在多个小批量（mini-batches）上累积梯度，然后再更新参数。这可以通过多次调用 loss.backward()（而不是在每次调用之间调用 model.zero_grad()）来实现。然后，你可以使用优化器（如 torch.optim.SGD）来更新参数。
不需要梯度的操作：如果你不想跟踪某个操作的梯度，可以使用 torch.no_grad() 上下文管理器。这对于评估模型或生成不需要梯度的数据时很有用。
梯度清零：在每次新的迭代（或批次）开始时，通常需要清零模型的梯度，以避免梯度累积。这可以通过调用 model.zero_grad() 来实现。

二分类交叉熵损失函数

torch.nn.BCELoss(weight=None, size_average=None, reduce=None, reduction='mean')

功能：计算二分类任务时的交叉熵（Cross Entropy）函数。在二分类中，label 是{0,1}。对于进入交叉熵函数的 input 为概率分布的形式。一般来说，input 为 sigmoid 激活层的输出，或者 softmax 的输出。

主要参数：

weight:每个类别的 loss 设置权值

size_average:数据为 bool，为 True 时，返回的 loss 为平均值；为 False 时，返回的各样本的 loss 之和。

reduce:数据类型为 bool，为 True 时，loss 的返回是标量。

计算公式如下： $ℓ (x, y) = {\begin{cases} mean (L), & if reduction = ’mean’ \\ sum (L), & if reduction = ’sum’ \end{cases}$

m = nn.Sigmoid()
loss = nn.BCELoss()
target = torch.empty(3).random_(2)
output = loss(m(input), target)
output.backward()

1
2
3
4
5

print('BCELoss损失函数的计算结果为',output)

    BCELoss损失函数的计算结果为 tensor(0.5732, grad_fn=<BinaryCrossEntropyBackward>)

交叉熵损失函数

torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')

功能：计算交叉熵函数

主要参数：

weight:每个类别的 loss 设置权值。

size_average:数据为 bool，为 True 时，返回的 loss 为平均值；为 False 时，返回的各样本的 loss 之和。

ignore_index:忽略某个类的损失函数。

reduce:数据类型为 bool，为 True 时，loss 的返回是标量。

计算公式如下： $loss (x, class) = - \log (\frac{\exp (x [class])}{\sum_{j} \exp (x [j])}) = - x [class] + \log (\sum_{j} \exp (x [j]))$

loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
output = loss(input, target)
output.backward()

1
2
3
4
5

print(output)

tensor(2.0115, grad_fn=<NllLossBackward>)

L1 损失函数

torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')

功能：计算输出y和真实标签target之间的差值的绝对值。

我们需要知道的是，reduction参数决定了计算模式。有三种计算模式可选：none：逐个元素计算。 sum：所有元素求和，返回标量。 mean：加权平均，返回标量。如果选择none，那么返回的结果是和输入元素相同尺寸的。默认计算方式是求平均。

计算公式如下： $L_{n} = | x_{n} - y_{n} |$

loss = nn.L1Loss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)
output.backward()

1
2
3
4
5

print('L1损失函数的计算结果为',output)

L1损失函数的计算结果为 tensor(1.5729, grad_fn=<L1LossBackward>)

MSE 损失函数

torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')

功能：计算输出y和真实标签target之差的平方。

和L1Loss一样，MSELoss损失函数中，reduction参数决定了计算模式。有三种计算模式可选：none：逐个元素计算。 sum：所有元素求和，返回标量。默认计算方式是求平均。

计算公式如下：

$l_{n} = {(x_{n} - y_{n})}^{2}$

loss = nn.MSELoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)
output.backward()

1
2
3
4
5

print('MSE损失函数的计算结果为',output)

MSE损失函数的计算结果为 tensor(1.6968, grad_fn=<MseLossBackward>)

平滑 L1 (Smooth L1)损失函数

torch.nn.SmoothL1Loss(size_average=None, reduce=None, reduction='mean', beta=1.0)

功能： L1 的平滑输出，其功能是减轻离群点带来的影响

reduction参数决定了计算模式。有三种计算模式可选：none：逐个元素计算。 sum：所有元素求和，返回标量。默认计算方式是求平均。

提醒：之后的损失函数中，关于reduction 这个参数依旧会存在。所以，之后就不再单独说明。

计算公式如下： $loss (x, y) = \frac{1}{n} \sum_{i = 1}^{n} z_{i}$ 其中， $z_{i} = {\begin{cases} 0.5 {(x_{i} - y_{i})}^{2}, & if | x_{i} - y_{i} | < 1 \\ | x_{i} - y_{i} | - 0.5, & otherwise \end{cases}$

loss = nn.SmoothL1Loss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)
output.backward()

1
2
3
4
5

print('SmoothL1Loss损失函数的计算结果为',output)

SmoothL1Loss损失函数的计算结果为 tensor(0.7808, grad_fn=<SmoothL1LossBackward>)

平滑 L1 与 L1 的对比

这里我们通过可视化两种损失函数曲线来对比平滑 L1 和 L1 两种损失函数的区别。

inputs = torch.linspace(-10, 10, steps=5000)
target = torch.zeros_like(inputs)

loss_f_smooth = nn.SmoothL1Loss(reduction='none')
loss_smooth = loss_f_smooth(inputs, target)
loss_f_l1 = nn.L1Loss(reduction='none')
loss_l1 = loss_f_l1(inputs,target)

plt.plot(inputs.numpy(), loss_smooth.numpy(), label='Smooth L1 Loss')
plt.plot(inputs.numpy(), loss_l1, label='L1 loss')
plt.xlabel('x_i - y_i')
plt.ylabel('loss value')
plt.legend()
plt.grid()
plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

可以看出，对于smoothL1来说，在 0 这个尖端处，过渡更为平滑。

目标泊松分布的负对数似然损失

torch.nn.PoissonNLLLoss(log_input=True, full=False, size_average=None, eps=1e-08, reduce=None, reduction='mean')

功能：泊松分布的负对数似然损失函数

主要参数：

log_input：输入是否为对数形式，决定计算公式。

full：计算所有 loss，默认为 False。

eps：修正项，避免 input 为 0 时，log(input) 为 nan 的情况。

数学公式：

当参数log_input=True： $loss (x_{n}, y_{n}) = e^{x_{n}} - x_{n} \cdot y_{n}$
当参数log_input=False：

$loss (x * n, y * n) = x * n - y * n \cdot \log (x_n + eps)$

loss = nn.PoissonNLLLoss()
log_input = torch.randn(5, 2, requires_grad=True)
target = torch.randn(5, 2)
output = loss(log_input, target)
output.backward()

1
2
3
4
5

print('PoissonNLLLoss损失函数的计算结果为',output)

PoissonNLLLoss损失函数的计算结果为 tensor(0.7358, grad_fn=<MeanBackward0>)

KL 散度

torch.nn.KLDivLoss(size_average=None, reduce=None, reduction='mean', log_target=False)

功能：计算 KL 散度，也就是计算相对熵。用于连续分布的距离度量，并且对离散采用的连续输出空间分布进行回归通常很有用。

主要参数:

reduction：计算模式，可为 none/sum/mean/batchmean。

none：逐个元素计算。

sum：所有元素求和，返回标量。

mean：加权平均，返回标量。

batchmean：batchsize 维度求平均值。

计算公式：

$\begin{aligned} D_{KL} (P, Q) = E_{X \sim P} [\log \frac{P (X)}{Q (X)}] & = E_{X \sim P} [\log P (X) - \log Q (X)] \\ = \sum_{i = 1}^{n} P (x_{i}) (\log P (x_{i}) - \log Q (x_{i})) \end{aligned}$

inputs = torch.tensor([[0.5, 0.3, 0.2], [0.2, 0.3, 0.5]])
target = torch.tensor([[0.9, 0.05, 0.05], [0.1, 0.7, 0.2]], dtype=torch.float)
loss = nn.KLDivLoss()
output = loss(inputs,target)

print('KLDivLoss损失函数的计算结果为',output)

1
2
3
4
5
6

KLDivLoss损失函数的计算结果为 tensor(-0.3335)

MarginRankingLoss

torch.nn.MarginRankingLoss(margin=0.0, size_average=None, reduce=None, reduction='mean')

功能：计算两个向量之间的相似度，用于排序任务。该方法用于计算两组数据之间的差异。

主要参数:

margin：边界值， $x_{1}$ 与 $x_{2}$ 之间的差异值。

reduction：计算模式，可为 none/sum/mean。

计算公式：

$loss (x 1, x 2, y) = max (0, - y * (x 1 - x 2) + margin)$

loss = nn.MarginRankingLoss()
input1 = torch.randn(3, requires_grad=True)
input2 = torch.randn(3, requires_grad=True)
target = torch.randn(3).sign()
output = loss(input1, input2, target)
output.backward()

print('MarginRankingLoss损失函数的计算结果为',output)

1
2
3
4
5
6
7
8

MarginRankingLoss损失函数的计算结果为 tensor(0.7740, grad_fn=<MeanBackward0>)

多标签边界损失函数

torch.nn.MultiLabelMarginLoss(size_average=None, reduce=None, reduction='mean')

功能：对于多标签分类问题计算损失函数。

主要参数:

reduction：计算模式，可为 none/sum/mean。

计算公式： $loss (x, y) = \sum_{i j} \frac{max (0, 1 - x [y [j]] - x [i])}{x \cdot size (0)}$

$\begin{array}{l} 其中, i = 0, \dots, x \cdot size (0), j = 0, \dots, y \cdot size (0), 对于所有的 i 和 j, 都有 y [j] \geq 0 并且 \\ i \neq y [j] \end{array}$

loss = nn.MultiLabelMarginLoss()
x = torch.FloatTensor([[0.9, 0.2, 0.4, 0.8]])
# for target y, only consider labels 3 and 0, not after label -1
y = torch.LongTensor([[3, 0, -1, 1]])# 真实的分类是，第3类和第0类
output = loss(x, y)

print('MultiLabelMarginLoss损失函数的计算结果为',output)

1
2
3
4
5
6
7

MultiLabelMarginLoss损失函数的计算结果为 tensor(0.4500)

二分类损失函数

torch.nn.SoftMarginLoss(size_average=None, reduce=None, reduction='mean')torch.nn.(size_average=None, reduce=None, reduction='mean')

功能：计算二分类的 logistic 损失。

主要参数:

reduction：计算模式，可为 none/sum/mean。

计算公式：

$loss (x, y) = \sum_{i} \frac{\log (1 + \exp (- y [i] \cdot x [i]))}{x \cdot nelement ()}$

$其中, x . nelement() 为输入 x 中的样本个数。注意这里 y 也有 1 和 - 1 两种模式。$

inputs = torch.tensor([[0.3, 0.7], [0.5, 0.5]])  # 两个样本，两个神经元
target = torch.tensor([[-1, 1], [1, -1]], dtype=torch.float)  # 该 loss 为逐个神经元计算，需要为每个神经元单独设置标签

loss_f = nn.SoftMarginLoss()
output = loss_f(inputs, target)

print('SoftMarginLoss损失函数的计算结果为',output)

1
2
3
4
5
6
7

SoftMarginLoss损失函数的计算结果为 tensor(0.6764)

多分类的折页损失

torch.nn.MultiMarginLoss(p=1, margin=1.0, weight=None, size_average=None, reduce=None, reduction='mean')

功能：计算多分类的折页损失

主要参数:

reduction：计算模式，可为 none/sum/mean。

p：可选 1 或 2。

weight：各类别的 loss 设置权值。

margin：边界值

计算公式：

$loss (x, y) = \frac{\sum_{i} max (0, margin - x [y] + x [i])^{p}}{x \cdot size (0)}$

$\begin{array}{l} 其中, x \in {0, \dots, x \cdot size (0) - 1}, y \in {0, \dots, y \cdot size (0) - 1}, 并且对于所有的 i 和 j, \\ 都有 0 \leq y [j] \leq x \cdot size (0) - 1, 以及 i \neq y [j] 。 \end{array}$

inputs = torch.tensor([[0.3, 0.7], [0.5, 0.5]])
target = torch.tensor([0, 1], dtype=torch.long)

loss_f = nn.MultiMarginLoss()
output = loss_f(inputs, target)

print('MultiMarginLoss损失函数的计算结果为',output)

1
2
3
4
5
6
7

MultiMarginLoss损失函数的计算结果为 tensor(0.6000)

三元组损失

torch.nn.TripletMarginLoss(margin=1.0, p=2.0, eps=1e-06, swap=False, size_average=None, reduce=None, reduction='mean')

功能：计算三元组损失。

三元组: 这是一种数据的存储或者使用格式。<实体 1，关系，实体 2>。在项目中，也可以表示为< anchor, positive examples , negative examples>

在这个损失函数中，我们希望去anchor的距离更接近positive examples，而远离negative examples

主要参数:

reduction：计算模式，可为 none/sum/mean。

p：可选 1 或 2。

margin：边界值

计算公式：

$L (a, p, n) = max {d (a_{i}, p_{i}) - d (a_{i}, n_{i}) + margin, 0}$

$其中, d (x_{i}, y_{i}) = {‖ x_{i} - y_{i} ‖}_{・}$

triplet_loss = nn.TripletMarginLoss(margin=1.0, p=2)
anchor = torch.randn(100, 128, requires_grad=True)
positive = torch.randn(100, 128, requires_grad=True)
negative = torch.randn(100, 128, requires_grad=True)
output = triplet_loss(anchor, positive, negative)
output.backward()
print('TripletMarginLoss损失函数的计算结果为',output)

1
2
3
4
5
6
7

TripletMarginLoss损失函数的计算结果为 tensor(1.1667, grad_fn=<MeanBackward0>)

HingEmbeddingLoss

torch.nn.HingeEmbeddingLoss(margin=1.0, size_average=None, reduce=None, reduction='mean')

功能：对输出的 embedding 结果做 Hing 损失计算

主要参数:

reduction：计算模式，可为 none/sum/mean。

margin：边界值

计算公式：

$l_{n} = {\begin{cases} x_{n}, & if y_{n} = 1 \\ max {0, Δ - x_{n}}, & if y_{n} = - 1 \end{cases}$ 注意事项：输入 x 应为两个输入之差的绝对值。

可以这样理解，让个输出的是正例 yn=1,那么 loss 就是 x，如果输出的是负例 y=-1，那么输出的 loss 就是要做一个比较。

loss_f = nn.HingeEmbeddingLoss()
inputs = torch.tensor([[1., 0.8, 0.5]])
target = torch.tensor([[1, 1, -1]])
output = loss_f(inputs,target)

print('HingEmbeddingLoss损失函数的计算结果为',output)

1
2
3
4
5
6

HingEmbeddingLoss损失函数的计算结果为 tensor(0.7667)

余弦相似度

torch.nn.CosineEmbeddingLoss(margin=0.0, size_average=None, reduce=None, reduction='mean')

功能：对两个向量做余弦相似度

主要参数:

reduction：计算模式，可为 none/sum/mean。

margin：可取值[-1,1] ，推荐为[0,0.5] 。

计算公式：

$loss (x, y) = {\begin{cases} 1 - \cos (x_{1}, x_{2}), & if y = 1 \\ max {0, \cos (x_{1}, x_{2}) - margin}, & if y = - 1 \end{cases}$ 其中, $\cos (θ) = \frac{A \cdot B}{∥ A ∥ ∥ B ∥} = \frac{\sum_{i = 1}^{n} A_{i} \times B_{i}}{\sqrt{\sum_{i = 1}^{n} {(A_{i})}^{2}} \times \sqrt{\sum_{i = 1}^{n} {(B_{i})}^{2}}}$

这个损失函数应该是最广为人知的。对于两个向量，做余弦相似度。将余弦相似度作为一个距离的计算方式，如果两个向量的距离近，则损失函数值小，反之亦然。

loss_f = nn.CosineEmbeddingLoss()
inputs_1 = torch.tensor([[0.3, 0.5, 0.7], [0.3, 0.5, 0.7]])
inputs_2 = torch.tensor([[0.1, 0.3, 0.5], [0.1, 0.3, 0.5]])
target = torch.tensor([1, -1], dtype=torch.float)
output = loss_f(inputs_1,inputs_2,target)

print('CosineEmbeddingLoss损失函数的计算结果为',output)

1
2
3
4
5
6
7

CosineEmbeddingLoss损失函数的计算结果为 tensor(0.5000)

CTC 损失函数

torch.nn.CTCLoss(blank=0, reduction='mean', zero_infinity=False)

功能：用于解决时序类数据的分类

计算连续时间序列和目标序列之间的损失。CTCLoss 对输入和目标的可能排列的概率进行求和，产生一个损失值，这个损失值对每个输入节点来说是可分的。输入与目标的对齐方式被假定为 "多对一"，这就限制了目标序列的长度，使其必须是 ≤ 输入长度。

主要参数:

reduction：计算模式，可为 none/sum/mean。

blank：blank label。

zero_infinity：无穷大的值或梯度值为

# Target are to be padded
T = 50      # Input sequence length
C = 20      # Number of classes (including blank)
N = 16      # Batch size
S = 30      # Target sequence length of longest target in batch (padding length)
S_min = 10  # Minimum target length, for demonstration purposes

# Initialize random batch of input vectors, for *size = (T,N,C)
input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()

# Initialize random batch of targets (0 = blank, 1:C = classes)
target = torch.randint(low=1, high=C, size=(N, S), dtype=torch.long)

input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)
target_lengths = torch.randint(low=S_min, high=S, size=(N,), dtype=torch.long)
ctc_loss = nn.CTCLoss()
loss = ctc_loss(input, target, input_lengths, target_lengths)
loss.backward()


# Target are to be un-padded
T = 50      # Input sequence length
C = 20      # Number of classes (including blank)
N = 16      # Batch size

# Initialize random batch of input vectors, for *size = (T,N,C)
input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()
input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)

# Initialize random batch of targets (0 = blank, 1:C = classes)
target_lengths = torch.randint(low=1, high=T, size=(N,), dtype=torch.long)
target = torch.randint(low=1, high=C, size=(sum(target_lengths),), dtype=torch.long)
ctc_loss = nn.CTCLoss()
loss = ctc_loss(input, target, input_lengths, target_lengths)
loss.backward()

print('CTCLoss损失函数的计算结果为',loss)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

CTCLoss损失函数的计算结果为 tensor(16.0885, grad_fn=<MeanBackward0>)

焦点损失 FocalLoss

def focal_loss(input_values, gamma):
    """Computes the focal loss"""
    p = torch.exp(-input_values)
    loss = (1 - p) ** gamma * input_values
    return loss.mean()

class FocalLoss(nn.Module):
    def __init__(self, cls_num_list=None, weight=None, gamma=0.):
        super(FocalLoss, self).__init__()
        assert gamma >= 0
        self.gamma = gamma
        self.weight = weight

    def _hook_before_epoch(self, epoch):
        pass

    def forward(self, output_logits, target):
        return focal_loss(F.cross_entropy(output_logits, target, reduction='none', weight=self.weight), self.gamma)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

#PyTorch

上次更新: 2025/07/06, 13:25:25

← 基本配置优化器→

损失函数

损失函数

loss.backward()

PyTorch 的 autograd 系统

使用 loss.backward()

示例

注意事项

二分类交叉熵损失函数

交叉熵损失函数

L1 损失函数

MSE 损失函数

平滑 L1 (Smooth L1)损失函数

目标泊松分布的负对数似然损失

KL 散度

MarginRankingLoss

多标签边界损失函数

二分类损失函数

多分类的折页损失

三元组损失

HingEmbeddingLoss

余弦相似度

CTC 损失函数

焦点损失 FocalLoss

使用 `loss.backward()`