yolox改进--添加Coordinate Attention模块

Coordinate Attention
代码
- 建立包含CAM代码的attention.py
- 在yolo_pafpn.py中添加CAM
总结

因为项目需要，尝试魔改一下yolox-s，看看能不能在个人数据集上刷高点mAP。因为Coordinate Attention模块（以下简称
CAM）的作者提供了代码，并且之前不少博主公开了CAM用在yolov5或者yolox等模型的代码，所以一开始我直接当了搬运工，但在搬运过程，我发现官方的代码不能直接用在yolox上，且之前公开CAM用在yolox的代码根本跑不通。在debug之后，发现问题是出现在官方的代码上，于是心血来潮写下这篇文章，废话不多说，来看修改后的代码吧！

Coordinate Attention

论文来源： http://arxiv.org/abs/2103.02907
官方代码：https://github.com/Andrew-Qibin/CoordAttention

注意力机制广泛用于深度神经网络中来提高模型的性能。然而，因为其昂贵的计算代价，很难应用在一些轻量级网络，但不乏有一些注意力模块脱颖而出，具有代表性的有SE、CBAM等。SE模块通过2D全局池化来计算通道注意力，在非常低的计算成本下达到了提升网络性能的目的，遗憾的是，SE模块忽视了捕获位置信息的注意力；CBAM模块通过使用大尺寸卷积来获得位置信息的注意力，但只偏向于捕获局部的位置信息。
CAM模块来源于2021CVPR，该模块通过将位置信息嵌入到通道注意力中，因为其较少的计算代价，使轻量级网可以较大的区域中获得注意力。为了缓解位置信息丢失的问题，论文作者将2D全局池化替换成分别在特征的w和h并行提取特征的两个1D池化，可以有效捕获空间坐标信息；而后这两个并行的特征图通过两个卷积来生成两个独立方向的注意力图；通过将两个注意力图乘入到原始特征图中，以达到增强特征图的表征能力。
yolox改进--添加Coordinate Attention模块（CVPR2021）

代码

建立包含CAM代码的attention.py

在./yolox/models/文件夹下建立attention.py，CAM代码如下。相较于官方的代码，为了适配yolox，这里将nn.AdaptiveAvgPool2d直接用于forward。

class CAM(nn.Module):
    def __init__(self, channels, reduction=32):
        super(CAM, self).__init__()
        self.conv_1x1 = nn.Conv2d(in_channels=channels, out_channels=channels // reduction, kernel_size=1, stride=1,
                                  bias=False)                        
        self.mish = Mish() # 可用自行选择激活函数
        self.bn = nn.BatchNorm2d(channels // reduction)  
        self.F_h = nn.Conv2d(in_channels=channels // reduction, out_channels=channels, kernel_size=1, stride=1, bias=False)
        self.F_w = nn.Conv2d(in_channels=channels // reduction, out_channels=channels, kernel_size=1, stride=1, bias=False)  
        self.sigmoid_h = nn.Sigmoid() 
        self.sigmoid_w = nn.Sigmoid()
    def forward(self, x):
        h, w = x.shape[2], x.shape[3]
        avg_pool_x = nn.AdaptiveAvgPool2d((h, 1))
        avg_pool_y = nn.AdaptiveAvgPool2d((1, w))
        x_h = avg_pool_x(x).permute(0, 1, 3, 2) 
        x_w = avg_pool_y(x)  
        x_cat_conv_relu = self.mish(self.conv_1x1(torch.cat((x_h, x_w), 3))) 
        x_cat_conv_split_h, x_cat_conv_split_w = x_cat_conv_relu.split([h, w], 3)
        s_h = self.sigmoid_h(self.F_h(x_cat_conv_split_h.permute(0, 1, 3, 2)))
        s_w = self.sigmoid_w(self.F_w(x_cat_conv_split_w))
        out = x * s_h.expand_as(x) * s_w.expand_as(x)
        return out

在yolo_pafpn.py中添加CAM

CAM作为即插即用的注意力模块，添加位置可以完全替换例如CBAM等经典的注意力机制模块，具体可参考其他有关yolox在head中插入注意力机制的教程，这里给的代码以添加在pafpn为例，添加在哪效果好要取决于添加位置在特定数据集的表现。

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Copyright (c) Megvii Inc. All rights reserved.
import torch
import torch.nn as nn
from .darknet import CSPDarknet
from .network_blocks import BaseConv, CSPLayer, DWConv
from .attention import CAM
class YOLOPAFPN(nn.Module):
    """
    YOLOv3 model. Darknet 53 is the default backbone of this model.
    """
    def __init__(
        self,
        depth=1.0,
        width=1.0,
        in_features=("dark3", "dark4", "dark5"),
        in_channels=[256, 512, 1024],
        depthwise=False,
        act="silu",
    ):
        super().__init__()
        self.backbone = CSPDarknet(depth, width, depthwise=depthwise, act=act)
        self.in_features = in_features
        self.in_channels = in_channels
        Conv = DWConv if depthwise else BaseConv
        self.upsample = nn.Upsample(scale_factor=2, mode="nearest")
        # self.upsample = nn.Upsample(scale_factor=2, mode="bilinear")
        self.lateral_conv0 = BaseConv(
            int(in_channels[2] * width), int(in_channels[1] * width), 1, 1, act=act
        )
        self.C3_p4 = CSPLayer(
            int(2 * in_channels[1] * width),
            int(in_channels[1] * width),
            round(3 * depth),
            False,
            depthwise=depthwise,
            act=act,
        )  # cat
        self.reduce_conv1 = BaseConv(
            int(in_channels[1] * width), int(in_channels[0] * width), 1, 1, act=act
        )
        self.C3_p3 = CSPLayer(
            int(2 * in_channels[0] * width),
            int(in_channels[0] * width),
            round(3 * depth),
            False,
            depthwise=depthwise,
            act=act,
        )
        # bottom-up conv
        self.bu_conv2 = Conv(
            int(in_channels[0] * width), int(in_channels[0] * width), 3, 2, act=act
        )
        self.C3_n3 = CSPLayer(
            int(2 * in_channels[0] * width),
            int(in_channels[1] * width),
            round(3 * depth),
            False,
            depthwise=depthwise,
            act=act,
        )
        # bottom-up conv
        self.bu_conv1 = Conv(
            int(in_channels[1] * width), int(in_channels[1] * width), 3, 2, act=act
        )
        self.C3_n4 = CSPLayer(
            int(2 * in_channels[1] * width),
            int(in_channels[2] * width),
            round(3 * depth),
            False,
            depthwise=depthwise,
            act=act,
        )
        self.CAM0 = CAM(int(in_channels[2] * width))
        self.CAM1 = CAM(int(in_channels[1] * width))
        self.CAM2 = CAM(int(in_channels[0] * width))
        # self.CAM3 = CAM(int(in_channels[0] * width))
        # self.CAM4 = CAM(int(in_channels[1] * width))
        # self.CAM5 = CAM(int(in_channels[2] * width))
    def forward(self, input):
        """
        Args:
            inputs: input images.
        Returns:
            Tuple[Tensor]: FPN feature.
        """
        #  backbone
        out_features = self.backbone(input)
        features = [out_features[f] for f in self.in_features]
        [x2, x1, x0] = features
        #############add CAM##############
        x0 = self.CAM0(x0)
        x1 = self.CAM1(x1)
        x2 = self.CAM2(x2)
        ##################################
        fpn_out0 = self.lateral_conv0(x0)  # 1024->512/32
        f_out0 = self.upsample(fpn_out0)  # 512/16
        f_out0 = torch.cat([f_out0, x1], 1)  # 512->1024/16
        f_out0 = self.C3_p4(f_out0)  # 1024->512/16
        fpn_out1 = self.reduce_conv1(f_out0)  # 512->256/16
        f_out1 = self.upsample(fpn_out1)  # 256/8
        f_out1 = torch.cat([f_out1, x2], 1)  # 256->512/8
        pan_out2 = self.C3_p3(f_out1)  # 512->256/8
        # pan_out2 = self.CAM3(pan_out2)
        p_out1 = self.bu_conv2(pan_out2)  # 256->256/16
        p_out1 = torch.cat([p_out1, fpn_out1], 1)  # 256->512/16
        pan_out1 = self.C3_n3(p_out1)  # 512->512/16
        # p_out1 = self.CAM4(p_out1)
        p_out0 = self.bu_conv1(pan_out1)  # 512->512/32
        p_out0 = torch.cat([p_out0, fpn_out0], 1)  # 512->1024/32
        pan_out0 = self.C3_n4(p_out0)  # 1024->1024/32
        # pan_out0 = self.CAM5(pan_out0)
        outputs = (pan_out2, pan_out1, pan_out0)
        return outputs

总结

CAM，同SE、CBAM等模块一样，作为即插即用的注意力机制，在yolov5、yolox等轻量级网络中有着重要的作用。本文介绍的CAM+yolox在我的数据集上，mAP比不添加的时候提高了0.02个点，相比使用CBAM提高了0.01个点，效果还是很可观的。

上述资讯来自网友投稿，如有侵犯了您的权益，请来信告诉我们：liujun100@vip.qq.com