ML | U-Net

U-Net

The U-Net architecture follows an encoder-decoder cascade structure, where the encoder gradually compresses information into a lower-dimensional representation. Then the decoder decodes this information back to the original image dimension. Owing to this, the architecture gets an overall U-shape, which leads to the name U-Net.

U-Net is a convolutional neural network that was developed for image segmentation. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation.

U-Net is a prominent deep learning architecture primarily used for semantic segmentation in computer vision tasks, particularly in the biomedical field. Developed by Olaf Ronneberger and colleagues in 2015, U-Net was designed to perform well even with limited training data, making it particularly useful in medical imaging where annotated datasets are scarce.

Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations. ( Source: Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation.

Architecture Overview

U-Net features a distinctive U-shaped architecture that consists of two main paths: the contracting path (encoder) and the expansive path (decoder).

Contracting Path (Encoder): This part of the network captures context from the input image. It consists of several convolutional layers followed by max pooling operations, which progressively reduce the spatial dimensions while increasing the depth of feature maps. The encoder extracts high-level features from the input image, similar to other convolutional networks.
Expansive Path (Decoder): The decoder's role is to upsample the feature maps obtained from the encoder back to the original image size. This is achieved through transposed convolutions (deconvolutions) and concatenation with corresponding feature maps from the encoder via skip connections. These skip connections help preserve spatial information that might be lost during downsampling, allowing for more precise localization of features in the output segmentation map.

Key Features

Skip Connections: These connections between corresponding layers in the encoder and decoder allow the model to retain spatial information, which is crucial for accurate segmentation.
Fully Convolutional Network: U-Net is a fully convolutional network, meaning it does not rely on fully connected layers, making it more efficient for image processing tasks.
Effective with Limited Data: U-Net's design enables it to perform well even when trained on small datasets, which is a significant advantage in fields like medical imaging where data acquisition can be challenging.

Applications

U-Net has been widely adopted for various applications beyond biomedical image segmentation, including:

Medical Imaging: Segmenting organs or tumors in MRI and CT scans.
Satellite Image Analysis: Identifying land use or changes in land cover.
Autonomous Vehicles: Assisting in scene understanding by segmenting objects within images.

In summary, U-Net's innovative architecture and effective use of skip connections have made it a foundational model in semantic segmentation tasks across various domains, particularly where data scarcity poses challenges.

U-Net 模型

U-Net 模型可以分為三個主要部分：

編碼器（Encoder）：編碼器負責通過卷積和最大池化等操作對輸入影像進行降維處理（Downsampling），以提取影像中的重要特徵。我們可以把這一過程想像成在影像上移動一個小視窗，然後記錄下視窗中的特徵
解碼器（Decoder）：解碼器則通過反卷積做上採樣（Upsample）來把特徵資訊重新組合成一張與原始影像大小相同的新影像，其中，反卷積會逐步還原原始輸入影像中的高頻資訊（影像細節）和低頻資訊（影像輪廓），最後生成的影像就會是像素級的分割結果
跳躍連接（Skip Connection）：跳躍連接是在編碼器和解碼器之間的連接，它有助於保留在編碼器階段可能遺失的資訊，通過拼接，使得解碼器在上採樣的時候，能夠保留更多在編碼器時的特徵圖中的高分辨率細節資訊，提高分割的精確度。這對於語義分割任務非常重要，因為編碼器的卷積會保留重要的細節資訊

U-net: Convolutional networks for biomedical image segmentation

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 (pp. 234-241). Springer International Publishing.

The network architecture is illustrated in Figure 1.

It consists of a contracting path (left side) and an expansive path (right side).
The contracting path follows the typical architecture of a convolutional network.
It consists of the repeated
- application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling.
- At each downsampling step we double the number of feature channels.
- Every step in the expansive path consists of an upsampling of the feature map followed by a
  - 2x2 convolution ("up-convolution") that halves the number of feature channels,
  - a concatenation with the correspondingly cropped feature map from the contracting path,
- and two 3x3 convolutions, each followed by a ReLU.
- The cropping is necessary due to the loss of border pixels in every convolution.
- At the final layer a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.

To allow a seamless tiling of the output segmentation map (see Figure 2), it is important to select the input tile size such that all 2x2 max-pooling operations are applied to a layer with an even x- and y-size.

一文搞懂 deconvolution、transposed convolution、sub-pixel or fractional convolution

deconvolution，实际上与 transposed convolution 、sub-pixel or fractional convolution 指代相同。transposed convolution 是一个更好的名字，sub-pixel or fractional convolution可以看成是transposed convolution的一个特例。对一个常规的卷积层而言，前向传播时是convolution，将input feature map映射为output feature map，反向传播时则是transposed convolution，根据output feature map的梯度计算出input feature map的梯度，梯度图的尺寸与feature map的尺寸相同
convolution 过程是将 4×4 的map映射为 2×2 的map，而 transposed convolution 过程则是将 2×2 的map映射为 4×4 的map，两者的kernel size均为3
对于一般情况，只需把握一个宗旨：transposed convolution 将 output size 恢复为 input size 且保持连接方式相同

反捲積(Deconvolution)、上採樣(UNSampling)與上池化(UnPooling)差異

逆卷積(Deconvolution)比較容易引起誤會，轉置卷積(Transposed Convolution)是一個更為合適的叫法

Copy and crop 操作

https://github.com/Rwzzz/Unet

论文中只用分割出细胞边界，所以最后使用的是2个11卷积得到背景和目标两个。如果是多目标分割，根据分割目标的种类来决定使用11的卷积的数量来输出Segmentation map.注：对于多目标的Label标注：可以使用不同颜色，然后使用One-hot编码生成Label进行训练
Copy and crop操作：经过两次33卷积后，大小变为28×28，然后经过一次22卷积，变为56×56，这时和左侧大小为64×64的图像进行维度的叠加，但是由于图像大小不同，需要将左侧的64×64大小的图像裁剪为56×56大小,此处的裁剪是合理的，原因看下一步3中讲的Overlap-tile策略。
可以发现Unet论文中输入的图像是572×572但是输出图像确实388×388.这是不是就意味着原图像存在信息丢失的现象呢？实际上不是的，经过了no padding的卷积操作，输入图像和输出图像肯定是不一样的尺寸，但是Unet在论文中提及了一种策略--Overlap-tile，将图像进行镜像扩充和输入网络，这样经过卷积后得到的输出图像和实际需要提取的图像是相同的尺寸。例如下图，实际需要分割的图像是黄色框所选中部分，但是输入到网络中的图像是蓝色部分，对空白部分进行镜像填充，这样经过网络后所得到的的输出大小尺寸适合实际需要分割的图像大小是一样的。
Unet模型简单，并且使用较少的数据集，可以达到非常理想的分割效果，对于医学和其他一些数据集比较少的领域优势很大。