The need for contactless vascular biometric systems has significantly increased. In recent years, deep learning has proven to be efficient for vein segmentation and matching. Palm and finger vein biometrics are well researched; however, research on wrist vein biometrics is limited. Wrist vein biometrics is promising due to it not having finger or palm patterns on the skin surface making the image acquisition process easier. This paper presents a deep learning-based novel low-cost end-to-end contactless wrist vein biometric recognition system. FYO wrist vein dataset was used to train a novel U-Net CNN structure to extract and segment wrist vein patterns effectively. The extracted images were evaluated to have a Dice Coefficient of 0.723. A CNN and Siamese Neural Network were implemented to match wrist vein images obtaining the highest F1-score of 84.7%. The average matching time is less than 3 s on a Raspberry Pi. All the subsystems were integrated with the help of a designed GUI to form a functional end-to-end deep learning-based wrist biometric recognition system.
Keywords: biometrics; wrist vein; deep learning; machine learning; Siamese Neural Network; convolutional neural network
In a data-driven world, data security, privacy security and protection of personal identification information are essential. Personal Identification Numbers (PINs) and passwords are extensively used for human identification and verification. With the advancement of technology, it has become more challenging to safeguard confidential information. PINs and passwords are susceptible to spoofing, and are likely to be stolen and transferred between people [[
Biometric recognition systems utilise physiological or behavioural features for recognition [[
The other type of physiological feature that can be used for biometric recognition is an intrinsic feature. Features used for biometrics are also referred to as biometric modalities. Intrinsic modalities, specifically vein modalities, are much harder to spoof [[
The wrist vein is known to have sufficient unique features to be used in a vascular biometric system [[
In biometric systems, identification and authentication are often interchangeably used. It is to be noted that these terms have different meanings and significance and are dependent on the operating mode of the biometric system. The authentication step is a one-to-one matching process for a specific user where the system compares the input preprocessed image obtained from the user, also called probe image, to the database image previously stored in the system, also called the reference image. Here, the user has already claimed a specific identity and hence the reason the operating mode has only one-to-one comparison operation. This comparison is to verify the claimed identity and grant access depending on the matching output. Identification is the step where the input image is taken from the user (probe image) and is compared to all the images which are already stored in the database using the registration process. This is a one-to-N matching operational mode.
Systems for both the above-mentioned operating modes can be designed using traditional signal processing methods and deep learning methods. Deep learning has seen success in palm and finger vein recognition recently. Deep learning has the advantage of being able to encompass all the signal processing steps required for vein recognition to provide an end-to-end recognition system. Although most palm and finger vein research now focuses on deep learning [[
The contributions of this paper are listed below:
- Extension of the literature survey carried out in our previous study [[
6 ]]. - Segmentation of wrist vein images using a modified UNet architecture.
- Development of a matching engine that can compare probe image with reference image using Convolutional Neural Network (CNN) followed by Siamese Neural Network [[
10 ]] for vascular biometrics. - Development of a Graphical User Interface (GUI) and integration of the subsystems to form a complete end-to-end deep learning-based wrist vein biometric system.
To the best of our knowledge, this is the first study where an entire end-to-end wrist vein recognition system using a deep learning algorithm has been developed. Section 2 provides a comprehensive literature review of state-of-the-art wrist vein recognition systems and deep learning approaches applied to other vein biometrics. Section 3 presents the proposed subsystems. Section 4 reports and discusses the obtained results with its performance evaluation. Section 5 is dedicated to the conclusion and future work.
This section provides a review of the state-of-the-art wrist vein recognition systems covering the literature on image acquisition, databases, segmentation algorithms and matching engines. Deep learning approaches applied to vein recognition have also been reviewed for the sake of completeness.
Acquiring an image of the vein patterns is the first step in the vein recognition process. Most commercial cameras have IR filters embedded to prevent IR light from entering the sensor to avoid having unnatural-looking images. For biometric recognition, this filter needs to be removed as IR light is necessary to capture the veins. To capture an optimum image, the distance between the wrist and the camera lens, as well as the camera quality/image size needs to be accounted for. Crisan et al. suggest utilising a tight optical spectrum window of 740 nm to 760 nm to provide maximum penetration in vein biometrics [[
Although NIR images provide much clearer vein structures, using regular light has also been considered in some studies. In [[
From the surveyed databases mentioned in Table 1, it was observed that the captured vein pattern images are generally not directly suitable for feature extraction. The position of the wrist is inconsistent between images and contains more information than required. Raghavendra et al. use hand pegs in [[
Enhancing the image is executed through a combination of traditional image enhancement approaches, such as filters and equalisation. Gaussian filters are used in [[
Once the preprocessed image is obtained, the next stage in the system is feature extraction and finally decision making or matching. Feature extraction is the process of capturing important features from the input image. The majority of the literature surrounding feature extraction in wrist vein recognition utilise texture feature extraction methods, which include statistical, structural, transform-based, model-based, graph-based, learning-based and entropy-based approaches [[
When compared to traditional feature extraction methods that are reviewed in [[
Kurban et al. use three different types of classifiers in [[
Chen et al. utilised CNN for palm vein recognition in [[
U-Net is a CNN designed for biomedical image segmentation as proposed by Ronneberger et al. in [[
We developed a Siamese Neural Network for matching palm vein patterns in [[
Conducting this review helped gain insights to reach the design decisions for the various sub-systems designed and the integration of the end-to-end wrist vein system.
This section describes the approach taken to design a low cost complete contactless end-to-end wrist vein biometric recognition system. This system is made up of multiple subsystems as shown in Figure 2. The first stage is to develop the image acquisition device. This device will then be used to acquire the wrist vein image. The second subsection discusses the preprocessing techniques used to transform a wrist vein image and make it suitable for image segmentation, which was done using CLAHE. The third step is to segment the preprocessed images using the U-Net CNN architecture and extract the vein patterns to represent them numerically. In the fourth step, CNN and Siamese Neural Networks were used for image matching. This is one of the most important steps of our recognition system. Finally, a GUI was designed and the subsystems were integrated to achieve a fully working system.
The image acquisition subsystem comprises NIR LEDs, a NoIR camera with distance mapping calibration. NIR LEDs used in this system are summarised in Table 2 along with their key specifications. The Raspberry Pi Camera Module v2 NoIR comprised of a Sony IMX219 8-megapixel CMOS sensor was chosen to capture the vein patterns. The camera module is small at 25 mm × 23 mm × 9 mm and costs less than USD 100. It was chosen due to these factors and its widespread availability and integration with the Raspberry Pi, a common hobbyist single-board computer. The camera can be controlled via Python by utilising the open-source picamera package. This package allows the user to configure the camera settings, preview the output and capture images from the camera. This influenced the camera choice decision as the previous work carried out by us for deep learning algorithm development was in Python Version 3.8.
Preprocessing is the process of applying transformations to raw input images to make them suitable for image segmentation. As the image acquisition device was being developed in parallel to the deep learning algorithms, it was not possible to test the algorithms with collected data during prototype building. In the interim, the FYO dataset, collected by Toygar et al. in [[
Wrist vein images captured during experimentation lacked contrast between the vein and skin. To improve this, CLAHE was applied as a preprocessing step. This process involves performing traditional histogram equalisation on small tiles throughout the image and then clipping the contrast at a given limit. In this case, CLAHE is applied on tiles with a size of 8 × 8 and clips the contrast with a threshold of 6.
Image segmentation is the process of extracting the vein patterns from the captured images and representing them numerically. The U-Net CNN was used for image segmentation in this paper. The U-Net architecture was chosen due to its success in segmenting palm vein images in our previous work [[
Significant amounts of labelled training data are also required to perform supervised learning. As the FYO dataset is unlabelled, this would have required days of manual segmentation of the images. Unsupervised learning, which does not require labelled training data, was also considered but dismissed as this method of learning is usually reserved for problems such as clustering and association of input data, not image segmentation. To solve this, signal processing methods proposed in [[
The U-Net CNN architecture consists of two symmetric paths: a contracting path designed to capture context and an expansive path to enable localisation. The architecture is shown in Figure 3.
The contracting path consists of four contracting blocks, each of which contains two 2D convolutions with a 3 × 3 kernel. The number of features doubles in every contracting block. The ReLU activation function is used with the 2D convolutions. These convolutions are followed by max pooling with a 2 × 2 pool size to downsample the feature map, which doubles the number of filters and halves both dimensions of the image. In this U-Net implementation, the input image contains one filter layer which is increased to 16 in the first contracting block. The number of filters then doubles every contracting block up to 128 filters in the final block.
The contracting and expansive paths are joined by two 2D convolutions, again both using a 3 × 3 kernel with a ReLU activation. Similar to the contracting path, the expansive path is built up of blocks. Each block consists of a 2D deconvolution (also known as an up convolution) with a 3 × 3 kernel and 2 × 2 strides. The output is then concatenated with the output of the corresponding block in the contracting path. This is followed by a dropout layer, which at random sets values to zero to prevent overfitting, at a rate of 10%. A 2D convolution with a 3 × 3 kernel size is then performed and is followed with a ReLU activation function, similar to the contracting path. In the original U-Net architecture, the dropout layer is not present and a second convolution layer is present in its place. The dropout layer is inserted to combat the effects of overfitting. The output of the final expansive block is followed by a 1 × 1 convolution to reduce the 16 features down to 1 feature to create a mask image. The sigmoid activation function is used for binary classification, such as in this case, where 1 represents vein and 0 represents not vein. In total, the model consists of 23 tunable layers and 1,179,121 trainable parameters.
The input and output from the network have the same dimensions—
This architecture was proposed in our previous work in [[
To generate labelled data for training the U-Net, a mask image generation algorithm was used as proposed in our previous work in [[
The Dice Coefficient measures how well two binary mask images overlap. The value returned from the Dice Coefficient is between 0, which represents no overlap, and 1, which represents complete overlap. The Dice Coefficient can be calculated as
(
where P is the predicted binary mask and Y is the known true binary mask.
The U-Net was implemented using TensorFlow and Keras 2.1. It was trained on the FYO Wrist Vein database, utilising the region-of-interest images which consists of 640 wrist vein images [[
Image matching is the process of comparing two mask images, which are obtained as the result of the image segmentation stage and deciding whether the two images match. Two neural networks were considered for matching: a CNN and a Siamese Neural Network.
This paper uses a CNN architecture to compare the mask images and output a probability that the mask images match. The network architecture is shown in Figure 5.
The network consists of convolutional blocks. Each block starts with a 2D convolution layer with a 3 × 3 kernel followed by a ReLU activation function. This is followed by a max pooling layer with a 2 × 2 pool size which halves the input dimensions and doubles the output features. A dropout layer follows this, which at random sets values to 0 to prevent overfitting at a rate of 25%. The output features of the block double every block, starting with 32 features in the blocks immediately after the input images. The network has two input images which follow two identical pathways consisting of two convolutional blocks as described. These blocks are chained in series. Each of these pathways consists of two tunable layers for a total of 18,816 trainable parameters.
After these blocks, the two image feature sets are concatenated and proceed through another two convolutional blocks identical in all aspects except output features. The feature set is then flattened into a 1D tensor with a length of 32,768. A fully connected layer reduces these features down to 512, which is followed by another 25% dropout layer and finally the last fully connected layer, which outputs one value between 0 and 1. All fully connected layers use the ReLU activation function except for the last layer, which uses the sigmoid activation function due to its binary output. In total, the model consists of eight tunable layers for a total of 18,291,201 trainable parameters.
Siamese Neural Networks are networks that consist of two or more sub-networks and have seen success for vein biometrics in [[
Figure 6 uses a sub-network. The architecture of this sub-network that processes each input image is shown in Figure 7. The network takes a
The output of the convolution blocks is then flattened and run through another batch normalisation layer. The output is then mapped onto a fully connected layer consisting of 128 neurons, which represent the output feature set. The ReLU activation function is used in the dense layer. In total, the model consists of 11 tunable layers with 7,678,602 trainable parameters.
Contrastive loss [[
(
where L is the calculated loss, Y is the known truth values, P is the predicted truth values and M is the baseline distance for pairs to be considered dissimilar. The margin is commonly set to 1, which is the value used throughout this paper.
Both networks were implemented using TensorFlow and Keras 2.1 and were trained using the FYO wrist vein dataset. The dataset was split 80:20 between training and testing for a total of 256 wrists for training and 64 wrists for testing and validation. It is important to note that each wrist was captured twice between two imaging sessions. The U-Net discussed in Section 3.3.1 was used to generate mask images for all images.
As matching is the process of comparing two images, a dataset of image pairs was constructed. Each wrist in the FYO dataset had two images across two sessions. These two images were paired and were assigned a "true" label. The image from the first session is then paired with a random image from the second session that does not match and is assigned a "false" label. This produces a wrist pair dataset of 512 pairs for training and 128 pairs for testing and validation.
To attempt to teach the network to compensate for translation, rotation and scale of the vein patterns, a set of augmentations were applied to the images before training or testing. These include rotating the image
Both networks were trained using the Adam optimiser with a learning rate of
A graphical user interface (GUI) was developed to allow users to capture wrist vein images, segment them using U-Net and compare them using various matching networks. As the camera in use is the Raspberry Pi NoIR camera, the application needed to run on the Raspberry Pi itself. Python was chosen as the language of choice due to the machine learning algorithms also being developed in Python. PyQt5, a Python interface to the Qt GUI library, was used to design the application.
Due to the Raspberry Pi having significantly less computation resources as compared to a desktop computer, it is not possible to install and run TensorFlow 2 without significant work-arounds. TensorFlow Lite (TFLite) is a lightweight version of TensorFlow that has the ability to run inference on small embedded platforms such as Raspberry Pi. Regular TensorFlow models can be compiled into a TensorFlow Lite model. The GUI application uses a TFLite U-Net model to segment the wrist vein images. To support the future development of matching networks, the application can also load different TFLite models from a folder to use in the matching process. The CNN and siamese neural network models are pre-loaded into the initial setup. The segmentation and matching processes of the GUI are shown in Figure 8 and Figure 9, respectively.
The image acquisition device was produced and evaluated. Of the four NIR wavelengths proposed, the 860 nm LEDs provided the clearest wrist vein images. This is likely due to the 860 nm LEDs having a high radiant intensity. The other wavelengths would require increasing the shutter speed and/or ISO of the camera, which in turn would introduce more noise into the wrist vein image. The final image acquisition device is shown in Figure 10 with the 860 nm LEDs. The acquisition system implementation is intended to be filed as a patent.
The U-Net used for image segmentation was trained and tested using the FYO wrist vein dataset. The dataset of 640 images was split 80:20 between training and testing. The network was trained for 107 epochs as training was stopped early due to the plateau of the accuracy metric. In this case, the accuracy is binary accuracy and is calculated based on how many pixels the model correctly predicts. The Dice Coefficient is another metric which measures the overlap of the predicted and true mask images as described in Section 3.3.3. The model achieved a binary accuracy of 90.5% and a Dice Coefficient of 0.723. Figure 11 shows the Dice Coefficient and binary accuracy for one training session.
Overall, the U-Net is successful at segmenting wrist vein images. Compared to the results of palm vein segmentation using U-Net from [[
Both image matching neural networks were evaluated using the FYO wrist vein dataset. The networks were provided wrist vein mask images segmented using the discussed U-Net architecture. Binary accuracy and F1-score are used to evaluate the matching models. Binary accuracy is the percentage of pairs the model predicted correctly, while the F1-score is the harmonic mean of the precision and recall metrics. Here, 100% indicates a perfect model and 0% represents the opposite. These are the metrics used to evaluate the model. The F1-score can be calculated with
(
where
The CNN was trained for 200 epochs. The binary cross entropy loss function was used for training and validation. Binary accuracy and F1-score were used as the accuracy metrics. The network achieved a binary accuracy of 65.6% and an F1-score of 73.3% after training for 200 epochs. Figure 12 and Figure 13 show the evaluation metrics of the network for the training and validation datasets for one training session.
Likely due to the size of the network, with almost 70,000,000 trainable parameters, the network learns at a slow rate as evidenced by the number of epochs required to increase the validation accuracy reported as compared to the siamese and U-Net neural networks. Even after 200 epochs the loss is still trending downwards, indicating the network still has room to learn. However, as can be seen in the plots in Figure 12 and Figure 13, the validation accuracy and F1-score plateaus early at approximately 50 epochs while the training metrics continue to increase, which may indicate that the network is failing to learn and is instead overfitting to the training data. This may be attributed to the relatively small dataset, as in general, larger networks require more data to train. Augmentation of the input images was performed to help this by effectively increasing the size of the dataset but does not help the fact there are only 256 unique pairs for testing the network. This small number of pairs is also likely why the validation F1-score is consistently higher than the training F1-score, as there is a chance the training dataset was allocated higher quality images that are easier to match.
The Siamese Neural Network was trained for 100 epochs. Contrastive loss was used for training and validation. Similar to the CNN, binary accuracy and F1-score were used as the accuracy metrics. The network achieved a binary accuracy of 85.1% and an F1-score of 84.7% after training for 100 epochs. Figure 14 and Figure 15 show the contrastive loss, binary accuracy and F1-score of the network for the training and validation datasets for one training session.
Compared to the CNN, the Siamese Neural Network provides a much better accuracy and F1-score even after only training for half the number of epochs. The loss of the network has begun to plateau at 100 epochs, indicating the network has reached a minima. If the network has reached a local minima, the accuracy and F1-score may be able to further be increased by adjusting the learning rate, which was performed during U-Net training.
In this paper, a small and relatively low-cost wrist vein image acquisition device was designed to capture high-quality images of wrist vein patterns. It was found that NIR light with a wavelength of 860 nm provides the highest quality vein pattern images; however, further evaluation is required to determine whether the radiant intensity of the LEDs influenced this result. The U-Net neural network architecture has been applied to the problem of image segmentation to extract the vein pattern features from the input images, achieving a Dice Coefficient of 0.723 when tested on the FYO wrist vein dataset. Convolutional and Siamese Neural Networks have been applied to the problem of image matching, with the Siamese Neural Network showing promising results, achieving an F1-score of 84.7% when tested on the FYO wrist vein dataset. To combine all aspects of the research together, a GUI was developed to allow users to capture wrist vein images, segment them with U-Net and compare them with locally saved images using their own matching neural networks or one of the provided networks.
Due to supply constraints, it was not possible to obtain LEDs with similar radiant intensities for each wavelength. Due to this, the LED chosen for 860 nm had a radiant intensity over 200× more than the next largest LED, as can be seen in Table 2. This has likely skewed the image acquisition results towards 860 nm as more light allows us to reduce the shutter speed and ISO of the camera to allow less light and less noise to enter the camera. Current limiting resistors were added to the design to ensure safety measures were in place with the LED usage. It would be ideal to source LEDs with similar radiant intensities as an extension of this research to provide more validation that 860 nm is absorbed optimally.
Although the Siamese Neural Network has provided promising results for image matching, it can be optimised further by tuning the hyperparameters. The GUI has been designed to allow users to easily test their image-matching algorithms in real-time settings with an image acquisition system. Users can simply compile their TensorFlow models to TensorFlow Lite models, upload the model to the Raspberry Pi and select the model in the GUI. The only requirement is that the model take two mask images as inputs and output the probability that the two provided images match. The next step could be to further investigate the image matching algorithms and evaluate them using the end-to-end system presented in this paper.
Graph: Figure 1 Wrist Vein Example Images. (a) input, (b) Enhanced Vein Image, (c) Vein Features.
Graph: Figure 2 Wrist Vein Recognition System Flowchart.
Graph: Figure 3 U-Net Architecture.
Graph: Figure 4 Generated mask images for the original image from FYO. (a) Original Image, (b) Mask Generated, (c) Mask Generated with U-Net.
Graph: Figure 5 CNN Matching Network Architecture.
Graph: Figure 6 Siamese Neural Network Architecture with sub-network.
Graph: Figure 7 Sub-network Siamese Feature Network Architecture.
Graph: Figure 8 Wrist Vein GUI Segmentation Process.
Graph: Figure 9 Wrist Vein GUI Matching Process.
Graph: Figure 10 A Developed Filed to Patent Image Acquisition Device.
Graph: Figure 11 U-Net Training Evaluation.
Graph: Figure 12 Convolutional Network Training Evaluation: Loss and Accuracy.
Graph: Figure 13 Convolutional Network Training Evaluation: F1-Score.
Graph: Figure 14 Siamese Network Training Evaluation: Loss and Accuracy.
Graph: Figure 15 Siamese Network Training Evaluation: F1-Score.
Table 1 Wrist Vein Pattern Databases.
Name Participants Wrist Samples Sessions Total Camera NIR PUT [ 50 2 4 3 1200 Unknown Unknown Singapore [ 150 2 3 Unknown 900 Hitachi KP-F2A 850 nm FYO [ 160 2 2 1 1 640 1/3 inch infrared CMOS Unknown UC3M [ 121 1 5 Unknown 605 DM 21BU054 880 nm UC3M-CV1 [ 50 2 6 2 1200 Logitech HD Webcam C525 850 nm UC3M-CV2 [ 50 2 6 2 1200 per device Xiaomi Pocophone F1, Xiaomi Mi 8 960 nm Kurban et al. [ 17 2 3 Unknown 102 5MP mobile phone No NIR Pascual et al. [ 30 2 6 Unknown 360 DM 21BU054 880 nm Fernández et al. [ 30 Right 4 1 120 CCD Camera 880 nm
Table 2 Specification of Near-Infrared Light Used In The Image Acquisition Device.
Wavelength Reasoning Model Forward Voltage Forward Current Radiant Intensity 740 nm Successfully used in wrist vein literature OIS 330 740 X T 1.7 V 30 mA 6 mW/sr 770 nm Absorbed best by deoxygenated hemoglobin OIS 330 770 1.65 V 50 mA 6 mW/sr 860 nm Successfully used in wrist vein literature SFH 4715AS 2.9 V 1 A 1120 mW/sr 880 nm Successfully used in wrist vein literature APT1608SF4C-PRV 1.3 V 20 mA 0.8 mW/sr
Conceptualisation, W.A. and F.M.; methodology, F.M., W.A., D.C., P.G.; software, D.C. and F.M.; validation, P.G., D.C. and F.M.; formal analysis, W.A. and F.M.; writing—original draft preparation, F.M., D.C. and P.G.; writing—review and editing, W.A.; supervision, W.A.; project administration, W.A. and F.M. All authors have read and agreed to the published version of the manuscript.
The study was conducted in accordance with The University of Auckland Human Participants Ethics Committee (UAHPEC) (protocol code: UAHPEC20540 and date of approval: 08/06/2022).
Written informed consent has been obtained from the participants to publish this paper.
Not applicable.
The authors declare no conflict of interest.
We would like to acknowledge the authors of the FYO wrist vein dataset for providing us with their dataset for research purposes.
By Felix Marattukalam; Waleed Abdulla; David Cole and Pranav Gulati
Reported by Author; Author; Author; Author