Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUTOMATIC ENVIRONMENTAL PERCEPTION BASED ON MULTI-MODAL SENSOR DATA OF A VEHICLE
Document Type and Number:
WIPO Patent Application WO/2024/062025
Kind Code:
A1
Abstract:
According to a method for automatic environmental perception based on sensor data of a vehicle (1), a first and a second image (6, 7) of respective sensor modalities (3, 4) are received, a first feature map is generated by applying at least one layer (11, 16) of a neural network (8) to the first image (6) and a second feature map is generated by applying at least one further layer (31, 36) of the neural network (8) to the second image (7). A transformed feature map is generated based on the second feature map using an affine transformation accounting for a deviation in extrinsic parameters of the sensor modalities (3, 4), a first fused feature map is generated by concatenating the first feature map with the transformed feature map, and a visual perception task is carried out depending on the first fused feature map.

Inventors:
YATHIRAJAM BHARADWAJA (IN)
DAS ARINDAM (IN)
Application Number:
PCT/EP2023/076055
Publication Date:
March 28, 2024
Filing Date:
September 21, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CONNAUGHT ELECTRONICS LTD (IE)
International Classes:
G06V20/58; G06T7/80; G06V10/143; G06V10/25; G06V10/80; G06V10/82
Other References:
HUAN YIN ET AL: "RaLL: End-to-end Radar Localization on Lidar Map Using Differentiable Measurement Model", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 March 2021 (2021-03-06), XP081897947, DOI: 10.1109/TITS.2021.3061165
DI KANG ET AL: "Incorporating Side Information by Adaptive Convolution", ARXIV.ORG, 8 December 2017 (2017-12-08), 201 Olin Library Cornell University Ithaca, NY 14853, pages 3867 - 3877, XP055728312, DOI: 10.1007/s11263-020-01345-8
JADERBERG MAX ET AL: "Spatial Transformer Networks", 5 June 2015 (2015-06-05), XP055946023, Retrieved from the Internet [retrieved on 20220725]
ZHANG LU ET AL: "Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection", 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 27 October 2019 (2019-10-27), pages 5126 - 5136, XP033723407, DOI: 10.1109/ICCV.2019.00523
L. ZHANG ET AL.: "Weakly aligned cross-modal learning for multispectral pedestrian detection", IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV, 2019, pages 5126 - 5136, XP033723407, DOI: 10.1109/ICCV.2019.00523
S. HWANG ET AL.: "Multispectral pedestrian detection: Benchmark dataset and baseline", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2015, pages 1037 - 1045, XP055769507, DOI: 10.1109/CVPR.2015.7298706
K. HE ET AL.: "Deep residual learning for image recognition", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2016, pages 770 - 778, XP055536240, DOI: 10.1109/CVPR.2016.90
Y. SUN ET AL.: "RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes", IEEE ROBOTICS AND AUTOMATION LETTERS, vol. 4, no. 3, July 2019 (2019-07-01), pages 2576 - 2583, XP011721471, DOI: 10.1109/LRA.2019.2904733
Attorney, Agent or Firm:
JAUREGUI URBAHN, Kristian (DE)
Download PDF:
Claims:
2022PF00807 27 Claims 1. Computer-implemented method for automatic environmental perception based on multi-modal sensor data of a vehicle (1), wherein - a first image (6) of a first environmental sensor modality (3) of the vehicle (1) and a second image (7) of a second environmental sensor modality (4) of the vehicle (1) are received; - a first feature map is generated by applying at least one layer (11, 16) of a first branch (9) of a trained artificial neural network (8) to the first image (6) and a second feature map is generated by applying at least one layer (31, 36) of a second branch (10) of the neural network (8) to the second image (7); characterized in that - a transformed feature map is generated based on the second feature map using an affine transformation, which accounts for a deviation between extrinsic parameters of the first environmental sensor modality (3) and the second environmental sensor modality (4); - a first fused feature map is generated by concatenating the first feature map with the transformed feature map; and - at least one visual perception task is carried out by at least one decoder module (26) of the neural network (8) depending on the first fused feature map. 2. Computer-implemented method according to claim 1, characterized in that - the at least one first layer comprises one or more first convolution layers (16) following a first rectification layer (11), which comprises a first set of distortion parameters of the first environmental sensor modality (3) and a first set of intrinsic parameters of the first environmental sensor modality (3); and/or - the at least one second layer comprises one or more second convolution layers (36) following a second rectification layer (31), which comprises a second set of distortion parameters of the second environmental sensor modality (4) and a second set of intrinsic parameters of the second environmental sensor modality (4). 2022PF00807 28 3. Computer-implemented method according to claim 2, characterized in that - a further first rectification layer (12), which comprises the first set of distortion parameters and the first set of intrinsic parameters, is applied to the first image (6) to generate a scaled and rectified first image, which is scaled with respect to the first image (6) to match spatial dimensions of the first feature map; - a second fused feature map is generated by concatenating the first fused feature map and the scaled and rectified first image; and - the at least one visual perception task is carried out by the at least one decoder module (26) of the neural network (8) depending on the second fused feature map. 4. Computer-implemented method according to claim 3, characterized in that - a further first feature map is generated by applying at least one further layer (17) of the first branch (9) to the second fused feature map; and - the at least one visual perception task is carried out by the at least one decoder module (26) of the neural network (8) depending on the further first feature map. 5. Computer-implemented method according to claim 4, characterized in that the at least one further layer (17) of the first branch (9) comprises a residual network block. 6. Computer-implemented method according to one of claims 4 or 5, characterized in that - a further second rectification layer (32), which comprises the second set of distortion parameters and the second set of intrinsic parameters, is applied to the second image (7) to generate a scaled and rectified second image, which is scaled with respect to the second image (7) to match spatial dimensions of the second feature map; - a third fused feature map is generated by concatenating the second feature map and the scaled and rectified second image; and - a further second feature map is generated by applying at least one further layer (37) of the second branch (10) to the third fused feature map; 2022PF00807 29 - a further transformed feature map is generated based on the further second feature map using the affine transformation; - a fourth fused feature map is generated by concatenating the further first feature map with the further transformed feature map; and - the at least one visual perception task is carried out by the at least one decoder module (26) depending on the fourth fused feature map. 7. Computer-implemented method according to one of the preceding claims, characterized in that - the first image (6) is a visible range camera image; and/or - the second image (7) is a thermal image or a lidar image or a radar image. 8. Computer-implemented method according to one of the preceding claims, characterized in that the at least one visual perception task comprises an object detection task and/or semantic segmentation task and/or a depth estimation task. 9. Computer-implemented method according to one of the preceding claims, characterized in that the affine transformation comprises a rotation and/or a translation. 10. Computer-implemented training method for training an artificial neural network (8) for automatic environmental perception based on multi-modal sensor data, wherein - a first training image of a first environmental sensor modality (3) and a second training image of a second environmental sensor modality (4) are received; - a first feature map is generated by applying at least one layer of a first branch (9) of the neural network (8) to the first training image and a second feature map is generated by applying at least one layer of a second branch (10) of the neural network (8) to the second training image; characterized in that - a transformed feature map is generated based on the second feature map using an affine transformation, which accounts for a deviation between extrinsic parameters of the first environmental sensor modality (3) and the second environmental sensor modality (4), wherein the affine transformation depends on a set of transformation parameters; 2022PF00807 30 - a first fused feature map is generated by concatenating the first feature map with the transformed feature map; and - output data (29) is generated by carrying out at least one visual perception task by at least one decoder module (26) of the neural network (8) depending on the first fused feature map; - at least one loss function is evaluated depending on the output data (29) and the neural network (8) is adapted based on a result of the evaluation. 11. Computer-implemented training method according to claim 10, characterized in that adapting the neural network (8) comprises adapting the set of transformation parameters. 12. Computer-implemented training method according to one of claims 10 or 11, characterized in that - the at least one first layer comprises one or more first convolution layers (16) following a first rectification layer (11), which comprises a first set of distortion parameters of the first environmental sensor modality (3) and/or a first set of intrinsic parameters of the first environmental sensor modality (3), and adapting the neural network (8) comprises adapting the first set of distortion parameters and/or the first set of intrinsic parameters; and/or - the at least one second layer comprises one or more second convolution layers (36) following a second rectification layer (31), which comprises a second set of distortion parameters of the second environmental sensor modality (4) and/or a second set of intrinsic parameters of the second environmental sensor modality (4), and adapting the neural network (8) comprises adapting the second set of distortion parameters and/or the second set of intrinsic parameters. 13. Computer-implemented training method according to one of claims 10 to 12, characterized in that - further output data (30) is generated by carrying out at least one further visual perception task by at least one further decoder module (41) of the second branch (10) depending on the second feature map; and - the at least one loss function is evaluated depending on the output data (29) and the further output data (30). 2022PF00807 31 14. Automatic environmental perception system (2) for a vehicle (1) comprising at least one computing unit (5), which is adapted to carry out a computer-implemented method according to one of claims 1 to 9. 15. Computer program product comprising instructions, which, when executed by at least one computing unit (5), cause the at least one computing unit (5) to carry out computer-implemented method according to one of claims 1 to 9 and/or a computer- implemented training method according to one of claims 10 to 13.
Description:
2022PF00807 1 Automatic environmental perception based on multi-modal sensor data of a vehicle The present invention is directed to a computer-implemented method for automatic envi- ronmental perception based on multi-modal sensor data of a vehicle, wherein a first image of a first environmental sensor modality of the vehicle and a second image of a second environmental sensor modality of the vehicle are received, a first feature map is generated by applying at least one layer of a first branch of a trained artificial neural network to the first image and a second feature map is generated by applying at least one layer of a sec- ond branch of the neural network to the second image. The invention is further directed to a computer-implemented training method for training an artificial neural network for auto- matic environmental perception based on multi-modal sensor data, to an automatic envi- ronmental perception system for a vehicle and to a computer program product. Cameras operating in the visible range, also denoted as visible range cameras, represent one of the major sensory modalities for realizing driver assistance functions in vehicles, in particular motor vehicles, and/or other functions for autonomous or semi-autonomous driv- ing of a vehicle. In particular for functions according to a level 3 and above, the processed data from the cameras is commonly fused with the data of radar systems or lidar systems. Existing multi-sensor systems are primarily suited to address the challenges during day- light conditions and their performance is significantly reduced for low light scenarios. For example, severe accidents may happen at night time due to the poor visibility, for example of pedestrians crossing the road or other objects on the road. The same holds to some ex- tent for bad weather conditions. In practical applications, it has been found that even be- fore illuminance levels of 1 lx or lower are reached, a poor performance of camera-based systems occurs. Thermal cameras could be used to supplement computer vision tasks. However, as pointed out for example in the publication L. Zhang et al.: “Weakly aligned cross-modal learning for multispectral pedestrian detection,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp.5126-5136 regarding the KAIST dataset of S. Hwang et al.: "Multispectral pedestrian detection: Benchmark dataset and baseline," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp.1037- 1045, there is the problem of a position shift in corresponding image pairs of an RGB- camera and a thermal camera. 2022PF00807 2 It is noted that the position shift is not a specific problem of the KAIST dataset but may occur for vehicles is the field as well. Possible reasons for this shift are hardware ageing, a shift in field-of-view of the two modalities, different resolutions, issues in the alignment algorithm and so forth. As a consequence, when analyzing the different images by means of perception algorithms, in particular trained artificial neural networks, features from different modalities may be mismatched, which lowers the accuracy and reliability of the inference model. For example, shifts in the positions of bounding boxes for objects in the respective images may lead to inconsistencies and reduced accuracy and reliability. The consequences may be even more severe when it comes to semantic segmentation, where a pixel-wise classification is required. Consequently, the involved neural networks would have to be retrained with every new camera setup, when changes in the camera placement on the vehicle occur or when intrinsic or extrinsic camera parameters change. Also a new sensor setup itself requires time consuming intrinsic and extrinsic calibration. Also images generated by means of lidar systems or radar systems or other environmental sensor modalities can supplement the perception at low light conditions. Also here, analog position shift problems amongst the different environmental sensor modalities may occur. Above mentioned publication of L. Zhang et al. proposes a region feature alignment approach to align the feature maps of different modalities. However, this approach does not consider any angular shift or rotation of the modalities with respect to each other. Moreover, this approach estimates only 2D translation shifts between RGB-images and thermal images for pedestrian detection. For perception tasks on smaller scales, for example semantic segmentation, the required accuracy, in particular at pixel level, is not achieved. It is an objective of the present invention to further reduce the effects of a position shift in images from two or more environmental sensor modalities of a vehicle for automatic environmental perception. This objective is achieved by the respective subject matter of the independent claims. Further implementations and preferred embodiments are subject matter of the dependent claims. The invention is based on the idea to make the used artificial neural network invariant with respect to at least extrinsic parameters of the environmental sensor modalities. To this 2022PF00807 3 end, a visual perception task is carried out by a decoder module based on fused features from at least two environmental sensor modalities, wherein for fusing the features, an affine transformation, which accounts for a deviation between the extrinsic parameters of the environmental sensor modalities, is used before the features are concatenated. According to a first aspect of the invention, a computer-implemented method for automatic environmental perception based on multi-modal sensor data of a vehicle is provided. Therein, a first image of a first environmental sensor modality of the vehicle and a second image of a second environmental sensor modality of the vehicle are received. A first feature map is generated by applying at least one layer of a first branch of a trained artificial neural network to the first image and a second feature map is generated by applying at least one layer of a second branch of the neural network to the second image. A transformed feature map is generated based on the second feature map using an affine transformation, which accounts for a deviation between extrinsic parameters of the first environmental sensor modality and the second environmental sensor modality. A first fused feature map is generated by concatenating the first feature map with the transformed feature map. At least one visual perception task is carried out by at least one decoder module of the neural network, in particular of the first branch, depending on the first fused feature map. Unless stated otherwise, all steps of the computer-implemented method may be performed by at least one computing unit, in particular of the vehicle, which may also be denoted as a data processing apparatus. In particular, the at least one computing unit comprises at least one processing circuit, which is configured or adapted to perform the steps of the computer-implemented method. For this purpose, the at least one computing unit may for example store a computer program comprising instructions which, when executed by the at least one computing unit, cause the at least one computing unit to execute the computer-implemented method. A visual perception task may for example be understood as a task for extracting visually perceivable information from image data. In particular, the visual perception task may in many cases, in principle, be carried out by a human, which is able to visually perceive an image corresponding to the image data. In the context of automatic visual perception, also denoted as computer vision, however, visual perception tasks are performed automatically without requiring the support of a human. In particular, typical computer vision tasks 2022PF00807 4 include object detection, semantic segmentation, depth estimation, optical flow estimation, et cetera. In the context of the present invention, the automatic visual perception is not necessarily or not exclusively carried out based on a visual range camera image or, in other words, an image generated by a camera, which is sensitive to visual light. In other words, the first and the second environmental sensor modality are different types of environmental sensor modalities and are at least not both visual range cameras. For example, the environmental sensor modalities may be two different of a visual range camera, a thermal camera, a lidar system or a radar system. Accordingly, the expression "multi-modal sensor data" can be understood such that at least two different types of environmental sensor modalities are used to generate the first and the second image. Even though infrared light or radio waves used by such environmental sensor modalities are, as such, not visible for humans, respective two dimensional images, for example thermal images, lidar images or radar images, may be displayed or represented in a way perceivable for a human, for example as a monochromic image or a false color image. Therefore, the same or analog visual perception tasks as used for visible range camera may also be applied to such images. For example, the trained artificial neural network may be provided in a computer-readable way, for example stored on a storage medium of the vehicle, in particular of the at least one computing unit. The at least one computing unit may receive the first image and the second image directly from the first and the second environmental sensor modality, respectively, for example as frames from respective data streams or image streams. However, the at least one computing unit may receive the first image and the second image also from a storage device in other implementations. Alternatively or in addition, the images may be pre- processed, for example by applying on or more filters or by upsampling or downsampling the resolutions. The spatial dimensions of the first feature map may in general differ from the spatial dimensions of the first image and the second feature map may in general differ from the spatial dimensions of the second image. In general, the first and the second image may be considered as respective three-dimensional tensors of size H x W x C. Therein, H x W denotes the spatial size of the respective image, namely its height H and width W in terms 2022PF00807 5 of pixels of the image. C is the channel dimension and may for example correspond to different color channels in case of a visual range camera image. However, C may also be equal to one for a single channel image, in particular a single channel thermal image, lidar image or radar image or a monochromic visual range camera image. Preferably, H and W are identical for the first and the second image. To this end, the first image and/or the second image may be upsampled or downsampled, respectively, before the at least one layer of the first branch or the second branch, respectively, are applied. The first feature map is also characterized by a respective spatial size H' x W' and a channel number C', which may, however, differ from the size and channel number of the first image, respectively. In particular, the output of the at least one layer of the first branch may contain one or more single-channel or multi-channel feature maps, which are all considered to be contained in the first feature map or, in other words, make up the channels of the first feature map. The same holds analogously for the second feature map and the second image, respectively. In particular, H' and W' and, preferably C', of the first feature map are identical to those of the second feature map. The neural network, for example the first branch, may for example comprise a first fusion layer, which receives the first feature map and the transformed feature map as an input and concatenates them to generate the first fused feature map. The affine transformation may in this case be applied to the second feature map by a transformation layer of the neural network. Alternatively, the first fusion layer may receive the first feature map and the second feature map, apply the affine transformation to the second feature map and concatenate the first feature map and the transformed feature map. In order to carry out the at least one visual perception task, the at least one decoder module may be applied to the first fused feature map or the first fused feature map may be further processed by the neural network, in particular the first branch, before the at least one decoder module is applied. In particular, each of the at least one visual perception task is carried out by a respective decoder module of the least one decoder module. The extrinsic parameters of an environmental sensor modality are for example given by a set of parameters describing the pose or, in other words, the position and orientation, of the environmental sensor modality or a component of the environmental sensor modality, for example a detector or sensor chip, in a reference coordinate system. The reference coordinate system is the same for all involved environmental sensor modalities, in 2022PF00807 6 particular for the first and the second environmental sensor modality. For example, the reference coordinate system may be a vehicle coordinate system of the vehicle, which is a coordinate system, rigidly connected to the vehicle. The reference coordinate system may also be a sensor coordinate system of one of the environmental sensor modalities. The extrinsic parameters may for example comprise three spatial coordinates specifying a position in the reference coordinate system and three angles specifying the orientation in the coordinate system. The angles may for example be Euler angles, which are denotes as yaw, roll and pitch angle, respectively. However, also other conventions may be used. The affine transformation may therefore comprise an affine rotation and an affine translation, in general, which maps the pose of the second environmental sensor modality to the pose of the environmental sensor modality in the reference coordinate system. The affine transformation therefore comprises, in general, six transformation parameters accounting for the translational shift between the poses and for the rotational deviation between the poses of the environmental sensor modalities. The transformation parameters may in principle be obtained by means of a calibration, in particular an extrinsic calibration. However, preferably, the transformation parameters are trained parameters of the neural network. In other words, the neural network architecture is designed such that the transformation parameters are trainable and, when training the neural network, the transformation parameters are found. Instead of fusing the first feature map directly with the second feature map, the computer- implemented method uses the affine transformation to generate the fused feature map first. In other words, the features of the second feature map are be re-projected to the correct location in feature space with respect to the first feature map before the fusion. This improves the accuracy of the at least one visual perception task, in particular the predictions in a semantic segmentation task. According to several implementations of the computer-implemented method for automatic environmental perception, the at least one first layer comprises one or more first convolution layers following a first rectification layer. The first rectification layer comprises a first set of distortion parameters, in particular trained or learned distortion parameters, of the first environmental sensor modality and a first set of intrinsic parameters, in particular trained or learned intrinsic parameters, of the first environmental sensor modality. 2022PF00807 7 Other than the extrinsic parameters, the intrinsic parameters of an environmental sensor modality, such as a visible range camera or a thermal camera for example, specify internal properties, for example optical properties, of the environmental sensor modality rather than its pose. In particular, the intrinsic parameters define, how a point in the three- dimensional environment is mapped to a pixel in the two-dimensional image plane or sensor plane of the environmental sensor modality. For example, in case of a visible range camera or a thermal camera, the intrinsic parameters may comprise coordinates of a center of projection, commonly denoted as cx, cy, for example, and focal lengths, commonly denoted as fx, fy, for example. The intrinsic parameters may for example be described in terms of a corresponding camera matrix. The distortion parameters of an environmental sensor modality describe the distortion of a respective image generated by the environmental sensor modality, in particular compared to an undistorted two-dimensional representation of the environment. The distortion parameters define, in particular, how a pixel position (u', v') of the respective image is transformed to a corresponding pixel position (u, v) in an undistorted image, also denoted as rectified image. The distortion parameters may be given by a respective model for the environmental sensor modality, such as a pin-hole camera model or a fisheye camera model, et cetera. The distortion parameters may be described in terms of a distortion function. Knowing the distortion parameters and the intrinsic parameters, the first rectification layer may rectify the first image and the rectified first image is then fed to the one or more first convolution layers for feature extraction. In this way, the accuracy of the at least one visual perception task is further improved. The distortion parameters and the intrinsic parameters may also be obtained from a calibration, in particular an intrinsic calibration. However, preferably, distortion parameters and the intrinsic parameters are trained parameters of the neural network. In other words, the neural network architecture is designed such that the distortion parameters and the intrinsic parameters are trainable and, when training the neural network, the distortion parameters and the intrinsic parameters are found. According to several implementations of the computer-implemented method for automatic environmental perception, the at least one second layer comprises one or more second convolution layers following a second rectification layer. The second rectification layer comprises a second set of distortion parameters, in particular trained or learned distortion 2022PF00807 8 parameters, of the second environmental sensor modality and a second set of intrinsic parameters, in particular trained or learned intrinsic parameters, of the second environmental sensor modality. The explanations regarding the one or more first convolution layers and the first rectification layer carry over analogously to the one or more second convolution layers and the second rectification layer. In preferred implementations, the at least one first layer comprises the one or more first convolution layers following the first rectification layer and the at least one second layer comprises the one or more second convolution layers following the second rectification layer. In this case the accuracy may be improved even further, since it is ensured that the features of both, the first and the second feature map, are extracted from rectified images and therefore are fused consistently. According to several implementations, a further first rectification layer of the neural network, in particular of the first branch, which comprises the first set of distortion parameters and the first set of intrinsic parameters, is applied to the first image to generate a scaled and rectified first image, which is scaled, in particular scaled down, with respect to the first image to match spatial dimensions of the first feature map. A second fused feature map is generated by concatenating the first fused feature map and the scaled and rectified first image. The at least one visual perception task is carried out by the at least one decoder module of the neural network depending on the second fused feature map. In particular, the spatial dimensions, that is H and W, of the first fused feature map are identical to the spatial dimensions of the first feature map and the transformed feature map. On the other hand, the spatial dimensions of the first image may be greater than for the first fused feature map. Thus, by applying the further first rectification layer, the scaled and rectified first image may be directly combined with the first fused feature map by the concatenation. The second fused feature map therefore contains, in addition to the features from the first fused feature map, the scaled and rectified image information directly. In other words, the 2022PF00807 9 second fused feature map may be considered to be made up of feature channels according to the first fused feature map and one or more image channels from the concatenated scaled and rectified first image. Therefore, the accuracy of the at least one visual perception task is further improved. According to several implementations, a further first feature map is generated by applying at least one further layer of the first branch to the second fused feature map. The at least one visual perception task is carried out by the at least one decoder module of the neural network depending on the further first feature map. The at least one further layer of the first branch comprises for example a residual block, for example a residual block following a pooling layer, in particular a maximum pooling layer. Residual blocks, also denoted as residual network blocks, are for example described in the publication of K. He et al.: "Deep residual learning for image recognition.", Proceedings of the IEEE conference on computer vision and pattern recognition, pp.770– 778, 2016. According to several implementations, a further second rectification layer of the neural network, in particular of the second branch, which comprises the second set of distortion parameters and the second set of intrinsic parameters, is applied to the second image to generate a scaled and rectified second image, which is scaled, in particular scaled down, with respect to the second image to match the spatial dimensions of the second feature map. A third fused feature map is generated by concatenating the second feature map and the scaled and rectified second image. A further second feature map is generated by applying at least one further layer of the second branch to the third fused feature map. A further transformed feature map is generated based on the further second feature map using the affine transformation. A fourth fused feature map is generated by concatenating the further first feature map with the further transformed feature map. The at least one visual perception task is carried out by the at least one decoder module depending on the fourth fused feature map. The at least one further layer of the second branch comprises for example a residual block, for example a residual block following a pooling layer, in particular a maximum pooling layer. 2022PF00807 10 The explanations regarding the second fused feature map, the first fused feature map and the scaled and rectified first image carry over analogously to the third fused feature map, the second feature map and the scaled and rectified second image. In particular, the third fused feature map contains, in addition to the features from the second feature map, the scaled and rectified image information directly. In other words, the third fused feature map may be considered to be made up of feature channels according to the second feature map and one or more image channels from the concatenated scaled and rectified second image. Therefore, the accuracy of the at least one visual perception task is further improved. The further transformed feature map is, in particular, generated depending on the affine transformation, the second set of distortion parameters and the second set of intrinsic parameters. Since, in general, the second set of distortion parameters and the second set of intrinsic parameters differ from the first set of distortion parameters and the first set of intrinsic parameters, respectively, also the second set of distortion parameters and the second set of intrinsic parameters are taken into account for generating the third fused feature map, in particular its image channels. According to several implementations, the first image is a visible range camera image and the second image is a thermal image. In other words, the first environmental sensor modality is a visible range camera, for example an RGB-camera, and the second environmental sensor modality is a thermal camera. The thermal camera may also be denoted as thermal imaging camera or thermographic camera or infrared camera. In particular, it contains an infrared detector or infrared-sensi- tive imager, which is sensitive to infrared radiation, which may also be denoted as infrared light. For example, the infrared detector or the imager may be sensitive to wave lengths in the range of 750 nm to 15 µm or in a subrange of this range. Consequently, the accuracy and reliability of the at least one visual perception task is par- ticularly improved for low light scenarios or adverse weather conditions such as snow, fog or rain. Under such conditions, the capability of detecting vulnerable road users such as pedestrians, animals and cyclists is improved. 2022PF00807 11 In alternative implementations, the second image may be a lidar image, for example a flash lidar image, or a radar image. According to several implementations, the at least one visual perception task comprises a semantic segmentation task. A result of the semantic segmentation task comprises a semantically segmented image. In the semantically segmented image, a respective pixel level object class, such as dynamic object, static object, road surface, lane marking, et cetera, is for example assigned to each pixel of the first image. According to several implementations, the at least one visual perception task comprises an object detection task and/or a depth estimation task. The result of the depth estimation task comprises, in particular, a depth map, which as- signs a respective depth value to each pixel of the first image. Alternatively, a depth value may be assigned to predefined groups of pixels of the first image. The result of the object detection task comprises position information for one or more bounding boxes for respective objects in the environment of the vehicle and a respective object class assigned to the object or the bounding box, respectively. Alternatively, a re- spective object confidence value may be assigned to each of a plurality of predefined ob- ject classes for each of the objects or each of the bounding boxes, respectively. The bounding boxes may for example be rectangular bounding boxes. However, also other geometric figures may be used. For example, in case of a rectangular bounding box, its position may be given by a center position of the rectangle or a corner position of the rectangle or another defined position of the rectangle. In this case, the size of the bounding box may be given by a width and/or height of the rectangle or by equivalent quantities. According to a further aspect of the invention, a computer-implemented training method for training an artificial neural network for automatic environmental perception based on multi-modal sensor data, in particular of a vehicle, is provided. Therein, a first training image of a first environmental sensor modality and a second training image of a second 2022PF00807 12 environmental sensor modality are received. A first feature map is generated by applying at least one layer of a first branch of the neural network to the first training image and a second feature map is generated by applying at least one layer of a second branch of the neural network to the second training image. A transformed feature map is generated based on the second feature map using an affine transformation, which accounts for a deviation between extrinsic parameters of the first environmental sensor modality and the second environmental sensor modality, wherein the affine transformation depends on a set of transformation parameters. A first fused feature map is generated by concatenating the first feature map with the transformed feature map. Output data is generated by carrying out at least one visual perception task by at least one decoder module of the neural network depending on the first fused feature map. At least one loss function is evaluated depending on the output data and the neural network is adapted based on the result of the evaluation, wherein adapting the neural network comprises adapting the set of transformation parameters. For example, a neural network trained by using a computer-implemented training method according to the invention may be used in the various implementations of the computer- implemented method for automatic environmental perception according to the invention. The computer-implemented training method may then also be considered as a part of the computer-implemented method for automatic environmental perception. In particular, the at least one loss function is evaluated depending on the output data and respective annotations of the first and the second training images. The output data as well as the annotations depend on the type of the at least one visual perception task, as described above with respect to semantic segmentation, depth estimation and object detection, for example. Adapting the neural network comprises, in particular, adapting weighting factors and bias factors of the neural network, for example using backpropagation or another known algorithm. In preferred implementations, adapting the neural network comprises adapting the set of transformation parameters. 2022PF00807 13 According to several implementations of the computer-implemented training method, further output data is generated by carrying out at least one further visual perception task by at least one further decoder module of the second branch depending on the second feature map and the at least one loss function is evaluated depending on the output data and the further output data. Therein, the at least one further decoder module uses the second feature map or the second feature map after further processing, but, in particular, neither the transformed feature map nor the first feature map or the first fused feature map. Carrying out the at least one further visual perception task during training also improves the performance of the at least one visual perception task carried out by the first branch. When the training is completed and the trained neural network is used, for example in a computer-implemented method for automatic environmental perception according to the invention, the at least one further visual perception task and the at least one further decoder module are not required anymore. Further implementations of the computer-implemented training method according to the invention follow directly from the various embodiments of the computer-implemented method for automatic environmental perception according to the invention and vice versa. In particular, individual features and corresponding explanations as well as advantages relating to the various implementations of the computer-implemented method for automatic environmental perception be transferred analogously to corresponding implementations of the computer-implemented training method according to the invention. According to a further aspect of the invention, a method for guiding a vehicle, in particular a motor vehicle, at least in part automatically, is provided. The method comprises carrying out a computer-implemented method for automatic environmental perception according to the invention. The method further comprises generating at least one control signal for guiding the vehicle at least in part automatically depending on a result of the at least one visual perception task. The at least one control signal may for example be provided to one or more actuators of the vehicle, which affect or carry out a lateral and/or longitudinal control of the vehicle automatically or in part automatically. 2022PF00807 14 For use cases or use situations which may arise in the method and which are not explicitly described here, it may be provided that, in accordance with the method, an error message and/or a prompt for user feedback is output and/or a default setting and/or a predetermined initial state is set. According to a further aspect of the invention, an automatic environmental perception system is provided. The automatic environmental perception system comprises at least one computing unit, which is configured to carry out computer-implemented method for automatic environmental perception according to the invention and/or a computer- implemented training method according to the invention. According to a further aspect of the invention, an electronic vehicle guidance system for a vehicle is provided. The electronic vehicle guidance system comprises an automatic environmental perception system according to the invention. The at least one computing unit is configured to generate at least one control signal for guiding the vehicle at least in part automatically depending on a result of the at least one visual perception task. In some implementations, the electronic vehicle guidance system comprises a first environmental sensor modality for the vehicle, which is configured to generate the first image, and a second environmental sensor modality for the vehicle, which is configured to generate the second image. An electronic vehicle guidance system may be understood as an electronic system, con- figured to guide a vehicle in a fully automated or a fully autonomous manner and, in particular, without a manual intervention or control by a driver or user of the vehicle being necessary. The vehicle carries out all required functions, such as steering maneuvers, deceleration maneuvers and/or acceleration maneuvers as well as monitoring and recording the road traffic and corresponding reactions automatically. In particular, the electronic vehicle guidance system may implement a fully automatic or fully autonomous driving mode ac-cording to level 5 of the SAE J3016 classification. An electronic vehicle guidance system may also be implemented as an advanced driver assistance system, ADAS, assisting a driver for partially automatic or partially autonomous driving. In particular, the electronic vehicle guidance system may implement a partly automatic or 2022PF00807 15 partly autonomous driving mode according to levels 1 to 4 of the SAE J3016 classification. Here and in the following, SAE J3016 refers to the respective standard dated June 2018. Guiding the vehicle at least in part automatically may therefore comprise guiding the vehicle according to a fully automatic or fully autonomous driving mode according to level 5 of the SAE J3016 classification. Guiding the vehicle at least in part automatically may also comprise guiding the vehicle according to a partly automatic or partly autonomous driving mode according to levels 1 to 4 of the SAE J3016 classification. A computing unit may in particular be understood as a data processing device, which comprises processing circuitry. The computing unit can therefore in particular process data to perform computing operations. This may also include operations to perform indexed accesses to a data structure, for example a look-up table, LUT. In particular, the computing unit may include one or more computers, one or more micro- controllers, and/or one or more integrated circuits, for example, one or more application- specific integrated circuits, ASIC, one or more field-programmable gate arrays, FPGA, and/or one or more systems on a chip, SoC. The computing unit may also include one or more processors, for example one or more microprocessors, one or more central processing units, CPU, one or more graphics processing units, GPU, and/or one or more signal processors, in particular one or more digital signal processors, DSP. The computing unit may also include a physical or a virtual cluster of computers or other of said units. In various embodiments, the computing unit includes one or more hardware and/or soft- ware interfaces and/or one or more memory units. A memory unit may be implemented as a volatile data memory, for example a dynamic random access memory, DRAM, or a static random access memory, SRAM, or as a non- volatile data memory, for example a read-only memory, ROM, a programmable read-only memory, PROM, an erasable programmable read-only memory, EPROM, an electrically erasable programmable read-only memory, EEPROM, a flash memory or flash EEPROM, a ferroelectric random access memory, FRAM, a magnetoresistive random access memory, MRAM, or a phase-change random access memory, PCRAM. 2022PF00807 16 According to a further aspect of the invention, a vehicle, in particular a motor vehicle, comprising an automatic environmental perception system and/or an electronic vehicle guidance system according to the invention is provided. Therein, the first and the second environmental sensor modality are mounted to the vehicle. According to a further aspect of the invention, a computer program comprising instructions is provided. When the instructions are executed by at least one computing unit, the in- structions cause the at least one computing unit to carry out a computer-implemented method for automatic environmental perception according to the invention and/or a com- puter-implemented training method according to the invention. According to a further aspect of the invention, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program according to the invention. The computer program and the computer-readable storage medium may be denoted as respective computer program products comprising the instructions. Further features of the invention are apparent from the claims, the figures and the figure description. The features and combinations of features mentioned above in the description as well as the features and combinations of features mentioned below in the description of figures and/or shown in the figures may be comprised by the invention not only in the re- spective combination stated, but also in other combinations. In particular, embodiments and combinations of features, which do not have all the features of an originally formu- lated claim, may also be comprised by the invention. Moreover, embodiments and combi- nations of features which go beyond or deviate from the combinations of features set forth in the recitations of the claims may be comprised by the invention. In the following, the invention will be explained in detail with reference to specific exem- plary implementations and respective schematic drawings. In the drawings, identical or functionally identical elements may be denoted by the same reference signs. The descrip- tion of identical or functionally identical elements is not necessarily repeated with respect to different figures. In the figures, 2022PF00807 17 Fig.1 shows schematically a vehicle with an exemplary implementation of an automatic environmental perception system according to the invention; Fig.2a shows schematically a visible range camera image; Fig.2b shows a detail of the visible range camera image of Fig.2a; Fig.3a shows schematically a thermal image; Fig.3b shows a detail of the thermal image of Fig.3a; Fig.4 shows schematically a neural network for use in an exemplary implementation of a computer-implemented method for automatic environmental perception according to the invention; and Fig.5 shows schematically a further neural network for use in a further exemplary implementation of a computer-implemented method for automatic environmental perception according to the invention. Fig.1 shows a vehicle 1 with an exemplary implementation of an automatic environmental perception system 2 according to the invention. The automatic environmental perception system 2 comprises a computing unit 5, which can be considered representative of one or more computing units of the vehicle 1. The ve- hicle 1, for example the automatic environmental perception system 2, comprises a first environmental sensor modality 3, for example a visible range camera, which is mounted at the vehicle 1 such that a field of view of the first environmental sensor modality 3 covers a part of an outer environment of the vehicle 1. The vehicle 1, for example the automatic en- vironmental perception system 2, comprises a second environmental sensor modality 4, for example a thermal camera, which is mounted at the vehicle 1 such that a field of view of the second environmental sensor modality 3 covers the part of the outer environment of the vehicle 1. In particular, the fields of view of the environmental sensor modalities 3, 4 overlap at least partially. 2022PF00807 18 The first environmental sensor modality 3 is configured to generate a first image 6 (see Fig.2a), for example a visible range image, which represents the environment of the vehi- cle 1 as covered by the field of view of the first environmental sensor modality 3. The sec- ond environmental sensor modality 4 is configured to generate a second image 7 (see Fig.3a), for example a thermal image, which represents the environment of the vehicle 1 as covered by the field of view of the second environmental sensor modality 4. The com- puting unit 5 may receive the first and the second image 6, 7 and carry out a computer- implemented method for automatic environmental perception according to the invention. The axes in Fig.2a) and Fig.3a) denote the pixel positions in arbitrary units. Fig.2b) shows a detail of Fig.2a) and Fig.3b) shows the corresponding detail of Fig.3a). One can see that there is a shift of at least several pixels in both directions between the first and the second image 6, 7. This is due to different intrinsic and extrinsic parameters and/or a time delay between both environmental sensor modalities 3, 4. The effect of such shifts in automatic environmental perception based on multi-modal sensor data is reduced by the invention. To carry out the computer-implemented method for automatic environmental perception, the computing unit 5 uses a trained artificial neural network 8. An example for such a neural network 8 during training phase is shown schematically in Fig 4. The computing unit receives the first image 6 and the second image 7 and generates a first feature map by applying at least one layer 11, 16 of a first branch 9 of the neural network 8 to the first image 6 and a second feature map by applying at least one layer 31, 36 of a second branch 10 of the neural network 8 to the second image 7. The computing unit 5 generates a transformed feature map based on the second feature map using an affine transformation, which accounts for a deviation in extrinsic parameters of the first environmental sensor modality 3 and the second environmental sensor modality 4. The computing unit 5 generates a first fused feature map by concatenating the first feature map with the transformed feature map. The computing unit 5 carries a visual perception task, for example a semantic segmentation task, by using a decoder module 26 of the neural network 8 depending on the first fused feature map. The computing unit 5 may for example generate control signals for affecting a longitudinal and/or lateral control, in other words the driving direction and/or speed, of the vehicle 1, depending on a result of the visual perception task, for example depending on a 2022PF00807 19 semantically segmented image 29. The control signals are for example transmitted to respective actuators (not shown) of the vehicle 1. The at least one layer 11, 16 of the first branch 9 comprises for example a first rectification layer 11 and a first convolution block 16, which contains one or more convolution layers followed by a batch normalization layer and an activation layer, for example according to a ReLu function. The first convolution block 16 is followed by a fusion layer 21, which generates the transformed feature map and the first fused feature map. The first rectification layer 11 comprises a first set of distortion parameters of the first environmental sensor modality 3 and a first set of intrinsic parameters of the first environmental sensor modality. The first rectification layer 11 rectifies the first image 6 and the rectified first image is then fed to the first convolution block 16 for feature extraction. Analogously, the at least one layer 31, 36 of the second branch 10 comprises for example a second rectification layer 31 and a second convolution block 36, which contains one or more convolution layers followed by a batch normalization layer and an activation layer, for example according to a ReLu function. The second rectification layer 31 comprises a second set of distortion parameters of the second environmental sensor modality 4 and a second set of intrinsic parameters of the second environmental sensor modality. The second rectification layer 31 rectifies the second image 7 and the rectified second image is then fed to the second convolution block 36 for feature extraction. A further first rectification layer 12 of the first branch 9, which comprises the first set of distortion parameters and the first set of intrinsic parameters, is applied to the first image 6 to generate a scaled and rectified first image, which is scaled with respect to the first image 6 to match spatial dimensions of the first feature map. A second fused feature map is generated by concatenating the first fused feature map and the scaled and rectified first image. Analogously, a further second rectification layer 32 of the second branch 10, which comprises the second set of distortion parameters and the second set of intrinsic 2022PF00807 20 parameters, may be applied to the second image 7 to generate a scaled and rectified second image, which is scaled with respect to the second image 7 to match spatial dimensions of the second feature map. A third fused feature map is generated by concatenating the second feature map and the scaled and rectified second image. For example, a further first feature map is generated by applying at least one further layer 17 of the first branch 9 to the second fused feature map. The at least one further layer 17 of the first branch 9 comprises a residual network block and, for example, a maximum pooling layer, wherein the residual network block follows the maximum pooling layer. Analogously, a further second feature map may be generated by applying at least one further layer 37 of the second branch 10 to the third fused feature map. The at least one further layer 37 of the second branch 10 comprises a residual network block and, for example, a maximum pooling layer, wherein the residual network block follows the maximum pooling layer. A further transformed feature map is generated based on the further second feature map using the affine transformation, for example by a further fusion layer 22 of the first branch 9 following the residual network block of the at least one further layer 17 of the first branch 9. The further fusion layer 22 may generate a fourth fused feature map by concatenating the further first feature map with the further transformed feature map. In some implementations, the first branch 9 comprises further first sections and the second branch 10 comprises further second sections. Each first section contains a residual block 18, 19, 20 followed by a respective fusion layer 23, 24, 25, each second section contains a respective residual block 38, 39, 40. The output feature maps of the residual blocks 38, 39, 40 of the second branch 10 are transformed according to the affine transformation, the second set of intrinsic parameters and the second set of distortion parameters, and then are concatenated with the output of the corresponding residual block 18, 19, 20 of the first branch 9. Furthermore, each first section may comprise a further rectification layer, 13, 14, 15, which scales and rectifies the first image and the result is then concatenated to the output of the preceding fusion layer 22, 23, 24. Analogously, each second section may comprise a further rectification layer 33, 34, 35, which scales and rectifies the second image and the result is then concatenated to the output of the preceding residual block 37, 38, 39. 2022PF00807 21 The first branch 9 comprises a first decoder module 26 followed by a first softmax layer 28, which consume the output of the last of the fusion layers 21, 22, 23, 24, 25 and generates an output according to the visual perception task, for example the semantically segmented image 29. Analogously, the second branch 10 comprises a second decoder module 41 followed by a second softmax layer 42, which consume the output of the last of the residual blocks 37, 38, 39, 40 of the second branch 10 and generates an output according to a further visual perception task, for example a further semantically segmented image 30. For training the neural network 8, a loss function may be evaluated depending on the outputs of the visual perception task and the further visual perception task, in particular the semantically segmented images 29, 30, and corresponding annotations of the first image 6 and the second image 7. The neural network 8 is then adapted or trained depending on the result of the evaluation. Therein, in particular the transformation parameters of the affine transformation, the first and second set of intrinsic parameters and the first and second set of distortion parameters may be treated as trainable parameters and adapted accordingly. After the training, the second decoder module 41 is not required anymore. The first and the second decoder module 26, 41 may for example contain one or more respective decoder layers 27. The decoder layers may for example comprise two consecutive upception blocks 27a, 27b. In the fusion layers 21, 22, 23, 24, 25, instead of concatenating the respective feature maps directly where the affine transformation is used, such that 2022PF00807 22 Therein, ^ ^ and ^ ^ denote the respective feature maps in the first and the second branch 9, 10, respectively, ^^ ^ ^ ^ , ^ ^ ^ ^^^ denotes the pixel coordinates according to the first image 6 or the resulting feature map ^ ^ , ^^^ ^ ^ , ^ ^ ^ ^ ^ ^ denotes the pixel coordinates according to the second image 7 or the resulting feature map ^ ^ , and F denotes the result of the concatenation operation, which is represented by the operator ^. R denotes an affine rotation and t denotes an affine translation, which contain the trainable transformation parameters. For initialization, an identity operation may be used for R and a null translation for t. The transformation between the two image planes may not be modelled exactly directly with 2D affine transformations. For example, pixels at the center of the images 6, 7 may obey the above 2D affine transformation, but the pixels at the borders may not obey the above2D affine transformation because of different distortions at the image borders. Consequently, the intrinsic parameters and the distortion parameters may be included to reduce the approximation errors. Therefore, in the above equation, ^ ^ and ^ ^ are considered to represent the undistorted or rectified feature maps or, in other words, feature maps resulting from the undistorted first and second image, respectively, while ^^ ^ ^ ^ , ^ ^ ^ ^^^ and are the rectified pixel coordinates. In order to include the intrinsic parameters and distortion parameters, the equation is modified to read wherein ^ ^ ^^^ denotes the distorted version of ^ ^ , ^ ^ ^^^ denotes the distorted version of ^ ^ , ^ ^ ^ , ^ ^ ^^ = ^ ^ ^^ ^ ^ ^ , ^ ^ ^ ^^^ and ^ ^ ^ , ^ ^ ^^ = ^ ^ ^^ ^ ^ ^ , ^ ^ ^ ^^^ denote the corresponding distorted pixel coordinates with ^ ^ and ^ ^ being the distortion functions for the first and second environmental sensor modality 3, 4, respectively. Since the first and the second image 6, 7 are distorted images, one is able to pick the corresponding pixel value by distorting the pixel from ^^ ^ ^ , ^ ^ ^ ^ ^ , the distorted pixel coordinates with a new and fixed set of intrinsic parameters K 2 ', to ^^ ^ , ^ ^ ^ ^ , the distorted 2022PF00807 23 pixel coordinates with the old and variable set of intrinsic parameters K 2 . So this step avoids the preprocessing step to undistort the images. Instead, on can include the intrinsic parameters and the distortion parameters as trainable parameters of the neural network 8. Assuming for example a pinhole camera model with radial and tangential distortion for the first and second environmental sensor modality 3, 4, the reprojection from undistorted image pixels to distorted image pixels may be described according to the following relations: wherein ^ ^ = ^^ ^^ , ^ ^^ , ^ ^^ , ^ ^^ , ^ ^^ , ^ ^ ^, ^ ^ = ^ ^ ^^ , ^ ^^ , ^ ^^ , ^ ^^ , ^ ^^ , ^ ^ ^ , ^ ^ = ^^ ^^ , ^ ^^ ^, ^ ^ = ^^ ^^ , ^ ^^ ^, 2022PF00807 24 * ^ ^ = & ^ ^^ + ( ^ ^^ , * ^ ^ = & ^ ^^ + ( ^ ^^ , ^ ^ = ^ ^" & ^ ^^ + $ ^" , ^ ^ = ^ ^" & ^ ^^ + $ ^" , ^ ^ = ^ ^# ( ^ ^^ + $ ^# , ^ ^ = ^ ^# ( ^ ^^ + $ ^# . This type of procedure can be applied to any other camera model or sensor model, for example for fisheye cameras. During the training phase, the original sets of intrinsic parameters ^ ^ , ^ ^ as well as the sets of distortion parameters ^ ^ , ^ ^ , ^ ^ , ^ ^ may be learned. It is noted that the above relations apply for the fusion of image channels of the respective feature maps. For the fusion of feature channels, the relation simplifies to The loss function for the training, for which the loss is defined in the publication Y. Sun et al.: "RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes," IEEE Robotics and Automation Letters, vol.4, no.3, pp.2576-2583, July 2019, includes the weighted loss due to the deviation in the intrinsic parameters and distortion parameters. This avoid overfitting of the intrinsic parameters and distortion parameters. To learn the intrinsic parameters and distortion parameters and the affine transformation more efficiently, one may pre-calibrate the environmental sensor modalities 3, 4. Around these values, one can apply deviations in the intrinsic parameters and distortion parameters to achieve data augmentation of the training data. The training images can be transformed according to the known deviations. The pairs of transformed images and output calibration values with known deviations are given during the training process. 2022PF00807 25 Invention may also be used for more than two environmental sensor modalities 3, 4. In the example shows in Fig.5, a third environmental sensor modality, for example a lidar system or radar system, may generate a third image 43, which is fed to a third branch 56 of the neural network 8. The third branch 56 is designed analogously to the second branch 10, whose details are not shown in Fig.5, and coupled to the first branch 9 in an analogous manner as described for the second branch 10. Thus, the third branch 56 comprises a third rectification layer 44 followed by a third convolution block 49, which contains one or more convolution layers followed by a batch normalization layer and an activation layer, for example according to a ReLu function. The third rectification layer 44 comprises a third set of distortion parameters of the third environmental sensor modality and a third set of intrinsic parameters of the third environmental sensor modality. The third rectification layer 44 rectifies the third image 43 and the rectified third image is then fed to the third convolution block 49, which generates a third feature map. The fusion layer 21 of the first branch 9 transforms the third feature map using a further affine transformation, which accounts for a deviation in extrinsic parameters of the first environmental sensor modality 3 and the third environmental sensor modality, and generates the first fused feature map by concatenating the first feature map with the transformed second feature map and the transformed third feature map. A further third rectification layer 45 of the third branch 56, which comprises the third set of distortion parameters and the third set of intrinsic parameters, may be applied to the third image 43 to generate a scaled and rectified third image, which is scaled with respect to the third image 43 to match spatial dimensions of the third feature map. A fifth fused feature map is generated by concatenating the third feature map and the scaled and rectified third image. A further third feature map may be generated by applying at least one further layer 50 of the third branch 56 to the fifth fused feature map. The at least one further layer 50 of the third branch 56 comprises a residual network block and, for example, a maximum pooling layer, wherein the residual network block follows the maximum pooling layer. 2022PF00807 26 The explanations with respect to the residual blocks 38, 39, 40 of the second branch 10 may be carried over analogously to respective residual blocks 51, 52, 53 of the third branch 56. The explanations with respect to the further rectification layers 33, 34, 35 of the second branch 10 may be carried over analogously to respective further rectification layers 46, 47, 48 of the third branch 56. The explanations with respect to the second decoder module 41, the second softmax layer 42 and the further semantically segmented image 30 may be carried over analogously to a third decoder module 54 of the third branch 56, a third softmax layer 57 of the third branch 56 and a further semantically segmented image 55 generated by the third branch 56. The invention may, in some implementations, therefore incorporate for example a visible range camera as the first environmental sensor modality 3, a thermal camera as the second environmental sensor modality 4, and a lidar system or radar system as the third environmental sensor modality. The proposed network can learn efficiently irrespective of the intrinsic and extrinsic parameters of the environmental sensor modalities.