"POINT CLOUD PROCESSING" - COMMW SCIENT IND RES ORG

Title:

"POINT CLOUD PROCESSING"

Document Type and Number:

WIPO Patent Application WO/2024/082004

Kind Code:

Abstract:

This disclosure relates to machine learning in relation to point clouds. A sensor system comprises a first sensor of a first modality to capture first sensor data, where the first sensor data has a strong link between the sensor data and the ground truth. The sensor system also comprises a second sensor of a second modality to capture second sensor data, where the second sensor data has a weak link between the sensor data and the ground truth. The sensor system further comprises a machine learning processor configured to train a machine learning model by using the second sensor data as input and reducing an error between an output and an estimation of the ground truth calculated based on the first sensor data. The machine learning processor is further configured to evaluate the trained machine learning model using further second sensor data as input.

More Like This:

JP2023552068	Adversarial semi-supervised one-shot learning
WO/2024/044029	ZERO-SHOT DOMAIN GENERALIZATION WITH PRIOR KNOWLEDGE
JP2023529135	Weakly supervised object localization method and system for realizing it

Inventors:

WANG ZIWEI (AU)
LIU JIAJUN (AU)

Application Number:

PCT/AU2023/051022

Publication Date:

April 25, 2024

Filing Date:

October 17, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

COMMW SCIENT IND RES ORG (AU)

International Classes:

G06N3/0895; G01S17/06; G01S17/894; G06T7/521; G06V10/774; G06V10/80

Attorney, Agent or Firm:

FB RICE PTY LTD (AU)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS: 1. A method for multi-modal training of a machine learning model to reduce an error between an output of the machine learning model and a ground truth, the method comprising: receiving first sensor data captured by a first sensor of a first modality, the first sensor data having a strong link between the sensor data and the ground truth; receiving second sensor data captured by a second sensor of a second modality, the second sensor data having a weak link between the sensor data and the ground truth; training the machine learning model by using the second sensor data as input to the machine learning model and reducing an error between the output of the machine learning model and an estimation of the ground truth calculated based on the first sensor data; and evaluating the trained machine learning model using further second sensor data as input to the trained machine learning model. 2. The method of claim 1, wherein the first sensor data comprises point cloud data. 3. The method of claim 1 or 2, wherein the second sensor data comprises image data. 4. The method of any one of the preceding claims, wherein the second sensor data comprises movement data indicative of acceleration of the second sensor. 5. The method any one of the preceding claims, wherein the first sensor data and the second sensor data represent a scene comprising a moving object and static object. 6. The method of claim 5, wherein the method further comprises determining whether a condition on a spatial relationship between the moving object and the static object is met.

7. The method of claim 5 or 6, the moving object is an animal and the static object is a drinking trough for the animal. 8. The method of any one of the preceding claims, wherein the output of the machine learning model relates to hand washing and is indicative of a moving hand being in proximity to a static sink. 9. The method of any one of the preceding claims, wherein the output of the machine learning model relates to knife sanitation and is indicative of a moving knife being inserted into a static knife steriliser. 10. The method of any one of the preceding claims, wherein the output of the machine learning model relates to animal feeding and the machine learning output is indicative of a moving animal being in proximity of a static feeding trough. 11. The method of any one of the preceding claims, wherein the output of the machine learning model relates to machine operations and the machine learning output is indicative of a moving operator being in proximity to a static machine. 12. The method of any one of the preceding claims, wherein the machine learning model is configured to perform one or more of: classification; detection; or segmentation. 13. A method for training a machine learning model for a point cloud, the method comprising: creating a three-dimensional model of an object in computer memory, the object being associated with object information; rendering a three-dimensional point cloud of the three-dimensional model; repeatedly rotating the three-dimensional model to render multiple three- dimensional point clouds; using the multiple three-dimensional point clouds as an input of the machine learning model to train the machine learning model by minimising an error between the object information and an output of the machine learning model; applying the trained machine learning model to a three-dimensional point cloud captured by an active light detection and ranging (LIDAR) sensor to determine information of an object in view of the LIDAR sensor. 14. The method of claim 13, wherein the output of the machine learning model is indicative of one or more of: classification; detection; or segmentation of the object. 15. The method of claim 13 or 14, wherein rendering the three-dimensional point cloud comprises: defining multiple rays from a view point; and for each ray, tracing the ray to determine an intersection point with a mesh model of the three-dimensional object. 16. The method of any one of claims 13 to 15, wherein the method further comprises emulating a partial occlusion of the object in computer memory. 17. The method of any one of claims 13 to 16, wherein the method further comprises adding noise to the rendered three-dimensional point cloud. 18. The method of any one of claims 13 to 16, wherein the machine learning model comprises: a point cloud feature encoder to generate point features; and a neural network to generate the output of the machine learning model.

19. The method of claim 18, neural network is a multilayer- perceptron. 20. Software that, when executed by a computer, causes the computer to perform the method of any one of the preceding claims. 21. A computer system for multi-modal training of a machine learning model to reduce an error between an output of the machine learning model and a ground truth, the computer system comprising a processor configured to: receive first sensor data captured by a first sensor of a first modality, the first sensor data having a strong link between the sensor data and the ground truth; receive second sensor data captured by a second sensor of a second modality, the second sensor data having a weak link between the sensor data and the ground truth; train the machine learning model by using the second sensor data as input to the machine learning model and reducing an error between the output of the machine learning model and an estimation of the ground truth calculated based on the first sensor data; and evaluate the trained machine learning model using further second sensor data as input to the trained machine learning model. 22. A sensor system comprising: a first sensor of a first modality to capture first sensor data, the first sensor data having a strong link between the sensor data and the ground truth; a second sensor of a second modality to capture second sensor data, the second sensor data having a weak link between the sensor data and the ground truth; a machine learning processor configured to: train a machine learning model by using the second sensor data as input to the machine learning model and reducing an error between an output of the machine learning model and an estimation of the ground truth calculated based on the first sensor data; and evaluate the trained machine learning model using further second sensor data as input to the trained machine learning model.

Description:

"Point cloud processing" Cross-Reference to Related Applications [0001] The present application claims priority from Australian Provisional Patent Application 2022903043 filed on 17 October 2022, the contents of which are incorporated herein by reference in their entirety. Technical Field [0002] This disclosure relates to machine learning in relation to point clouds, such as, but not limited to, point clouds obtained from light detection and ranging (LIDAR) sensors. Background [0003] Machine learning models are available for classification, object detection and segmentation in image data. However, the training of these machine learning models requires training data that is often not available with a number of samples that is sufficient to train the models for accurate performance. [0004] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each claim of this application. Summary [0005] This disclosure provides a method for generating training data for machine learning models that is unsupervised in the sense that human labelling of training samples is not necessary. Therefore, a large number of training samples can be generated relatively quickly. For example, different sensing modalities can be used so that the input training samples are from a sensor (such as a passive image sensor) and the labels are from a second sensor (such as an active LIDAR sensor). In another example, the training samples are generated by rotating a three-dimensional in-silico model of an object and rendering a point cloud from that object given a viewing position and laser sampling pattern. [0006] There is provided a method for multi-modal training of a machine learning model to reduce an error between an output of the machine learning model and a ground truth. The method comprises: receiving first sensor data captured by a first sensor of a first modality, the first sensor data having a strong link between the sensor data and the ground truth; receiving second sensor data captured by a second sensor of a second modality, the second sensor data having a weak link between the sensor data and the ground truth; training the machine learning model by using the second sensor data as input to the machine learning model and reducing an error between the output of the machine learning model and an estimation of the ground truth calculated based on the first sensor data; and evaluating the trained machine learning model using further second sensor data as input to the trained machine learning model. [0007] In some embodiments, the first sensor data comprises point cloud data. [0008] In some embodiments, the second sensor data comprises image data. [0009] In some embodiments, the second sensor data comprises movement data indicative of acceleration of the second sensor. [0010] In some embodiments, the first sensor data and the second sensor data represent a scene comprising a moving object and static object. [0011] In some embodiments, the further comprises determining whether a condition on a spatial relationship between the moving object and the static object is met. [0012] In some embodiments, the moving object is an animal and the static object is a drinking trough for the animal. [0013] In some embodiments, the output of the machine learning model relates to hand washing and is indicative of a moving hand being in proximity to a static sink. [0014] In some embodiments, the output of the machine learning model relates to knife sanitation and is indicative of a moving knife being inserted into a static knife steriliser. [0015] In some embodiments, the output of the machine learning model relates to animal feeding and the machine learning output is indicative of a moving animal being in proximity of a static feeding trough. [0016] In some embodiments, the output of the machine learning model relates to machine operations and the machine learning output is indicative of a moving operator being in proximity to a static machine. [0017] In some embodiments, the machine learning model is configured to perform one or more of classification; detection; or segmentation. [0018] A method for training a machine learning model for a point cloud comprises: creating a three-dimensional model of an object in computer memory, the object being associated with object information; rendering a three-dimensional point cloud of the three-dimensional model; repeatedly rotating the three-dimensional model to render multiple three- dimensional point clouds; using the multiple three-dimensional point clouds as an input of the machine learning model to train the machine learning model by minimising an error between the object information and an output of the machine learning model; applying the trained machine learning model to a three-dimensional point cloud captured by an active light detection and ranging (LIDAR) sensor to determine information of an object in view of the LIDAR sensor. [0019] In some embodiments, the output of the machine learning model is indicative of one or more of classification; detection; or segmentation of the object. [0020] In some embodiments, rendering the three-dimensional point cloud comprises defining multiple rays from a view point; and for each ray, tracing the ray to determine an intersection point with a mesh model of the three-dimensional object. [0021] In some embodiments, the method further comprises emulating a partial occlusion of the object in computer memory. [0022] In some embodiments, the method further comprises adding noise to the rendered three-dimensional point cloud. [0023] In some embodiments, the machine learning model comprises a point cloud feature encoder to generate point features; and a neural network to generate the output of the machine learning model. [0024] In some embodiments, the neural network is a multilayer-perceptron. [0025] Software, when executed by a computer, causes the computer to perform the above method. [0026] There is provided a computer system for multi-modal training of a machine learning model to reduce an error between an output of the machine learning model and a ground truth. The computer system comprises a processor configured to: receive first sensor data captured by a first sensor of a first modality, the first sensor data having a strong link between the sensor data and the ground truth; receive second sensor data captured by a second sensor of a second modality, the second sensor data having a weak link between the sensor data and the ground truth; train the machine learning model by using the second sensor data as input to the machine learning model and reducing an error between the output of the machine learning model and an estimation of the ground truth calculated based on the first sensor data; and evaluate the trained machine learning model using further second sensor data as input to the trained machine learning model. [0027] A sensor system comprises a first sensor of a first modality to capture first sensor data, the first sensor data having a strong link between the sensor data and the ground truth; a second sensor of a second modality to capture second sensor data, the second sensor data having a weak link between the sensor data and the ground truth; a machine learning processor configured to: train a machine learning model by using the second sensor data as input to the machine learning model and reducing an error between an output of the machine learning model and an estimation of the ground truth calculated based on the first sensor data; and evaluate the trained machine learning model using further second sensor data as input to the trained machine learning model. [0028] Optional features described of any aspect of method, computer readable medium or computer system, where appropriate, similarly apply to the other aspects also described here. [0029] Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps. Brief Description of Drawings [0030] An example will now be described with reference to the following drawings: [0031] Fig.1 illustrates a scenario for multi-modal training of a machine learning model. [0032] Fig.2 illustrates a method for multi-modal training of a machine learning model. [0033] Fig.3 illustrates a method for training a machine learning model for a point cloud. [0034] Fig.4 illustrates a multi-modal sensing system comprising LiDARs and cameras overlooking the field and collars equipped with IMU and GNSS and worn by animals. [0035] Fig.5 illustrates a processing system for processing sensor data. [0036] Fig.6 illustrates a visualisation of data captured through various sensing modes. [0037] Fig.7 illustrates the ROC curve for drinking behaviour recognition with a window size of 10s. [0038] Fig.8 is an illustration of single-domain learning versus synthetic-to-real cross-domain learning. The single-domain approach collects real-world point cloud data and annotates it. The cross-domain approach trains the model on the data synthesized through automated simulation and utilizes the learned model for inference on real data. [0039] Fig.9 illustrates a synthetic-to- cross-domain learning approach. [0040] Fig.10 illustrates an example synthetic point cloud dataset with six objects and four viewpoints. [0041] Fig.11 shows examples of synthetic and real data in the Point-Syn2Real dataset for the outdoor setting. [0042] Fig.12 illustrates t-SNE visualizations of the feature space for ModelNet- ScanNet cross-domain learning. The Baseline model is learned from raw synthetic 3D objects, while Point-Syn2Real A+S+E uses more realistic partial scans synthesized from multiple viewpoints. The target domain, ScanNet, is a real-world point cloud dataset. [0043] Fig.13 illustrates class-wise accuracy values (%) for the ModelNet to ScanNet case. [0044] Fig.14 illustrates a sensor system. Description of Embodiments [0045] This disclosure provides an improved method for generating labelled training data for training a machine learning model on sensor data. Sensor scenario [0046] Fig.1 illustrates an example scenario 100 comprising a first sensor 101. In this example, the first sensor 101 is a light detection and ranging (LIDAR) sensor that generates a point cloud 102. The point cloud 102 comprises three-dimensional data points having a x, y, and z coordinate in Cartesian space. The LIDAR sensor generates the point cloud by directing a laser beam at the scene and measuring the time of flight to calculate a distance. The LIDAR sensor repeats this measurement for multiple angles of the laser beam to obtain multiple distance measurements and then converts this essentially polar coordinate points into dimensional Cartesian points. It is noted that Fig.1 only shows two dimensions for simplicity. [0047] It is further noted that the distance measurements, and therefore the point cloud 102 readily enables the separation between background points 103 and foreground points 104. As a result, it is possible to relatively robustly perform segmentation of the point cloud 102 into foreground points 104 and background points 103. Therefore, the LIDAR sensor is said to have a relatively strong link between the sensor data and the ground truth of the association of each point to background or foreground objects, that is segmentation. [0048] It is emphasized here that the LIDAR sensor is just one example and segmentation is also just one example. Other examples use other sensors to capture first sensor data. For example, hyperspectral cameras may have a strong link between the sensor data and the ground truth of materials in the scene. Or proximity sensors, such as radio frequency identification (RFID) readers, may serve as first sensor as they have a strong link between the sensor data and the presence of a tagged object as ground truth. Many other examples exist for a wide range of sensors. In many examples, the first sensor 101 provides a strong link to the ground truth but is not practical for field applications. The reasons may include high cost, high complexity, low robustness, low speed, high power consumption, weight and others. [0049] It is now the aim to train a machine learning model 105 on second sensor data from a second sensor 106 that is practical for field applications. For example, the second sensor 106 may be lower in cost or complexity, more robust, faster, lower in power consumption and weight or more advantageous in other aspects for field applications compared to the first sensor. The downside, however, is that the second sensor 106 only has a weak link between the sensor data and the ground truth. For this reason, the machine learning model 105 is trained using the first sensor data as training labels. [0050] In this example, the second 106 is an three-axis inertial measuring unit (IMU) that measures acceleration in three dimensions. However, other sensors are equally possible, including cameras, thermometers, etc. There is a multitude of different combinations that can be envisaged depending on the environment. For example, LIDAR data can be used to label training samples for image classification. Vice versa, image data can be used to generate labels for training samples for point cloud classification. So essentially any sensor may have a strong ground truth link in some applications and a weak ground truth link in other application. Therefore, any possible combination of a first sensor and a second sensor is possible for various applications. [0051] Essentially, there is a processing module 107 that processes the first sensor data with the strong link to the ground truth to generate labels for the training samples. Then, the machine learning model 105 generates an output, which may be a classification, detection, segmentation or other machine learning output. An error calculation module 108 calculates an error that is then used in a backpropagation 109 feedback loop. This way, the machine learning model 105 can adjust the internal weights to reduce the error calculated by error calculation module 108. [0052] It is noted that the shown structure of the machine learning model 105 is a simplified example and more or less complex models can equally be used, including convolutional neural networks (CNN), recurrent neural networks (RNN) and other networks with a different number of layers. Each network may be trained using gradient descent, stochastic gradient decent with backpropagation. [0053] It is noted that the samples generated by both sensors may be static, such as images and point clouds, or may be time series, such as audio recordings or text. The methods disclosed herein are performed by a processor 110 indicated by the dashed box. In this example, the processor 110 comprises data storage to store the machine learning model 105, including learned parameters and the model structure. Processor 110 also comprises program memory, that is, a non-transitory computer readable medium with program code stored The program code causes processor 110 to perform the methods disclosed herein, including the methods shown in Figs.2 and 3. Multi-modal training method [0054] Fig.2 illustrates a method 200 for multi-modal training of a machine learning model to reduce an error between an output of the machine learning model and a ground truth as performed by processor 110. Processor 110 receives 201 first sensor data captured by a first sensor of a first modality. Modality in this context means the physical phenomenon that is used to sense a particular observation of interest. For example, an active laser source and time of flight measurement is a modality, while inertial, photographic, hyperspectral, temperature, proximity, etc. present other modalities. The first sensor data has a strong link between the sensor data and the ground truth as explained above. As a result, a label can be derived from the sensor data without prior training as explained with reference to module 107 above. The sensor data may be received at once, such as a data file, or may be received over time, such as a data stream. This means processor 110 may receive sensor data of one “capture” in a file (such as a point cloud or an image) and may receive a subsequent capture later in a similar data file. [0055] Processor 110 then receives 202 second sensor data captured by a second sensor of a second modality. The above description of modality applies here as well. The second sensor data has a weak link between the sensor data and the ground truth, which is why the machine learning model 105 is trained to process the second sensor data. [0056] More specifically, processor 110 trains 203 the machine learning model 105 by using the second sensor data as input to the machine learning model and reducing an error (see module 108) between the output of the machine learning model and an estimation of the ground truth calculated (see module 107) based on the first sensor data. Finally, processor 110 evaluates 204 the trained machine learning model 105 using further second sensor data as input to the trained machine learning model. Evaluating the trained machine learning 105 means that processor 110 uses the further (weak) second sensor data, for which no first (stronger) sensor data is available, as an input to the trained machine learning model 105. Because the machine learning model 105 is now trained, the output will provide a reasonably good estimate of the ground truth – even without the use of the stronger first sensor data. Example multi-modal implementation [0057] One example use case is behaviour recognition, which aims at detecting animal (including human) states of activity/inactivity from sensor data. It may involve challenging time-series classification tasks whose solution uses in-depth domain knowledge as well as signal processing and machine learning expertise. With the recent advancement in small and light wearable sensors, e.g., inertial measurement unit (IMU), as well as more efficient sensors for monitoring from a distance, e.g., light detection and ranging (LiDAR), the sensing modes available for behaviour recognition have become more diverse and comprehensive. [0058] Although individual sensing modes, such as IMU, can provide accurate signals for behaviour recognition from certain perspectives, they are often limited from some other perspectives. For example, common ruminant behaviours such as grazing, resting, and ruminating may confidently be recognised using tri-axial accelerometry data. However, some behaviours such as drinking and grazing have similar movement patterns and may confuse IMU-only-based models. Moreover, RGB-camera-based computer-vision models can be used to predict human actions. However, they are prone to error or failure when the ambient light is too low. [0059] Collaborative use of multiple sensing modes can help improve behaviour recognition accuracy and robustness. The vision sensors, including RGB camera, LiDAR, and radiometric thermal camera, can observe spatial and temporal interactions between the physical objects and their environment. Simultaneously, wearable sensors, such as IMU and global navigation satellite system (GNSS) receiver, can provide fine- grained movement and location Therefore, multiple sensing modes can complement each other and enhance the accuracy of behaviour recognition. [0060] This disclosure provides a multi-modal sensing system for training a machine learning model, which can be applied to a range of applications including animal behaviour recognition. In one example, the disclosed methods use LiDARs and RGB cameras together with IMU and GNSS sensors. However, other sensor modalities are equally possible. The multiple-view vision enables full coverage of the field containing the animals with minimal occlusion, while the wearable sensors collect data from individual animals. The performance of the proposed system is evaluated in a field trial with grazing beef cattle. System Design [0061] Fig.5 illustrates a multi-modal sensing system 500 , which receives first sensor data including LiDAR, RGB camera, IMU, and GNSS signals as the inputs. The system comprises an operating system 501, such as the Robot Operating System (ROS) a pre-processing module 502, a database 504, an offline analysis module 505 and an online analysis module 506. The ROS module 501 manages the data streams from multiple input sources. The pre-processing module 502 filters noise and transforms the input data. The database 504 stores the pre-processed data and provides it to the offline analysis module 505, which segments the data and performs other analysis tasks. The online analysis module 506 receives the pre-processed data and performs visualisation, recognition and other online analysis tasks. [0062] In this example, the multiple sensors are used to train a machine learning model that uses the LiDAR data as input. So system 500 further comprises a evaluation pipeline 510, comprising a region selection module 512, a background selection module 513 and an model evaluation module 514, which in this case is a drinking recognition module. The region selection module selects a region of interest in which the LiDAR data is to be processed. The background subtraction module 513 subtracts the background, such as by thresholding the distance between clusters of points as disclosed herein. The drinking module 514 evaluates the trained machine learning model using the processed sensor data as input to the machine learning model. In this example, the pipeline 510 recognises drinking behaviour, but other classification, detection, segmentation or other machine learning tasks may be performed. The pipeline 510 may also curate the data for further analysis such as image segmentation. Hardware Components Edge Computer [0063] In one exmaple, the in-situ computer is a desktop PC running Ubuntu 20.04. It is equipped with a gigabit Ethernet switch for connection to two LiDAR sensors and two power-over-Ethernet (PoE) cameras. In other examples, this will be replaced by an edge AI device, e.g., using an Nvidia Jetson. LiDAR [0064] In one exmaple, two Baraja solid-state LiDAR sensors are used to capture point-cloud data. Using two or more LiDAR heads helps achieve good coverage of the field and minimises occlusion, which is useful for capturing 3D shapes of the objects present in the field. RGB Camera [0065] Some exaples use two PoE bullet cameras above the LiDAR sensors, set at a frame rate of 10Hz to match the LiDARs’ sampling rate. IMU [0066] Each collar worn by cattle contains a Bosch Sensortec BMX1609-axis MEMS IMU chip, which features a tri-axial accelerometer, magnetometer, and gyroscope for sensing motion. GNSS [0067] Each collar also has a u-blox ZOE-M8Q GNSS receiver that estimates position and speed. Software Modules ROS [0068] Some examples run on ROS, which is an open-source robotics middleware suite, which comprises of a set of drivers, algorithms, and development tools that facilitate building systems with multiple heterogeneous data streams. Dual-LiDAR and LiDAR-RGB Calibration [0069] Monocular vision-based sensing is prone to occlusion. Therefore, some examples utilise two or more different viewpoints to cover the area of interest optimally. Two LiDAR sensors are calibrated using a coarse-to-fine approach. It is possible to use the CloudCompare toolkit to coarsely align the point clouds collected by LiDARs from different viewpoints. The ICP registration algorithm can then refine the alignment. It is also possible to use automatic registration methods to align LiDAR sensors and RGB cameras within a common global coordinate system. Background Subtraction and Behaviour Recognition [0070] The LiDAR sensor provides the 3D position coordinates of the environment including the foreground objects. The sensor can pre-scan the environment in the absence of any foreground object to build a background 3D model. The processor first applies one or more of distance, noise, and voxelised grid filters to reduce the number of data points to be processed. Furthermore, the processor calculates the cloud-to-cloud distance between the newly captured point cloud and the background model using the nearest-neighbour-based approach from the Point Cloud Library. Finally, the processor considers the points that are sufficiently from the background to belong to the foreground objects. [0071] Once the background is subtracted, the processor can detect the interactions between the moving and stationary objects by calculating the relevant Euclidean distances in the 3D space. For example, the processor detects the interactions between the cattle and the water trough to recognise drinking behaviour as described below. Instance Segmentation [0072] In one example, the processor uses PointRend, a refined version of Mask- RCNN model, for image instance segmentation. The segmentation module can be implemented using efficient online models such as YOLACT++ or more accurate offline models depending on the application. Experiments and Results [0073] To evaluate the proposed sensing system in the field, we conducted a field experiment in CSIRO FD McMaster Laboratory Pasture Intake Facility. During the experiment, we confined eight cattle within paddocks of size 25m ^ 25 m . We mounted each LiDAR-camera pair on a fixed pole and placed the two poles around 10m apart as illustrated in Fig.4. We collected IMU and GNSS data using eGrazor collar tags worn by the cattle. The experiment was approved by the relevant animal ethics committee. We annotated parts of the data by observing animal behaviour in the recorded videos. Multi-modal Sensing Visualisation [0074] Using our in-situ computer, we visualise the data of different sensing modes as in Fig.6. For real-time visualisation of the point clouds captured by the LiDARs, we use a 3D visualiser from ROS. With the dual-LiDAR/camera setup, we can monitor the entire paddock with a panoramic view and reduced occlusion. The top view obtained from LiDAR also provides useful information about the movement of the objects. For time-series data streams, we visualise captured by the accelerometer and GNSS receiver as in Fig.6. Therefore, the disclosed multi-modal sensing system enables real- time visualisation providing useful insights to domain scientists and practitioners using the system. In particular, the real-time visualisation comprises a visual indication (such as by colouring points) of a classification or other machine learning model output. Instance Segmentation [0075] We show an example image instance segmentation using the PointRend model in Fig.6. The cattle and the person are clearly distinguished from the background while moderately overlapped cattle are distinctly segmented. Therefore, the processor can use the results from RGB images to supervise the segmentation of the object in the corresponding point clouds. Background Subtraction [0076] The background subtraction module, as implemented by the processor, creates a background environment model, and separates the objects of interest, such as the cattle and the water trough. The LiDAR view in Fig.6 shows an example real-time background subtraction. The foreground points are in green and yellow corresponding to two LiDAR sensors. Using the proposed background subtraction method, the processor can efficiently filter out the background in near real-time and obtain foreground point clouds for behaviour recognition. Drinking Recognition [0077] Recognising drinking, an elusive but important cattle behaviour, from video or motion data is challenging. Therefore, it is a target use case of this work. Using point cloud data, the processor measures the extent to which any cattle places its head in the water trough, and uses this information to build a classifier and recognise drinking behaviour. The receiver operator characteristic (ROC) curve shown in Fig.7 illustrates the performance of the proposed point-cloud-based cattle drinking behaviour recogniser, based on a classification results using a recognition window of 10s are robust as the area under the ROC curve reaches 0.96. Further applications [0078] The disclosed method may relate to food export regulatory and compliance, and in particular to hand washing detection. There are strict rules for the workers in meat processor to wash hands regularly. The tap, sink and soap dispenser are static environment, and hand is moving object. The output of the machine learning model relates to hand washing and is indicative of a moving hand being in proximity to a static sink. [0079] The disclosed method may relate to food export regulatory and compliance, and in particular to knife sanitisation detection. The workers in meat processor need to sanitise the knife regularly. They dip the knife into a knife steriliser (a specialised basin), and then remove the knife after a few seconds. In that case, the output of the machine learning model is indicative of a moving knife being inserted into a static knife steriliser. [0080] The disclosed method may relate to agriculture and food, and in particular to animal feeding. The model can detect the interactions between fixed animal feeding trough and livestock. That is, the output of the machine learning model relates to animal feeding and the machine learning output is indicative of a moving animal being in proximity of a static feeding trough. [0081] Further, the disclosed method may relate to manufacturing and in particular to monitoring the machinery operation in manufacture. That is the output of the machine learning model relates to machine operations and the machine learning output is indicative of a moving operator being in proximity to a static machine Point cloud training method [0082] As described above, method 200 provides for the generation of a large number of training samples without supervision by using strong sensor data. However, for training a machine learning model on a point cloud, it is often not possible or not practical to generate strong sensor data. Therefore, Fig.3 provides a further method 300 for training a machine learning model for a point cloud. Again, method 300 may be performed by a processor comprising non-volatile program memory and data memory. [0083] In this sense, the processor creates 301 a three-dimensional model of an object in computer memory. The object is associated with object information. For example, for a classification task, the object information comprises a type or identity of an object, such as “cat”, “dog”, “car”, “bike”, “house” etc. For a detection task, the object information comprises a bounding box of the object and for a segmentation task. the object information comprises pixel labels to associate each pixel to either the object or not the object. [0084] The processor then renders 302 a three-dimensional point cloud of the three- dimensional model. For this rendering process, the processor may retrieve a laser beam pattern, such as an opening angle and step size in degrees. For each beam in the beam pattern, the processor performs a raytracing function that determines an intersection point of the beam with a mesh model of the object. More particularly, the processor defines a view point and multiple rays from the view point. For each ray, the processor traces the ray to determine an intersection point with a mesh model of the three- dimensional object. In one example, the processor uses a ray tracing software such as Blender available from https://www.blender.org/. The processor then uses the intersection point as one point in the point cloud. By repeating this process, the processor generates all points in the point cloud. [0085] The processor repeatedly rotates 303 the three-dimensional model to render multiple three-dimensional point clouds as described above. Then, the processor uses 304 the multiple three-dimensional point clouds as an input of the machine learning model to train the machine learning by minimising an error between the object information and an output of the machine learning model, such as by using gradient descent and backpropagation. [0086] Finally, the processor applies the trained machine learning model to a three- dimensional point cloud captured by an active light detection and ranging (LIDAR) sensor to determine information of an object in view of the LIDAR sensor. In contrast to the synthetic training data, this final evaluation is on real-world point cloud data. In other words, during training, the object information is known because the object has been created in computer memory. However, for the evaluation, the object information is not known but the machine learning model outputs the object information since it has been trained for this task. Example point cloud training implementation [0087] Object classification using LiDAR 3D point cloud data is useful for modern applications such as autonomous driving. However, labelling point cloud data is labour- intensive as it requires human annotators to visualize and inspect the 3D data from different perspectives. This disclosure provides a semi-supervised cross-domain learning approach that does not rely on manual annotations of point clouds and performs similar to fully-supervised approaches. The disclosed methods utilize available 3D object models to train classifiers that can generalize to real-world point clouds. [0088] The processor simulates the acquisition of point clouds by sampling 3D object models from multiple viewpoints and with arbitrary partial occlusions. The processor then augments the resulting set of point clouds through random rotations and adds Gaussian noise to better emulate the real-world scenarios. The processor then trains point cloud encoding models, e.g., DGCNN, PointNet++, on the synthesized and augmented datasets and evaluates their cross-domain classification performance on corresponding real-world datasets. Point-Syn2Real is a dataset for cross-domain learning on point clouds. The results of extensive experiments with this dataset demonstrate that the proposed cross- learning approach for point clouds outperforms the related baseline and state-of-the-art approaches in both indoor and outdoor settings in terms of cross-domain generalizability. [0089] Fig.8 is an illustration of single-domain learning versus synthetic-to-real cross-domain learning. The single-domain approach requires collecting real-world point cloud data and annotating it. The cross-domain approach trains the model on the data synthesized through automated simulation and utilizes the learned model for inference on real data. [0090] Detecting and identifying the objects present in a scene is a useful but challenging task in machine learning. Although object detection in computer vision has significantly advanced in recent years, there exist fundamental limitations. For example, most cameras have difficulty capturing clear images in low- or excessive- light conditions, or determining the exact distance of objects in 2D images is challenging. In addition, privacy concerns often arise around images as they may contain private information easily perceivable by humans. [0091] Light detection and ranging (LiDAR) sensors use laser beams to scan their surroundings and construct 3D representations of the objects within. The scanned 3D snapshots are stored as the so-called point clouds. Detecting objects from 3D point clouds can help resolve some of the challenges associated with image-based object detection. A LiDAR scanner can obtain precise information of object positions and shapes regardless of lighting conditions. Moreover, as point clouds are less perceivable by humans, they can help enhance privacy preservation. Given the above advantages, computer vision applications can benefit from 3D point clouds. For example, in autonomous driving, when the visibility is poor, LiDAR sensors can help detect obstacles. [0092] It is possible to recognize objects in point clouds using machine learning. Some approaches use carefully designed features to represent various shapes in point clouds. Deep learning (DL) models can be used to learn point-level and object-level features in an end-to-end manner. A example is PointNet that uses end-to-end DL. It can lead to significant improvement in point cloud classification and segmentation performance. PointNet++ and DGCNN consider local neighbourhood information for refined feature extraction. Nonetheless, training DL models that perform well on real-world data is challenging. Firstly, DL-based methods require large amounts of labelled data for training, whose acquisition is slow and laborious. Secondly, the sensed back-scatter laser light in LiDAR scans is often corrupted by noise that can affect the performance of the learned model. Third, real-world point cloud data is often subject to partial occlusions that can also affect the performance of the learned model. [0093] Databases of models created for 3D graphics design contain large collections of high-quality synthetic 3D models of various known objects. These databases can be used to train new machine learning models to recognize objects in real-world applications within complex environments while minimizing the need for human annotations of point cloud data. [0094] This disclosure provides a synthetic-to-real cross-domain learning approach for point cloud data, called Point-Syn2Real. It enables learning end-to-end DL-based models for classifying objects in real-world point clouds by making use of the available synthetic 3D model databases. With Point-Syn2Real, we learn classification models from synthetic 3D object data (the source domain) and extend the knowledge gained from the synthetic data to real-world point cloud data (the target domain) as illustrated in Fig.8. Since there are substantial discrepancies between the characteristics of the synthetic and real data, an object classification model trained on the source domain may not usually perform well on the target domain, when applied directly. Therefore, this disclosure aims to improve the generalizability of the learned model. [0095] In the training phase of Point-Syn2Real, the processor first simulates 3D LiDAR scans for each considered object from multiple viewpoints to generate synthetic but realistic point cloud data. The processor also emulates arbitrary partial occlusions that can occur in real-world 3D scans. The processor then applies several random rotations and adds Gaussian noise to the synthetic data. Augmenting the datasets via rotation and noise addition helps the trained models learn feature representations that are rotation-invariant and robust to noise. The processor feeds the simulated and augmented synthetic 3D point cloud data into a DL-based point-cloud feature encoder, e.g., DGCNN or PointNet++, and aggregates the learned point features via max-pooling to preserve the most salient features. The processor then passes the features through a multilayer perceptron (MLP) classifier to generate class-wise predictions. [0096] The processor computes the loss function associated to each labeled point cloud as the cross-entropy of predictions and the corresponding ground truth label. In addition, the processor utilizes the unlabeled real-world point cloud data of the target domain for training in a semi-supervised learning fashion via entropy minimization. To this end, the processor feeds the unlabeled point clouds through the feature extractor and object classifier to create the respective class-wise predictions. The loss function for each unlabeled point cloud is the entropy of its corresponding predicted probabilities. This encourages the learned model to make more confident predictions on the unlabeled data and consequently improves its generalization ability. [0097] During inference on real-world data of the target domain, the learned model encodes the input point cloud data into object-level features, and the classifier predicts the associated object class based on the features. [0098] Aspects of the proposed method include: ^ A semi-supervised cross-domain learning approach that can generalize the knowledge learned from synthetic 3D point clouds to real-world data collected by LiDAR scanners. This uses random rotation/noise addition augmentation, multi-view simulation, and entropy minimization to enhance the robustness and performance of the learned models. ^ A comprehensive synthetic-to-real cross-domain 3D point cloud dataset as a benchmark, which includes indoor and outdoor scenarios. ^ Extensive experiments using data of both indoor and outdoor settings, and demonstrate the effectiveness of the proposed approach. Cross-domain learning [0099] This disclosure provides a semi-supervised approach to synthetic-to-real cross- domain learning including procedures of multi-view simulation and data augmentation as well as the utilized point cloud encoder. There is also provided a model learning process and an associated objective function designed for semi-supervised learning while coping with class imbalance. Overview [0100] Fig.9 is a visual overview of the proposed approach. During training, the processor uses a 3D computer-aided design (CAD) tool to generate a set of partially- occluded point clouds taken from multiple viewpoints for each considered object. The generated point cloud set associated with the i th object is denoted as ^ _i ^{= {P} _{i 1} ^{, P} _{i 2} ^{, ^ , P} _iM ^} _{where M is the number of viewpoints. Each point cloud is a set} _{space, i.e.,} ^P _ij ^{= { p} _{ij 1} ^{, p} _{ij 2} ^{, ^ , p} _ijNij ^} _{where N ij is the number} of points in P _ij and each point p _ijk has (x _ijk , y _ijk , z _ijk ) . The processor utilizes unlabeled real-world point clouds from the target domain to realize semi-supervise learning. Thus, the set of real point clouds associated with the i th object is denoted as S _i . [0101] The processor augments the set of synthetic point clouds by applying multiple random rotations and adding Gaussian noise. The processor then feeds the augmented synthetic data into a DL-based point cloud encoder to extract point-level features. The processor aggregates these features via max-pooling before forwarding to fully- connected (FC) classification layers, output predicted posterior probabilities for each object class using the softmax function. To jointly train the neural networks of the encoder and the classifier, the processor uses a composite objective function that aggregates the losses associated with both labeled synthetic point cloud data and unlabeled real point cloud data. For the labeled synthetic data, the processor uses the cross-entropy loss, and, for the unlabeled real data, the processor uses the entropy loss. The processor weights both losses appropriately to account for the class imbalance. At inference time, the processor feeds point clouds produced by LiDAR scanners into the relevant trained model to make predictions. Multi-view Point Cloud Simulation [0102] In LiDAR scans, the objects of interest may be occluded by other objects or even themselves. In the disclosed cross-domain learning approach, the processor simulates occlusions in synthesizing the training set, by taking snapshots from multiple viewpoints. In particular, given a 3D object model, the processor can simulate realistic LiDAR scans and generate multiple point clouds of the object from different viewpoints. [0103] In one example, the processor utilizes the open-source software Blender to create synthetic point clouds for training. The procedure has two major steps, i.e., depth map simulation and back projection. The 3D model is positioned in the center of the scene at coordinates (0,0,0) . The depth sensor is set up in a random position with an empirically adjusted maximum distance to the object. Its intrinsic and extrinsic properties are recorded for 3D reconstruction. After setting up the scene, a snapshot of the depth map is captured and saved. A partially-occluded 3D scan is then generated by back-projecting the depth map. The processor selects the new position of the depth sensor randomly and repeat the simulation procedure M times for each object to generate the sets of point clouds ^ _i , ^ i ^ O , where O is the set of objects. In Figure 3, we illustrate an example cloud dataset with six objects and four viewpoints. Data Augmentation [0104] The processor may augment the synthesized partially-occluded point clouds by applying random rotations and adding Gaussian noise to improve the robustness and accuracy of the learned models. Random Rotation [0105] The processor rotates each point cloud around the z -axis by a uniformly- distributed random angle, i.e., ^ ^ [0,2 ^ ] . The rotation of every point of the point ^cloud _{P ij} ^{, i.e.,} _{pijk = ( x ijk , y ijk , z ijk )} ^{, around the z -axis by} _^ ^{is expressed via the} following linear transformation ^ x ' ^{^} cos( ^ ) si ^{^ ^} ^{^} ijk ^{^} n( ^ ) 0 ^{^} x ijk ' ^{^ ^ ^} ^ ^ ^{^ ^ ^} [0106] The rotated point cloud is donated as P ^' i _j . Gaussian Noise [0107] When collecting data in real world, sensing imperfections due to, e.g., measurement noise or error, may corrupt the data. Therefore, to make the synthetic point cloud data more realistic, the processor add noise to the values of each synthetic point as p _i ^{^} _j ^{^} _{k = p} ^' i _{jk ^ ^ ijk ,} ⁽²⁾ where ^ _ijk = ( ^ _xijk , ^ _yijk , ^ _zijk ) is the additive noise with ^ _cijk , ^ c ^ { x , y , z }, being a Gaussian distribution with mean ^ = 0 and standard deviation ^ = 0.01. Cross-domain Point Cloud Encoder [0108] In this section, we elaborate the point cloud encoder and classifier, and describe the unified learning objective that consists of cross entropy loss for labeled synthetic data, and entropy loss for unlabeled real data. [0109] Given a partially-occluded and randomly-augmented point cloud P _ij ^{^ ^} , the point cloud encoder, denoted by the function f ( ^ ) , takes the point cloud as the input and outputs the point-level feature vector of dimension D , i.e., f _ij = f ( P _ij ^{^ ^} ) . The point features are then aggregated using an effective symmetric aggregation function, i.e., max-pooling denoted by max ^ pool( ^ ) , to produce the pooled global feature vector, i.e., g ^D i _j = max ^ pool( f _ij ) ^ ^ . The global feature vector is then passed through a classifier, h ( ^ ) , that is a multilayer fully-connected neural network (perceptron) and outputs the logits for each class. The softmax( ^ ) function is applied to the logits to produce the class-wise posterior probabilities, denoted by q ^C i _j ^ ^ where C is the number of classes, i.e., q _ij = softmax(h ( g _ij )) . We calculate the categorical cross- entropy loss that evaluates the divergence between the predicted posterior probabilities and the ground-truth label as l _ij = ^y ^{^} i _j log( q _ij ) (3) where y ^C i _j ^ ^ is the one-hot vector for the ground-truth label corresponding to P _ij . The processor may use a weighted version of the cross-entropy loss to mitigate the impact of class imbalance. [0110] To exploit the information available through the unlabeled real point clouds from the target domain, we utilize the entropy loss function calculated as ^ _it = ^s ^{^} i _t log( s _it ) (4) where s _it is the vector of posterior probabilities predicted by the model for the t th real point cloud of the i th object available for training. Minimizing the entropy loss for unlabeled data encourages the learned model to make more confident predictions, which can in turn improve its performance. The unified objective function that the processor minimizes during training is the weighted average of the cross-entropy and entropy losses for all available synthetic and real point clouds. [0111] The specifications of the encoder is not essential to this disclosure. Hence, any suitable point cloud encoder can be used. In one example, use DGCNN (as described in Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph CNN for learning on point clouds. ACM Trans. Graph., 38(5):146:1–146:12, 2019, which is included herein by reference) as the point cloud encoder since it is efficient and leads to good performance. Other examples use PointNet++ (as described in Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5099–5108, 2017, which is incorporated herein by reference.) as the point cloud encoder below. The main common property of DGCNN and PointNet++ is their preservation of local neighborhood in feature calculation. DGCNN constructs a k -nearest neighbor (KNN) graph in every graph convolutional layer. The first KNN graph is built upon raw 3D point coordinates to preserve geometric local neighborhood information. In the second and following layers, the local neighborhood is defined in the feature space. PointNet++ uses a hierarchical point-set feature learning module to recursively sample and group points in local regions. In both encoders, the computed point-level features are max-pooled to yield global features. Point-Syn2Real Dataset [0112] [0113] We compile a new benchmark dataset, called Point-Syn2Real, by gathering data from multiple sources. The dataset can be used to evaluate the performance of cross-domain learning methods that involve transferring knowledge from synthetic 3D data to real-world point cloud data. Point-Syn2Real covers both indoor and outdoor settings as shown in Table 1. The multi-view synthesis considerably increases the number of instances available for training. [0114] For the indoor setting, we extracted ten overlapping categories from ModelNet (as described in Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao.3d shapenets: A deep representation for volumetric shapes. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 1912–1920. IEEE Computer Society, 2015, which is included herein by reference), ShapeNet(as described in Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015, which is included herein by reference) and ScanNet (as described in Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas A. Funkhouser, and Matthias Nießner. Richly-annotated 3d reconstructions of indoor scenes. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2432–2443. IEEE Computer Society, 2017, which is included herein by reference) datasets following the protocols described in Can Qin, Haoxuan You, Lichen Wang, C.-C. Jay Kuo, and Yun Fu. Pointdan: A multi-scale 3d domain adaption network for point cloud representation. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019, which is included herein by reference. In particular, ModelNet and ShapeNet constitute the source domain of synthetic data and ScanNet forms the target domain of real-world data. [0115] For the outdoor setting, we obtained five representative categories from 3D Warehouse (https://3dwarehouse.sketchup.com), ShapeNet, and SemanticKITTI (as described in Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and J¨urgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 9296–9306. IEEE, 2019, which is included herein by reference) datasets. We collected CAD models from 3D Warehouse and ShapeNet to construct the synthetic source domain, which we call the 3D_City subset. We take the real LiDAR object scans from SemnaticKITTI as the real-world target domain. As the outdoor 3D object scans in SemnaticKITTI are annotated point clouds, we only selected objects that have at least 30 points to make up the SemKitti_Obj subset. In Fig.11, we show some examples of synthetic and real data. The class distribution of the dataset in the outdoor setting is highly imbalanced. It reflects the availability of public 3D models as well as the distribution of the available real-world data mostly collected for autonomous driving applications. Evaluation [0116] We use DGCNN with the neighborhood size of k = 20 as the point cloud encoder. For training, we use a batch size of 32 and a maximum epoch number of 80. We use the Adam optimization algorithm with the learning rate set to 0.001 and the weight decay to 5 ^ 10 ^{^ 5}. The values of other hyperparameters can be found in the provided code. We implement model training and evaluation using PyTorch and an NVIDIA GTX 3090 GPU. Up to 5GB of GPU memory was used during our experiments. Table 2: Performance comparison of the considered approaches in the indoor settings. ModelNet to ScanNet ShapeNet to ScanNet A h F1 M F1 M [0117] In our evaluations, we use classification metrics including overall accuracy and weighted average F1-score. In addition, we calculate the Matthews correlation coefficient (MCC) to measure the performance of multi-class classification, especially given that the considered cross-domain datasets are imbalanced. An MCC value of ^ 1 indicates perfect prediction, 0 no better than random prediction, and ^ 1 perfect opposite prediction. For semi-supervised cross-domain learning, we use labeled synthetic data from the source domain and unlabeled real point clouds from the target domain to train the model. We use a small set of labeled real point clouds from the target domain to validate the model fit and tune the hyper-parameters. We evaluate the eventual learned model on a held-out target-domain test set that is unseen during the training. [0118] We compare the performance of the proposed approach with a number of existing baseline and approaches as listed below. • Supervised: The model trained on labeled real-world data from the target domain. It sets an upper bound on the performance of all cross-domain learning approaches. • Baseline: The model trained only on the synthetic data of the source domain with no domain adaptation, multi-view simulation, or random augmentation. • PointDAN]: A domain-adaptation-based approach that utilizes local geometric structures and the global feature distribution. • MMD: The maximum mean discrepancy approach that uses a discrepancy loss to align the global features between the source and target domains. • DANN: The domain adversarial neural networks approach that utilizes adversarial training to align the global features across the source and target domains. • DefRec+PCM: An approach that performs self-supervised deformation- reconstruction (DefRec) to learn cross-domain features using the point cloud mixup (PCM) procedure. • Point-Syn2Real: The approach disclosed herein. [0119] We denote the multiview point cloud simulation by S, the data augmentation 3 by A, and the inclusion of entropy loss for semi-supervised learning by E. Indoor Object Classification [0120] In a typical indoor setting, common objects are furniture such as table, chair, and bookshelf/cupboard. We collect ten different types of furniture 3D models to train the point cloud encoder and extract key features of these objects. In indoor settings, 3D LiDAR scans usually have better resolution and lower noise compared with outdoor settings. As indoor areas are often smaller than outdoor areas of interest, it is easier to obtain high-resolution scans. In addition, indoor settings are generally more controlled and stable. Therefore, the scans are less likely to be contaminated with high levels of noise. Nonetheless, recognizing objects in the indoor scans, e.g., ScanNet, can be challenging as the object scans are often partially occluded. Especially, for some objects such as a bathtub and a bookshelf, only their top or forward facets are scanned due to the nature of their usage. [0121] The results presented in Table 1 for the ModelNet to ScanNet case show that when augmented via random rotations and additive Gaussian noise, the proposed approach outperforms the earlier domain adaptation method PointDAN, which aligns the distribution of the features learned in the source and target domains. There is a similar observation for the ShapeNet to ScanNet case where the accuracy is improved from 33.90% (for PointDAN) to 50.37%. [0122] However, the augmentation alone does not represent the real world, since objects may face different directions in the point cloud coordinate system when they are scanned in the real world. To this, we generate simulated training data from multiple view points which results in slightly varied samples of the same object. Benefiting from both augmentation (A) and multiview simulation (S), Point-Syn2Real A+S, further improves the performances. We incorporate the knowledge of the unlabeled target domain data into the training to further regularize the model using the entropy loss for the unlabeled data (E) and adapt it to the target domain. The full model, Point-Syn2Real A+S+E, outperforms the state-of-the-art approach DefRec+PCM by 7.33% and 8.98% in the ModelNet to ScanNet and ShapeNet to ScanNet cases, respectively. MCC score is also improved significantly compared to all the existing methods. Overall, it is evident that the proposed approach offers significant performance improvement in the indoor settings. In addition, our experiments demonstrate that our semi-supervised learning approach through the use of information entropy loss for unlabeled data outperforms more complex domain adaptation methods. The detailed ablation study and comparison with the existing domain-adaptation-based methods, e.g., MMD and DANN, are provided below. Outdoor Object Classification [0123] For evaluation on outdoor objects, we extract the real object scans from the SemanticKitti autonomous driving dataset. The LiDAR point cloud data in this dataset has been collected using a fast-moving vehicle while the scanned objects themselves may also be moving. In addition, the scanner and the objects are relatively distant. Compared to indoor settings, outdoor settings are generally more dynamic and larger and the scans are more susceptible to noise and error. We train the model on labeled synthetic source domain, i.e.3D_City, and adapt the model to target domain during the training with unlabeled real point cloud from SemKitti_Obj training split. A held-out test split from target domain is used for testing. In Table 2, we present the performance evaluation results for the considered outdoor setting. [0124] Both Point-Syn2Real A and Point-Syn2Real A+S perform better than the Baseline approach attesting to the effectiveness of the utilized random augmentation and multiview simulation. MMD has high accuracy and F-1 score, close to those of the Supervised approach. However, its is substantially lower than that of Supervised. This is mainly because MMD is able to classify the more common classes with good accuracy but it fails with the classes that have low frequency. The proposed Point-Syn2Real A+S+E approach offers significant improvements over other considered approaches in terms of all three metrics and draws close to the upper bounds set by the Supervised approach. This means that the combination of random augmentation, multi-view simulation, and semi-supervised learning appreciably enhances the ability of cross-domain object classification models to generalize to the target domain with minimal supervision. Table 3. Performance comparison of the considered approaches in an outdoor setting. 3D_City to SemKitti_Obj A h F1 M Qualitative Analysis

t ^{t M} _. ⁰ _{. . .} _{s e} ^{0 0 0} ^a ^{+ N} + _{n )} t _e ^a _c ⁽% ^{6 7 3 6} N ^{S 1} ₃ ^. ₄ ^. ₂ ^. ₀ ^. ^t _n ^o ⁱ t ^t _{F 8} ² ₂ ⁶ ₅ ³ ₅ ⁵ o P _{e )} _r N % o ^e _p ⁽ N ^a _h ^. _c ² ₀ ^{. 8} ₄ ^{. 7} ₁ ^{. 9} ₁ ^. ^N _C ^{S c} _{a 4} ² ₃ ⁶ ₂ ³ ₉ ⁵ G ^C D _{C 3 8 4 4} _g ^{t M 2} _. ^{0 4} _. ^{0 2} _. ^{0 4} _. ⁰ ⁿ _{i e} s _u ^N _{n )} _n ^a ^e _c _h ^{S (}% ⁸ ₉ ^{. 0} ₃ ^{. 8} ₅ ³ ₆ ^. ^{w o} _t ¹ _{F 2} ³ ₉ ^. ⁵ ₁ ⁴ ₆ ⁵ ^st ^t l _{e )} u _s N ^l ^e _{r e} ^{d (}% . ⁹ ₀ ^{. 3} ₁ ^{. 1} ₆ ⁹ ₄ _{n o} ^c ^o _{c 1 9} ^. ₁ ^. ₀ _i M ^{a 3 5 4 6} ^t _{a h} u _l ^c ^a _a _v ^o ^e _r ^p _e ^l ⁿ _{i a} ^e _e ^l ⁿi _a ^e ^e _p ^l c _e - s ^t _n ⁱ _R ² _l ^e - n ^A _s ^t _n ⁱ _R ² a _* ² _a B _o ^P _n ^y _S ^a B _o ^P _n ^y ^m _S _r o _f ^r _e ^{r 8}- ⁺ ₊ ^t ^P _e ^{d 6)} ^. _r N 4 _o ^{c l} _{( 1} ^N _e _C N ^t _n _e ^l _n E ⁵- ¹- G ⁱ _o _b a _* ³ ₎ ⁹ ₎ ^D _* ^P _* _T ^{2 r}l ₍ ^r _{l (} ^{2 2} [0125] Fig.12 illustrates feature space for ModelNet-ScanNet cross- domain learning using the Baseline approach and the proposed Point-Syn2Real approach produced by the t-SNE algorithm. For the visualizations, we randomly select 1000 point clouds from the target domain dataset, i.e., ScanNet, and compute the corresponding object features using the DGCNN encoder trained on the ModelNet synthetic dataset via Baseline or Point-Syn2Real. Each point in Figure 6 represents an object point cloud and is colored according to the corresponding object label. As seen in the figure, the proposed approach results in features that cluster more distinctly for each object class compared with those of the Baseline approach. This is particularly noticeable for classes that are less prevalent such as “sofa” (orange), “cabinet”(pink), and “bed” (yellow). Choice of Backbone Point Cloud Encoder [0126] In Table 4, we give the performance evaluation results for the considered cross-domain learning settings using two different point cloud feature extraction models, namely, DGCNN and PointNet++. We implement the Baseline and Point- Syn2Real approaches in the same manner as described herein. The results in Table 4 show that both encoders lead to similar performance. Overall, Point-Syn2Real can benefit from both considered point cloud encoders, although, in general, DGCNN is slightly more advantageous hence is our primary choice. Ablation Study [0127] We conduct an ablation study to better understand the relative contribution of each component in the proposed Point-Syn2Real approach. In Tables 2 and 3, we examine the benefits of including augmentation (A), multi-view simulation (S), and entropy loss (E) in both indoor and outdoor settings. For the considered indoor settings, as shown in Table 1, including the random augmentation alone increases the accuracy significantly, i.e., from 31.09% to 51.33% in the ModelNet to ScanNet case and from 24.02% to 50.37% in the ShapeNet to ScanNet case, compared to Baseline. This suggests that augmentation is an effective way of enhancing the generalization capacity with relatively small training datasets ModelNet. For the considered outdoor setting, as seen in Table 3, augmentation alone does not lead to good performance while, together with multi-view simulation, it can improve the performance significantly. Semi-supervised learning through the use of the entropy loss improves the performance in both indoor and outdoor settings. Especially, in the considered outdoor setting, it increases the MCC value from 0.50 to 0.57. Minimizing the entropy of the posterior class probabilities predicted by the classifier for the unlabeled target training data encourages the classifier to make more confident predictions. This helps the learned model better generalize to unseen data from the target domain. The inclusion of the entropy loss can also be perceived as a form of regularization that prevents the learned model from overfitting to the source domain without relying on any labeled data from the target domain. [0128] Fig.13 shows the class-wise accuracy of the Baseline, MMD, and Point- Syn2Real A+S+E approaches for the ModelNet to ScanNet case. The results indicate that Point-Syn2Real has the best accuracy for most classes. Especially, the accuracy for the Chair class is about 70% with Point-Syn2Real while it is around 30% with Baseline and MMD. The accuracy of Point-Syn2Real is lower than that of MMD for only three classes of Lamp, Monitor, and Plant. It is also interesting to observe that MMD is less accurate than Baseline for five classes. In general, there appears ample room for further improvement considering the class-wise accuracy values, although our proposed approach achieves appreciable improvement over the state-of-the-art. Discussion on Domain Adaptation [0129] In developing Point-Syn2Real, the processor learns models from synthetic data (source domain) that can generalize to corresponding real-world data (target domain) via simulating the data collection, augmenting the synthesized data, and exploiting the information available through unlabeled target domain data. Nonetheless, methods based on domain adaptation (DA) such as MMD and DANN have demonstrated promising results in similar tasks pertaining to 2D computer vision. MMD calculates discrepancy and DANN applies adversarial training to adapt the distribution of the global features the source domain to those of the target domain. Tables 2 and 3 include the performance evaluation results for the mentioned DA-based approaches as well. The results show that the considered DA-based approaches alone do not offer any significant benefit. They rather imply that DA for 3D point cloud data is a challenging research question. We conduct further experiments by applying random augmentation and multi-view simulation in conjunction with the DA- based approaches, results of which are indicated by MMD+A+S and DANN+A+S. The results show that the considered data augmentation and multi-view simulation are not only beneficial on their own right but also essential for achieving good generalizability across synthetic and real point cloud data domains regardless of the approach taken to adapt the domains. Sensor system [0130] Fig.14 illustrates a sensor system 1400 comprising a LiDAR sensor 101 and an IMU 106. The LiDAR sensor 101 is of a laser range finder modality to capture first sensor data in the form of a point cloud. The point cloud has a strong link between the sensor data and the ground truth as explained above. The IMU 106 is of a motion sensing modality and captures acceleration data, which has a weak link between the sensor data and the ground truth as explained above. Other combinations of sensor modalities are also possible as explained above. [0131] Sensor system 1400 further comprises a computer system 1401 having a processor performing machine learning methods, which is why processor 1402 may also be referred to as a a machine learning processor. [0132] Processor 1402 is connected to a program memory 1403, a data memory 1404, and a communication port 1405. The program memory 1403 is a non-transitory computer readable medium, such as a hard drive, a solid state disk or CD-ROM. Software, that is, an executable program stored on program memory 1403 causes the processor 1403 to perform the methods disclosed herein and in particular, the methods of Fig.2 and Fig.3. For example, processor 1402 trains a machine learning model by using the acceleration data as input to learning model and reducing an error between an output of the machine learning model and an estimation of the ground truth calculated based on the point cloud. Processor 1402 can then evaluate the trained machine learning model using further acceleration data as input to the trained machine learning model. [0133] The processor 1402 may then store the trained model parameters or the evaluation output on data store 1404, such as on RAM or a processor register. Processor 1402 may also send the determined machine learning model parameters or the evaluation output via communication port 108 to another computing device, such as a user device or a server. [0134] The processor 1402 may receive data, such as acceleration data or point clouds, from data memory 1404 as well as from the communications port 1405. In one example, the processor 1402 receives sensor data from the sensors via communications port 1405 , such as by using a wired connection or a Wi-Fi network according to IEEE 802.11. The Wi-Fi network may be a decentralised ad-hoc network, such that no dedicated management infrastructure, such as a router, is required or a centralised network with a router or access point managing the network. [0135] In one example, the processor 1402 receives and processes the sensor data in real time. This means that the processor 1402 determines the output every time sensor data is received from the sensors and completes this calculation before the sensors send the next sensor data update. This would be useful for simultaneous location and mapping (SLAM) or other autonomous vehicle navigation. [0136] Although communications port 1405 is shown as a distinct entity, it is to be understood that any kind of data port may be used to receive data, such as a network connection, a memory interface, a pin of the chip package of processor 1402, or logical ports, such as IP sockets or parameters of functions stored on program memory 1403 and executed by processor 1402. These parameters may be stored on data memory 1404 and may be handled by-value or that is, as a pointer, in the source code. [0137] The processor 1402 may receive data through all these interfaces, which includes memory access of volatile memory, such as cache or RAM, or non-volatile memory, such as an optical disk drive, hard disk drive, storage server or cloud storage. The computer system 1401 may further be implemented using graphical processing units (GPUs) or within a cloud computing environment, such as a managed group of interconnected servers hosting a dynamic number of virtual machines. [0138] It is to be understood that any receiving step may be preceded by the processor 1402 determining or computing the data that is later received. For example, the processor 1402 determines features from the sensor data and stores the features in data memory 1404, such as RAM or a processor register. The processor 1402 then requests the data from the data memory 1404, such as by providing a read signal together with a memory address. The data memory 1404 provides the data as a voltage signal on a physical bit line and the processor 1402 receives the data via a memory interface. [0139] It is to be understood that throughout this disclosure unless stated otherwise, nodes, edges, graphs, solutions, variables, models and the like refer to data structures, which are physically stored on data memory 1404 or processed by processor 1402. Further, for the sake of brevity when reference is made to particular variable names, such as “input” or “output” this is to be understood to refer to values of variables stored as physical data in computer system 1401. [0140] Figs.2 and 3 are to be understood as a blueprint for the software program and may be implemented step-by-step, such that each step in Figs.2 and 3 is represented by a function in a programming language, such as C++ or Java. The resulting source code is then compiled and stored as computer executable instructions on program memory Conclusion [0141] This disclosure provides synthetic-to-real semi-supervised cross-domain learning approach, referred to as Point-Syn2Real, to learn 3D point cloud classification models that can generalize from synthetic domain to real world. Point-Syn2Real produces synthetic object point clouds by simulating their LiDAR scans from multiple viewpoints while inducing partial occlusions that may occur in real-world 3D scans. It then augments the simulated point clouds by applying random rotations and adding Gaussian noise. The synthesized point clouds are used to train the object classifier that includes a suitable point cloud encoder. To mitigate the likelihood of overfitting to the synthetic data of the source domain and hence improve the performance. [0142] The method incorporates the entropy loss associated with the available unlabeled real data from the target domain into the training objective function. Through extensive experimentation with synthetic and real data in both indoor and outdoor settings, we showed that Point-Syn2Real outperforms several relevant approaches. This is because the utilization of data augmentation, multi-view simulation, and entropy loss enables Point-Syn2Real to better generalize the knowledge learned from the synthetic point cloud data to the real-world data. We also created a new point cloud dataset, Point-Syn2Real, that can be used to evaluated the performance of point cloud synthetic- to-real cross-domain learning methods. [0143] It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the above-described embodiments, without departing from the broad general scope of the present disclosure. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Previous Patent: ANTIVIRAL NUCLEIC ACIDS AND COMPOSITIONS

Next Patent: CELL THERAPY