A NEURAL NETWORK TRAINING METHOD AND APPARATUS THEREOF

Title:

A NEURAL NETWORK TRAINING METHOD AND APPARATUS THEREOF

Document Type and Number:

WIPO Patent Application WO/2024/083584

Kind Code:

Abstract:

Disclosed is a multimodal neural network training method comprising performing, by way of a processor, a first training operation of a neural network model comprising a first dataset and generating by way of the processor, a first action in response to at least one corresponding target response trained. The method further comprises performing by way of the processor, a second training operation of a neural network model comprising a second dataset; and generating by way of the processor, a second action in response to the at least one corresponding target response trained. The training method further comprises generating, by way of the processor, at least an output action predicted in response to at least one corresponding target response; and executing, by way of a generator network, the at least an output action predicted, the at least an output action predicted comprising: a trajectory procedure; a dialogue procedure or combination thereof. A multimodal neural network apparatus and an interactive autonomous robot is also disclosed.

More Like This:

WO/2024/099342	TRANSLATION METHOD AND APPARATUS, READABLE MEDIUM, AND ELECTRONIC DEVICE
WO/2024/056651	GENERATING EXTREME BUT PLAUSIBLE SYSTEM RESPONSE SCENARIOS USING GENERATIVE NEURAL NETWORKS
WO/2023/229692	FAST AND ACCURATE ANOMALY DETECTION EXPLANATIONS WITH FORWARD-BACKWARD FEATURE IMPORTANCE

Inventors:

ISSAC BABY FEBIN (SG)
HOY MICHAEL COLIN (SG)
SINGH RAHUL (SG)

Application Number:

PCT/EP2023/078117

Publication Date:

April 25, 2024

Filing Date:

October 11, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

CONTINENTAL AUTOMOTIVE TECH GMBH (DE)

International Classes:

G06N3/0475; G06F40/00; G06N3/045; G06N3/092; G06V20/50

Other References:

WANG YEFEI ET AL: "Audio-Visual Grounding Referring Expression for Robotic Manipulation", 2022 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE, 23 May 2022 (2022-05-23), pages 9258 - 9264, XP034147034, DOI: 10.1109/ICRA46639.2022.9811895
ZHU YI ET AL: "Vision-Dialog Navigation by Exploring Cross-Modal Memory", 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 13 June 2020 (2020-06-13), pages 10727 - 10736, XP033804739, DOI: 10.1109/CVPR42600.2020.01074
MUTHUKUMAR PRATYUSH ET AL: "Generative Adversarial Imitation Learning for Empathy-based AI", ARXIV.ORG, 27 May 2021 (2021-05-27), Ithaca, XP093119629, Retrieved from the Internet [retrieved on 20240115], DOI: 10.48550/arxiv.2105.13328

Attorney, Agent or Firm:

CONTINENTAL CORPORATION (DE)

Download PDF:

View/Download PDF PDF Help

Claims:

Patent claims

1 . A neural network training method (100) comprising: performing (102), by way of a processor, a first training operation of a neural network model comprising a first dataset; generating (104), by way of the processor, a first action in response to at least one corresponding target response trained; performing (106), by way of the processor, a second training operation of a neural network model comprising a second dataset and at least one corresponding target response; and generating (108), by way of the processor, a second action in response to the at least one corresponding target response trained; characterized by that: the training method (100) further comprises: generating (110), by way of the processor, an output action predicted in response to at least one corresponding target response; and executing, by way of a generator network (304), an output action (408) predicted comprising: a trajectory procedure (410); a dialogue procedure (412), or combination thereof.

2. The method (100) of claim 1 , characterized by that the neural network model is a multimodal-based neural network model.

3. The method (100) of claim 1 - 2, characterized by that the first dataset comprises: a first modality and a second modality.

4. The method (100) of claims 1 - 3, characterized by that the second data comprises: a second modality and a third modality.

RECTIFIED SHEET (RULE 91) ISA/EP 5. The method (100) of claims 1 - 4, characterized by that the at least one corresponding target response comprises: a dialogue procedure including: an audio response; and/or a text response; a trajectory procedure comprising: a movement from a first position to a second position or combination thereof.

6. The method (100) of claims 1 - 5, characterized by that the training method further comprises: performing (202), by way of the generator network, a pre-training operation comprising a multi-domain dialogue dataset.

7. The method (100) of claims 1 - 6, characterized by that the training method (100) further comprises: performing (204), by way of an image classification module, a first auxiliary training operation for identifying one or more objects in an image.

8. The method (100) of claims 1 - 7, characterized by that the training method (100) further comprise: performing (206), by way of a named entity recognition module, a second auxiliary training operation for identifying at least a string of text input.

9. The method (100) of claims 1 - 8, characterized by that the training method further comprises: performing (208), by way of the processor, an imitation training operation for learning a policy network for predicting the at least an output action.

10. The method (100) according to claim 9, characterized by that the imitation training operation (208) is a multi-agent generative adversarial imitation learning (GAIL) technique.

RECTIFIED SHEET (RULE 91) ISA/EP 11. The method (100) according to any one of the preceding claims, characterized by that the training method (100) further comprises a reinforcement learning process algorithm.

12. The method (100) according to claim 11 , characterized in that the reinforcement learning process algorithm further comprises: performing (210), by way of a discriminator, a fourth training operation for segregating images containing information of types of surrounding according to an authenticity classification; and providing, by way of a replay buffer, the images classified according to the authenticity classification to the reinforcement learning process algorithm.

13. The method (100) of any one of the preceding claims, characterized by that the first modality is a set of audio input.

14. The method (100) of any one of the preceding claims, characterized by that the second modality is a set of text input.

15. The method (100) of any one of the preceding claims, characterized by that the first modality is a set of images.

16. A neural network apparatus comprising: a multimodal neural network model trained to predict at least an output action in response to at least one type of input received; and a generator network characterized in that: the generator network (304) is operable to execute the at least an output action (408) predicted, the at least an output action (408) comprising: a trajectory procedure (410); a dialogue procedure (412), or combination thereof.

RECTIFIED SHEET (RULE 91) ISA/EP 17. The neural network apparatus according to claim 13, characterized in that the neural network model is a multimodal-based neural network model.

18. The neural network apparatus of claims 13 - 14, characterized by that the neural network model further comprises: an image classification module operable to identify one or more objects in an image captured by the neural network apparatus.

19. The neural network apparatus of claims 13 - 15, characterized by that the neural network model further comprises: a named entity recognition module operable to identify a string of audio input by the neural network apparatus.

20. The neural network apparatus of claims 13 - 16, characterized by that the neural network model further comprises: a discriminator (304) operable to segregate a plurality of images (308, 310) captured by the neural network apparatus, each of the images (308, 310) contains information of a surrounding, wherein the plurality of images is segregated according to an authenticity classification.

21. The neural network apparatus of claims 13 -17, characterized by that the neural network model further comprises: a replay buffer (302) operable to receive an authenticity classified image to reinforce learning, wherein the authenticity classified image is classified by the discriminator (304) as authentic (308).

22. The neural network apparatus according to any one of the preceding claims, characterized by that the generator network (304) is operable to predict the at least an output action (408) according to the trajectory procedure (410), wherein the trajectory procedure (410) is a motion planning of the neural network apparatus.

RECTIFIED SHEET (RULE 91) ISA/EP 23. The neural network apparatus according to any one of the preceding claims, characterized by that that the generator network is operable to predict the at least an output action (408) according to the dialogue procedure (412), wherein the dialogue procedure (412) comprises a speech utterance of a natural language output by the neural network apparatus.

24. The neural network apparatus according to any one of the preceding claims, characterized by that the neural network apparatus is selected from a group comprising of: an interactive autonomous robot; an interactive processor-implemented digital voice assistant; and combination thereof.

25. A computer program product comprising instructions to cause the neural network apparatus of claims 13 to 21 to execute the steps of the method of claims 1 to 12.

26. A computer - readable medium having stored thereon the computer program of claim 22.

27. An interactive autonomous robot comprising: an audio device operable to receive an audio signal and transmit an audio signal; an image sensing device operable to capture at least one image signal in relation to a field of view of an environment forward of the image sensing device; and a processor comprising an algorithm stored thereon; characterized in that: in response to the audio signal received by the audio device; and/or the at least one image signal captured by the image sensing device

RECTIFIED SHEET (RULE 91) ISA/EP the at least one processor is operable execute at least an output action, wherein the at least an output action (408) comprises a command: to maneuver (410) the interactive autonomous robot from a first position to a second position; to execute a dialogue response (412) by the interactive autonomous robot, or combination thereof.

RECTIFIED SHEET (RULE 91) ISA/EP

Description:

A NEURAL NETWORK TRAINING METHOD AND APPARATUS THEREOF

Technical Field

This description relates to a neural network training method and a neural network apparatus.

Background

Increasingly industrial efforts are moving towards automation to optimize efficiency. By way of an example, mobile robots have been deployed in smart cities to promote efficiency and improve productivity in daily tasks. To enhance human-robot interaction, existing robots incorporate voice-activated functions, for example digital voice assistant technology for speech assistance. Nonetheless, existing digital voice assistant technology implements oversimplified utterance, which cannot deliver intonation in the audio output, thus delivering utterances which sounds very machine-like.

While human-robot interaction is a given in robotics technology, recognition of human pose and surrounding environment layout is not. With enhanced interaction between robots and surrounding environment layout as well as human pose, robots are able to carry out tasks in more complex environment which requires such function. An example is a mobile robot for use as a delivery robot in a smart city.

There is therefore a need to provide to provide a motor vehicle display apparatus that overcomes, or at least ameliorates, the problems described above. Furthermore, other desirable features and characteristics will become apparent from the subsequent detailed description and the appended claims, taking in conjunction with the accompanying drawings and this background of the disclosure.

RECTIFIED SHEET (RULE 91) ISA/EP Summary

A purpose of this disclosure is to ameliorate the problems of (1 ) oversimplified utterance by robots and (2) lack of robot trajectory commends having capabilities to recognize human pose and surrounding environment layout by providing the subject-matter of the independent claims.

Further purposes of this disclosure are set out in the accompanying dependent claims.

The objective of this disclosure is solved by a neural network training method comprising: performing, by way of a processor, a first training operation of a neural network model comprising a first dataset; generating, by way of the processor, a first action in response to at least one corresponding target response trained; performing, by way of the processor, a second training operation of a neural network model comprising a second dataset : generating, by way of the processor, a second action in response to at least one corresponding target response trained; characterized by that: the training method further comprises: generating, by way of the processor, at least an output action predicted in response to at least one corresponding target response; and executing, by way of a generator network, the at least an output action predicted, the at least an output action predicted comprising: a trajectory procedure; a dialogue procedure, or combination thereof.

An advantage of the above described aspect of this disclosure yields a neural network training method which utilises multimodal neural network for natural language processing and motion planning by using three types of modality. More

RECTIFIED SHEET (RULE 91) ISA/EP advantageously, the training method uses imitation learning or inverse reinforcement learning to increase the accuracy of recognition through the use of different types of modalities. Even more advantageously, using a multimodal neural network enables a neural network apparatus having such a trained multimodal neural network to be capable of analysing an environment and generate text or audio commands proactively. Through direct learning, the neural network apparatus is operable to generate audio output which makes the neural network apparatus sound more like a human being, thus achieve the objective of generating a dialogue.

Preferred is a method as described above or as described above as being preferred, in which: the neural network model is a multimodal-based neural network model.

The advantage of the above aspect of this disclosure is yield a training method for a multimodal neural network model which is operable to integrate different sources of information for decision making process.

Preferred is a method as described above or as described above as being preferred, in which: the first dataset comprises: a first modality and a second modality.

The advantage of the above aspect of this disclosure is to provide a training method for a multimodal neural network model, through the use of two different types of modalities.

Preferred is a method as described above or as described above as being preferred, in which: the second data comprises: a second modality and a third modality.

RECTIFIED SHEET (RULE 91) ISA/EP The advantage of the above aspect of this disclosure is to provide a training method for a multimodal neural network model, through the use of two different types of modalities.

Preferred is a method as described above or as described above as being preferred, in which: the at least one corresponding target response comprises: a dialogue procedure including: an audio response; and/or a text response; a trajectory procedure comprising: a movement from a first position to a second position or combination thereof.

The advantage of the above aspect of this disclosure is to provide a training method for a multimodal neural network model to generate or execute at least one corresponding target response. The at least one corresponding target response may comprise a dialogue procedure including an audio response and/or a text response. The at least one corresponding may further comprise a trajectory procedure. The trajectory procedure comprises a movement from a first position to a second position.

Preferred is a method as described above or as described above as being preferred, in which: performing, by way of the generator network, a pre-training operation comprising a multi-domain dialogue dataset.

The advantage of the above aspect of this disclosure is to provide a pre-training process for the generator network using a multi-domain dialogue dataset such that the generator network is operable to modify and generate voice signals instead of executing a dialogue. In addition, the pre-training operation achieves the objective of enabling the generator network to interpret audio input and audio generation.

RECTIFIED SHEET (RULE 91) ISA/EP Preferred is a method as described above or as described above as being preferred, in which: performing, by way of an image classification module, a first auxiliary training operation for identifying one or more objects in an image.

The advantage of the above aspect of this disclosure is to provide an auxiliary training operation for identifying different types of objects appearing in an image. More advantageously, object recognition for a trained neural network model is achieved.

Preferred is a method as described above or as described above as being preferred, in which: performing, by way of a named entity recognizer (NER) module, a second auxiliary training operation for identifying at least a string of text input.

The advantage of the above aspect of this disclosure is to provide an auxiliary training operation for identifying amongst others, a string of text input, for example sentence or paragraph and identifies relevant nouns such as people, places, and organizations that are mentioned. The named entity recognizer module is operable to identify a user query, such that the trained neural network model may interpret or perceive a user’s request. By way of an example, a user may input to the neural network model trained a sentence, “Can you come near the lift lobby”. The trained neural network model is operable to interpret the user’s input statement.

Preferred is a method as described above or as described above as being preferred, in which: performing, by way of the processor, an imitation training operation for learning a policy network for predicting the at least an output action.

The advantage of the above aspect of this disclosure is to provide a combined training using imitation training operation to simulate different real-world scenarios. More advantageously, a trained neural network model is operable to navigate and interact with its surrounding or environment.

RECTIFIED SHEET (RULE 91) ISA/EP Preferred is a method as described above or as described above as being preferred, in which: the imitation training operation is a multi-agent generative adversarial imitation learning (GAIL) technique.

The advantage of the above aspect of this disclosure is to provide a GAIL algorithm for training, such that a trained neural network model learns the policy network to achieve the objective of predicting trajectory procedure and dialogue procedure to provide an output.

Preferred is a method as described above or as described above as being preferred, in which: the training method further comprises a reinforcement learning process algorithm.

The advantage of the above aspect of this disclosure is to provide a form of imitation training through a reinforcement learning process algorithm. By combining a reinforcement learning algorithm with the type of modalities described above, this training method yields a neural network model that is operable to generalize natural language and provide a natural language audio output.

Preferred is a method as described above or as described above as being preferred, in which: the reinforcement learning process algorithm further comprises: performing, by way of a discriminator, a fourth training operation for segregating images containing information of types of surrounding according to an authenticity classification; and providing, by way of a replay buffer, the images classified according to the authenticity classification to the reinforcement learning process algorithm.

RECTIFIED SHEET (RULE 91) ISA/EP The advantage of the above aspect of this disclosure is to provide a fourth training operation for the neural network model to segregate images according to an authenticity classification. This function may be executed by a discriminator. The segregation of images may be classified according to expert samples or generated samples.

Preferred is a method as described above or as described above as being preferred, in which:the first modality is a set of audio input.

The advantage of the above aspect of this disclosure is to provide a training method for a multimodal neural network model using a set of audio input. More advantageously, the modality data may be classified according to different types of user’s intention to simulate dialogue with user.

Preferred is a method as described above or as described above as being preferred, in which: the second modality is a set of text input.

The advantage of the above aspect of this disclosure is to provide a training method for a multimodal neural network model using a set of text input, to achieve the objective of the neural network model comprehending speech or audio input from a user. More advantageously, the modality data may be classified according to different types of user’s intention to simulate different dialogues.

Preferred is a method as described above or as described above as being preferred, in which: the first modality is a set of images.

The objective of this disclosure is solved by a neural network apparatus: a multimodal neural network model trained to predict at least an output action in response to at least one type of input received; and a generator network, characterized in that:

RECTIFIED SHEET (RULE 91) ISA/EP the generator network (304) is operable to execute the at least an output action (408) predicted, the at least an output action (408) comprising: a trajectory procedure (410); a dialogue procedure (412), or combination thereof.

An advantage of the above-described aspect of this disclosure yields a neural network apparatus which uses a multimodal neural network model for natural language processing and motion planning using different types of modalities. More advantageously, the neural network apparatus is trained to use imitation learning or inverse reinforcement learning to increase the accuracy of recognition through using different or multiple types of modalities. Even more advantageously, using a multimodal neural network enables a neural network apparatus having such a trained multimodal neural network to be capable of analysing an environment and generate text or audio commands proactively. Through direct learning, the neural network apparatus is operable to generate audio output which makes the neural network apparatus sound more like a human being, thus achieve the objective of generating a dialogue.

Preferred is a neural network apparatus as described above or as described above as being preferred, in which: the neural network model is a multimodal-based neural network model.

The advantage of the above aspect of this disclosure is to yield a multimodal-based neural network model integrated in a neural network apparatus to achieve predict at least an output using multiple sources of modalities.

Preferred is a neural network apparatus as described above or as described above as being preferred, in which: the neural network model further comprises: an image classification module operable to identify one or more objects in an image captured by the neural network apparatus.

RECTIFIED SHEET (RULE 91) ISA/EP The advantage of the above aspect of this disclosure is to yield a neutral network apparatus operable to identify one or more objects in an image captured by the neural network apparatus.

Preferred is a neural network apparatus as described above or as described above as being preferred, in which: the neural network model further comprises: a named entity recognition (NER) module operable to identify a location in an image captured by the neural network apparatus.

The advantage of the above aspect of this disclosure is to yield a neural network apparatus operable to identify a string of text input, for example, a sentence or a paragraph and identify relevant nouns such as people, places or locations and organizations that are mentioned in the string of text input. Advantageously, the named entity recognizer module is operable to identify the entities and a user query, such that the trained neural network model may interpret the user’s request.

Preferred is a neural network apparatus as described above or as described above as being preferred, in which: the neural network model further comprises: a discriminator operable to segregate a plurality of images captured by the neural network apparatus, each of the images contains information of a surrounding, wherein the plurality of images is segregated according to an authenticity classification.

The advantage of the above aspect of this disclosure is to provide a neural network model to segregate images according to an authenticity classification. This function may be executed by a discriminator. The segregation of images may be classified according to expert samples or generated samples.

Preferred is a neural network apparatus as described above or as described above as being preferred, in which:

RECTIFIED SHEET (RULE 91) ISA/EP the neural network model further comprises: a replay buffer operable to receive an authenticity classified image to reinforce learning, wherein the authenticity classified image is classified by the discriminator as authentic.

The advantage of the above aspect of this disclosure is to provide a neural network model having a replay buffer to receive classified images to reinforce learning process. The images are classified by a discriminator, of which the discriminator is operable to classify the images according to expert samples or generated samples. More advantageously, this allows the reinforcement learning by the neural network model.

Preferred is a neural network apparatus as described above or as described above as being preferred, in which: the generator network is operable to predict the at least an output action according to the trajectory procedure, wherein the trajectory procedure is a motion planning of the neural network apparatus.

The advantage of the above aspect of this disclosure is to provide a neural network apparatus having a generator network operable to predict output action according to multiple modalities, including motion planning, but not limited thereto.

Preferred is a neural network apparatus as described above or as described above as being preferred, in which: the generator network is operable to predict the at least an output action according to the dialogue procedure, wherein the dialogue procedure is a text to speech utterance of a natural language output by the neural network apparatus.

The advantage of the above aspect of this disclosure is to provide a neural network apparatus having a generator network operable to predict an output action according to multiple modalities, including but a dialogue procedure but not limited thereto.

RECTIFIED SHEET (RULE 91) ISA/EP Preferred is a neural network apparatus as described above or as described above as being preferred, in which: the neural network apparatus is selected from a group comprising of: an interactive autonomous robot; an interactive processor-implemented digital voice assistant; and combination thereof.

The advantage of the above aspect of this disclosure is to yield an interactive autonomous robot or an interactive processor-implemented digital voice assistant operable to interact with at least one user through output a natural language utterance and/or navigate a surrounding layout.

The objective of this disclosure is solved by a computer program product comprising instructions to cause a neural network apparatus as described above to execute the steps of the method of claims as described above.

An advantage of the above described aspect of this disclosure yields a computer program suitable for a neural network apparatus as disclosed herein, to execute the steps of at least an output action.

The objective of this disclosure is solved by a computer-readable medium having stored thereon the computer program as disclosed above.

An advantage of the above described aspect of this disclosure yields a computer-readable medium, for example a non-transitory computer medium installed with a computer program operable to execute the steps of at least an output action.

The objective of this disclosure is solved by an interactive autonomous robot comprising: an audio device operable to receive an audio signal and transmit an audio signal;

RECTIFIED SHEET (RULE 91) ISA/EP an image sensing device operable to capture at least one image signal in relation to a field of view of an environment forward of the image sensing device; and a processor comprising an algorithm stored thereon; characterized in that: in response to the audio signal received by the audio device; and/or the at least one image signal captured by the image sensing device the at least one processor is operable execute at least an output action, wherein the at least an output action comprises a command: to maneuver the interactive autonomous robot from a first position to a second position; to execute a dialogue response by the interactive autonomous robot, or combination thereof.

An advantage of the above described aspect of this disclosure yields an interactive autonomous robot incorporated with a multimodal neural network model for executing natural language processing and motion planning by using three types of modality. More advantageously, the training method uses imitation learning or inverse reinforcement learning to increase the accuracy of recognition through the use of different types of modalities.

Brief Description of Drawings

Other objects and aspects of this disclosure will become apparent from the following description of embodiments with reference to the accompanying drawings in which:

FIG. 1 shows a flowchart of a neural network training method in accordance with an embodiment.

FIG. 2 shows a flowchart of a neural network training method in accordance with some embodiments.

RECTIFIED SHEET (RULE 91) ISA/EP FIG. 3 shows a block diagram of a multimodal neural network in accordance with an embodiment.

FIG. 4 shows a block diagram of an audio-based interaction between a multimodal neural network apparatus and multiple multimodal neural network apparatus users in accordance with an embodiment.

FIG. 5 shows an image 500 of a surrounding layout in accordance with an embodiment.

FIG. 6 shows an image 600 of a surrounding layout in accordance with an embodiment.

FIG. 7 shows a multimodal neural network architecture in accordance with an embodiment.

In various embodiments described by reference to the above figures, like reference signs refer to like components in several perspective views and/or configurations.

Detailed Description

Hereinafter the term “first”, “second”, “third” and the like used in the context of this disclosure may refer to modification of different elements in accordance with various exemplary embodiments, but not limited thereto. The expressions may be used to distinguish one element from another element, regardless of sequence of importance. By way of an example, “a first dataset” and “a second dataset” may indicate different types of modalities regardless of order or importance. On a similar note, a first dataset may be referred to as the second dataset and vice versa without departing from the scope of this disclosure.

RECTIFIED SHEET (RULE 91) ISA/EP The term “processor” may also refer to a “computer”, which may be implemented by one or more processing elements for example an array of logic gates, a controller and an arithmetic logic unit, a digital signal processor, a microcomputer, a programmable logic controller, a field-programmable gate array, a programmable logic array, a microprocessor, or any other device or combination of devices that is configured to respond to and execute instructions in a defined manner to achieve a desired result. In one example, a processor or computer includes, or is connected to, one or more memories storing instructions or software that are executed by the processor or computer. Hardware components implemented by a processor or computer may execute instructions or software, such as an operating system (OS) and one or more software applications that run on the OS, to perform the operations described in the context herein. The hardware components may also access, manipulate, process, create, and store data in response to execution of the instructions or software. For simplicity, the singular term “processor” or “computer” may be used in the description of the examples described in this disclosure, but in other examples multiple processors or computers may be used, or a processor or computer may include multiple processing elements, or multiple types of processing elements, or both. For example, a single hardware component or two or more hardware components may be implemented by a single processor, or two or more processors, or a processor and a controller. One or more hardware components may be implemented by one or more processors, or a processor and a controller, and one or more other hardware components may be implemented by one or more other processors, or another processor and another controller. One or more processors, or a processor and a controller, may implement a single hardware component, or two or more hardware components. A hardware component may have any one or more of different processing configurations, examples of which include a single processor, independent processors, parallel processors, single-instruction single-data (SISD) multiprocessing, single-instruction multiple-data (SIMD) multiprocessing, multiple-instruction single-data (MISD) multiprocessing, and multiple-instruction multiple-data (MIMD) multiprocessing.

The term “neural network apparatus” used in the context herein may refer to a hardware incorporating a processor or computer in communication with a memory,

RECTIFIED SHEET (RULE 91) ISA/EP which operates on a neural network model. For example, a neural network apparatus may refer to a processor-implemented robot, operable to perform, predict and execute at least an output action in accordance with training operations as disclosed herein.

The following detailed description is merely exemplary in nature and is not intended to limit the disclosure or the application and uses of the disclosure. Furthermore, there is no intention to be bound by any theory presented in the preceding background of the disclosure or the following detailed description. It is the intent of this disclosure to present a neural network training method and a neural network apparatus operable to execute at least an output action to interact with user in a natural language and autonomously navigate a surrounding.

Neural Network Model Training Method

Referring to FIG. 1 of the accompanying drawings shows a flowchart 100 illustrating steps for training a neural network. In a preferred embodiment, a multimodal neural network model is the artificial neural network for integrating multiple sources of information.

At step 102, a first training operation is performed by way of a processor. In a preferred embodiment, the first training operation may use a first dataset comprising a set of audio input and a set of images for generating at least one corresponding target response output. For example, the audio input used may simulate scenarios or natural language processing where the neural network may interact with a user. At the next step 104, the processor generates a first output according to the at least one corresponding target response output as trained in previous step 102.

At step 106, a second training operation is performed by way of the processor. In a preferred embodiment, the second training may use a second dataset. The second dataset comprises a set of text content, each of the text content classified in accordance with a corresponding type of surrounding. At the next step 108, the processor generates a second output action according to text content classification

RECTIFIED SHEET (RULE 91) ISA/EP according to the processor generates a second output according to the at least one corresponding target response output as trained in previous step 106.

Phase 1 Training Model

The training operations as described above achieves an objective of training the neural network model to generate responses which may be integrated to predict actions. To further illustrate this objective, the training method disclosed achieves generating tele-operated human- neural network apparatus dataset. This includes state-action pairs, i.e., training the neural network model using input with corresponding output data, and natural language. For natural language text input training operation, it can be selectively given as input to cater to multiple scenarios. By way of an example, the training operations may simulate interaction between human -machine while navigating a traffic congested area by communicating to the neural network model, “Can you give me a way” and the neural network model may response to a corresponding target response, such as “Where are you?”. In turn, a trainer of the neural network model may reply “I am near the lift”, to simulate real-world scenarios. Using a system, the neural network model is trained to identify a traffic congestion and to decide whether a sidewalk pedestrian I user may adjust a pose, or an output policy action is required. The trained neural network model is operable to make a prediction of an output action according to a trajectory procedure, a dialogue procedure or combination thereof.

To provide training operation to simulate a busy city sidewalk, multiple sidewalk users may be placed in the same environment and the state-action pair of each sidewalk user can be recorded for separate training operations.

The training samples may be generated according to different types of scenarios, for example, only one neural network apparatus is in the environment, one neural network apparatus with one sidewalk user or one neural network apparatus with multiple sidewalk users. To enhance the training operations, the trainer of the neural network model may also input to the neural network model random queries in audio input form, to simulate sidewalk user walking near the neural network apparatus.

RECTIFIED SHEET (RULE 91) ISA/EP In a preferred embodiment, a sample input dataset may include information such as position, orientation and velocity information of the sidewalk user agent and voice command response by a user, i.e., at least one audio input. At least one corresponding target output, i.e., at least an output action in response to the simulated audio input and/or surrounding may be provided during the training operation, for example, a trajectory procedure. The trajectory procedure shall include a motion planning, to navigate around a sidewalk user’s request. In addition, the at least an output action shall include a dialogue procedure, to include a natural language response to the sidewalk user. A main advantage of this disclosure is the generation of trajectory procedure to achieve navigation of a multimodal neural network apparatus by integrating different types of modalities to predict a motion of the neural network apparatus.

Phase 2 Training Model

As shown in FIG. 2 is a flowchart 200 illustrating further training operations. In a preferred embodiment, the training method includes a generator network. At step 202, a pre-training operation is performed to using a multi domain dialogue dataset. This pre-training operation modifies the generator network to generate voice signals directly instead of relying on dialogue procedure which requires execution of at least an output action.

In some embodiment, the above steps are essential for training a multimodal neural network model to yield an interactive, processor-implemented voice activated computer.

In some embodiment, the follow steps and/or training operations yields an interactive autonomous robot, suitable for mobile delivery.

At step 204, a first auxiliary training operation is carried out to train an image classification model to identify the objects in an image, to simulate object recognition using images captured by the neural network model. In a preferred embodiment the image classification model is trained to identify objects that are commonly found along a sidewalk, for example road, human or users, plants, lamp-post, etc. The

RECTIFIED SHEET (RULE 91) ISA/EP image classification model may also be trained to identify locations in the images, to simulate location recognition in response to images captured by the neural network model.

At step 206, a second auxiliary training operation is performed to train a named entity recognition module. The objective of the second auxiliary training operation is such that the neural network model is operable to identify at least a string of text input captured by the neural network model. By way of an example, when a user input to a neural network apparatus having a neural network model trained in a manner as disclosed herein, say, “Can you come near the lift lobby?”, the neural network apparatus is operable to interpret the string of sentence that the user’s request requires it to move from a first position, i.e. a position where the neural network apparatus is located to a second position, i.e. a location near to a lift lobby where the user is standing. Thus, it can be seen, this interpretation

Phase 3 GAIL Training

In some embodiments, the neural network training method may further comprise a reinforcement learning process algorithm. A main advantage of this disclosure is the use of reinforcement learning process, to to increase the accuracy of recognition through the use of different types of modalities.

At step 208, an imitation training operation is performed, such that the neural network model learns a policy network to predict the at least an output action. In an embodiment, the imitation training operation is a multi-agent generative adversarial imitation learning (GAIL) technique.

At step 210, a discriminator performs a fourth training operation for segregating images containing information of types of surrounding according to an authenticity classification. The authenticity classification may be according to classified according to expert samples or generated samples.

At next step 212; the images classified according to the authenticity classification is provided by way of a replay buffer, to the reinforcement learning process algorithm.

RECTIFIED SHEET (RULE 91) ISA/EP GAIL Architecture

FIG. 3 shows a block diagram 300 of a multimodal neural network with multi agent generative adversarial imitation learning technique (GAIL) or GAIL architecture. The GAIL architecture includes a replay buffer 302, a generator network 304 and a discriminator 306. The arrows as shown in block diagram 300 represents transmission of information between the replay buffer 302, the generator network 304 and the discriminator 306. The generator network 304 function to learn a policy network to predict at least an output action, which includes a trajectory procedure or motion planning, and a dialogue procedure which involves a natural language audio response. The objective of the generator network 304 may be achieve using simulated modality data of neural network apparatus 312 and simulated modality datas of neural network sidewalk user agents or sidewalk users 314. The generator network 304 may be updated for simulated modality data of neural network apparatus and simulated modality data of sidewalk users 314 independently at different step. This updating process may be supported using proximal policy optimization (PPG). The PPO is an optimization technique in reinforcement learning. Using PPO optimization technique, policy update at each step after each training operation is minimal, i.e., policies do not create huge changes due to execution of updates. This is proven to be effective when using reinforcement learning algorithm using text modality data.

The objective of the discriminator 306 is to segregate images captured according to expert samples from generated samples. The classification of images may be according to an authentication classification, to differentiate expert samples 308 and generated samples 310, such that the discriminator 306 is trained to identify an authentic or expert sample. The expert samples 308 and generated samples 310, are provided to the replay buffer 302, for training the discriminator 306.

Dataset: Audio / Text

For clarity and brevity, the following description explains the use of different types of modality for training the neural network model. As explained above, this disclosure uses a first dataset and a second dataset. More specifically, a first training operation

RECTIFIED SHEET (RULE 91) ISA/EP including the use a first dataset comprising a set of audio input for training the neural network model and a set of images for training the neural network model to generate at least one corresponding target response output; a second dataset comprising a set of text content, each of the text content classified in accordance with a corresponding type of surrounding and a set of images containing information in relation to one or more types of real-world surrounding layouts.

Referring to FIG. 4 of the accompany drawings, which shows a block diagram 400 illustrating an audio-based interaction between a multimodal neural network apparatus and one or more multimodal neural network apparatus user in a training operation or training environment. A neural network apparatus 402 receives an input 404 in the form of audio or voice content. The neural network apparatus 402 processes the input 404 and generate at least an output action, interacting with the environment (env). This output action may be a dialogue procedure response to the one or more users 406 and the dialogue procedure may be an audio output generated, of which the audio output is in natural language. On a similar note, the input 404 may be replaced with an input in the form of text content. It shall be understood by a skilled practitioner, the form of the input 404, whether audio, voice or text context is not limiting thereto. A main advantage of this disclosure is the generation of a dialogue procedure in response to an audio input or text context input, using a natural language output. This allows the neural network apparatus to interact with sidewalk users in a real-world scenario.

During training and inference, the neural network apparatus 402 communicates with the multimodal neural network apparatus user, i.e. , the sidewalk users 314, by after selecting the trajectory procedure as suggested by the generator network, the state information will be updated based on the effect of the corresponding policy in the environment. The text input from the robot will be broadcasted to the corresponding sidewalk users 314 and the vice versa.

Dataset: Images

Turning now to use of a set of images, the input data to the neural network apparatus may be 2D image data captured an image sensing device or a 3D point

RECTIFIED SHEET (RULE 91) ISA/EP cloud data received from an image sensing device. Suitable types of image sensing device may be a camera, an image sensor or LiDAR.

In some embodiments, the image dataset is the only form of input data to the neural network model. In some embodiments, the image dataset is used in combination with audio and/or text content dataset.

For brevity, the following information is provided to support understanding of the training of object recognition and image recognition using publicly available dataset to explain the exemplary embodiments described below.

The image I point cloud data may be taken from databases available to the public. An example is JackRabbot Dataset and Benchmark (JRDB), and further information on JRDB may be obtained from papers available to the public, for example:

• [Martin-Martin et. al] JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments arXiv preprint arXiv 1910.11792 (2019); and

• [Ehsanpour et. al.] JRDB-Act: {A} Large-scale Multi-modal Dataset for Spatio-temporal Action, Social Group and Activity Detection arXiv preprint arXiv 2106.08827 (2021 )

In an exemplary embodiment, the set of images include a surrounding layout 500 that simulates a crowded scenario as shown in FIG. 5, but not limited thereto. In this exemplary embodiment the image frame is a 3D point Cloud data input. Without combining audio input and/or text input, the neural network apparatus 402 is shown an images of a crowded surrounding. The trajectory procedure of the neural network apparatus indicates linear velocity is zero (0) and angular velocity as zero (0). The generator network predicts at least an output action of the neural network apparatus as no motion planned or movement planned, since the neural network apparatus is in a crowded surrounding and thus cannot move. In response, the generator network executes at least an output action comprising the dialogue procedure to generate a corresponding target response, in audio format, “Please give me a way”, and/or in corresponding text content, a statement containing the same text. Once

RECTIFIED SHEET (RULE 91) ISA/EP the surrounding layout indicates the neural network apparatus is operable to move forward, the generator network is operable to integrate the different modalities to predict the trajectory procedure of the neural network apparatus indicates linear velocity is one (1 ) and angular velocity as zero (0). the generator network integrates the different modalities to predict the trajectory procedure of the neural network apparatus indicates linear velocity is one (1 ) and angular velocity as zero (0). No dialogue response is required.

In an exemplary embodiment, the set of images include a surrounding layout 600 as shown in FIG. 6, which includes active conversation with multimodal neural network users 406 to assist the neural network apparatus 402 to interpret an orientation of position. In this exemplary embodiment, the surrounding layout 600 is a 3D point Cloud data input. An input 404 may be in the form of audio or text content input, for example, “Where are you?” or a text content of the same statement. Combining with the image of surrounding layout 600, the generator network is operable to integrate the different modalities to predict the trajectory procedure 410 of the neural network apparatus indicates linear velocity is zero (0) and angular velocity as zero (0). The generator network predicts at least an output action of the neural network apparatus as motion planned or movement planned, triggering the neural network apparatus to move forward in a straight line. In response to the audio or text content input of “Where are you?”, the neural network model generates dialogue procedure 412 which corresponds to a corresponding target response, for example, “Beside the stairs near the lobby” in both audio and text context output.

A trained neural network model using the above training method may be implemented on neural network apparatus. When the training method is complete, the neural network model is operable to integrate the different types of modalities and predict at least an output action in accordance with real-world scenario.

Neural Network Apparatus

In a preferred embodiment, a neural network apparatus includes a neural network model and a processor. The processor includes a memory (not shown) to store a computer program product.

RECTIFIED SHEET (RULE 91) ISA/EP The processor includes a system operable to generate a first action in response to at least one corresponding target response trained matching with an audio input received.

The processor includes a system operable to generate a second action in response to at least one corresponding target response trained matching with a natural language text content input received.

The processor includes a multimodal-based system operable to generate a third action in response to at least one corresponding multimodal-based associated response trained matching with an image of a surrounding captured.

The neural network model further comprises a generator network operable to execute at least an output action predicted. The at least an output action comprising a trajectory procedure and/or a dialogue procedure. The neural network model may be a multimodal-based neural network model.

The neural network model may further comprise a named entity recognition module operable to identify a string of audio input captured by the neural network apparatus.

The neural network model further comprises a discriminator operable to segregate a plurality of images captured by the neural network apparatus. Each of the images contains information of a surrounding. The plurality of images is segregated according to an authenticity classification. The authenticity classification may be according to expert samples and generated samples.

The neural network model further comprises a replay buffer operable to receive an authenticity classified image to reinforce learning. The authenticity classified image is classified by the discriminator as authentic, or expert samples.

RECTIFIED SHEET (RULE 91) ISA/EP The generator network is operable to predict the at least an output action according to the trajectory procedure. The trajectory procedure is a motion planning of the neural network apparatus.

The neural network apparatus may be an interactive autonomous robot or an interactive processor-implemented digital voice assistant.

In one embodiment, an audio input or a text input produces at least an output action, which includes a trajectory procedure comprising a linear velocity, an angular velocity and/or a dialogue procedure comprising a corresponding target response in the form of an audio output, and/or a corresponding target response in the form of a text output.

Interactive Autonomous Robot

In an embodiment, a neural network apparatus is an interactive autonomous robot. The interactive autonomous robot includes and audio device, an image sensing device and a processor having a memory. The audio device is operable to receive an audio signal and transmit an audio signal. The image sensing device is operable to capture at least one image signal in relation to a field of view of an environment forward of the image sensing device, and a processor comprising an algorithm stored in memory.

In operation, the interactive autonomous robot response to the audio signal received by the audio device; and/or the at least one image signal captured by the image sensing device. The signals received by the interactive autonomous robot are input signals in the form of audio input, of which a corresponding target response may be generated from the algorithm stored in memory. The at least one processor is operable execute at least an output action, wherein the at least an output action comprises a command to maneuver the interactive autonomous robot from a first position to a second position and/or to execute a dialogue response by the interactive autonomous robot.

Neural Network Specifications

RECTIFIED SHEET (RULE 91) ISA/EP The above neural network training method and neural network apparatus requires a multimodal-based neural network.

Referring to FIG. 7 of the accompanying drawings, as shown is a multimodal neural network architecture 700, which include state input for image data and audio or text context input.

The state input represents a 2D image data. Assuming the pixel dimension of the 2D image is 250 X 250 (length X height), it can be flattened out to form a 62,500 long vector (250*250). For an RGB image, the dimensionality of input image will be 3 X 62,500.

Text input may be a language model embedding of the input vector which can be of the dimension 50 X 512. Where 50 is the maximum word length of the text input and each word will be a 512-dimension vector.

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

The pillar of this architecture is a latent transformer chain. That means the Query(Q), Key(K) and Value(V) matrix is derived from the embedding itself.

Multiple cross-attention layers are included in an interleaved fashion. The cross-attention modules effectively encode the corresponding environmental (trajectory information) or audio data into the latent transformer representation. For the cross-attention modules, the Query Matrix is derived from the latent representation, while Key and Value Matrix from the input data. Same state and text input may be given in multiple levels in an interleaved fashion. This enables the transformer-based architecture to select the relevant portions of each input during

RECTIFIED SHEET (RULE 91) ISA/EP training in the cross-attention module. The term “cross-attention” used in the context herein refers to a technique of merging two or more types of sequences. For example, image of surrounding layout with audio signal input.

For the latent transformer module, the Query, Key and Value Matrix is derived from the latent embedding representation itself. Each environmental and audio input cross-attention layer is added in an interleaved fashion. An advantage of adding the layers in an interleaved fashion is that if a new modality is introduced, for e.g., voice instead of text, this can be done with ease by replacing the input layer.

Similarly, if the objective is to generate an audio signal as output instead of dialogue policy, this can be done by attaching a decoder to the end of network, instead of logits for dialogue policy and must be trained on user/robot utterances instead. The length of the latent array (N) can be selected in such a way that it fits the computational capability of the machine and the complexity of the input data.

As can be seen from FIG. 7, the two components, i.e. , cross-attention layer and the latent transformer are alternated one after another. The number of latent transformer layers after the input layer, which is represented by ‘L’ is preferably up to 8 layers. The dotted lines between different cross-attention module indicates shared parameters.

The final output layer may consist of multiple outputs. One for neural network apparatus trajectory procedure action and one for dialogue procedure. For neural network apparatus trajectory procedure, it be a regression function to find the angular velocity and linear velocity of the neural network apparatus output action, which contains an activation function and optimized using squared-loss loss function.

RECTIFIED SHEET (RULE 91) ISA/EP For dialogue procedure, the output will be again a 50*768 vector for representing text output, where 50 represents the max length of the sentence, and 768-long vector representing each token/word.

In-case of audio input the text input layer and dialogue procedure output layer will be changed. The change is described as follows:

The input will be an audio waveform, which is an image representing the sound signal or recording. This will be treated like the image input once the waveform is generated from audio. The output will also be an audio waveform, which can be later converted to sound signals, thus producing an audio output.

Thus, it can be seen that a multimodal neural network training method and multimodal neural network apparatus has been provided. A main advantage of this of the training method and apparatus disclosed herein is a neural network training method and apparatus which utilises multimodal neural network for natural language processing and motion planning by integrating three types of modality to predict at least an output action. More advantageously, the training method and apparatus uses imitation learning or inverse reinforcement learning to increase the accuracy of recognition through the use of different types of modalities. While exemplary embodiments have been presented in the foregoing detailed description of the disclosure, it should be appreciated that a vast number of variation exist.

It should further be appreciated that the exemplary embodiments are only examples, and are not intended to limit the scope, applicability, operation or configuration of the disclosure in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing an exemplary embodiment of the disclosure, it being understood that various changes may be made in the function and arrangement of elements and method of operation described in the exemplary embodiment without departing from the scope of the disclosure as set forth in the appended claims.

RECTIFIED SHEET (RULE 91) ISA/EP List of Reference Signs

RECTIFIED SHEET (RULE 91) ISA/EP

Previous Patent: METHOD FOR PRODUCING A COMBINATION PROFILE

Next Patent: GENERATING UNIT WITH SECURE GRID CONNECTION OF AN ASYNCHRONOUS MACHINE