Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MIXED SAC BEHAVIOR CLONING FOR CAVITY FILTER TUNING
Document Type and Number:
WIPO Patent Application WO/2023/222383
Kind Code:
A1
Abstract:
A method of training a reinforcement learning, RL, agent to perform a task in an environment includes training a model-based reinforcement learning (MBRL) agent to perform the task, and generating a plurality of expert trajectories by applying the trained MBRL agent on the environment. The expert trajectories include sequences of state/action pairs obtained by applying the trained MBRL agent on the environment. A model-free reinforcement learning (MFRL) agent is trained based on the expert trajectories. The MFRL agent is applied to the environment to obtain a plurality of sample MFRL trajectories. The sample MFRL trajectories and the expert trajectories are combined to obtain a set of combined trajectories, and the MFRL agent is trained using the set of combined trajectories.

Inventors:
NIMARA DOUMITROU DANIIL (SE)
MALEK MOHAMMADI MOHAMMADREZA (SE)
HUANG VINCENT (SE)
WEI JIEQIANG (SE)
Application Number:
PCT/EP2023/061656
Publication Date:
November 23, 2023
Filing Date:
May 03, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
H03J1/00; G06N3/092
Domestic Patent References:
WO2020242367A12020-12-03
WO2020242367A12020-12-03
Other References:
LINDSTAH SIMON ET AL: "Reinforcement Learning with Imitation for Cavity Filter Tuning", 2020 IEEE/ASME INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT MECHATRONICS (AIM), IEEE, 6 July 2020 (2020-07-06), pages 1335 - 1340, XP033807484, DOI: 10.1109/AIM43001.2020.9158839
NIMARA DOUMITROU DANIIL: "Model Based Reinforcement Learning for Automatic Tuning of Cavity Filters", 26 October 2021 (2021-10-26), XP055895894, Retrieved from the Internet [retrieved on 20220228]
LINDSTAHL, S., REINFORCEMENT LEARNING WITH IMITATION FOR CAVITY FILTER TUNING : SOLVING PROBLEMS, 2019, Retrieved from the Internet
HARSCHER, R. VAHLDIECKS. AMARI: "Automated filter tuning using generalized low-pass prototype networks and gradient-based parameter extraction", IEEE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES, vol. 49, no. 12, 2001, pages 2532 - 2538, XP011038510
Attorney, Agent or Firm:
ERICSSON (SE)
Download PDF:
Claims:
CLAIMS:

1. A method of training a reinforcement learning, RL, agent to perform a task in an environment, the method comprising: training (602) a model-based reinforcement learning, MBRL, agent to perform the task; generating (605) a plurality of expert trajectories by applying the trained MBRL agent on the environment, wherein the expert trajectories comprise sequences of state/action pairs obtained by applying the trained MBRL agent on the environment; training (606) a model-free reinforcement learning, MFRL, agent based on the expert trajectories; applying (613) the MFRL agent to the environment to obtain a plurality of sample MFRL trajectories; combining (614) the sample MFRL trajectories and the expert trajectories to obtain a set of combined trajectories; and training (616) the MFRL agent using the set of combined trajectories.

2. The method of Claim 1, further comprising: determining (610) a ratio of expert trajectories and sample MFRL trajectories; wherein combining the sample MFRL trajectories and the expert trajectories comprises combining the sample MFRL trajectories and the expert trajectories according to the determined ratio to obtain the set of combined trajectories.

3. The method of Claim 2, wherein determining the ratio of expert trajectories and sample MFRL trajectories is initially performed after training the MFRL agent based on the expert trajectories.

4. The method of Claim 3, further comprising: training (612) the MBRL agent to generate new expert trajectories after determining the ratio of expert trajectories and sample MFRL trajectories.

5. The method of Claim 2, wherein combining the expert trajectories and the sample MFRL trajectories comprises mixing the expert trajectories and the sample MFRL trajectories according to the determined ratio.

6. The method of any of Claims 2 to 5, further comprising: repeating steps of determining a ratio of expert trajectories and sample MFRL trajectories, generating sample MFRL trajectories, combining the expert trajectories and the sample MFRL trajectories, and training the MFRL agent using the set of combined trajectories.

7. The method of Claim 6, wherein the steps of determining a ratio of expert trajectories and sample MFRL trajectories, generating sample MFRL trajectories, combining the expert trajectories and the sample MFRL trajectories, and training the MFRL agent using the set of combined trajectories are repeated until a performance metric of the MFRL agent exceeds (618) a third threshold.

8. The method of Claim 6, wherein training the MFRL agent using the set of combined trajectories comprises training the MFRL agent to optimize a mixed loss function, wherein the mixed loss function is given as

Lmixed = Lbc + LSAC where Lbc is a behavior cloning objective and LSAC is a soft actor-critic objective.

9. The method of Claim 8, where LSAC comprises a critic component LQ,SAC and a policy component Lπ, SAC.

10. The method of Claim 9 wherein the mixed loss function Lmixed is defined according to the equation Lmixed = Lbc + LQ,SAC + Lπ,SAC.

11. The method of Claim 10, wherein: and wherein B is a behavior objective, Q symbolizes a critic network, dt has a value of 0 or 1, α is an entropy coefficient, rt is a reward, st is a state, at is an action, π is a policy, t is a time step, a’ is an updated action, and Φ and 0 represent learnable parameters of the critic network and an actor network, respectively.

12. The method of any previous Claim, wherein training the MFRL agent based on the combined trajectories comprises training the MFRL agent using behavior cloning based on the combined trajectories.

13. The method of any previous Claim, wherein training the MFRL agent using the set of combined trajectories comprises training the MFRL agent using the set of combined trajectories using mixed loss soft actor-critic, SAC, reinforcement learning.

14. The method of any previous Claim, wherein the environment comprises a tunable filter and wherein the task comprises tuning the tunable filter.

15. A computer-implemented method performed by a device (201, 1000) configured with a MFRL agent (203) trained according to the method of any previous Claim for radio frequency, RF, filter tuning, the computer-implemented method comprising: obtaining (911) a scattering parameter, S-parameter, reading for an RF filter; generating (913), from the MFRL agent, a value for influencing a tuning mechanism of the RF filter based on the S-parameter reading; and signaling (915) the value to a controller for automatic execution of the value on the tuning mechanism of the RF filter.

16. The method of Claim 15, wherein the tuning mechanism comprises a screw, and the value for influencing the tuning mechanism comprises a screw rotation value for changing a height of at least one screw of the RF filter.

17. The method of Claim 16, wherein the RF filter comprises a ceramic filter, and the value for influencing the tuning mechanism comprises an amount of material to remove for tuning a capacitor of the filter.

18. The method of any previous Claim, wherein: training the MBRL comprises training the MBRL until a performance metric of the

MBRL exceeds a first threshold; and training the MFRL based on the expert trajectories comprises training the MFRL based on the expert trajectories until a performance metric of the MBRL exceeds a first threshold.

19. A device (201, 1000) configured with a reinforcement learning, RL, agent (203) for RF filter tuning, the device configured to perform operations according to any of Claims 1 to 18.

20. A device (201, 1000) configured with a reinforcement learning, RL, agent (203) for RF filter tuning, the device comprising: processing circuitry (1003); memory (1005) coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the device to perform operations according to any of Claims 1 to 18.

21. A computer program comprising computer code to be executed by a device (201, 1000) configured with a reinforcement learning, RL, agent (203) for RF filter tuning to perform operations according to any of Claims 1 to 18.

22. A computer program product comprising a non-transitory storage medium (1005) including program code to be executed by processing circuitry (1003) of a device (201, 1000) configured with a reinforcement learning, RL, agent (203) for RF filter tuning, whereby execution of the program code causes the device to perform operations according to any of Claims 1 to 18.

Description:
MIXED SAC BEHAVIOR CLONING FOR CAVITY FILTER TUNING

TECHNICAL FIELD

[0001] The present disclosure relates generally to methods for training a reinforcement learning (RL) agent, and in particular to methods for training an RL agent for performing cavity filter tuning using mixed soft actor-critic behavior cloning, and related systems and devices.

BACKGROUND

[0002] Radio frequency (RF) filters used in base stations for wireless communications may be demanding in terms of filter characteristics (e.g., the frequency response of the filter) necessitated to respond to challenging requirements such as very narrow bandwidths (e.g., less than 50 MHz) and high attenuation requirements (e.g., more than 80 dB) at frequencies close to the frequency range(s) of the passband(s).

[0003] The frequency response of a filter typically is described with the help of scattering parameters, or S-parameters. S-parameter traces of filters typically include poles, which represent a frequency point in the passband of the filter at which the input signal is not reflected and therefor can pass the filter with the least attenuation; whereas zeros (also referred to as transmission zeros) in S-parameters refer to frequency points in the stopband, or rejection band, of a filter at which no energy is transmitted.

[0004] Generally, increasing the number of poles may allow achieving higher attenuation levels, while these attenuation levels may be further increased for some frequency points (or certain frequency ranges) by the introduction of zeros.

[0005] In order to reach, e.g., a very narrow bandwidth with high rejection ratio, a selected filter topology may need many poles and at least a couple of zeros (e.g., more than six poles and two zeros). For cavity filters, which may be used in base stations in a mobile communications system, the number of poles is directly translated into the number of physical resonators of the manufactured filter. As a resonator is electromagnetically connected for some frequencies to the next resonator, a path from the input to output is created, allowing energy to flow from the input to the output at the designed frequencies while some frequencies are rejected. When a pair of non- consecutive resonators are coupled, an alternative path for the energy is created. This alternative path is related to a zero in the rejection band.

[0006] In some cavity filters, each pole/resonator has a tunable structure (e.g., a screw, a rod, a knob, a peg, a bolt, a gear, etc.) which may be adjusted to endeavor to address inaccuracies in the manufacturing process, while each zero (due to consecutive or non-consecutive resonators) has another tunable structure to endeavor to control the desired coupling. The tuning of poles and zeros may be very demanding. Thus, in some approaches, tuning may be performed manually by a well-trained technician that manipulates the tunable structure and verifies the desired frequency response in a vector network analyzer (VNA).

[0007] Some approaches propose possible use of artificial intelligence (Al)/machine learning (ML) in a circuit-based simulator as a potential alternative to try to tune a filter. Some approaches propose possible use of AI/ML as a potential alternative to manual tuning to try to reduce tuning time per filter including, e.g., Lindstahl, S., Reinforcement Learning with Imitation for Cavity Filter Tuning : Solving problems by throwing DIRT at them (Dissertation) (2019), http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254422 and WO2020242367 discussing a RL agent that may be able to solve 6p2z (p: poles, z: zeroes) environments. However, in some situations the is a need to improve the accuracy of AI/ML-based tuning models even further.

SUMMARY

[0008] There currently exist certain challenges. Manual tuning of RF filters may be time consuming and expensive (e.g., thirty minutes to tune a cavity filter and costs associated with a person performing the tuning). While use of AI/ML may help reduce time and cost associated with manual tuning, such an approach may be impractical due to the time needed to train a ML agent to perform the tuning.

[0009] Certain aspects of the disclosure and their embodiments may provide solutions to these or other challenges. Some embodiments provide methods of training an RL agent to perform a task in an environment. The environment may be a tunable filter and the task may be tuning the tunable filter. [0010] Some embodiments provide a method of training a reinforcement learning, RL, agent to perform a task in an environment. The method includes training a model- based reinforcement learning (MBRL) agent to perform the task, and generating a plurality of expert trajectories by applying the trained MBRL agent on the environment. The expert trajectories include sequences of state/action pairs obtained by applying the trained MBRL agent on the environment. A model-free reinforcement learning (MFRL) agent is trained based on the expert trajectories. The MFRL agent is applied to the environment to obtain a plurality of sample MFRL trajectories. The sample MFRL trajectories and the expert trajectories are combined to obtain a set of combined trajectories, and the MFRL agent is trained using the set of combined trajectories.

[0011] The method may further include determining a ratio of expert trajectories and sample MFRL trajectories. Combining the sample MFRL trajectories and the expert trajectories may include combining the sample MFRL trajectories and the expert trajectories according to the determined ratio to obtain the set of combined trajectories. [0012] Determining the ratio of expert trajectories and sample MFRL trajectories may be initially performed after training the MFRL agent based on the expert trajectories. [0013] The method may further include training the MBRL agent to generate new expert trajectories after determining the ratio of expert trajectories and sample MFRL trajectories.

[0014] Combining the expert trajectories and the sample MFRL trajectories may include mixing the expert trajectories and the sample MFRL trajectories according to the determined ratio.

[0015] The method may further include repeating steps of determining a ratio of expert trajectories and sample MFRL trajectories, generating sample MFRL trajectories, combining the expert trajectories and the sample MFRL trajectories, and training the MFRL agent using the set of combined trajectories.

[0016] The steps of determining a ratio of expert trajectories and sample MFRL trajectories, generating sample MFRL trajectories, combining the expert trajectories and the sample MFRL trajectories, and training the MFRL agent using the set of combined trajectories may be repeated until a performance metric of the MFRL agent exceeds a third threshold. [0017] In some embodiments, training the MFRL agent using the set of combined trajectories may include training the MFRL agent to optimize a mixed loss function, wherein the mixed loss function is given as:

L mixed = L bc + L SAC where L bc is a behavior cloning objective and L SAC is a soft actor-critic objective.

[0018] L SAC may include a critic component L Q,SAC and a policy component L π, SAC . The mixed loss function Lmixed may be defined according to the equation L mixed = L bc + L Q,SAC + L π , SAC .

[0019] In some embodiments, L bc + L Q,SAC + L π,SAC may be defined as: and wherein B is a behavior objective, Q symbolizes a critic network, d t has a value of 0 or 1, a is an entropy coefficient, r t is a reward, s t is a state, a t is an action, π is a policy, t is a time step, a' is an updated action, and Φ and 0 represent learnable parameters of the critic network and an actor network, respectively.

[0020] Training the MFRL agent based on the combined trajectories may include training the MFRL agent using behavior cloning based on the combined trajectories. In some embodiments, training the MFRL agent using the set of combined trajectories may include training the MFRL agent using the set of combined trajectories using mixed loss soft actor-critic, SAC, reinforcement learning.

[0021] Some embodiments provide a computer-implemented method performed by a device configured with a MFRL agent trained as described above for RF filter tuning. The computer-implemented method includes obtaining a scattering parameter, S- parameter, reading for an RF filter, generating, from the MFRL agent, a value for influencing a tuning mechanism of the RF filter based on the S-parameter reading, and signaling the value to a controller for automatic execution of the value on the tuning mechanism of the RF filter.

[0022] In some embodiments, the tuning mechanism includes a screw, and the value for influencing the tuning mechanism includes a screw rotation value for changing a height of at least one screw of the RF filter.

[0023] In some embodiments, the RF filter includes a ceramic filter, and the value for influencing the tuning mechanism includes an amount of material to remove for tuning a capacitor of the filter.

[0024] In some embodiments, training the MBRL includes training the MBRL until a performance metric of the MBRL exceeds a first threshold, and training the MFRL based on the expert trajectories includes training the MFRL based on the expert trajectories until a performance metric of the MBRL exceeds a first threshold.

[0025] Some embodiments provide a device configured with a reinforcement learning, RL, agent for RF filter tuning, the device configured to perform operations described herein.

[0026] Some embodiments provide a device configured with a reinforcement learning, RL, agent for RF filter tuning. The device includes processing circuitry, memory coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the device to perform operations described herein.

[0027] A computer program according to some embodiments includes computer code to be executed by a device configured with a reinforcement learning, RL, agent for RF filter tuning to perform operations described herein.

[0028] A computer program product according to some embodiments includes a non-transitory storage medium including program code to be executed by processing circuitry of a device configured with a reinforcement learning, RL, agent for RF filter tuning, whereby execution of the program code causes the device to perform operations described herein.

BRIEF DESCRIPTION OF DRAWINGS [0029] The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate certain non-limiting embodiments of inventive concepts. In the drawings:

[0030] Figure 1 is a block diagram illustrating a process used by a human to tune a cavity filter;

[0031] Figure 2 is block diagram illustrating an example embodiment of a device configured with a RL agent for RF filter tuning in accordance with some embodiments of the present disclosure;

[0032] Figures 3A-3D are curves illustrating S-parameter readings in accordance with some embodiments of the present disclosure;

[0033] Figure 4 illustrates an overview of a training procedure for the MBRL model according to some embodiments;

[0034] Figure 5A illustrates operations for training a world model;

[0035] Figure 5B graphically illustrates operations for training a world model;

[0036] Figure 6 is a flowchart of operations for training a model according to some embodiments;

[0037] Figures 7A and 7B are graphs of Agent performance for behavior cloning for various reward levels according to some embodiments;

[0038] Figure 8 is a block diagram of a device in accordance with some embodiments;

[0039] Figure 9 is a flowchart illustrating operations of a device in accordance with some embodiments; and

[0040] Figure 10 is a block diagram of a communication system in accordance with some embodiments.

DETAILED DESCRIPTION

[0041] Inventive concepts will now be described more fully hereinafter with reference to the accompanying drawings, in which examples of embodiments of inventive concepts are shown. Inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of present inventive concepts to those skilled in the art. It should also be noted that these embodiments are not mutually exclusive. Components from one embodiment may be tacitly assumed to be present/used in another embodiment.

[0042] The following description presents various embodiments of the disclosed subject matter. These embodiments are presented as teaching examples and are not to be construed as limiting the scope of the disclosed subject matter. For example, certain details of the described embodiments may be modified, omitted, or expanded upon without departing from the scope of the described subject matter.

[0043] The following explanation of potential problems with some approaches is a present realization as part of the present disclosure and is not to be construed as previously known by others.

[0044] To try to get closer to real filter products and to endeavor to provide a RL agent that may work in a reasonable amount of time to tune real, manufactured filter products, a three-dimensional electromagnetic field simulator (3D EM simulator) may be considered (such as the CST Microwave Studio that solves Maxwell's equations over a 3D body of a filter to generate more accurate results than, e.g., a circuit-based simulator). However, training a RL agent to model and tune an actual filter with a 3D EM simulator may use a large amount of samples for training, may be time consuming, and may not be practical. For example, for every timestep of training, a simulation can be run with a complex 3D EM simulator, where each simulation can take several minutes. For instance, with a CST simulator (e.g., which may be used for frequency domain simulations of RF filters (e.g., such as a 8p4z cavity filter), after optimizing a grid and a number of points where Maxwell's equation is solved, 3 minutes may be needed to get the scattering parameters (S-parameters). A model-free reinforcement learning (MFRL) based agent may need about 700,000 samples for training, which may make training with the CST simulator impractical (or even almost impossible) because about 4 years may be needed to complete such training.

[0045] While model-based reinforcement learning (MBRL) techniques may decrease sample complexity of MFRL by leveraging a world model to boost training efficiency, the decrease may not be sufficient. Moreover, model-based learning may suffer due to inability of a model to capture all real-world effects.

[0046] Thus, existing approaches for training and using an RL agent to tune a RF filter with a simulator with sufficient accuracy, efficiency, and/or within a reasonable amount of time may be lacking.

[0047] Certain aspects of the disclosure and their embodiments may provide solutions to these or other challenges.

[0048] The term "RF filter" is used herein in a non-limiting manner and, as explained herein, can refer to any type of RF filter for communication applications including, without limitation, cavity filters. Further, the term "tuning mechanism" is used in a non-limiting manner and, as explained herein, can refer to any type of tunable structure on an RF filter (including, without limitation, a screw, a rod, a knob, a peg, a bolt, a gear, etc.) and/or a non-mechanical RF relevant property of a waveguide/filter material of the RF filter (including, without limitations, dielectric parameters, temperature, etc.) which may be modulated using non-mechanical electric parameters such as current, chemical, etc.

[0049] RF filters of the present disclosure include, without limitation, microwave cavity filters. Band-pass filters may be used in wireless communication systems to meet sharp and demanding requirements of commercial bands. Presently, cavity filters may be dominantly used due to low cost for mass production and high-Q-factor per resonator (e.g., especially for frequencies below 1GHz). This type of filter may provide high-Q resonators that can be used to implement sharp filters with fast transitions between pass and stop bands and high selectivity. Moreover, they can cope with high-power input signals.

[0050] Cavity filters can be applicable from as low as 50 MHz up to several GHz. This versality in frequency range, as well high selectivity, may make them a popular choice in many applications, such as in base stations in a communications system (e.g., a radio access network RAN node such as an eNodeB (eNB) and/or a gNodeB (gNB) in a mobile communications system).

[0051] A drawback of this type of narrow band filter may be that since they require a very sharp frequency response, a small tolerance in the fabrication process may impact the final performance. An approach to avoid an expensive fabrication process is based on post-production tuning. In some approaches, post-production tuning uses highly trained human operators to manually tune the filter. Figure 1 is a block diagram illustrating a process used by a human 101 to tune a cavity filter 107. As illustrated, the tuning process may include turning a set of screws on the cavity filter 107 and, with the aid of a VNA 105, compare how close a current filter frequency response (S-parameters 103 (also referred to herein as an S-curve or S-parameter reading) is to a desired filter frequency response. This process is repeated until the measurement in the VNA 105 and the designed filter mask are close enough. Potential challenges with this approach may include cost (e.g., for the human), time (e.g., it may take up to 30 minutes to tune a single filter), and/or lack of automation.

[0052] In some cases, the filter 107 may include a ceramic waveguide filter that is tuned by removing portions of ceramic/silver material of the filter to change the filter characteristics.

[0053] Some approaches have tried to automate a tuning process. See e.g., For example, Harscher, R. Vahldieck, and S. Amari, "Automated filter tuning using generalized low-pass prototype networks and gradient-based parameter extraction," IEEE Transactions on Microwave Theory and Techniques, vol. 49, no. 12, pp. 2532-2538, 2001. doi: 10.1109/22.971646 ("Harscher"). Harscher discusses breaking the task into first finding underlying model parameters which generate a current S-curve and then performing sensitivity analysis to adjust them so that they end up to the nominal (e.g., ideal) values of a perfectly tuned filter. In another approach, as discussed herein, Al has been proposed. However, for more complicated filters with more sophisticated topologies, there may be a need for a large amount of training samples and time to achieve desired performance, and a need to be more efficient and/or practical.

[0054] Certain embodiments of the present disclosure may provide solutions to these or other challenges. The method of the present disclosure can include an end-to-end process for the tuning of real, physical RF filters. The RL agent having transferred learning can generate values for influencing a tuning mechanism(s) of an RF filter in simulation, and signal the value(s) to a controller for automatic execution of the value on the tuning mechanism of the RF filter (e.g., by a robot which also can have direct access to S- parameter readings from a VNA). Actions can lie within [-1,1] and correspond to altering the tuning mechanism. Altering the tuning mechanism may include, for example, altering the position of a tunable structure, such as the height of a screw by a specified amount (e.g., in millimeters), removing portions of ceramic/silver material, etc.

[0055] Figure 2 is block diagram illustrating an example embodiment of a device 201 configured with a RL agent 203 for RF filter tuning in an end-to-end process in accordance with some embodiments of the present disclosure. RL agent 203 of a target simulator is trained by interacting either with a simulator or directly with a real filter (as illustrated in Figure 2). In the latter case, a robot 211 that can automatically execute a value on a tuning mechanism of the RF filter (e.g., turn 213 physical screws on the real filter). A goal of RL agent 203 can include devising a sequence of actions that may lead to a tuned configuration as fast as possible.

[0056] Training 203 is described as follows. RL agent 203 obtains an S-parameter observation o, generates action a, evolving the system, yielding the corresponding reward r and next observation o'. The tuple (o,a,r,o') can be stored internally, as it can be later used for training. RL agent 203 checks 205, 207 if it should train its world model and/or actor-critic networks (e.g., perform gradient updates every 10 steps). If not, RL agent 203 simulates values 209 (e.g., simulates values for screw rotations) and returns to obtaining an S-parameter observation o, generates action a, evolving the system, yielding the corresponding reward r and next observation o'.

[0057] A goal of RL agent 203 can be quantified via the reward r, which can depict the distance that the current configuration has to a tuned one. Thus, in some embodiments, the reward value comprises a representation of a distance that a current configuration of the RF filter of the target simulator has to a tuned configuration. In some embodiments, the distance comprises a point-wise Euclidean distance between the current S-parameter values and the desired ones (that is, an S-parameter value of the tuned configuration of the RF filter of the target simulator, across the examined frequency range (as illustrated in Figures 3A-3D). If a tuned configuration is reached, the RL agent 203 receives a fixed r tuned reward value (e.g. +10, +100, etc.).

[0058] Figures 3A-3D are graphs illustrating S-parameter readings from a VNA within a training loop in accordance with some embodiments of the present disclosure. S- parameter curves 301 (in dB) for an example embodiment of Figure 2 are shown throughout each graph. Requirements are indicated by the horizonal bars in each graph. For example, the curve 301 must lie above bar 303 in the pass band and below the four other horizontal bars 305, 307, 309, 311 indicating the stop band. The curve 313 (dotted line) and the curve 315 (dashed line) must lie below the bar 317 in the passband. As illustrated in Figure 3C, the filter satisfies the requirements after two time steps.

[0059] RL agent 203 can interact with the RF filter by changing a set of tuning mechanisms (e.g., tunable parameters via the tunable structures (e.g., screws)) of the RF filter. Thus, observations are mapped to rewards which in turn are mapped (by the RL agent 203) to adjustments of values for influencing a tuning mechanism(s) (e.g., screw rotations), which lead to automatic execution of a value on the tuning mechanism of the RF filter (e.g., modifications via robot 211, such as turn screws on filter 213).

[0060] After training, at inference, the RL agent 203 is employed as illustrated in Figure 2. The RL agent 203 obtains the S-observation(s) provided from the VNA 105, translates them into the corresponding value for influencing a tuning mechanism of the RF filter (e.g., adjustments to tunable structures such as screw rotations), and signals the value to a controller for automatic execution (e.g., to robot 211). The automatic controller (e.g., robot 211) then executes 213 the value on the tuning mechanism of the RF filter (e.g., an adjustment(s) to tunable structures of the RF filter signaled by the RL agent 203). This process continues until a tuned configuration is reached.

[0061] As referenced herein, the method of various embodiments of the present disclosure uses transfer learning, or behavior cloning, for RL, which may significantly decrease sample complexity in the target simulator. According to some embodiments, a MBRL agent is trained. MBRL training can efficiently produce an RL agent that performs with reasonable accuracy. The MBRL agent is used to generate a plurality of expert trajectories that can be used to train an MFRL agent through behavior cloning. The MFRL agent can then be trained to attain a higher level of accuracy than the MBRL agent would be capable of within reasonable training times. The method may be particularly efficient for RF filter tuning, such as cavity filter tuning, where there may be a training time discrepancy between the training a MBRL agent and training an MFRL agent. [0062] Reinforcement learning is a learning method concerned with how a RL agent can take actions in an environment so as to maximize a numerical reward signal (also referred to herein in as a reward value). In some embodiments of the present disclosure, the environment is an RF filter for mobile communication applications, for example a cavity filter. A cavity filter can have various topologies, e.g., such as a type of cavity filter with 6 poles and 2 zeros. In some embodiment, the RL agent of the source and target simulators is an algorithm which generates values for influencing tuning mechanisms (e.g., values for adjusting positions (e.g., heights) of tunable structures (e.g., screws, rods, knobs, pegs, bolts, gears, etc.)) of the RF filter.

[0063] To train and use the RL agent to tune the RF filter, in some embodiments, the environment can be treated as a black box in which a MFRL agent may be used; or the environment can be modeled by using a MBRL agent. Given sufficient samples (which may be referred to as "asymptotic performance"), MFRL may tend to exhibit better performance than MBRL, as errors induced by the RL agent of the target simulator may get propagated to the decision making of that RL agent (e.g., model errors may act as a bottleneck in performance).

[0064] On the other hand, a RL agent (e.g., a MBRL agent) may leverage a world model of the RL agent, which may boost training efficiency and may lead to faster training. For example, the RL agent of the target simulator can use the learned environment model to simulate a sequence of actions and observations, which in turn can give it a better understanding of the consequences of its actions. When designing an RL agent, a balance may need to be found between training speed and asymptotic performance. Achieving both may need careful modelling.

[0065] To summarize, MFRL can achieve high success rate but it takes longer time to train. MBRL requires less time to train the agent, but the final trained agent exhibits lower accuracy. The problem to be solved is to achieve a final high accuracy agent while keeping the training time low.

[0066] With current state-of-the-art model-free and model-based RL algorithms, it is difficult to train an agent efficiently, while maintaining the same degree of accuracy, on complicated filters, such as 8p4z filters. This also precludes consideration of more complex filters for automatic tuning. [0067] Some embodiments described herein utilize imitation learning for cavity filter tuning using MBRL and model-free reinforcement learning (MFRL) to significantly decrease sample complexity exhibited by the MFRL agent. In particular, some embodiments provide a framework that can leverage the early training efficiency of MBRL to boost the performance of MFRL, which may decrease sample complexity without sacrificing performance. According to some embodiments, a sample-efficient MBRL agent is trained for a limited number of steps, as an MBRL agent may reach good performance significantly faster than an MFRL agent. The MBRL agent is then used to generate expert data, which is leveraged by the MFRL agent, to enhance its training efficiency.

[0068] Brief reference is made to Figure 6, which illustrates operations for training a RL agent according to some embodiments. As shown therein, at block 602, an MBRL agent is trained on an environment (such as a circuit simulator) using an environment model until a performance threshold is exceeded at block 604 (when a performance metric, such as a success rate, exceeds a first threshold Thrl). For example, an MBRL agent can achieve over 90% success rate at filter tuning significantly faster than its MFRL counterpart and as such, surpassing this threshold can be done efficiently.

[0069] At block 605, the MBRL agent is used to generate expert trajectories. A "trajectory" is a sequence of tuples (ot, at) that represent actions by an agent and associated observations of the state of an environment as it performs a task for which it was trained. A trajectory thus represents the behavior of a trained agent. Because the MBRL agent has been trained to exceed a certain success rate, a trajectory generated by operation of the agent can be considered an "expert" trajectory that can be used to train other agents.

[0070] Accordingly, at block 606, a soft actor-critic (SAC) MFRL agent is pre-trained exclusively on the expert trajectories using an imitation learning such as behavior cloning. Training is performed until a performance metric of the MFRL agent, such as a success rate, exceeds a second threshold (Th r2) at block 608.

[0071] At block 610, the method determines a ratio of expert trajectories generated by the MBRL agent and sample trajectories generated by the MFRL agent to be used for subsequent training based on the performance of the MBRL and MFRL agents. For example, when the performance of the MFRL agent is low, more expert trajectories from the MBRL agent and fewer sample trajectories from the MFRL agent may be used, and vice versa.

[0072] At block 612, the method optionally trains the MBRL agent on the environment and generates new expert trajectories. That is, in some embodiments, the system/method may continue to train the MBRL agent to improve the performance of the MBRL agent.

[0073] At block 613, the MFRL agent is applied on the environment (or simulated environment), and sample trajectories are collected from the MFRL agent.

[0074] At block 614, expert trajectories generated by the MBRL agent and sample trajectories generated by the MFRL agent are combined and mixed according to the percentage determined at block 610. For example, the method may determine that a set of trajectories to be used for training the MFRL should consist of 75% from the MBRL and 25% from the MFRL. An appropriate number of such trajectories are then selected and combined into a single set of training trajectories, which may be randomly distributed. [0075] At block 616, the MFRL agent is trained using the mixed trajectories. In particular, the MFRL agent may be trained using a mixed SAC-Behavior cloning objective as described in more detail below.

[0076] The operations of blocks 610 to 618 are repeated until a performance metric of the MFRL agent exceeds a third threshold (Thr3) at block 618, at which point the performance requirement for MFRL is fulfilled. The performance requirement can be a success rate threshold. Alternately, training can be stopped when performance of the model does not improve for a predefined number of iterations.

[0077] Accordingly, some embodiments provide a progressive/incremental approach for training an RL agent that tunes microwave filters. Some embodiments described herein bridge the gap between MFRL and MBRL, by using imitation learning to generate an MFRL agent which is both efficient early and performant.

[0078] A training method as described herein may have certain advantages, such as faster training and faster hyperparameter tuning. For example, using embodiments described herein, MFRL training time may be reduced by a factor of 2. Because training is faster, the hyperparameter space can be searched faster, which enables consideration of more intricate filter environments, including those for which auto-tuning is not currently possible.

[0079] Some embodiments may also provide modularity. For instance, other MBRL and MFRL algorithms can be utilized. Furthermore, other environments can be considered. For instance, one can utilize the described embodiments to improve sample efficiency of an agent trained on a 3D simulator. Finally, the described embodiments can be extended for the tuning of other types of filters (not cavity filters), as there are no explicit assumptions about the topology of the filter.

[0080] Some embodiments may also provide increased flexibility. For example, if a clear performance threshold is not evident for a given task, the described embodiments can be leveraged incrementally. The flexibility of this approach makes it applicable in more complicated environments.

[0081] As briefly explained above, some embodiments use learning for RL to decrease sample complexity. In particular, a MBRL agent is trained to act as an expert and enhance the efficiency of a MFRL agent. A mixed Behavior cloning - SAC objective allows the MFRL agent to train its actor and critic networks, while simultaneously considering actions taken by the expert agent.

[0082] According to some embodiments, imitation learning is used to train an MFRL agent. To do so, an MBRL agent is first trained and then deployed to tune previously unseen detuned filters. This tuning process generates a plurality of trajectories, which consist of a sequence of state-action pairs which can be leveraged for behavior cloning. This dataset is used in the pretraining stage of the MFRL agent, as well as afterwards, where a mixed objective is devised which encompasses both behavior cloning and SAC losses.

[0083] Imitation learning is a well-established Machine Learning technique in which expert data are leveraged to help, guide and train an agent. One characteristic example of such an approach is called Behavior Cloning, where state-action pairs are accumulated by the expert, which are then used to train an agent's policy network in a supervised learning arrangement. In essence, the agent learns to mimic the actions performed by the expert. An issue of this approach is that the agent is bottlenecked by the performance of the expert and typically struggles to surpass it. Furthermore, using this as an initial configuration on top of which one trains their agent in an RL set up might also be problematic, as the optimization objectives differ. For this reason, some embodiments extend the mixed behavior cloning loss by incorporating it with the state-of-the-art SAC MFRL algorithm.

[0084] While embodiments discussed herein are explained in the non-limiting context of a RL agent comprising a Dreamer/modified Dreamer-based architecture, the invention is not so limited and includes any RL agent configured to perform operations according to embodiments disclosed herein. The actor of the Dreamer chooses the actions performed by the RL agent, and bases its decisions purely on a lower dimensional latent space. The Dreamer leverages a world model to imagine trajectories, without requiring the generation of actual observations. Thus, it may be beneficial to plan in a lower dimensional, information rich, latent space.

[0085] The Dreamer includes an Actor-Critic network pair and a world model. The world model is fit onto a sequence of observations, so that it can reconstruct an original observation from the latent space and predicts the corresponding reward. The Actor and Critic receive as an input the latent representation of the observations. The Critic aims to predict the value of a state (e.g., how close is the RF circuit to a tuned configuration), while the Actor aims to find the action which would lead to a configuration exhibiting a higher value (e.g., more tuned). The Actor obtains more precise value estimates of its output by leveraging the world model to examine the consequences of its actions multiple steps ahead.

[0086] Training of the Dreamer/modified Dreamer may include initializing an experience buffer with a random RL agent. The random RL agent trains a world model on a testing sample. The world model reconstructs an original observation from the latent space and predicts the corresponding reward. The Actor and Critic receive as an input the latent representation of the observations. The experiences from interacting with the environment are added to the experience buffer. This process is repeated until the RL agent performs at the desired level.

[0087] World model training may include observations fed through an encoder.

The world model may be trained as to simultaneously maximize the likelihood of generating the correct environment rewards r and maintain an accurate reconstruction of the original observation via a decoder. Actions are denoted as ai.

[0088] Thus, an RL agent having a Dreamer-based architecture learns from interacting with the environment to train its value, action, reward, transition, representation and observation models.

[0089] Figure 4 illustrates an overview of a training procedure for an MBRL model according to some embodiments. In step 401, an experience buffer is initialized. The experience buffer may include random seed episodes, wherein each seed episode includes a sequence of experiences. Alternatively, the experience buffer may include a series of experiences not contained within seed episodes. Each experience is described by a tuple in the form (o t , a t , r t , o t+1 ).

[0090] When drawing information from the experience buffer, the MBRL model may, for example, select a random seed episode, and may then select a random sequence of experiences from the within the selected seed episode.

[0091] The neural network parameters of the various neural networks in the model may also be initialized randomly. In step 402, the world model is trained. In step 403, the actor-critic model is trained.

[0092] In step 404, the updated model interacts with the environment to add experiences to the experience buffer. The method then returns to step 402. The method may then continue until the network parameters of the world model and the actor-critic model converge, or until the model performs at a desired level.

[0093] Figure 5A illustrates training the world model in step 402 of Figure 4 in more detail. Figure 5B graphically illustrates how the world model may be trained in step 402 of Figure 4. In Figure 5B, all blocks that are illustrated with non-circular shapes are trainable. In other words, the neural network parameters for the models represented by the non-circular blocks may be updated during training of the world model at step 402 of Figure 4.

[0094] Referring to Figures 5A and 5B, in step 521, the method obtains a sequence of observations, o t , representative of the environment at a time t. For example, as illustrated in Figure 5B, the encoder 501 is configured to receive the observations o t-1 503a (at time t-1) and o t 503b (at time t). The illustrated observations may be S-parameters of a cavity filter. This is given as an example of a type of observation, and is not limiting.

[0095] In step 522, the method estimates latent states s t at time t using a representation model, wherein the representation model estimates the latent states s t based on the previous latent states s t-1 , previous actions a t-1 and the observations o t. The representation model is therefore based on previous sequences that have occurred. For example, the representation model estimates the latent state s t 502b at time t based on the previous latent state s t-1 502a, the previous action a t-1 504 and the observation o t 503b. [0096] In step 523, the method generates modeled observations, o m ,t, using an observation model (q(o m,t s t )), wherein the observation model generates the modeled observations based on the respective latent states s t . For example, the decoder 505 generates the modeled observations o m,t 506b and o m,t-1 506a based on the states s t and s t. i respectively.

[0097] The step of generating the modeled observations includes determining means and standard deviations based on the latent states s t . For example, generating the modeled observations may include determining a respective mean and standard deviation based on each of the latent states s t . This is in contrast to the original "Dreamer" model, which (as described above), produces only means based on the latent states in the observation model.

[0098] The decoder 505 determines a mean and a standard deviation based on the latent state s t it receives as an input. As previously described the decoder includes a neural network configured to attempt to map the latent state s t to the corresponding observation o t .

[0099] The output modeled observation o m ,t may then be determined by sampling a distribution generated from the determined mean and standard deviation.

[00100] In step 524 the method minimizes a first loss function to update network parameters of the representation model and the observation model, wherein the first loss function includes a component comparing the modeled observations, o m,t to the respective observations o t . In other words, the neural network parameters of the representation model and the observation model may be updated based on how similar the modeled observations o m,t are to the observations o t . [00101] In some examples the method further includes determining a reward r t based on a reward model (q(r t |s t )) 507, wherein the reward model 507 determines the reward r t based on the latent state s t . The step of minimizing the first loss function may then be further used to update network parameters of the reward model. For example, the neural network parameters of the reward model may be updated based on minimizing the loss function. The first loss function may therefore further include a component relating to the how well the reward r t represents a real reward for the observation o t . In other words, the loss function may include a component measuring how well the determined reward r t matches how well the observation o t should be rewarded.

[00102] The overall world model may therefore be trained to simultaneously maximize the likelihood of generating the correct environment rewards r and to maintain an accurate reconstruction of the original observation via the decoder 505.

[00103] Behavior cloning works well when the expert data is capable of capturing the general distribution of correct actions within the environment. It is, perhaps, one of the more intuitive approaches of leveraging expert data D = {(s, , a, ), i = 1, 2, ..., N }, where N is the number of expert state-action pairs. The behavior cloning objective is described as (B⊂D B: batch):

[00104] In essence, the loss describes typical, mean squared error (MSE) between the actor policy network (parametrized by 0) and the expert actions. This approach is different than the general RL objective. Furthermore, it does not provide a framework for training the critic network, which is crucial when choosing to continue training within the RL framework.

[00105] Some embodiments employ a loss function that combines behavior cloning and SAC - loss to allow simultaneous utilization of expert data to train both critic and actor, as well as optimize towards both objectives. To this end, let: where Q symbolizes the Q network (critic), d t describes whether the episode is done or not (0 and 1), a is the entropy coefficient, r t is a reward, s t the state, a t the action, π is a policy, t is a time step, a' is an updated action, and Φ and 0 represent learnable parameters of the critic network and an actor network, respectively. The MFRL agent is then trained agent on a mixed loss, which incorporates both:

[00106] Note that extra loss terms can be added for L2 regularizations. The target networks are updated softly, via Φ target, i =TΦ i +(1-τ) Φ target, ..

[00107] During pretraining, the expert data generated by the MBRL and MFRL are used to generate L mixed , treating the data as if they are part of the MFRL agent's replay buffer. During training, 25% may be sampled from the expert data and 75% from the agent's own replay buffer. Lbc is generated only from the expert data, while the remaining loss terms consist of both expert and agent experiences.

[00108] The performance of the method described above was evaluated experimentally. For the experiments, a circuit-based simulator scenario was employed using a modified MBRL dreamer architecture to act as an expert agent in the imitation learning context. The performance of training an MFRL SAC agent from scratch was compared with that of the described embodiments.

[00109] Figures 7A and 7B are graphs of reward curves indicating MFRL agent performance for behavior cloning for various reward levels according to some embodiments. Figure 7A illustrates reward curves for a conventionally trained MFRL SAC agent (curve 702) and a MFRL SAC agent trained using behavior cloning as described herein (curve 704) with a reward of +10, while Figure 7B illustrates reward curves for a conventionally trained MFRL SAC agent (curve 712) and a MFRL SAC agent trained using behavior cloning as described herein (curve 714) with a reward of +100.

[00110] In both cases, imitation learning significantly improves early efficiency. For instance, with imitation learning, the agent can tune 82.8 % of the filters after 100k steps, while its non-lmitation Learning counterpart is still struggling to tune a single filter. Similarly, the former starts tuning filters after around 40k steps while the latter well after 150k steps.

[00111] Table 1 compares success rate at different stages of training for a conventional MFRL approach and an approach that uses imitation learning as described herein. The success rate was measured by evaluating the number of filters that are successfully tuned within an episode. In each timestep checkpoint, performance was measured on 1000 detuned filter instances.

Table 1 - Performance comparison (success rate) at different time steps.

[00112] Table 1 illustrates some benefits of the embodiments described herein.

Even if the 70k initial MBRL training steps required to generate the trained MBRL "expert" are considered, the results show that sample complexity can be decreased by several hundred thousand steps. For reference, with imitation learning, a success rate over 93% success rate (94.5 %) can be achieved after 70k + 400k timesteps, versus 700k required for the conventional approach.

[00113] Overall, the described embodiments manage to decrease sample efficiency by almost a factor of 2. The modularity of this approach makes it applicable with any MBRL or MFRL framework and can be applied on top of other sample efficiency reduction techniques.

[00114] Figure 8 is a block diagram illustrating elements of a device 800 configured with a RL agent 203 for RF filter tuning. Device 800 may be provided by, e.g., a device in the cloud running software on cloud compute hardware; or a software function/service governing or controlling the RF filter tuning running in the cloud. That is, the device may be implemented as part of a communications system (e.g., a device that is part of the communications system 1000 as discussed below with respect to Figure 10), or on a device as a separate functionality/service hosted in the cloud. The device also may be provided as a standalone software for tuning an RF filter running on computational systems like servers or workstations; and the device may be in a deployment that may include virtual or cloud- based network functions (VNFs or CNFs) and even physical network functions (PNFs). The cloud may be public, private (e.g., on premises or hosted), or hybrid.

[00115] As shown, the device may include transceiver circuitry 800 (e.g., RF transceiver circuitry) including a transmitter and a receiver configured to provide uplink and downlink radio communications with devices (e.g., a controller for automatic execution of a value on a tuning mechanism of an RF filter). The device may include network interface circuitry 807 (also referred to as a network interface,) configured to provide communications with other devices (e.g., a controller for automatic execution of a value on a tuning mechanism of an RF filter). The device may also include processing circuitry 803 (also referred to as a processor) coupled to the transceiver circuitry, memory circuitry 805 (also referred to as memory) coupled to the processing circuitry, and RL agent 2-3 coupled to the processing circuit. The RL agent 203 and/or memory circuitry 805 may include computer readable program code that when executed by the processing circuitry 803 causes the processing circuitry to perform operations according to embodiments disclosed herein. According to other embodiments, processing circuitry 803 may be defined to include memory or RL agent 203 so that a separate memory circuitry or separate RL agent is not required.

[00116] As discussed herein, operations of the device may be performed by processing circuitry 803, network interface 807, and/or transceiver 801. For example, processing circuitry 803 may control RL agent 203 to perform operations according to embodiments disclosed herein. Processing circuitry 803 also may control transceiver 801 to transmit downlink communications through transceiver 801 over a radio interface to one or more devices and/or to receive uplink communications through transceiver 801 from one or more devices over a radio interface. Similarly, processing circuitry 803 may control network interface 807 to transmit communications through network interface 807 to one or more devices and/or to receive communications through network interface from one or more devices. Moreover, modules may be stored in memory 805, and these modules may provide instructions so that when instructions of a module are executed by processing circuitry 803, processing circuitry 803 performs respective operations (e.g., operations discussed below with respect to example embodiments relating to devices). According to some embodiments, device 800 and/or an element(s)/function(s) thereof may be embodied as a virtual device/devices and/or a virtual machine/machines.

[00117] According to some other embodiments, a device may be implemented without a transceiver. In such embodiments, transmission to a wireless device may be initiated by the device 800 so that transmission to the wireless device is provided through a device including a transceiver (e.g., through a base station). According to embodiments where the device includes a transceiver, initiating transmission may include transmitting through the transceiver.

[00118] Figure 9 is a flowchart illustrating a computer-implemented method of a device configured with a MFRL agent for RF filter tuning according to some embodiments of the present disclosure. The device can be device 201, 1000 as discussed further herein. The method includes obtaining (911) a scattering parameter, S-parameter, reading for an RF filter. The method further includes generating (913), from the RL agent, a value for influencing a tuning mechanism of the RF filter based on the S-parameter reading. The MFRL agent has transferred learning from a MBRL agent and subsequent training of the MFRL agent. The method further includes (915) signaling the value to a controller for automatic execution of the value on the tuning mechanism of the RF filter.

[00119] Figure 10 shows an example of a communication system 1000 in accordance with some embodiments.

[00120] In the example, the communication system 1000 includes a telecommunication network 1002 that includes an access network 1004, such as a RAN, and a core network 1006, which includes one or more core network nodes 1008. The access network 1004 includes one or more access network nodes, such as network nodes 1010a and 1010b (one or more of which may be generally referred to as network nodes 1010), or any other similar 3rd Generation Partnership Project (3GPP) access node or non- 3GPP access point. The network nodes 1010 facilitate direct or indirect connection of a user equipment (UE), such as by connecting UEs 1012a, 1012b, 1012c, and 1012d (one or more of which may be generally referred to as UEs 1012) to the core network 1006 over one or more wireless connections.

[00121] Example wireless communications over a wireless connection include transmitting and/or receiving wireless signals using electromagnetic waves, radio waves, infrared waves, and/or other types of signals suitable for conveying information without the use of wires, cables, or other material conductors. Moreover, in different embodiments, the communication system 1000 may include any number of wired or wireless networks, network nodes, UEs, and/or any other components or systems that may facilitate or participate in the communication of data and/or signals whether via wired or wireless connections. The communication system 1000 may include and/or interface with any type of communication, telecommunication, data, cellular, radio network, and/or other similar type of system.

[00122] The UEs 1012 may be any of a wide variety of communication devices, including wireless devices arranged, configured, and/or operable to communicate wirelessly with the network nodes 1010 and other communication devices. Similarly, the network nodes 1010 are arranged, capable, configured, and/or operable to communicate directly or indirectly with the UEs 1012 and/or with other network nodes or equipment in the telecommunication network 1002 to enable and/or provide network access, such as wireless network access, and/or to perform other functions, such as administration in the telecommunication network 1002.

[00123] In the depicted example, the core network 1006 connects the network nodes 1010 to one or more hosts, such as host 1016. These connections may be direct or indirect via one or more intermediary networks or devices. In other examples, network nodes may be directly coupled to hosts. The core network 1006 includes one more core network nodes (e.g., core network node 1008) that are structured with hardware and software components. Features of these components may be substantially similar to those described with respect to the UEs, network nodes, and/or hosts, such that the descriptions thereof are generally applicable to the corresponding components of the core network node 1008. Example core network nodes include functions of one or more of a Mobile Switching Center (MSC), Mobility Management Entity (MME), Home Subscriber Server (HSS), Access and Mobility Management Function (AMF), Session Management Function (SMF), Authentication Server Function (AUSF), Subscription Identifier De-concealing function (SIDF), Unified Data Management (UDM), Security Edge Protection Proxy (SEPP), Network Exposure Function ( N EF), and/or a User Plane Function (UPF).

[00124] The host 1016 may be under the ownership or control of a service provider other than an operator or provider of the access network 1004 and/or the telecommunication network 1002, and may be operated by the service provider or on behalf of the service provider. The host 1016 may host a variety of applications to provide one or more service. Examples of such applications include live and pre-recorded audio/video content, data collection services such as retrieving and compiling data on various ambient conditions detected by a plurality of UEs, analytics functionality, social media, functions for controlling or otherwise interacting with remote devices, functions for an alarm and surveillance center, or any other such function performed by a server. [00125] As a whole, the communication system 1000 of Figure 10 enables connectivity between the UEs, network nodes, hosts, and devices. In that sense, the communication system may be configured to operate according to predefined rules or procedures, such as specific standards that include, but are not limited to: Global System for Mobile Communications (GSM); Universal Mobile Telecommunications System (UMTS); Long Term Evolution (LTE), and/or other suitable 2G, 3G, 4G, 5G standards, or any applicable future generation standard (e.g., 6G); wireless local area network (WLAN) standards, such as the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standards (WiFi); and/or any other appropriate wireless communication standard, such as the Worldwide Interoperability for Microwave Access (WiMax), Bluetooth, Z-Wave, Near Field Communication (NFC) ZigBee, LiFi, and/or any low-power wide-area network (LPWAN) standards such as LoRa and Sigfox.

[00126] In some examples, the telecommunication network 1002 is a cellular network that implements 3GPP standardized features. Accordingly, the telecommunications network 1002 may support network slicing to provide different logical networks to different devices that are connected to the telecommunication network 1002. For example, the telecommunications network 1002 may provide Ultra Reliable Low Latency Communication (URLLC) services to some UEs, while providing Enhanced Mobile Broadband (eMBB) services to other UEs, and/or Massive Machine Type Communication (mMTC)/Massive loT services to yet further UEs.

[00127] In some examples, the UEs 1012 are configured to transmit and/or receive information without direct human interaction. For instance, a UE may be designed to transmit information to the access network 1004 on a predetermined schedule, when triggered by an internal or external event, or in response to requests from the access network 1004. Additionally, a UE may be configured for operating in single- or multi-RAT or multi-standard mode. For example, a UE may operate with any one or combination of Wi- Fi, NR (New Radio) and LTE, i.e. being configured for multi-radio dual connectivity (MR-DC), such as E-UTRAN (Evolved-UMTS Terrestrial Radio Access Network) New Radio - Dual Connectivity (EN-DC).

[00128] In the example, the hub 1014 communicates with the access network 1004 to facilitate indirect communication between one or more UEs (e.g., UE 1012c and/or 1012d) and network nodes (e.g., network node 1010b).

[00129] Although the devices described herein may include the illustrated combination of hardware components, other embodiments may comprise computing devices with different combinations of components. It is to be understood that these devices may comprise any suitable combination of hardware and/or software needed to perform the tasks, features, functions and methods disclosed herein. Determining, calculating, obtaining or similar operations described herein may be performed by processing circuitry, which may process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the device, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination. Moreover, while components are depicted as single boxes located within a larger box, or nested within multiple boxes, in practice, devices may comprise multiple different physical components that make up a single illustrated component, and functionality may be partitioned between separate components. For example, a communication interface may be configured to include any of the components described herein, and/or the functionality of the components may be partitioned between the processing circuitry and the communication interface. In another example, non-computationally intensive functions of any of such components may be implemented in software or firmware and computationally intensive functions may be implemented in hardware.

[00130] In certain embodiments, some or all of the functionality described herein may be provided by processing circuitry executing instructions stored on in memory, which in certain embodiments may be an RL agent and/or computer program product (e.g., including an RL agent) in the form of a non-transitory computer-readable storage medium. In alternative embodiments, some or all of the functionality may be provided by the processing circuitry without executing instructions stored on a separate or discrete device- readable storage medium, such as in a hard-wired manner. In any of those particular embodiments, whether executing instructions stored on a non-transitory computer- readable storage medium or not, the processing circuitry can be configured to perform the described functionality. The benefits provided by such functionality are not limited to the processing circuitry alone or to other components of the device, but are enjoyed by the device as a whole, and/or by end users and a wireless network generally.

[00131] Further definitions and embodiments are discussed below.

[00132] In the above-description of various embodiments of present inventive concepts, it is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of present inventive concepts. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which present inventive concepts belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

[00133] When an element is referred to as being "connected", "coupled", "responsive", or variants thereof to another element, it can be directly connected, coupled, or responsive to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected", "directly coupled", "directly responsive", or variants thereof to another element, there are no intervening elements present. Like numbers refer to like elements throughout. Furthermore, "coupled", "connected", "responsive", or variants thereof as used herein may include wirelessly coupled, connected, or responsive. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Well-known functions or constructions may not be described in detail for brevity and/or clarity. The term "and/or" (abbreviated "/") includes any and all combinations of one or more of the associated listed items.

[00134] It will be understood that although the terms first, second, third, etc. may be used herein to describe various elements/operations, these elements/operations should not be limited by these terms. These terms are only used to distinguish one element/operation from another element/operation. Thus a first element/operation in some embodiments could be termed a second element/operation in other embodiments without departing from the teachings of present inventive concepts. The same reference numerals or the same reference designators denote the same or similar elements throughout the specification.

[00135] As used herein, the terms "comprise", "comprising", "comprises", "include", "including", "includes", "have", "has", "having", or variants thereof are open-ended, and include one or more stated features, integers, elements, steps, components or functions but does not preclude the presence or addition of one or more other features, integers, elements, steps, components, functions or groups thereof. Furthermore, as used herein, the common abbreviation "e.g.", which derives from the Latin phrase "exempli gratia," may be used to introduce or specify a general example or examples of a previously mentioned item, and is not intended to be limiting of such item. The common abbreviation "i.e.", which derives from the Latin phrase "id est," may be used to specify a particular item from a more general recitation.

[00136] Example embodiments are described herein with reference to block diagrams and/or flowchart illustrations of computer-implemented methods, apparatus (systems and/or devices) and/or computer program products. It is understood that a block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions that are performed by one or more computer circuits. These computer program instructions may be provided to a processor circuit of a general purpose computer circuit, special purpose computer circuit, and/or other programmable data processing circuit to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, transform and control transistors, values stored in memory locations, and other hardware components within such circuitry to implement the functions/acts specified in the block diagrams and/or flowchart block or blocks, and thereby create means (functionality) and/or structure for implementing the functions/acts specified in the block diagrams and/or flowchart block(s).

[00137] These computer program instructions may also be stored in a tangible computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the functions/acts specified in the block diagrams and/or flowchart block or blocks. Accordingly, embodiments of present inventive concepts may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.) that runs on a processor such as a digital signal processor, which may collectively be referred to as "circuitry," "a module" or variants thereof.

[00138] It should also be noted that in some alternate implementations, the functions/acts noted in the blocks may occur out of the order noted in the flowcharts. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Moreover, the functionality of a given block of the flowcharts and/or block diagrams may be separated into multiple blocks and/or the functionality of two or more blocks of the flowcharts and/or block diagrams may be at least partially integrated. Finally, other blocks may be added/inserted between the blocks that are illustrated, and/or blocks/operations may be omitted without departing from the scope of inventive concepts. Moreover, although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows. [00139] Many variations and modifications can be made to the embodiments without substantially departing from the principles of the present inventive concepts. All such variations and modifications are intended to be included herein within the scope of present inventive concepts. Accordingly, the above disclosed subject matter is to be considered illustrative, and not restrictive, and the examples of embodiments are intended to cover all such modifications, enhancements, and other embodiments, which fall within the spirit and scope of present inventive concepts. Thus, to the maximum extent allowed by law, the scope of present inventive concepts are to be determined by the broadest permissible interpretation of the present disclosure including the examples of embodiments and their equivalents, and shall not be restricted or limited by the foregoing detailed description.