Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
OFFLINE SELF TUNING OF MICROWAVE FILTER
Document Type and Number:
WIPO Patent Application WO/2023/047168
Kind Code:
A1
Abstract:
Disclosed are embodiments related to offline self-tuning of microwave filters. Certain embodiments relate to base station filters, wireless communications, artificial intelligence, reinforcement learning, and auto-tuning. In some embodiments, filter tuning actions are recorded without following a specific policy, the tuning data is stored in an offline database, desired filter characteristics are obtained, a reward function is determined based on the desired filter characteristics, at least two replay buffers are initiated with the offline data, rewards are calculated for the at least two replay buffers, at least two filter-tuning policies are trained, and the at least two filter-tuning policies are combined to create a final-tuning policy.

Inventors:
HUANG VINCENT (SE)
MALEK MOHAMMADI MOHAMMADREZA (SE)
WEI JIEQIANG (SE)
BLANCO DARWIN (SE)
Application Number:
PCT/IB2021/058800
Publication Date:
March 30, 2023
Filing Date:
September 27, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
H01P11/00; H03J5/00
Foreign References:
CN105048040A2015-11-11
CN109766614A2019-05-17
CN109783905A2019-05-21
CN106814307A2017-06-09
CN108270057A2018-07-10
CN105789812A2016-07-20
CN105680827A2016-06-15
CN104659460A2015-05-27
Other References:
WANG ZHIYANG ET AL: "Reinforcement learning approach to learning human experience in tuning cavity filters", 2015 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (ROBIO), IEEE, 6 December 2015 (2015-12-06), pages 2145 - 2150, XP032873415, DOI: 10.1109/ROBIO.2015.7419091
HANNES LARSSON: "Deep Reinforcement Learning for Cavity Filter Tuning", EXAMENSARBETE, 1 January 2018 (2018-01-01), XP055764519, Retrieved from the Internet
SIMON LINDSTÅHL: "Reinforcement Learning with Imitation for Cavity Filter Tuning: Solving problems by throwing DIRT at them", 1 June 2019 (2019-06-01), XP055764605, Retrieved from the Internet [retrieved on 20210113]
LINDSTAHL, S, REINFORCEMENT LEARNING WITH IMITATION FOR CAVITY FILTER TUNING : SOLVING PROBLEMS BY THROWING DIRT AT THEM (DISSERTATION, 2019, Retrieved from the Internet
HAARNOJA, TUOMASAURICK ZHOUPIETER ABBEELSERGEY LEVINE: "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor", ARXIV: 1801.01290, 2018
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method of self-tuning a microwave filter, the method comprising: obtaining (s902) a plurality of state vectors, wherein each state vector comprises an initial filter response of a microwave filter, a turning action performed on the microwave filter, and a resulting filter response of the microwave filter after the tuning action; obtaining (s904) a desired filter response for a target microwave filter (101); determining (s906) a reward function based on the desired filter response; generating (s908) a first replay buffer, the first replay buffer comprising a first set of randomly selected state vectors from the plurality of state vectors; calculating (s910) a first corresponding reward value based on the reward function for the first replay buffer; training (s912), using the first replay buffer, a first filter-tuning policy of an agent (209, 409, 509) to optimize the reward function; generating (s914) a second replay buffer, the second replay buffer comprising a second set of randomly selected state vectors from the plurality of state vectors; calculating (s916) a second corresponding reward value based on the reward function for the second replay buffer; training (s918), using the second replay buffer, a second filter-tuning policy of the agent to optimize the reward function; and combining (s920) the trained first filter-tuning policy and the trained second filter-tuning policy to create a final filter-tuning policy.

2. The method of claim 1, further comprising: using the final filter-tuning policy to tune the target filter.

3. The method of any one of claims 1-2, further comprising: obtaining an initial filter response of the target microwave filter; determining, using the final filter-tuning policy, one or more tuning actions to perform on the target microwave filter based on the initial filter response; and outputting the determined one or more tuning actions. 4. The method of claim 3, wherein the outputting further comprises: generating instructions for a device to apply the determined one or more tuning actions to the target microwave filter.

5. The method of claim 3 or 4, further comprising: obtaining a second filter response of the target microwave filter after the determined one or more tuning actions have been applied to the target microwave filter; and determining, using the final filter-tuning policy, one or more tuning actions to perform on the target microwave filter based on the second filter response.

6. The method of any one of claims 1-5, wherein the desired filter response comprises a frequency mask.

7. The method of any one of claims 1-6, wherein the initial filter response, the resulting filter response, and the desired filter response each comprise a respective set of scattering parameters.

8. The method of claim 7, wherein the scattering parameters comprise:

Sn, a reflection coefficient from an input port, and

S21, a forward voltage gain.

9. The method of claim 7 or 8, wherein determining the reward function comprises determining a loss vector I according to the following equation:

I = [h, . . . In], wherein n is a total number of frequency points of the target filter, li= max(0, dii) + max(0, di2), h is a loss value at frequency point i, dii is a difference between a first scattering parameter of a first resulting filter response at frequency point i and a first scattering parameter of a first desired filter response at frequency point i, and 15 di2 is a difference between a second scattering parameter of the first resulting filter response at frequency point i and a second scattering parameter of the first desired filter response at frequency point i.

10. The method of claim 9, wherein the reward value for each state vector comprises the following equation: wherein ||Z||2 denotes a Euclidean norm of the vector I, and rextra is a reward value.

11. The method of any one of claims 1-10, wherein the tuning action comprises a relative position of all tuning screws of a filter.

12. The method of any one of claims 1-11, wherein the agent comprises a neural network and a reinforcement learning algorithm.

13. The method of claim 12, wherein the reinforcement learning algorithm comprises one or more of: a Deep Q Network (DQN), Deep Deterministic Policy Gradient (DDPG), Hindsight Experience Replay (HER), or Actor-Critic.

14. A device (1000) adapted to perform any one of methods 1-13.

15. A computer program (1043) comprising instructions (1044) which when executed by processing circuity (1055) of a device (1000) causes the device to perform the method of any one of methods 1-13.

16. A carrier containing the computer program of embodiment 15, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.

Description:
OFFLINE SELF TUNING OF MICROWAVE FILTER

TECHNICAL FIELD

[001] Disclosed are embodiments related to offline self-tuning of microwave filters. Certain embodiments relate to base station filters, wireless communications, artificial intelligence, reinforcement learning, and auto-tuning of filters.

INTRODUCTION

[002] Filters used in base stations for wireless communications are known for being very demanding in terms of the filter capabilities. For example, the bandwidth may be very narrow (i.e. typically less than 100MHz) and constraints in the rejection’s bands may be very high (i.e. typically more than 60dB). In order to reach a very narrow bandwidth with high rejection bands, the selected filter topology will need many poles and at least a couple of zeros (i.e. commonly more than six poles and two zeros). The number of poles may directly translate as the number of physical resonators of the manufactured filter. Since every resonator is electrically and/or magnetically connected to the next one for some frequencies, a path from the input to output is created, allowing the energy to flow from the input to the output for the designed frequencies whilst some frequencies are rejected. When a pair of non-consecutive resonators are coupled, then an alternative path for the energy is created. This alternative path is related to a zero in the rejection band.

[003] Commonly, each pole/resonator has a tuning screw to adjust some possible inaccuracies in the manufacturing process while each zero (due to consecutive or non-consecutive resonators) has another screw to control the desired coupling. The tuning of these large number of poles and resonators (e.g., tuning of screws) are normally done by a manual tuning (i.e. a well-trained human that manipulate the screws and verify the desired response in a vector network analyzer (VNA)) which is a time-consuming task. Indeed, for some complex filter units the total process can take 30 minutes.

[004] Recently, artificial intelligence and machine learning have emerged as potential alternatives to solve this problem reducing the required tuning time per filter unit and offering the possibility to explore more complex filter topologies [1]. [005] Certain automatic cavity filter tuning results are described in the following:

CN105048040B, CN109766614A, CN109783905A, CN106814307B, CN108270057A,

CN105789812B, CN105680827B, CN104659460A.

SUMMARY

[006] Manual tuning is time consuming and very expensive as it requires a human expert per filter unit. Approaches with model simulations may not be accurate and may be limited by the simulation models. Moreover, automatic cavity filtering techniques may be limited because they do not propose leveraging off-line data.

[007] Embodiments disclosed herein relate to automatic tuning of cavity filters by collecting off-line data without a specific target frequency response and following a predefined action sequence. Off-line data may be applied for tuning different target frequency responses, which allows utilization of collected data more efficiently and provides more accurate filter tuning. [008] Embodiments disclosed herein may leverage a machine learning (ML) algorithm to tune a filter unit. The filter unit may be used in wireless communications systems. For example, the filter unit may be for a base-station of a wireless communication system.

[009] In some embodiments, a method for training an agent for filter tuning with offline collected data includes:

(1) Recording all filter tuning actions without following a specific policy,

(2) Storing data in an offline database,

(3) Specifying desired filter characteristics,

(4) Specifying a reward function based on the desired filter characteristics,

(5) Initiating at least two replay buffers with the offline data and calculated rewards,

(6) Training the at least two policies, and

(7) Combining the at least two policies to final policy.

[0010] In one aspect, a computer-implemented method of self-tuning a microwave filter is provided. The method includes obtaining a plurality of state vectors, wherein each state vector comprises an initial filter response of a microwave filter, a tuning action performed on the microwave filter, and a resulting filter response of the microwave filter after the tuning action. The method includes obtaining a desired filter response for a target microwave filter. The method includes determining a reward function based on the desired filter response. The method includes generating a first replay buffer, the first replay buffer comprising a first set of randomly selected state vectors from the plurality of state vectors. The method includes calculating a first corresponding reward value based on the reward function for the first replay buffer. The method includes training, using the first replay buffer, a first filter-tuning policy of an agent to optimize the reward function. The method includes generating a second replay buffer, the second replay buffer comprising a second set of randomly selected state vectors from the plurality of state vectors. The method includes calculating a second corresponding reward value based on the reward function for the second replay buffer. The method includes training, using the second replay buffer, a second filter-tuning policy of the agent to optimize the reward function. The method includes combining the trained first filter-tuning policy and the trained second filtertuning policy to create a final filter-tuning policy.

[0011] In another aspect there is provided a device adapted to perform the method. In another aspect there is provided a computer program comprising instructions which when executed by processing circuity of a device causes the device to perform the methods. In another aspect there is provided a carrier containing the computer program, where the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.

[0012] One of the advantages made possible by the embodiments disclosed herein is the reuse of offline collected data for training adaptive filter characteristics.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.

[0014] FIG. 1 is a block diagram illustrating a manual filter tuning process.

[0015] FIG. 2 is a block diagram, according to some embodiments.

[0016] FIG. 3 is a flow diagram, according to some embodiments.

[0017] FIG. 4 is a block diagram, according to some embodiments.

[0018] FIG. 5 illustrates a block diagram, according to some embodiments.

[0019] FIG. 6 illustrates a training phase of a reinforcement learning agent, according to some embodiments.

[0020] FIG. 7 illustrates a histogram showing a number of actions needed to tune a filter, according to some embodiments. [0021] FIG. 8 illustrates a tuning sequence, according to some embodiments.

[0022] FIG. 9 illustrates a method, according to some embodiments.

[0023] FIG. 10 is a block diagram of an apparatus 1000 according to some embodiments.

DETAILED DESCRIPTION

[0024] Microwave Cavity Filter

[0025] Band-pass filters are commonly used in wireless communication systems to meet the sharp and demanding requirements of commercial bands. Cavity filters are still dominantly used due to the low cost for mass production and high-Q-factor per resonator (especially for frequencies below to 1GHz). Cavity filters may provide high-Q resonators that can be used to implement sharp filters with very fast transitions between pass and stop bands and very high selectivity. Moreover, they can easily cope with very high-power input signals.

[0026] Cavity filters are applicable from as low as 50 MHz up to several GHz. The versatility in frequency range as well the aforementioned high selectivity make cavity filters a very popular choice in many applications such as base stations.

[0027] A significant drawback of these types of narrow band filters is that because they require a very sharp frequency response, a small tolerance in the fabrication process will impact the final performance. A common solution to avoid extremely expensive fabrication process is postproduction tuning.

[0028] FIG. 1 is a block diagram illustrating a manual filter tuning process. FIG. 1 shows the process of manually tuning a typical filter (101) by a human expert (109). The human expert (109) is normally used to tune a detuned cavity filter (101) by measuring the frequency response in a vector network analyzer (VNA) (105). The conventional tuning method thus far requires exploiting highly trained human operators (109) to manually tune the filter (101). For instance, the tuning process might include turning a set of screws (103). With the aid of a VNA (105), the human operator (109) may then compare on a display (107) how close the current filter response and the desired filter response are, e.g., by reviewing the measured scattering parameters (s- parameters). This process may be repeated until the measurement in the VNA and the designed filter mask are close enough.

[0029] Reinforcement Learning [0030] Reinforcement learning (RL) is a machine learning (ML) method concerned with how an agent should take one or more actions in an environment as to maximize a numerical reward signal. According to some embodiments, the environment is one type of cavity filter, for instance a cavity filter with 6 poles and 2 zeros, and the agent may be an algorithm or ML model that turns the screws on the filters.

[0031] Since tuning the filters using human operators is time consuming and costly, one advantage of the embodiments disclosed herein is to make full use of training data obtained from post-production tuning. Accordingly, techniques disclosed herein relate to offline learning for a filter tuning problem.

[0032] The training data may be prepared as follows. For each given filter with a state (s- parameters) St, one action At may be taken from a predefined policy. The environment returns one reward Rt and new filter state St+1. The tuple (St, At, Rt, St+1) is saved to a buffer. When enough data is collected, a ML agent can learn from the data and train a policy to tune a filter successfully. The reward Rt is not calculated at data collection time. Instead, it is calculated when the data is selected to be used for a specific target filter. Thus, for different target filters, the reward Rt may be different. Since the training data is not generated during the training of the agent, e.g., not from interaction between the agent and the environment directly, this technique may be characterized as an offline training method.

[0033] Proposed Method

[0034] FIG. 2 is a block diagram, according to some embodiments. FIG. 2 illustrates a system that may be used for automatically tuning a filter using offline data. At block 202, one or more tuning actions may be performed on one or more filters. In some embodiments, the system may include a control mechanism (e.g., a robotic arm) to perform a tuning action on a cavity filter (e.g., turning a screw) and a measuring mechanism (e.g., VNA) to measure an output frequency response of the filter. Unlike previous solutions, the system can make use of all available data from actions of a human or machine through an offline training method. For example, any person or a machine may perform tuning actions (e.g., screw turning) on a filter, and data on how the filter performance is affected by the tuning action(s) is stored in a database for later offline use. [0035] At block 204, the tuning data is stored in an offline database. The data collection mechanism may include collecting data including: information on a previous frequency response, a performed turning action, and a resulting filter response. [0036] At block 206, a plurality of replay buffers are initiated with the offline tuning data.

[0037] At block 208, a reward is determined using the replay buffers.

[0038] During a filter tuning stage, a desired filter response may be defined and received by the system. Based on the desired filter response, a reward function is calculated. The reward function is based on how close a tuned filter response is to the desired filter response. Utilizing the reward function and the stored offline data, a random data set is selected to generate a replay buffer. The replay buffer consists of a set of randomly selected filter tuning actions, initial frequency response, resulting new frequency response, and a corresponding one-step reward based on the reward function.

[0039] At block 209, a ML/RL agent is trained with a filter-tuning policy.

[0040] At block 210, a filter is tuned using the filter- tuning policy.

[0041] According to some embodiments, the replay buffers are used for training a filter-tuning policy to optimize the reward function. The ML/RL agent, which may include a neural network, is trained with an input of the initial frequency response and the output includes the action(s) that maximized the total reward function.

[0042] Since the replay buffer is selected randomly and no new online feedback data is provided, the process needs to be repeated several times and a combined/averaged policy is selected as the final policy for the filter tuning.

[0043] FIG. 3 is a flow diagram, according to some embodiments. At step s302, offline data is collected, such as described above. At step s304, the offline data is stored in an offline data storage database. At step s306, a filter wave form is selected, e.g., a desired filter response for a target filter is selected, and a reward function may be determined based on the desired filter response. At step s308, a plurality of replay buffers are selected and populated with randomly selected data from the offline data. At step s310, a reward is determined using the reward function, e.g., a one-step reward in the replay buffer. At step s312, a plurality of policies are trained using the data from the replay buffers. At step s314, the policies are combined. The policies may be combined in a number of ways, such as by averaging, weighted average, etc. [0044] Observation or state vector

[0045] An observation or state vector from the offline tuning data may consist of a set of s- parameters at different frequency points within a particular frequency range. In general, one may obtain the following complex-valued parameters from measurement or environment: S X1 which shows the reflection coefficient from input port,

S 22 which shows the reflection coefficient from output port,

S 12 which shows the reverse voltage gain, and

S 21 which shows the forward voltage gain.

[0046] The frequency may need to be discretized within the desired frequency range. For example, there may be a frequency resolution of Af, and low and high frequencies \f L , f H \. Accordingly, there would be n = fn frequency points and at each frequency point, four complex numbers (5 11; S 22 , S 12 , S 21 ). Another practicality with the state vector is that, since the neural net used in the RL algorithm may not be able to accept complex values, the real and imaginary parts of each frequency point may need to be used separately. Accordingly, for each frequency point, the observation vector may consists of 2*4=8 real numbers. This means that in summary, the state vector may have a length of 8 * [0047] Reward function

[0048] There may be several ways to reward the agent according to the closeness of the filter’s frequency response to the desired response (aka frequency mask). However, as only absolute values of S X1 and S 21 may be specified in the frequency mask of a tuned filter, only those two parameters may be specified in the reward function whereas all four complex-value s-parameters may be used as the observation.

[0049] Moreover, for calculating the reward, in some embodiments, the 20 log 10 (. ) is taken in order to convert to decibel (dB) value. One main reason to do so is because the requirements are very different in value in stopband and passband, and if one uses a linear scale, it is difficult to balance between the requirements. For example, the requirement in the stopband may be a rejection of 80 dB which is equivalent to abs(S 21 ) < 0.0001. On the other hand, the requirement in the passband may be insertion loss of 1 dB or equivalently abs(S 21 ) > 0.8913. For this large range of values, working with linear scale is much harder than working with the logarithmic one. [0050] First, for a given frequency , a difference between the current frequency responses and the desired ones from mask are calculated. Depending on the frequency, there might be one or two requirements to satisfy. Particularly, in stopband, there is usually only one requirement in terms of S 21 , while in passband there are requirements on both S X1 and S 21 . If d L1 and d L2 denote the difference between desired value of S X1 or S 21 , respectively, at frequency and the current ones, then the loss f is defined as f = max(0, d i ) + max(0, d i2 ). [0051] In effect, if the current frequency response satisfies the requirement, then the associated loss would be zero.

[0052] If we make a vector I = [Z 1( ... , l n ], where n is the total number of frequency points, then the reward would be: r = -i i 2 if (E/i) > o o extra otherwise where ||. || 2 denotes the Euclidean norm of a vector and r extra shows an extra reward which is given to the agent whenever all the requirements are satisfied. Intuitively, the above reward is proportional to the minus of all losses aggregated at different frequencies. The bigger the aggregated loss, the bigger the distance from desired frequency response and hence the smaller the reward.

[0053] Tuning Actions

[0054] In some embodiments, the action that the agent takes at each space may be seen as the relative position of all screws that tunes the filter. This means that the size of the action vector equals the number of tuning screws. Also, since the action is the relative position of the screws, the RL agent does not need to know the current position of the screws. In training phase, the action consists of the relative position of all screws, while in implementation phase, depending on the mechanical limitations, the robotic arm may have one or several screw drivers. This means that to complete one action, the robot may need to turn the screw drivers several times. Screw position is one example, and a person of skill in the art would appreciate that alternative tuning actions may be used, e.g., for different types of filters.

[0055] Since the s-parameters (and also the state vector) capture the effect of the screws’ positions on the characteristics of the filter, and the actions are relative position of the screws, it may not be necessary to feed the position of the screws as a part of state vector to the agent.

[0056] RL Algorithms

[0057] In general, the proposed method can be used together with several different RL algorithms. However, considering the nature of filter tuning problem, in some embodiments Deep Q learning algorithms like DDPG, DQN, and HER methods, as well as actor-critic methods like soft actor-critic (SAC), may be quite efficient.

[0058] Neural network architecture for implementing the actor [0059] FIG. 4 is a block diagram, according to some embodiments. In one embodiment, simulations were performed using actor-critic methods that obtained the best performance where two models were trained in parallel. One model, which is usually called “actor” (409), in any given state takes an action to maximize expected cumulative reward. This, which in turn is a policy, is updated in a similar manner as for a typical policy-based algorithm. On the other hand, the second model, which is usually called “critic”, looks at the actions taken by the actor and tries to learn the value or action- value function. This may be seen as criticism because the second model is analyzing the behavior of the actor by looking at its action.

[0060] In practice, the actor part (409) may be implemented by a function approximator such as a look-up table or a neural network (depending on the application, the neural network could be a fully connected, recurrent, or a convolutional one.). In some embodiments, for the cavity filter application and for the actor part, a fully-connected network, as shown in FIG. 4, can do an excellent job for the actor 409. In this embodiment, the s parameters (407) are providing all the information the actor (409) needs to take an action. The input to the network is basically, the vector of frequency response including all s-parameters (407) in which real and imaginary parts are fed to the network separately. The output, which consists of one or more tuning actions, may be applied to filter 101.

[0061] Simulation Results

[0062] FIG. 5 illustrates a block diagram, according to some embodiments. FIG. 5 illustrates a simulation environment. A number of simulations were collected using a filter environment (501) where the observation vector consists of s-parameters given by a circuit simulator (507), and coupled with a reward calculation (511), were used to train a RL agent 509. To test the RL agent (509), a filter example (501) was used consisting of 10 poles and 4 zeros (called 10p4z filter herein for brevity).

[0063] FIG. 6 illustrates a training phase of a reinforcement learning agent, according to some embodiments. The RL algorithm/agent has been successfully employed in several RL settings, and FIG. 6 illustrates an average episode reward during training for 10p4z filters as a function of time steps. FIG. 6 shows the training phase of the RL agent where the y-axis represents the average episode reward and x-axis depicts the time. FIG. 6 shows that the agent’s ability to tune filters and achieve higher rewards evolves as the agent continues training over time and improve the policy. [0064]

[0065] FIG. 7 illustrates a histogram showing a number of actions needed to tune a filter, according to some embodiments. To evaluate the performance of the trained agent with unseen examples of the filter, the RL agent was tested with 10,000 independent and identically distributed (iid) realizations of the filter. Overall, the success rate in tuning the filter was above 90%, and Figure 7 shows the histogram of the number of actions needed to tune the filter, with frequency on the y axis and number of actions on the x axis. The histogram in FIG. 7 is of the number of actions the agent needs to tune 10,000 iid realizations of 10p4z filters, and only includes the results from filters that are tuned in at most 20 actions.

[0066] FIG. 8 illustrates a tuning sequence, according to some embodiments. To illustrate the tuning process with the trained agent, FIG. 8 shows an example sequence of tuning of one realization of a filter. More specifically, the first plot in the sequence shows frequency response (S 21 parameter and S X1 parameter) of a detuned filter together with the requirements (represented by horizontal bars) that the filter needs to fulfill. The next plot shows the frequency response after the first tuning action of the agent, the next plot shows the frequency response after the second tuning action of the agent, the next plot shows the frequency response after the third tuning action of the agent, and the last plot shows the frequency response after the fourth tuning action of the agent. The last plot illustrates that the filter is tuned after four tuning actions.

[0067] FIG. 9 illustrates a method, according to some embodiments. FIG. 9 illustrates a computer-implemented method (900) of self-tuning a microwave filter. At step s902, a plurality of state vectors is obtained, wherein each state vector comprises an initial filter response of a microwave filter, a tuning action performed on the microwave filter, and a resulting filter response of the microwave filter after the tuning action. The state vector is described in further detail above, and comprises offline data.

[0068] At step s904, a desired filter response for a target microwave filter is obtained. The target filter may be a filter (101) to which a tuning action may be applied in order to achieve the desired filter response.

[0069] At step s906, a reward function is determined based on the desired filter response.

[0070] At step s908, a first replay buffer is generated, the first replay buffer comprising a first set of randomly selected state vectors from the plurality of state vectors. [0071] At step s910, a first corresponding reward value is calculated based on the reward function for the first replay buffer.

[0072] At step s912, the first replay buffer is used to train a first filter- tuning policy of an agent to optimize the reward function. The agent may be the RL agent, such as a neural network or algorithm, as discussed above.

[0073] At step s914, a second replay buffer is generated, the second replay buffer comprising a second set of randomly selected state vectors from the plurality of state vectors.

[0074] At step s916, a second corresponding reward value is calculated based on the reward function for the second replay buffer.

[0075] At step s918, the second replay buffer is used to train a second filter-tuning policy of the agent to optimize the reward function.

[0076] At step s920, the trained first filter-tuning policy and the trained second filter-tuning policy are combined to create a final filter-tuning policy. While the method discloses the training of two policies, as described above, the process may be repeated in order to arrive at a combined final filter-tuning policy.

[0077] FIG. 10 is a block diagram of an apparatus 1000 according to some embodiments. In some embodiments, apparatus 1000 may comprise one or more of the components for automatically tuning a filter as described above. As shown in FIG. 10, apparatus 1000 may comprise: processing circuitry (PC) 1002, which may include one or more processors (P) 1055 (e.g., one or more general purpose microprocessors and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); communication circuitry 1048, comprising a transmitter (Tx) 1045 and a receiver (Rx) 1047 for enabling apparatus 1000 to transmit data and receive data (e.g., wirelessly transmit/receive data); and a local storage unit (a.k.a., “data storage system”) 1008, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 1002 includes a programmable processor, a computer program product (CPP) 1041 may be provided. CPP 1041 includes a computer readable medium (CRM) 1042 storing a computer program (CP) 1043 comprising computer readable instructions (CRI) 1044. CRM 1042 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 1044 of computer program 1043 is configured such that when executed by PC 1002, the CRI causes device 1000 to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, device 1000 may be configured to perform steps described herein without the need for code. That is, for example, PC 1002 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

[0078] While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above described embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.

[0079] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.

[0080] ABBREVIATIONS

[0081] RL Reinforcement learning

[0082] VNA Vector network analyzer

[0083] REFERENCES

[0084] [1] Lindstahl, S. (2019). Reinforcement Learning with Imitation for Cavity Filter Tuning : Solving problems by throwing DIRT at them (Dissertation). Retrieved from http : //urn .kb. se/resol ve ?urn=urn : nbn : se :kth: diva-254422

[0085] [2] Haarnoja, Tuomas, Aurick Zhou, Pieter Abbeel, and Sergey Levine. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv: 1801.01290 (2018).