Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
NATIVE EXPANSION OF A SPARSE TRAINING DATASET INTO A DENSE TRAINING DATASET FOR SUPERVISED TRAINING OF A SYNONYMOUS VARIANT SEQUENCE GENERATOR
Document Type and Number:
WIPO Patent Application WO/2024/086143
Kind Code:
A1
Abstract:
Disclosed is a data representation and computational process robust to data sparsity and lack of annotative/target sequences in deep learning applications referred to as Auto-Pairing. Auto-Pairing enables the expansion of initially inadequate datasets from n to ~n2 data points, providing sufficient learning instances to produce accurate therapeutic predictions. Auto-Pairing also enables the transformation of initially unsupervised datasets to supervised datasets, allowing a direct mapping from an input therapeutic to a target therapeutic. The Auto-Pairing process and representation comprises a novel integration of four existent general purpose computational sub-processes with domain-specific fine tunings: Clustering, Pairing, Pre-processing, and Modeling, enabling the generation of functional therapeutic variants of any desired small dataset of a certain therapeutic family, and even non-biological data.

Inventors:
SHAFEI MOHAMED (EG)
ANWAR MUHAMMAD (EG)
SALAH-ELDIN WAFAA ASHRAF (EG)
ESSAM HAZEM (EG)
MOUSTAFA WALID (EG)
ELKERDAWY MOHAMED (EG)
BADAWY SARAH (EG)
NADER ATEF (EG)
SALEH AHMED (EG)
Application Number:
PCT/US2023/035290
Publication Date:
April 25, 2024
Filing Date:
October 17, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
PROTEINEA INC (US)
International Classes:
G16B40/20; G06N3/0495; G06N3/09; G16B15/30; G16B20/20; G16B30/10; G16B45/00; G16B50/00
Domestic Patent References:
WO2020049087A12020-03-12
Foreign References:
US20200019859A12020-01-16
JP2021170350A2021-10-28
Other References:
KENNETH ATZ: "Geometric Deep Learning on Molecular Representations", ARXIV, 31 December 2021 (2021-12-31), XP093160928, DOI: 10.48550/arxiv.2107.12375
MICHAEL A. SKINNIDER: "Chemical language models enable navigation in sparsely populated chemical space", NATURE MACHINE INTELLIGENCE, vol. 3, no. 9, 1 September 2021 (2021-09-01), pages 759 - 770, XP093160929, ISSN: 2522-5839, DOI: 10.1038/s42256-021-00368-1
Attorney, Agent or Firm:
KHAN, Sikander (US)
Download PDF:
Claims:
Attorney Docket No. PRTN1006WO01 CLAIMS 1. A computer-implemented method, including: receiving input representations; processing the input representations; and based on the processing, generating output representations, wherein the output representations have compositions that are different from compositions of the input representations, and wherein the input representations and the output representations share at least one common function. 2. The computer-implemented method of claim 1, wherein the input representations are input gene representations, and the output representations are output gene representations, wherein the common function is a common gene function, wherein the input representations are input protein representations, and the output representations are output protein representations, wherein the input protein representations are protein sequences, protein structures, and/or n-dimensional feature vector embeddings, wherein the protein structures include primary protein structures, secondary protein structures, tertiary protein structures, and quaternary protein structures, wherein n > 1, wherein the common function is a common protein function, and wherein the output representations have enhanced capabilities relative to the input representations. 3. The computer-implemented method of claim 2, further including using at least one neural network system for processing the input representations, wherein the neural network system processes the input representations as input and generates the output representations as output, wherein the neural network system is at least one of a language model neural network, a sequence-to-sequence neural network, an encoder-decoder neural network, an autoencoder neural network, a variational autoencoder neural network, a generative adversarial neural network, a diffusion neural network, a Transformer neural network, a recurrent neural network, a long-short term memory neural network, an autoregressive neural network, an energy-based neural network, and a flow-based neural network, wherein the neural network system is trained on a training dataset using supervised learning, and wherein the training dataset comprises clusters of reference representation-variant representation pairs. Attorney Docket No. PRTN1006WO01 4. The computer-implemented method of claim 3, wherein a particular cluster of reference representation-variant representation pairs comprises a plurality of representations that have different compositions but share at least one common function, wherein the representations are gene representations, wherein the representations are protein representations, wherein reference representation-variant representation pairs in the particular cluster pair each representation in the plurality of representations with every other representation in the plurality of representations, thereby each of the reference representation-variant representation pairs comprises a reference representation that is paired with a variant representation that is different from the reference representation by at least one element but shares at least one common function with the reference representation, wherein the variant representation differs from the reference representation by many elements, wherein the element is an amino acid element, wherein the variant representation shares multiple common functions with the reference representation, wherein the common functions are common gene functions, and wherein the common functions are common protein functions. 5. The computer-implemented method of claim 4, wherein the neural network system is trained to process reference representations in the reference representation-variant representation pairs and, in response, generate approximations that progressively match corresponding variant representations in the reference representation-variant representation pairs. 6. The computer-implemented method of claim 5, wherein the clusters of reference representation-variant representation pairs are created by clustering based on one or more representation attributes, wherein the representation attributes correspond to biological constraints, and wherein the biological constraints include identity similarity, homology, structural similarity, size, length, distribution, and rarity. 7. The computer-implemented method of claim 6, wherein the clusters of reference representation-variant representation pairs are created by clustering those representations in a same cluster that have an identity score for at least one representation identity higher than a similarity threshold, wherein the representation identity includes homology overlap between the representations, wherein the representations are embedded in an embedding space, wherein the representation identity includes embedding distances between the representations in the embedding space, wherein the representation identity includes primary protein structure similarity between the protein representations, wherein the representation identity includes tertiary protein structure similarity between the protein representations, wherein the Attorney Docket No. PRTN1006WO01 representation identity includes protein function similarity between the protein representations, and wherein a higher similarity threshold creates more of the clusters of reference representation- variant representation pairs, and a lower similarity threshold creates less of the clusters of reference representation-variant representation pairs. 8. The computer-implemented method of claim 7, further including filtering out from the clusters of reference representation-variant representation pairs those clusters of reference representation-variant representation pairs that have a representation count lower than a cluster size threshold, wherein the cluster size threshold is determined based on cluster size distributions observed across the clusters of reference representation-variant representation pairs. 9. The computer-implemented method of claim 8, further including filtering out from the clusters of reference representation-variant representation pairs those representations that are either below a rare short representation length threshold or above a rare high representation length threshold, wherein the rare short representation length threshold and the rare high representation length threshold are determined based on representation length and count distributions observed across representations in the clusters of reference representation-variant representation pairs. 10. The computer-implemented method of claim 9, further including replacing rare elements observed in the representations in the clusters of reference representation-variant representation pairs with an element mask, wherein the rare elements are rare amino acid elements. 11. The computer-implemented method of claim 10, further including, during inference, controlling exploration-exploitation trade-off in outputs of the neural network system using one or more sampling parameters, wherein the sampling parameters include a temperature sampling parameter. 12. The computer-implemented method of claim 11, wherein a lower temperature sampling parameter promotes exploitation that causes the neural network system to generate outputs that are more similar to input representations used during the supervised learning, and wherein a higher temperature sampling parameter promotes exploration that causes the neural network system to generate outputs that are less similar to the input representations used during the supervised learning. 13. The computer-implemented method of claim 12, wherein the sampling parameters include a k-value sampling parameter selected based on top-k sampling, wherein the sampling parameters Attorney Docket No. PRTN1006WO01 include a p-value sampling parameter selected based on top-p sampling, wherein the sampling parameters include a beam count sampling parameter selected based on beam search sampling, and wherein the sampling parameters include a contrastive sampling parameter selected based on contrastive search sampling. 14. A system for native expansion of a sparse training dataset into a dense training dataset, comprising: memory storing a sparse training dataset that lacks target output sequences required as annotations for supervised training of a sequence generator, wherein the sparse training dataset has n unlabeled training examples; pairing logic configured for native expansion of the sparse training dataset into a dense training dataset of input-output pairs, wherein the dense training dataset has m labeled training examples whose generation is confined to the n unlabeled training examples, wherein m >> n, and wherein the pairing logic is configured to construct the dense training dataset by: generating the input-output pairs by pairing each unlabeled training example in the sparse training dataset with every other unlabeled training example in the sparse training dataset, wherein a particular input-output pair comprises an input training example labeled with an output training example; and training logic configured to implement the supervised training of the sequence generator using the dense training dataset by causing the sequence generator to process input training examples in the input-output pairs and, in response, generate approximations that progressively match corresponding output training examples in the input-output pairs. 15. The system of claim 14, wherein m = n2, wherein m = ~n2, and wherein the input training example and the output training example are protein sequences. 16. The system of claim 14, wherein the output training example is different from the input training example by at least one amino acid but shares at least one common protein function with the input training example. 17. The system of claim 14, further comprising: memory storing a sparse therapeutic training dataset that lacks target output therapeutics required as annotations for supervised training of a therapeutic generator, wherein the sparse Attorney Docket No. PRTN1006WO01 therapeutic training dataset has n unlabeled therapeutic training examples; pairing logic configured for native expansion of the sparse therapeutic training dataset into a dense therapeutic training dataset of input-output therapeutic pairs, wherein the dense therapeutic training dataset has m labeled therapeutic training examples whose generation is confined to the n unlabeled therapeutic training examples, wherein m >> n, and wherein the pairing logic is configured to construct the dense therapeutic training dataset by: generating the input-output therapeutic pairs by pairing each unlabeled therapeutic training example in the sparse therapeutic training dataset with every other unlabeled therapeutic training example in the sparse therapeutic training dataset, wherein a particular input-output therapeutic pair comprises an input training therapeutic example labeled with an output training therapeutic example; and training logic configured to implement the supervised training of the therapeutic generator using the dense therapeutic training dataset by causing the therapeutic generator to process input therapeutic training examples in the input-output therapeutic pairs and, in response, generate approximations that progressively match corresponding output therapeutic training examples in the input-output therapeutic pairs, wherein m = n2, wherein m = ~n2, wherein the input therapeutic training example and the output therapeutic training example are protein sequence, and wherein the output therapeutic training example is different from the input therapeutic training example by at least one amino acid but shares at least one common protein function with the input therapeutic training example. 18. A computer-implemented method, including: initializing a population of synonymous proteins that share at least one common function; grouping the population of synonymous proteins into a plurality of sub-populations of synonymous proteins based on one or more biological constraints; for each sub-population of synonymous proteins in the plurality of sub-populations of synonymous proteins, generating permutations of sequence-variant pairs by pairing each synonymous protein in a given sub-population of synonymous proteins with every other synonymous protein in the given sub-population; and conducting supervised training of a model to learn an input-output distribution that Attorney Docket No. PRTN1006WO01 generates output proteins which share at least one common function with input proteins, wherein the supervised training uses sequences in the permutations of sequence-variant pairs as inputs and variants in the permutations of sequence-variant pairs as ground truth target outputs of the inputs. 19. The computer-implemented method of claim 18, further including controlling sampling from the input-output distribution for customized exploitation-exploration trade-off to generate the output proteins that diverge from the input proteins but are biologically valid. 20. The computer-implemented method of claim 18, wherein the output proteins have enhanced capabilities relative to the input proteins.
Description:
Attorney Docket No. PRTN1006WO01 NATIVE EXPANSION OF A SPARSE TRAINING DATASET INTO A DENSE TRAINING DATASET FOR SUPERVISED TRAINING OF A SYNONYMOUS VARIANT SEQUENCE GENERATOR PRIORITY APPLICATION [0001] This application claims priority to or the benefit of US Provisional Patent Application No.63/416,698, titled, “Auto-Pairing: Enabling the Use of Sparse Unsupervised Data for Intelligent Therapeutics,” filed October 17, 2022 (Attorney Docket No. PRTN1006USP01). The priority provisional patent application is hereby incorporated by reference for all purposes. FIELD OF THE TECHNOLOGY [0002] The technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge-based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. The technology disclosed relates generally to data processing for artificial neural networks and, in particular, to systems and methods for using sparse unsupervised data for intelligent therapeutics. BACKGROUND [0003] The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to implementations of the claimed technology. [0004] Sparsity in training data, specifically, continually impedes utilization of deep learning advances in healthcare. Data sparsity, such as the lack of training data of antibodies bonding to newly emergent antibodies, can prevent the development of needed intelligent therapeutics. Furthermore, lack of paired training data such as antibodies along with likewise functional variants, can eliminate a wide class of supervised generative models that are capable of adhering to relatively small datasets. [0005] For instance, it is impractical to obtain a large-scale dataset consisting of antibodies binding to newly emergent or rarely occurring antigens for the purpose of pre-training an unsupervised model. This sparsity is also present for the search for annotated/paired data for the purpose of using supervised models. For example, the availability of protein-protein binders such Attorney Docket No. PRTN1006WO01 as antibody-antigen pairs are very limited and almost non-existent for newly emergent or rarely occurring antigens. In exemplary approaches, the generation of functional therapeutic variants was addressed by either the means of rational design or directed evolution, placing constraints of low-throughput expert-dependency or massive search spaces, respectively. From a deep learning perspective, even the means of transfer learning from general-purpose large-scale models underperforms when directed to specialized datasets, as small as the order of 102 records. Furthermore, the means of data augmentation proposes its own experimental validity limitations when applied to protein-based data. The aforementioned research gaps in the generation of functional therapeutics variants span long periods of time and require unscalable cost, demands that are often unavailable. [0006] Accordingly, there is a need for a method of sparse unsupervised data processing in intelligent therapeutics. Disclosed is a data representation and computational method that can expand input data and generate supervised data from unsupervised data. This method takes a robust approach to data sparsity and the generation of annotative/target sequences in deep learning referred to as Auto-Pairing. SUMMARY [0007] According to certain aspects of the present disclosure, systems and methods are disclosed for enabling the use of sparse unsupervised data in intelligent therapeutics. [0008] The technology disclosed can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections – these recitations are hereby incorporated forward by reference into each of the following implementations. [0009] One or more implementations and features of the technology disclosed, or elements thereof can be implemented in the form of a computer product, including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations and features of the technology disclosed, or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. Yet further, in another aspect, one or more implementations Attorney Docket No. PRTN1006WO01 and features of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media). [0010] The features described in this section can be combined as features. In the interest of conciseness, the combinations of features are not individually enumerated and are not repeated with each base set of features. The reader will understand how features identified in the features described in this section can readily be combined with sets of base features identified as implementations in other sections of this application. These features are not meant to be mutually exclusive, exhaustive, or restrictive; and the technology disclosed is not limited to these features but rather encompasses all possible combinations, modifications, and variations within the scope of the claimed technology and its equivalents. [0011] Other implementations of the features described in this section can include a non- transitory computer readable storage medium storing instructions executable by a processor to perform any of the features described in this section. Yet another implementation of the features described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the features described in this section. [0012] In some implementations, the technology disclosed relates to a computer- implemented method. The method includes receiving input representations, processing the input representations, and based on the processing, generating output representations. The output representations have compositions that are different from compositions of the input representations. The input representations and the output representations share at least one common function. [0013] In one implementation, the input representations are input gene representations, and the output representations are output gene representations. In one implementation, the common function is a common gene function. [0014] In another implementation, the input representations are input protein representations, and the output representations are output protein representations. In some implementations, the input protein representations are protein sequences, protein structures, and/or n-dimensional feature vector embeddings. In some implementations, the protein structures include primary protein structures, secondary protein structures, tertiary protein structures, and quaternary protein Attorney Docket No. PRTN1006WO01 structures. In one implementation, n > 1. In other implementations, the common function is a common protein function. [0015] In some implementations, the technology disclosed uses at least one neural network system for processing the input representations. In one implementation, the neural network system processes the input representations as input and generates the output representations as output. In some implementations, the neural network system is at least one of a language model neural network, a sequence-to-sequence neural network, an encoder-decoder neural network, an autoencoder neural network, a variational autoencoder neural network, a generative adversarial neural network, a diffusion neural network, a Transformer neural network, a recurrent neural network, a long-short term memory neural network, an autoregressive neural network, an energy-based neural network, and a flow-based neural network. In some implementations, the neural network system is trained on a training dataset using supervised learning. The training dataset comprises clusters of reference representation-variant representation pairs. [0016] In one implementation, a particular cluster of reference representation-variant representation pairs comprises a plurality of representations that have different compositions but share at least one common function. In some implementations, the representations are gene representations. In other implementations, the representations are protein representations. [0017] In one implementation, reference representation-variant representation pairs in the particular cluster pair each representation in the plurality of representations with every other representation in the plurality of representations. Thereby, each of the reference representation- variant representation pairs comprises a reference representation that is paired with a variant representation that is different from the reference representation by at least one element but shares at least one common function with the reference representation. In some implementations, the variant representation differs from the reference representation by many elements. [0018] In some implementations, the element is an amino acid element. In some implementations, the variant representation shares multiple common functions with the reference representation. In other implementations, the common functions are common gene functions. In other implementations, the common functions are common protein functions. [0019] In one implementation, the neural network system is trained to process reference representations in the reference representation-variant representation pairs and, in response, generate approximations that progressively match corresponding variant representations in the reference representation-variant representation pairs. [0020] In some implementations, the clusters of reference representation-variant representation pairs are created by clustering based on one or more representation attributes. In Attorney Docket No. PRTN1006WO01 one implementation, the representation attributes correspond to biological constraints. In some implementations, the biological constraints include identity similarity, homology, structural similarity, size, length, distribution, and rarity. [0021] In some implementations, the clusters of reference representation-variant representation pairs are created by clustering those representations in a same cluster that have an identity score for at least one representation identity higher than a similarity threshold. In one implementation, the representation identity includes homology overlap between the representations. In another implementation, the representations are embedded in an embedding space. The representation identity includes embedding distances between the representations in the embedding space. In yet another implementation, the representation identity includes primary protein structure similarity between the protein representations. In yet another implementation, the representation identity includes tertiary protein structure similarity between the protein representations. [0022] In one implementation, the representation identity includes protein function similarity between the protein representations. In some implementations, a higher similarity threshold creates more of the clusters of reference representation-variant representation pairs, and a lower similarity threshold creates less of the clusters of reference representation-variant representation pairs. [0023] In some implementations, the technology disclosed includes filtering out from the clusters of reference representation-variant representation pairs those clusters of reference representation-variant representation pairs that have a representation count lower than a cluster size threshold. In one implementation, the cluster size threshold is determined based on cluster size distributions observed across the clusters of reference representation-variant representation pairs. [0024] In some implementations, the technology disclosed includes filtering out from the clusters of reference representation-variant representation pairs those representations that are either below a rare short representation length threshold or above a rare high representation length threshold. In one implementation, the rare short representation length threshold and the rare high representation length threshold are determined based on representation length and count distributions observed across representations in the clusters of reference representation- variant representation pairs. [0025] In some implementations, the technology disclosed includes replacing rare elements observed in the representations in the clusters of reference representation-variant representation Attorney Docket No. PRTN1006WO01 pairs with an element mask. In one implementation, the rare elements are rare amino acid elements. [0026] In some implementations, the technology disclosed includes, during inference, controlling exploration-exploitation trade-off in outputs of the neural network system using one or more sampling parameters. In one implementation, the sampling parameters include a temperature sampling parameter. In some implementations, a lower temperature sampling parameter promotes exploitation that causes the neural network system to generate outputs that are more similar to input representations used during the supervised learning. In other implementations, a higher temperature sampling parameter promotes exploration that causes the neural network system to generate outputs that are less similar to the input representations used during the supervised learning. [0027] In other implementations, the sampling parameters include a k-value sampling parameter selected based on top-k sampling. In yet other implementations, the sampling parameters include a p-value sampling parameter selected based on top-p sampling. In yet further implementations, the sampling parameters include a beam count sampling parameter selected based on beam search sampling. In yet other implementations, the sampling parameters include a contrastive sampling parameter selected based on contrastive search sampling. [0028] In some implementations, the output representations have enhanced capabilities relative to the input representations. [0029] In some implementations, the technology disclosed includes a system for native expansion of a sparse training dataset into a dense training dataset. The system comprises memory, pairing logic, and training logic. The memory stores a sparse training dataset that lacks target output sequences required as annotations for supervised training of a sequence generator. The sparse training dataset has n unlabeled training examples. The pairing logic is configured for native expansion of the sparse training dataset into a dense training dataset of input-output pairs. The dense training dataset has m labeled training examples whose generation is confined to the n unlabeled training examples, with m >> n (e.g., m = n 2 , m = ~ n 2 ). The pairing logic is configured to construct the dense training dataset by generating the input-output pairs by pairing each unlabeled training example in the sparse training dataset with every other unlabeled training example in the sparse training dataset. A particular input-output pair comprises an input training example labeled with an output training example. The training logic is configured to implement the supervised training of the sequence generator using the dense training dataset by causing the sequence generator to process input training examples in the input-output pairs and, in response, Attorney Docket No. PRTN1006WO01 generate approximations that progressively match corresponding output training examples in the input-output pairs. [0030] In some implementations, the input training example and the output training example are protein sequences. In some implementations, the output training example is different from the input training example by at least one amino acid but shares at least one common protein function with the input training example. [0031] In some implementations, the technology disclosed includes a therapeutic system. The therapeutic system comprises memory, pairing logic, and training logic. The memory stores a sparse therapeutic training dataset that lacks target output therapeutics required as annotations for supervised training of a therapeutic generator. The sparse therapeutic training dataset has n unlabeled therapeutic training examples. The pairing logic is configured for native expansion of the sparse therapeutic training dataset into a dense therapeutic training dataset of input-output therapeutic pairs. The dense therapeutic training dataset has m labeled therapeutic training examples whose generation is confined to the n unlabeled therapeutic training examples, with m >> n (e.g., m = n 2 , m = ~ n 2 ). The pairing logic is configured to construct the dense therapeutic training dataset by generating the input-output therapeutic pairs by pairing each unlabeled therapeutic training example in the sparse therapeutic training dataset with every other unlabeled therapeutic training example in the sparse therapeutic training dataset. A particular input-output therapeutic pair comprises an input training therapeutic example labeled with an output training therapeutic example. The training logic is configured to implement the supervised training of the therapeutic generator using the dense therapeutic training dataset by causing the therapeutic generator to process input therapeutic training examples in the input-output therapeutic pairs and, in response, generate approximations that progressively match corresponding output therapeutic training examples in the input-output therapeutic pairs. [0032] In some implementations, the input therapeutic training example and the output therapeutic training example are protein sequences. In some implementations, the output therapeutic training example is different from the input therapeutic training example by at least one amino acid but shares at least one common protein function with the input therapeutic training example. [0033] In some implementations, the technology disclosed relates to a computer- implemented method. The method includes initializing a population of synonymous proteins that share at least one common function. The method further includes grouping the population of synonymous proteins into a plurality of sub-populations of synonymous proteins based on one or more biological constraints. The method further includes, for each sub-population of Attorney Docket No. PRTN1006WO01 synonymous proteins in the plurality of sub-populations of synonymous proteins, generating permutations of sequence-variant pairs by pairing each synonymous protein in a given sub- population of synonymous proteins with every other synonymous protein in the given sub- population. The method further includes conducting supervised training of a model to learn an input-output distribution that generates output proteins which share at least one common function with input proteins. The supervised training uses sequences in the permutations of sequence-variant pairs as inputs and variants in the permutations of sequence-variant pairs as ground truth target outputs of the inputs. [0034] In some implementations, the method further includes controlling sampling from the input-output distribution for customized exploitation-exploration trade-off to generate the output proteins that diverge from the input proteins but are biologically valid. [0035] In some implementations, the output proteins have enhanced capabilities relative to the input proteins. [0036] Other aspects and advantages of the technology described herein can be seen on review of the drawings, the detailed description, and the claims, which follow. BRIEF DESCRIPTION OF THE DRAWINGS [0037] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee. The color drawings also may be available in PAIR via the Supplemental Content tab. [0038] In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which. [0039] Figure 1 illustrates one implementation of a training pipeline that implements the disclosed native expansion of a sparse training dataset into a dense training dataset for supervised training of a synonymous variant sequence generator. [0040] Figure 2 depicts different components of the disclosed clustering logic. [0041] Figure 3 portrays one implementation of the disclosed clustering pipeline. [0042] Figure 4 shows one implementation of the output of the disclosed clustering pipeline portrayed in Figure 3. [0043] Figure 5 illustrates one implementation of the disclosed pairing logic. Attorney Docket No. PRTN1006WO01 [0044] Figure 6 portrays one implementation of the disclosed filtering logic. [0045] Figure 7 is a graph illustrating a distribution of similarity clusters, according to techniques disclosed herein. [0046] Figure 8 is a graph illustrating the distribution of sequence lengths, according to techniques disclosed herein. [0047] Figure 9 depicts one implementation of supervised training of the disclosed sequence generator using the dense training dataset. [0048] Figure 10 depicts one implementation of runtime execution of the disclosed trained sequence generator during inference. [0049] Figure 11 portrays one implementation of the disclosed sampling logic. [0050] Figure 12 is a flowchart illustrating a computer-implemented method for conducting supervised training of a model to learn an input-output distribution that generates output proteins which share at least one common function with input proteins, according to techniques disclosed herein. [0051] Figure 13 shows CoV-AbDab filtration entries. [0052] Figure 14 shows CoV-AbDab SARS-COV-2 Nanobody subset. [0053] Figure 15 shows amino acids letter abbreviation. [0054] Figure 16 illustrates one implementation of the disclosed pipeline. [0055] Figure 17 depicts one implementation of the pairwise identity scores per temperature. [0056] Figure 18 shows one implementation of the MSA of a subsample of the generated sequences belonging to temperature 0.3. [0057] Figure 19 illustrates one implementation of the entropy curves of the temperature 0.3 generated sequences vs. the natural. [0058] Figure 20 illustrates one implementation of the entropy curves of the temperature 0.4 generated sequences vs. the natural. [0059] Figure 21 illustrates one implementation of the entropy curves of the temperature 0.5 generated sequences vs. the natural. [0060] Figure 22 illustrates one implementation of the fitness space. [0061] Figure 23 shows one implementation of the sequence logo of temperature 0.3. [0062] Figure 24 shows one implementation of the sequence logo of a conserved subsequence of temperature 0.3. [0063] Figure 25 depicts one implementation of the structure of the 3ntc(H) group. [0064] Figure 26 depicts one implementation of the structure of the 4b41A group. Attorney Docket No. PRTN1006WO01 [0065] Figure 27 shows one implementation of the disclosed transfer-learned pairwise identity scores with temperatures. [0066] Figure 28 shows one implementation of a sample of the transfer-learned model’s entropy curves for temperature 0.8. [0067] Figure 29 shows one implementation of a sample of the transfer-learned model’s entropy curves for temperature 0.9. [0068] Figure 30 shows one implementation of a sample of the transfer-learned model’s entropy curves for temperature 1. [0069] Figure 31 illustrates one implementation of a sample of the transfer-learned model’s sequence logos for temp 0.6. DETAILED DESCRIPTION [0070] The following detailed description is made with reference to the figures. Example implementations are described to illustrate the technology disclosed, not to limit its scope, which is defined by the claims. Those of ordinary skill in the art will recognize a variety of equivalent variations on the description that follows. Reference will now be made in detail to the exemplary implementations of the present disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. [0071] The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein. [0072] The systems, devices, and methods disclosed herein are described in detail by way of examples and with reference to the figures. The examples discussed herein are examples only and are provided to assist in the explanation of the apparatuses, devices, systems, and methods described herein. None of the features or components shown in the drawings or discussed below should be taken as mandatory for any specific implementation of any of these devices, systems, or methods unless specifically designated as mandatory. Attorney Docket No. PRTN1006WO01 [0073] Also, for any methods described, regardless of whether the method is described in conjunction with a flow diagram, it should be understood that unless otherwise specified or required by context, any explicit or implicit ordering of steps performed in the execution of a method does not imply that those steps must be performed in the order presented but instead may be performed in a different order or in parallel. [0074] The detailed description of various implementations will be better understood when read in conjunction with the appended drawings. To the extent that the figures illustrate diagrams of the functional blocks of the various implementations, the functional blocks are not necessarily indicative of the division between hardware circuitry. Thus, for example, one or more of the functional blocks (e.g., modules, processors, or memories) may be implemented in a single piece of hardware (e.g., a general-purpose signal processor or a block of random-access memory, hard disk, or the like) or multiple pieces of hardware. Similarly, the programs may be stand-alone programs, may be incorporated as subroutines in an operating system, may be functions in an installed software package, and the like. It should be understood that the various implementations are not limited to the arrangements and instrumentality shown in the drawings. [0075] The processing engines and databases of the figures, designated as modules, can be implemented in hardware or software, and need not be divided up in precisely the same blocks as shown in the figures. Some of the modules can also be implemented on different processors, computers, or servers, or spread among a number of different processors, computers, or servers. In addition, it will be appreciated that some of the modules can be combined, operated in parallel or in a different sequence than that shown in the figures without affecting the functions achieved. The modules in the figures can also be thought of as flowchart steps in a method. A module also need not necessarily have all its code disposed contiguously in memory; some parts of the code can be separated from other parts of the code with code from other modules or other functions disposed in between. Terminology [0076] Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. Various scientific dictionaries that include the terms included herein are well known and available to those in the art. Any methods and materials similar or equivalent to those described herein find use in the practice of the implementations disclosed herein. [0077] The terms defined immediately below are more fully understood by reference to the specification as a whole. The definitions are for the purpose of describing particular Attorney Docket No. PRTN1006WO01 implementations only and aiding in understanding the complex concepts described in this specification. They are not intended to limit the full scope of the disclosure. Specifically, it is to be understood that this disclosure is not limited to the particular sequences, compositions, algorithms, systems, methodology, protocols, and reagents described, as these may vary, depending upon the context they are used by those of skill in the art. [0078] As used in this specification and appended claims, the singular forms “a”, “an”, and “the” include plural referents unless the content and context clearly dictates otherwise. Thus, for example, reference to “a device” includes a combination of two or more such devices, and the like. [0079] Unless indicated otherwise, an “or” conjunction is intended to be used in its correct sense as a Boolean logical operator, encompassing both the selection of features in the alternative (A or B, where the selection of A is mutually exclusive from B) and the selection of features in conjunction (A or B, where both A and B are selected). In some places in the text, the term “and/or” is used for the same purpose, which shall not be construed to imply that “or” is used with reference to mutually exclusive alternatives. [0080] As used herein, the term “exemplary” is used in the sense of “example,” rather than “ideal.” Moreover, the terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of one or more of the referenced items. [0081] A “bio-molecule” or “biological molecule” refers to a molecule that is generally found in a biological organism. In some implementations, biological molecules comprise polymeric biological macromolecules having multiple subunits (i.e., “biopolymers”). Typical bio-molecules include, but are not limited to, molecules that share some structural features with naturally occurring polymers such as RNAs (formed from nucleotide subunits), DNAs (formed from nucleotide subunits), and peptides or polypeptides (formed from amino acid subunits), including, e.g., RNAs, RNA analogues, DNAs, DNA analogues, polypeptides, polypeptide analogues, peptide nucleic acids (PNAs), combinations of RNA and DNA (e.g., chimeraplasts), or the like. It is not intended that bio-molecules be limited to any particular molecule, as any suitable biological molecule finds use in the present invention, including but not limited to, e.g., lipids, carbohydrates, or other organic molecules that are made by one or more genetically encodable molecules (e.g., one or more enzymes or enzyme pathways) or the like. [0082] The terms “polynucleotide” and “nucleic acid” refer to deoxyribonucleotides or ribonucleotides and polymers (e.g., oligonucleotides, polynucleotides, etc.) thereof in either single- or double-stranded form. These terms include, but are not limited to, single-, double- or triple-stranded DNA, genomic DNA, cDNA, RNA, DNA-RNA hybrid, polymers comprising Attorney Docket No. PRTN1006WO01 purine and pyrimidine bases, and/or other natural, chemically or biochemically modified, non- natural or derivatized nucleotide bases. The following are non-limiting examples of polynucleotides: genes, gene fragments, chromosomal fragments, ESTs, exons, introns, mRNA, tRNA, rRNA, ribozymes, cDNA, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, and primers. In some implementations, polynucleotides comprise modified nucleotides, such as methylated nucleotides and nucleotide analogs, uracyl, other sugars and linking groups such as fluororibose and thioate, and/or nucleotide branches. In some alternative implementations, the sequence of nucleotides is interrupted by non-nucleotide components. [0083] Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al. (1991) Nucleic Acid Res.19:5081; Ohtsuka et al. (1985) J. Biol. Chem.260:2605-2608; Rossolini et al. (1994) Mol. Cell. Probes 8:91-98). The term nucleic acid is used interchangeably with, e.g., oligonucleotide, polynucleotide, cDNA, and mRNA. [0084] The terms “protein,” “polypeptide” and “peptide” are used interchangeably to denote a polymer of at least two amino acids covalently linked by an amide bond, regardless of length or post-translational modification (e.g., glycosylation, phosphorylation, lipidation, myristilation, ubiquitination, etc.). In some cases, the polymer has at least about 30 amino acid residues, and usually at least about 50 amino acid residues. More typically, they contain at least about 100 amino acid residues. The terms include compositions conventionally considered to be fragments of full-length proteins or peptides. Included within this definition are D- and L-amino acids, and mixtures of D- and L-amino acids. The polypeptides described herein are not restricted to the genetically encoded amino acids. Indeed, in addition to the genetically encoded amino acids, the polypeptides described herein may be made up of, either in whole or in part, naturally-occurring and/or synthetic non-encoded amino acids. In some implementations, a polypeptide is a portion of the full-length ancestral or parental polypeptide, containing amino acid additions or deletions (e.g., gaps) or substitutions as compared to the amino acid sequence of the full-length parental polypeptide, while still retaining functional activity (e.g., catalytic activity). Attorney Docket No. PRTN1006WO01 [0085] As used herein, the term “cellulase” refers to a category of enzymes capable of hydrolyzing cellulose (ȕ-1,4-glucan or ȕ-D-glucosidic linkages) to shorter cellulose chains, oligosaccharides, cellobiose and/or glucose. In some implementations, the term “cellulase” encompasses beta-glucosidases, endoglucanases, cellobiohydrolases, cellobiose dehydrogenases, endoxylanases, beta-xylosidases, arabinofuranosidases, alpha-glucuronidases, acetylxylan esterases, feruloyl esterases, and/or alpha-glucuronyl esterases. In some implementations, the term “cellulase” encompasses hemicellulose-hydrolyzing enzymes, including but not limited to endoxylanases, beta-xylosidases, arabinofuranosidases, alpha-glucuronidases, acetylxylan esterase, feruloyl esterase, and alpha-glucuronyl esterase. A “cellulase-producing fungal cell” is a fungal cell that expresses and secretes at least one cellulose hydrolyzing enzyme. In some implementations, the cellulase-producing fungal cells express and secrete a mixture of cellulose hydrolyzing enzymes. “Cellulolytic,” “cellulose hydrolyzing,” “cellulose degrading,” and similar terms refer to enzymes such as endoglucanases and cellobiohydrolases (the latter are also referred to as “exoglucanases”) that act synergistically to break down the cellulose to soluble di- or oligosaccharides such as cellobiose, which are then further hydrolyzed to glucose by beta- glucosidase. In some implementations, the cellulase is a recombinant cellulase selected from ȕ- glucosidases (BGLs), Type 1 cellobiohydrolases (CBH1s), Type 2 cellobiohydrolases (CBH2s), glycoside hydrolase 61s (GH61s), and/or endoglucanases (EGs). In some implementations, the cellulase is a recombinant Myceliophthora cellulase selected from ȕ-glucosidases (BGLs), Type 1 cellobiohydrolases (CBH1s), Type 2 cellobiohydrolases (CBH2s), glycoside hydrolase 61s (GH61s), and/or endoglucanases (EGs). In some additional implementations, the cellulase is a recombinant cellulase selected from EG1b, EG2, EG3, EG4, EG5, EG6, CBH1a, CBH1b, CBH2a, CBH2b, GH61a, and/or BGL. [0086] The term “sequence” is used herein to refer to the order and identity of any biological sequences including but not limited to a whole genome, whole chromosome, chromosome segment, collection of gene sequences for interacting genes, gene, nucleic acid sequence, protein, polysaccharide, etc. In some contexts, a sequence refers to the order and identity of amino acid residues in a protein (i.e., a protein sequence or protein character string) or to the order and identity of nucleotides in a nucleic acid (i.e., a nucleic acid sequence or nucleic acid character string). A sequence may be represented by a character string. A “nucleic acid sequence” refers to the order and identity of the nucleotides comprising a nucleic acid. A “protein sequence” refers to the order and identity of the amino acids comprising a protein or peptide. Attorney Docket No. PRTN1006WO01 [0087] “Codon” refers to a specific sequence of three consecutive nucleotides that is part of the genetic code and that specifies a particular amino acid in a protein or starts or stops protein synthesis. [0088] “Native sequence” or “wild type sequence” refers to a polynucleotide or polypeptide isolated from a naturally occurring source. Included within “native sequence” are recombinant forms of a native polypeptide or polynucleotide which have a sequence identical to the native form. [0089] The term “gene” is used broadly to refer to any segment of DNA or other nucleic acid associated with a biological function. Thus, genes include coding sequences and optionally, the regulatory sequences required for their expression. Genes also optionally include nonexpressed nucleic acid segments that, for example, form recognition sequences for other proteins. Genes can be obtained from a variety of sources, including cloning from a source of interest or synthesizing from known or predicted sequence information, and may include sequences designed to have desired parameters. [0090] A “motif” refers to a pattern of subunits in or among biological molecules. For example, the term “motif” can be used in reference to a subunit pattern of the unencoded biological molecule or to a subunit pattern of an encoded representation of a biological molecule. [0091] The term “chromosome” is used in reference to an organized structure of DNA and associated protein found cells, comprising a single piece of coiled DNA including many genes, regulatory elements, and other nucleotide sequences. The term is also used in reference to the DNA sequence of the structure. [0092] In the context of genetic algorithm, the term “chromosome” is used as an alias for an individual model (or a set of model parameters) in a population of models. It is so used because a model from a parent generation passes its parameters (or genes) onto the models of a child generation, which resembles the manners that a parent chromosome passing its genes to a child chromosome. [0093] A “fragment” is any portion of a sequence of nucleotides or amino acids. Fragments may be produced using any suitable method known in the art, including but not limited to cleaving a polypeptide or polynucleotide sequence. In some implementations, fragments are produced by using nucleases that cleave polynucleotides. In some additional implementations, fragments are generated using chemical and/or biological synthesis techniques. In some implementations, fragments comprise subsequences of at least one parental sequence, generated using partial chain elongation of complementary nucleic acid(s). Attorney Docket No. PRTN1006WO01 [0094] “Parental polypeptide,” “parental polynucleotide,” “parent nucleic acid,” and “parent” are generally used to refer to the wild-type polypeptide, wild-type polynucleotide, or a variant used as a starting point in a diversity generation procedure such as a directed evolution. In some implementations, the parent itself is produced via shuffling or other diversity generation procedures. In some implementations, mutants used in directed evolution are directly related to a parent polypeptide. In some implementations, the parent polypeptide is stable when exposed to extremes of temperature, pH and/or solvent conditions and can serve as the basis for generating variants for shuffling. In some implementations, the parental polypeptide is not stable to extremes of temperature, pH and/or solvent conditions, and the parental polypeptide is evolved to make a robust variant. [0095] A “parent nucleic acid” encodes a parental polypeptide. [0096] “Mutant,” “variant,” and “variant sequence” as used herein, refer to a biological sequence that differs in some respect from a standard or reference sequence. The difference may be referred to as a “mutation”. In some implementations, a mutant is an amino acid (i.e., polypeptide) or polynucleotide sequence that has been altered by at least one substitution, insertion, cross-over, deletion, and/or other genetic operation. For purposes of the present disclosure, mutants and variants are not limited to a particular method by which they are generated. In some implementations, a mutant or variant sequence has increased, decreased, or substantially similar activities or properties, in comparison to the parental sequence. In some implementations, the variant polypeptide comprises one or more amino acid residues that have been mutated, as compared to the amino acid sequence of the wild-type polypeptide (e.g., a parent polypeptide). In some implementations, one or more amino acid residues of the polypeptide are held constant, are invariant, or are not mutated as compared to a parent polypeptide in the variant polypeptides making up the plurality. In some implementations, the parent polypeptide is used as the basis for generating variants with improved stability, activity, or other property. [0097] A “library” or “population” refers to a collection of at least two different molecules, character strings, and/or models, such as nucleic acid sequences (e.g., genes, oligonucleotides, etc.) or expression products (e.g., enzymes or other proteins) therefrom. A library or population generally includes a number of different molecules. For example, a library or population typically includes at least about 10 different molecules. Large libraries typically include at least about 100 different molecules, more typically at least about 1000 different molecules. For some applications, the library includes at least about 10000 or more different molecules. In certain Attorney Docket No. PRTN1006WO01 implementations, the library contains a number variant or chimeric nucleic acids or proteins produced by a directed evolution procedure. [0098] Two nucleic acids are “recombined” when sequences from each of the two nucleic acids are combined in a progeny nucleic acid. Two sequences are “directly” recombined when both of the nucleic acids are substrates for recombination. [0099] “Selection” refers to the process in which one or more bio-molecules are identified as having one or more properties of interest. Thus, for example, one can screen a library to determine one or more properties of one or more library members. If one or more of the library members is/are identified as possessing a property of interest, it is selected. Selection can include the isolation of a library member, but this is not necessary. Further, selection and screening can be, and often are, simultaneous. [00100] “Reference sequence” is a sequence from which variation of sequence is effected. In some cases, a “reference sequence” is used to define the variations. Such sequence may be one predicted by a model to have the highest value (or one of the highest values) of the desired activity. In another case, the reference sequence may be that of a member of an original protein variant library. In certain implementations, a reference sequence is the sequence of a parent protein or nucleic acid. [00101] “Training set” or “training dataset” refers to a set of sequence generator data or observations that one or more models are fitted to and built upon. For instance, for a protein sequence generator model, also referred to herein as “synonymous variant sequence generator,” a training set comprises residue sequences for an initial or improved protein variant library. Typically, these data include complete or partial residue sequence information, together with an activity value for each protein in the library. In some cases, multiple types of activities (e.g., rate constant data and thermal stability data) are provided together in the training set. The activity is sometimes a beneficial property. [00102] “Cross validation” refers to a method for testing the generalizability of a model’s ability to predict a value of interest (i.e., the value of the dependent variable). The method prepares a model using one set of data, and tests the model error using a different set of data. The first set of data is viewed as a training set, and the second set of data is a validation set. [00103] The term “observation” is information about protein or other biological entity that may be used in a training set for generating a model such as a sequence activity model. The term “observation” may refer to any sequenced and assayed biological molecules, including protein variants. In certain implementations, each observation is an activity value and an associated Attorney Docket No. PRTN1006WO01 sequence for a variant in a library. Generally, the more observations employed to create a sequence generator model, the better the predictive power of that sequence generator model. [00104] As used herein, the term “beneficial property” is intended to refer to a phenotypic or other identifiable feature that confers some benefit to a protein or a composition of matter or process associated with the protein. Examples of beneficial properties include an increase or decrease, when compared to a parent protein, in a variant protein’s catalytic properties, binding properties, stability when exposed to extremes of temperature, pH, etc., sensitivity to stimuli, inhibition, and the like. Other beneficial properties may include an altered profile in response to a particular stimulus. Further examples of beneficial properties are set forth below. Values of beneficial properties may be used as activity values in the observations used in a training set for a sequence activity model. [00105] “Next-generation sequencing” or “high-throughput sequencing” are sequencing techniques that parallelize the sequencing process, producing thousands or millions of sequences at once. Examples of suitable next-generation sequencing methods include, but are not limited to, single molecule real-time sequencing (e.g., Pacific Biosciences, Menlo Park, Calif.), ion semiconductor sequencing (e.g., Ion Torrent, South San Francisco, Calif.), pyrosequencing (e.g., 454, Branford, Conn.), sequencing by ligation (e.g., SOLid sequencing of Life Technologies, Carlsbad, Calif.), sequencing by synthesis and reversible terminator (e.g., Illumina, San Diego, Calif.), nucleic acid imaging technologies such as transmission electron microscopy, and the like. Further descriptions of exemplary techniques are described in the detailed description of this disclosure. [00106] The term “systematically varied sequences” refers to a set of sequences in which each residue is seen in multiple contexts. In principle, the level of systematic variation can be quantified by the degree to which the sequences are orthogonal from one another (i.e., maximally different compared to the mean). [00107] The term “toggling” refers to the introduction of multiple amino acid residue types into a specific position in the sequences of protein variants in the optimized library. [00108] The term “encoded character string” refers to a representation of a biological molecule that preserves sequence/structural information regarding that molecule. In some implementations, the encoded character string contains information about sequence mutations in a library of variants. Encoded character strings of bio-molecules along with activity information for the bio-molecules may be used as a training set for a sequence activity model. Non-sequence properties of bio-molecules can be stored or otherwise associated with encoded character strings for the bio-molecules. Attorney Docket No. PRTN1006WO01 [00109] The terms “regression” and “regression analysis” refer to techniques used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. It is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Regression techniques may be used to generate the disclosed sequence generators from training sets comprising multiple observations, which may contain sequence information. [00110] Partial Least Squares or PLS is a family of methods that finds a linear regression model by projecting predicted variables and the observable variables to a new space. PLS is also known as projection to latent structures. Both the X (independent variables) and Y (dependent variables) data are projected to new spaces. PLS is used to find the fundamental relations between two matrices (X and Y). A latent variable approach is used to model the covariance structures in the X and Y spaces. A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values. [00111] A “descriptor” refers to something that serves to describe or identify an item. For example, characters in a character string can be descriptors of amino acids in a polypeptide being represented by the character string. [00112] In a regression model, the dependent variable is related to independent variables by a sum of terms. Each term includes a product of an independent variable and an associated regression coefficient. In the case of a purely linear regression model, the regression coefficients are given by ȕ in the following form of expression: [00113] y i=ȕ1 x i1+ ... +ȕp x ip+İi =x i Tȕ+İi [00114] where yi is the dependent variable, the xi are the independent variables, İi is the error variable, and T denotes the transpose, that is the inner product of the vectors xi and ȕ. [00115] “Principal component regression” (PCR) refers to a regression analysis that uses principal component analysis when estimating regression coefficients. In PCR instead of regressing the dependent variable on the independent variables directly, the principal Attorney Docket No. PRTN1006WO01 components of the independent variables are used. PCR typically only uses a subset of the principal components in the regression. [00116] “Principal component analysis” (PCA) refers to a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (i.e., uncorrelated with) the preceding components. [00117] “Neural network” is a model containing an interconnected group of processing elements or “neurons” that process information using a connectionist approach to computation. Neural networks are used to model complex relationships between inputs and outputs or to find patterns in data. Most neural networks process data in a non-linear, distributed, parallel fashion. In most cases a neural network is an adaptive system that changes its structure during a learning phase. Functions are performed collectively and in parallel by the processing elements, rather than there being a clear delineation of subtasks to which various units are assigned. [00118] Generally, a neural network involves a network of simple processing elements that exhibit complex global behavior determined by the connections between the processing elements and element parameters. Neural networks are used with algorithms designed to alter the strength of the connections in the network to produce a desired signal flow. The strength is altered during training or learning. [00119] “Random forest” refers to a combination of classification tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. A random forest is a learning ensemble consisting of a bagging of un-pruned decision tree learners with a randomized selection of features at each split of the decision tree. A random forest grows a large number of classification trees, each of which votes for the most popular class. The random forest then classifies a variable by taking the most popular voted class from all the tree predictors in the forest. [00120] “Prior probability distribution”, or “prior,” of an uncertain quantity p is the probability distribution that expresses the uncertainty about p before data of interest (e.g., a training set of protein sequences) are taken into account. The unknown quantity may be a parameter, coefficient, variable, latent variable, or the like (e.g., a coefficient in a multiple regression model). Attorney Docket No. PRTN1006WO01 [00121] “Posterior probability distribution,” or “posterior,” of an uncertain quantity p is the probability distribution that expresses the uncertainty about p after the data of interest are taken into account. [00122] The term “Bayesian linear regression” refers to an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference. The prior belief about the linear regression model, including the prior probability distribution function of the model’s parameter, is combined with the data’s likelihood function according to Bayes theorem to yield the posterior probability distribution about the parameters. [00123] “Overfitting” refers to a condition that occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. [00124] The term “base model” is used in reference to the disclosed sequence generator provided at the beginning of a process of improving a model. [00125] The term “updated model” is used in reference to the disclosed sequence generator that is derived directly or indirectly from a base model, which has improved predictive power compared to the base model and/or another model from which it is derived from. [00126] A “likelihood function” or “likelihood” of a model is a function of the parameters of a statistical model. The likelihood of a set of parameter values given some observed outcomes equals to the probability of those observed outcomes given those parameter values, i.e., L(^|x)=P(x|^). [00127] “Monte Carlo simulations” are simulations that rely on a large number of random sampling to obtain numerical results that simulate a real phenomenon. For instance, drawing a large number of pseudo-random uniform variables from the interval (0,1], and assigning values less than or equal to 0.50 as heads and greater than 0.50 as tails, is a Monte Carlo simulation of the behavior of repeatedly tossing a coin. [00128] A “Metropolis algorithm” or “Metropolis-Hastings algorithm” is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. This sampling sequence can be used to approximate the distribution (i.e., to generate a histogram), or to compute an integral (such as an expected value). Metropolis-Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. The objective of the Metropolis-Hastings algorithm is to asymptotically generate states x Attorney Docket No. PRTN1006WO01 according to a desired distribution P(x) and uses a stochastic process to fulfill it. The idea of the algorithm is to condition the stochastic process such that it asymptotically converges to the unique distribution P(x). [00129] A “Markov chain” is a sequence of random variables X1, X2, X3... with the Markov property. In other words, given the present state, the future and past states are independent. Formally, Pr(X n+1 =x|X 1 =x 1 ,X 2 =x 2 , ... ,X n =x n)=Pr(X n+1 =x|X n =x n). [00130] The possible values of Xi form a countable set S called the state space of the chain. A “Markov chain” system is a mathematical system that undergoes transitions from one state to another, between a finite or countable number of possible states. It is a random process usually characterized as memoryless: the next state depends only on the current state and not on the sequence of events that preceded it. [00131] The “Akaike Information Criterion” (AIC) is a measure of the relative goodness of fit of a statistical model, and it is often used as a criterion for model selection among a finite set of models. The AIC is grounded in the concept of information entropy, in effect offering a relative measure of the information lost when a given model is used to describe reality. It can be said to describe the tradeoff between bias and variance in model construction, or loosely speaking between accuracy and complexity of the model. The AIC can be calculated as: AIC=í2 logeL+2k, wherein L is the maximum likelihood of the function and k is the number of free parameters of the model to be estimated. [00132] “Bayesian Information Criterion” is a criterion for model selection among a finite set of models, and is closely related to AIC. The BIC can be calculated as: BIC=í2 logeL+k loge(n), wherein n is the number of data observations. As the number of observations increased, BIC often penalizes extra number of free parameters more heavily than AIC. [00133] A “genetic algorithm” is a process that mimics evolutionary processes. Genetic algorithms (GAs) are used in a wide variety of fields to solve problems which are not fully characterized or too complex to allow full characterization, but for which some analytical evaluation is available. That is, GAs are used to solve problems which can be evaluated by some quantifiable measure for the relative value of a solution (or at least the relative value of one potential solution in comparison to another). In the context of the present disclosure, a genetic algorithm is a process for selecting or manipulating character strings in a computer, typically where the character string corresponds to one or more biological molecules (e.g., nucleic acids, proteins, or the like). [00134] The term “genetic operation” (or “GO”) refers to biological and/or computational genetic operations, wherein all changes in any population of any type of character strings (and Attorney Docket No. PRTN1006WO01 thus in any physical properties of physical objects encoded by such strings) can be described as a result of random and/or predetermined application of a finite set of logical algebraic functions. Examples of GO include but are not limited to multiplication, crossover, recombination, mutation, ligation, fragmentation, etc. [00135] “Ensemble model” is a model whose terms include all the terms of a group of models, wherein the ensemble model’s coefficients of the terms are based on the weighted coefficients of the corresponding terms of the individual models of the group. The weighting of coefficients is based on the predictive power and/or fitness of the individual models. Introduction [00136] Disclosed is a data representation and computational process robust to data sparsity and lack of annotative/target sequences in deep learning applications referred to as Auto-Pairing. Auto-Pairing enables the expansion of initially inadequate datasets from n to ~n 2 data points, providing sufficient learning instances to produce accurate therapeutic predictions. Auto-Pairing also enables the transformation of initially unsupervised datasets to supervised datasets, allowing a direct mapping from an input therapeutic to a target therapeutic. The Auto-Pairing process and representation comprises a novel integration of four existent general purpose computational sub- processes with domain-specific fine tunings: Clustering, Pairing, Pre-processing, and Modeling, enabling the generation of functional therapeutic variants of any desired small dataset of a certain therapeutic family, and even non-biological data. [00137] Auto-Pairing is a data representation and computational process that is robust to therapeutic data sparsity and the related lack of annotative sequences, enabling the deep learning-based generation of functional therapeutic variants of any desired small dataset of a certain therapeutic family. Auto-Pairing is a therapeutic data representation that considers sequences performing the same functionality (e.g., antibodies binding to SARS-COV-2) to be paraphrases or variants to each other. The Auto-Pairing process begins by performing similarity clustering that can be in the level of primary/tertiary structure or functional interpretation, in cases of antibodies within a predetermined range. Afterwards, sequences of the same cluster are iteratively paired to each other transforming the data from unsupervised into supervised. Proteins with similar functions (or structures) are considered paraphrases of one another, and a set of paired proteins is accordingly considered a supervised dataset. Architecture-specific pre- processing is conducted to eliminate sequences with characteristics presenting a confusion to the learning algorithm. Finally, sequence-to-sequence modeling is conducted to generate variants of the original small dataset. Attorney Docket No. PRTN1006WO01 [00138] In some implementations, the technology disclosed relates to Auto-Pairing. While Auto-Pairing may be applied to proteins or nucleic acids that encode proteins, in some cases Auto-Pairing is applied to biological molecules above and beyond proteins. In such implementations, the disclosed sequence generator may be employed to generate sequences of various biological molecules. For example, the sequences may be that of a whole genome, a whole chromosome, a chromosome segment, a collection of gene sequences for interacting genes, a gene, a nucleic acid sequence, a protein, a polysaccharide, etc. In one or more implementations, sub-units of the sequence are chromosomes, chromosome segments, haplotypes, genes, nucleotides, codons, mutations, amino acids, carbohydrates (mono, di, tri, or oligomeric), lipid, etc. [00139] In some implementations, a training dataset for training the disclosed sequence generator is derived from a plurality of proteins, which may be provided as a protein library. The protein library may include proteins from various sources. In one example, the members include naturally occurring proteins such as those encoded by members of a single gene family. In another example, the sequences include proteins obtained by using a recombination-based diversity generation mechanism. For example, DNA fragmentation-mediated recombination, synthetic oligonucleotide-mediated recombination or a combination thereof may be performed on nucleic acids encoding all or part of one or more naturally occurring parent proteins for this purpose. In still another example, the members are obtained by implementing a design of experiment (DOE) protocol to identify the systematically varied sequences. Training Pipeline [00140] Figure 1 illustrates one implementation of a training pipeline 100 that implements the disclosed native expansion of a sparse training dataset into a dense training dataset for supervised training of a synonymous variant sequence generator. For example, the training pipeline 100 can be implemented by a processor automatically or in response to a request by a user. [00141] Figure 1 shows representations 102. The representations 102 can include input representations and output representations. In one implementation, the input representations are input gene representations, and the output representations are output gene representations. In another implementation, the input representations are input protein representations, and the output representations are output protein representations. In some implementations, the input protein representations are protein sequences, protein structures, and/or n-dimensional feature vector embeddings (e.g., feature vector embeddings with 5000 dimensions). In some Attorney Docket No. PRTN1006WO01 implementations, the protein structures include primary protein structures, secondary protein structures, tertiary protein structures, and quaternary protein structures. In one implementation, n > 1. [00142] In some implementations, the representations 102 can include a set of sequences. Each sequence is a protein sequence of antibodies with an identifiable primary structure, tertiary structure, and embedded distance. The input sequences are not labelled or annotated. While implementations of the present disclosure discuss protein sequences, it is contemplated that the sequences can represent any other suitable biological or non-biological data. [00143] Figure 1 also shows clustering logic 112. In one implementation, the clustering logic 112 generates clustered representations 122 as output. The clustering logic 112 is configured to group the sequences into one or more clusters based on a sequence similarity threshold applied to the set of sequences. Grouping, or clustering, is the task of grouping subsets, or clusters, based on some identified similarity. Each cluster is grouped in a manner denoting that the antibody sequences are more similar to each other than to those in other clusters, where similarity can be based on primary structure, tertiary structure, and/or embedding distance. A similarity threshold is used to determine whether an antibody sequence is similar enough to another antibody sequence for the sequences to be clustered together. Sequences with an identity score higher than the threshold belong to the same cluster. The similarity threshold can be predetermined by a user, where a high threshold for identify generates a large number of clusters and a low threshold generates a small number of clusters. [00144] In implementations, an MMSeq2 algorithm is used to cluster the sequences based on the threshold set by the user. This process is detailed in Figure 3 and the discussion below. In using the MMSeq2 algorithm, an input FASTA file 302 is converted to the MMSeq2 database format 312. Protein sequences are clustered by the created database file 322 by the predetermined threshold value, for example 0.7 (or 70% similarity). Cluster IDs and Sequence IDs are extracted from the generated cluster file 322 to a TSV format file 332. To map the sequences to the Sequence IDs in the TSV file 332, the originally input FASTA file 302 containing the protein sequences is also converted to the TSV output 332 of the MMSeq2 algorithm that contains the protein IDs then compared to the created TSV output. The mapped sequences are then converted to a TSV file format 332. The output of the MMSeq2 algorithm is a CSV file 342 containing the three columns indicated in Figure 4. [00145] Figure 1 also shows pairing logic 132. The pairing logic 132 is configured to iteratively pair each sequence within a cluster with every other sequence that is within the same cluster. As each sequence within each cluster has satisfied the similarity threshold to the other Attorney Docket No. PRTN1006WO01 sequences within the cluster, all member sequences can be considered as paraphrases or targets of one another. All possible unique pairings are output. In some implementations, the pairing logic 132 generates clusters of reference representation-variant representation pairs 142 as output, which are discussed later in Figure 5. [00146] Figure 1 also shows filtering logic 152. The filtering logic 152 includes filtering one or more remaining clusters based on a length of a member sequence within the cluster. To denoise the clusters remaining after the first filtration step, sequence length distribution is used to eliminate unduly long or short sequence outliers that are not represented adequately. A length threshold can be determined by plotting the sequence length ranges against length count, as shown in graph 800 in Figure 8. Length outlier clusters, such as the length outlier cluster (129, 134] in Figure 8, are clusters that are not adequately represented. The threshold length used to determine an outlier can be a percentage (for example, 10% of a predetermined count such as 140), or a cutoff length count (for example, 20). Similar to the previous filtering implementation, clusters that are one or more standard deviations below a generalized cluster count can be eliminated. Optionally, the method may include an additional step of filtering out clusters with minority classes. Sequences with rare amino acids (e.g., U, Z, O, or B) are modified by replacing the rare amino acids with the amino acid “X.” In some implementations, the filtering logic 152 generates filtered clusters of reference representation-variant representation pairs 162 as output. [00147] Figure 1 also shows training logic 182. The training logic 182 uses the filtered clusters of reference representation-variant representation pairs 162 to train the sequence generator 172. The training logic 192 trains the sequence generator 172 to process reference representations in the reference representation-variant representation pairs and, in response, generate approximations that progressively match corresponding variant representations in the reference representation-variant representation pairs. In some implementations, the sequence generator 172 is trained using supervised learning. [00148] One example of the sequence generator 172 is a neural network system. In one implementation, the neural network system processes the input representations as input and generates the output representations as output. In some implementations, the neural network system is at least one of a language model neural network, a sequence-to-sequence neural network, an encoder-decoder neural network, an autoencoder neural network, a variational autoencoder neural network, a generative adversarial neural network, a diffusion neural network, a Transformer neural network, a recurrent neural network, a long-short term memory neural network, an autoregressive neural network, an energy-based neural network, and a flow-based neural network. Attorney Docket No. PRTN1006WO01 [00149] In one implementation, the sequence generator 172 is a multilayer perceptron (MLP). In another implementation, the sequence generator 172 is a feedforward neural network. In yet another implementation, the sequence generator 172 is a fully-connected neural network. In a further implementation, the sequence generator 172 is a fully convolution neural network. In a yet further implementation, the sequence generator 172 is a semantic segmentation neural network. In a yet another further implementation, the sequence generator 172 is a generative adversarial network (GAN) (e.g., CycleGAN, StyleGAN, pixelRNN, text-2-image, DiscoGAN, IsGAN). In a yet another implementation, the sequence generator 172 includes self-attention mechanisms like Transformer, Vision Transformer (ViT), Bidirectional Transformer (BERT), Detection Transformer (DETR), Deformable DETR, UP-DETR, DeiT, Swin, GPT, iGPT, GPT- 2, GPT-3, BERT, SpanBERT, RoBERTa, XLNet, ELECTRA, UniLM, BART, T5, ERNIE (THU), KnowBERT, DeiT-Ti, DeiT-S, DeiT-B, T2T-ViT-14, T2T-ViT-19, T2T-ViT-24, PVT- Small, PVT-Medium, PVT-Large, TNT-S, TNT-B, CPVT-S, CPVT-S-GAP, CPVT-B, Swin-T, Swin-S, Swin-B, Twins-SVT-S, Twins-SVT-B, Twins-SVT-L, Shuffle-T, Shuffle-S, Shuffle-B, XCiT-S12/16, CMT-S, CMT-B, VOLO-D1, VOLO-D2, VOLO-D3, VOLO-D4, MoCo v3, ACT, TSP, Max-DeepLab, VisTR, SETR, Hand-Transformer, HOT-Net, METRO, Image Transformer, Taming transformer, TransGAN, IPT, TTSR, STTN, Masked Transformer, CLIP, DALL-E, Cogview, UniT, ASH, TinyBert, FullyQT, ConvBert, FCOS, Faster R-CNN + FPN, DETR-DC5, TSP-FCOS, TSP-RCNN, ACT+MKDD (L=32), ACT+MKDD (L=16), SMCA, Efficient DETR, UP-DETR, UP-DETR, ViTB/16-FRCNN, ViT-B/16-FRCNN, PVT- Small+RetinaNet, Swin-T+RetinaNet, Swin-T+ATSS, PVT-Small+DETR, TNT-S+DETR, YOLOS-Ti, YOLOS-S, and YOLOS-B. [00150] In one implementation, the sequence generator 172 is a convolution neural network (CNN) with a plurality of convolution layers. In another implementation, the sequence generator 172 is a recurrent neural network (RNN) such as a long short-term memory network (LSTM), bi- directional LSTM (Bi- LSTM), or a gated recurrent unit (GRU). In yet another implementation, the sequence generator 172 includes both a CNN and an RNN. [00151] In yet other implementations, the sequence generator 172 can use 1D convolutions, 2D convolutions, 3D convolutions, 4D convolutions, 5D convolutions, dilated or atrous convolutions, transpose convolutions, depthwise separable convolutions, pointwise convolutions, 1 x 1 convolutions, group convolutions, flattened convolutions, spatial and cross-channel convolutions, shuffled grouped convolutions, spatial separable convolutions, and deconvolutions. The sequence generator 172 can use one or more loss functions such as logistic regression/log loss, multi- class cross-entropy/softmax loss, binary cross-entropy loss, mean- Attorney Docket No. PRTN1006WO01 squared error loss, L1 loss, L2 loss, smooth L1 loss, and Huber loss. The sequence generator 172 can use any parallelism, efficiency, and compression schemes such TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformation, batching, prefetching, model parallelism, data parallelism, and synchronous/asynchronous stochastic gradient descent (SGD). The sequence generator 172 can include upsampling layers, downsampling layers, recurrent connections, gates and gated memory units (like an LSTM or GRU), residual blocks, residual connections, highway connections, skip connections, peephole connections, activation functions (e.g., non-linear transformation functions like rectifying linear unit (ReLU), leaky ReLU, exponential liner unit (ELU), sigmoid and hyperbolic tangent (tanh)), batch normalization layers, regularization layers, dropout, pooling layers (e.g., max or average pooling), global average pooling layers, and attention mechanisms. [00152] The sequence generator 172 can be a linear regression model, a logistic regression model, an Elastic Net model, a support vector machine (SVM), a random forest (RF), a decision tree, and a boosted decision tree (e.g., XGBoost), or some other tree-based logic (e.g., metric trees, kd-trees, R-trees, universal B-trees, X-trees, ball trees, locality sensitive hashes, and inverted indexes). The sequence generator 172 can be an ensemble of multiple models, in some implementations. [00153] In some implementations, the sequence generator 172 can be trained using backpropagation-based gradient update techniques. Example gradient descent techniques that can be used for training the sequence generator 172 include stochastic gradient descent, batch gradient descent, and mini-batch gradient descent. Some examples of gradient descent optimization algorithms that can be used to train the sequence generator 172 are Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad. Clustering Logic [00154] Figure 2 depicts different components of the disclosed clustering logic 112. In some implementations, the clusters of reference representation-variant representation pairs are created by clustering based on one or more representation attributes. In one implementation, the representation attributes correspond to biological constraints. In some implementations, the biological constraints include identity similarity, homology, structural similarity, size, length, distribution, and rarity. [00155] In some implementations, the clusters of reference representation-variant representation pairs are created by clustering those representations in a same cluster that have an Attorney Docket No. PRTN1006WO01 identity score for at least one representation identity higher than a similarity threshold. In one implementation, the representation identity includes homology overlap between the representations, and implemented by the disclosed sequence homology determination logic 202. [00156] In another implementation, the representations are embedded in an embedding space. The representation identity includes embedding distances between the representations in the embedding space, and implemented by the disclosed embedding distance determination logic 222. An embedding space in which the representations 102 are embedded, for example, to group/cluster/subcluster similar representations in a latent space. A “latent space,” for example, in deep learning is a reduced-dimensionality vector space of a hidden layer. A hidden layer of a neural network compresses an input and forms a new low-dimensional representation with interesting properties that are distance-wise correlated in the latent space. [00157] A distance is identified between each pair of the instances in the embedding space corresponding to a predetermined measure of similarity between the pair of the instances. The “embedding space,” into which the instances are embedded, for example, by an embedding module (not shown), can be a geometric space within which the instances are represented. In one implementation, the embedding space can be a vector space (or tensor space), and in another implementation the embedding space can be a metric space. In a vector space, the features of an instance define its “position” in the vector space relative to an origin. The position is typically represented as a vector from the origin to the instance’s position, and the space has a number of dimensions based on the number of coordinates in the vector. Vector spaces deal with vectors and the operations that may be performed on those vectors. [00158] When the embedding space is a metric space, the embedding space does not have a concept of position, dimensions, or an origin. Distances among instances in a metric space are maintained relative to each other, rather than relative to any particular origin, as in a vector space. Metric spaces deal with representations combined with a distance between those representations and the operations that may be performed on those representations. [00159] For purposes of the present disclosure, these representations are significant in that many efficient algorithms exist that operate on vector spaces and metric spaces. For example, metric trees may be used to rapidly identify representations that are “close” to each other. Representations can be embedded into vector spaces and/or metric spaces. In the context of a vector space, this means that a function can be defined that maps representations to vectors in some vector space. In the context of a metric space, this means that it is possible to define a metric (or distance) between those representations, which allows the set of all such representations to be treated as a metric space. Vector spaces allow the use of a variety of Attorney Docket No. PRTN1006WO01 standard measures of distance/divergence (e.g., the Euclidean distance). Other implementations can use other types of embedding spaces. [00160] As used herein, “an embedding” is a map that maps instances into an embedding space. An embedding is a function that takes, as inputs, a potentially large number of characteristics of the instance to be embedded. For some embeddings, the mapping can be created and understood by a human, whereas for other embeddings the mapping can be very complex and non-intuitive. In many implementations, the latter type of mapping is developed by a machine learning algorithm based on training examples, rather than being programmed explicitly. [00161] In order to embed an instance in a vector space, each instance must be associated with a vector. A distance between two instances in such a space is then determined using standard measures of distance using vectors. [00162] A goal of embedding instances in a vector space is to place intuitively similar instances close to each other. One way of embedding text instances is to use a bag-of-words model. The bag of words model maintains a dictionary. Each word in the dictionary is given an integer index, for example, the word aardvark may be given the index 1, and the word zebra may be given the index 60,000. Each instance is processed by counting the number of occurrences of each dictionary word in that instance. A vector is created where the value at the ith index is the count for the ith dictionary word. Variants of this representation normalize the counts in various ways. Such an embedding captures information about the content and therefore the meaning of the instances. Text instances with similar word distributions are close to each other in this embedded space. [00163] Images may be processed to identify commonly occurring features using, e.g., scale invariant feature transforms (SIFT), which are then binned and used in a representation similar to the bag-of-words embedding described above. Further, embeddings can be created using deep neural networks, or other deep learning techniques. For example, a neural network can learn an appropriate embedding by performing gradient descent against a measure of dimensionality reduction on a large set of training data. As another example, a kernel can be learned based on data and derive a distance based on that kernel. Likewise, distances may be learned directly. [00164] These approaches generally use large neural networks to map instances, words, or images to high dimensional vectors (for example see: A brief introduction to kernel classifiers, Mark Johnson, Brown University 2009, http://cs.brown.edu/courses/cs195- 5/fall2009/docs/lecture_10-27.pdf “Using Confidence Bounds for Exploitation-Exploration Trade-offs, incorporated herein by reference; and Kernel Method for General Pattern Analysis, Attorney Docket No. PRTN1006WO01 Nello Cristianini, University of California, Davis, accessed October 2016, http://www.kernel- methods.net/tutorials/KMtalk.pdf). In another example, image patches can be represented as deep embeddings. As an image is passed through a deep neural network model, the output after each hidden layer is an embedding in a latent space. These deep embeddings provide hints for the model to distinguish different images. In some implementations, the embeddings can be chosen from a low-dimensional layer as the latent representation. [00165] In other implementations, an embedding can be learned using examples with algorithms such as Multi-Dimensional Scaling, or Stochastic Neighbor Embedding. An embedding into a vector space may also be defined implicitly via a kernel. In this case, the explicit vectors may never be generated or used, rather the operations in the vector space are carried out by performing kernel operations in the original space. [00166] Other types of embeddings of particular interest capture date and time information regarding the instance, e.g., the date and time when a photograph was taken. In such cases, a kernel may be used that positions images closer if they were taken on the same day of the week in different weeks, or in the same month but different years. For example, photographs taken around Christmas may be considered similar even though they were taken in different years and so have a large absolute difference in their timestamps. In general, such kernels may capture information beyond that available by simply looking at the difference between timestamps. [00167] Similarly, embeddings capturing geographic information may be of interest. Such embeddings may consider geographic metadata associated with instances, e.g., the geo-tag associated with a photograph. In these cases, a kernel or embedding may be used that captures more information than simply the difference in miles between two locations. For example, it may capture whether the photographs were taken in the same city, the same building, or the same country. [00168] Often embeddings will consider instances in multiple ways. For example, a product may be embedded in terms of the metadata associated with that product, the image of that product, and the textual content of reviews for that product. Such an embedding may be achieved by developing kernels for each aspect of the instance and combining those kernels in some way, e.g., via a linear combination. [00169] In many cases a very high dimensional space would be required to capture the intuitive relationships between instances. In some of these cases, the required dimensionality may be reduced by choosing to embed the instances on a manifold (curved surface) in the space rather than to arbitrary locations. Attorney Docket No. PRTN1006WO01 [00170] Different embeddings may be appropriate on different subsets of the instance catalog. For example, it may be most effective to re-embed the candidate result sets at each iteration of the search procedure. In this way, the subset may be re-embedded to capture the most important axes of variation or of interest in that subset. [00171] To embed an instance in a metric space requires associating that catalog with a distance (or metric). [00172] A “distance” between two instances in an embedding space corresponds to a predetermined measurement (measure) of similarity among instances. Preferably, it is a monotonic function of the measurement of similarity (or dissimilarity). Typically, the distance equals the measurement of similarity. Example distances include the Manhattan distance, the Euclidean distance, the Hamming distance, and the Mahalanobis distance. [00173] Given the distance (similarity measure) between instances to be searched, or the embedding of those instances into a vector space, a metric space or a manifold, there are a variety of data structures that may be used to index the instance catalog and hence allow for rapid search. Such data structures include metric trees, kd-trees, R-trees, universal B-trees, X- trees, ball trees, locality sensitive hashes, and inverted indexes. The technology disclosed can use a combination of such data structures to identify a next set of candidate results based on a refined query. An advantage of using geometric constraints is that they may be used with such efficient data structures to identify the next results in time that is sub-linear in the size of the catalog. [00174] There are a wide variety of ways to measure the distance (or similarity) between instances, and these may be combined to produce new measures of distance. An important concept is that the intuitive relationships between digital instances may be captured via such a similarity or distance measure. For example, some useful distance measures place images containing the same person in the same place close to each other. Likewise, some useful measures place instances discussing the same topic close to each other. Of course, there are many axes along which digital instances may be intuitively related, so that the set of all instances close (with respect to that distance) to a given instance may be quite diverse. For example, a historical text describing the relationship between Anthony and Cleopatra may be similar to other historical texts, texts about Egypt, texts about Rome, movies about Anthony and Cleopatra, and love stories. Each of these types of differences constitutes a different axis relative to the original historical text. [00175] Such distances may be defined in a variety of ways. One typical way is via embeddings into a vector space. Other ways include encoding the similarity via a kernel. By associating a set of instances with a distance, we are effectively embedding those instances into a Attorney Docket No. PRTN1006WO01 metric space. Instances that are intuitively similar will be close in this metric space while those that are intuitively dissimilar will be far apart. Note further that kernels and distance functions may be learned. In fact, it may be useful to learn new distance functions on subsets of the instances at each iteration of the search procedure. [00176] Note that wherever a distance is used to measure the similarity between instances a kernel may be used to measure the similarity between instances instead, and vice-versa. However, kernels may be used directly instead without the need to transform them into distances. [00177] Kernels and distances may be combined in a variety of ways. In this way, multiple kernels or distances may be leveraged. Each kernel may capture different information about an instance, e.g., one kernel captures visual information about a piece of jewelry, while another captures price, and another captures brand. [00178] Also note that embeddings may be specific to a given domain, such as a given catalog of products or type of content. For example, it may be appropriate to learn or develop an embedding specific to men’s shoes. Such an embedding would capture the similarity between men’s shoes but would be uninformative with regards to men’s shirts. [00179] In other implementations, instead of a distance function, a similarity function can be used, for example, to group/cluster/subcluster visually similar images in a latent space. The similarity function, which is used to determine a measure of similarity, can be any function having kernel properties, such as but not limited to a dot product function, a linear function, a polynomial function, a Gaussian function, an exponential function, a Laplacian function, an analysis of variants (ANOVA) function, a hyperbolic tangent function, a rational quadratic function, a multi-quadratic function, an inverse multi-quadratic function, a circular function, a wave function, a power function, a log function, a spline function, a B-spline function, a Bessel function, a Cauchy function, a chi-square function, a histogram intersection function, a generalized histogram intersection function, a generalized T-student function, a Bayesian function, and a wavelet function. [00180] In the above-described context, using similarity functions, as opposed to using distance functions, is better because neural networks are often trained with regularizers, which add an ever-increasing cost in order to reach the training objective as the weights of the neural network get larger. These regularizers are added to prevent overfitting, where the network pays undue attention to details in the training data, instead of identifying broad trends. Further, these regularizers may be viewed as applying pressure toward a default behavior, which must be overcome by the training data. When used for learning embeddings, standard regularizers have Attorney Docket No. PRTN1006WO01 an effect of pushing the embeddings toward an origin, which tends to push them closer together. If one uses a goal to achieve large distances when items are dissimilar, then this sort of regularization pushes towards a default that items will be similar. However, if a goal is set to have the embeddings have a large dot product when the items are similar (as in the case of the above-described similarity function), then the regularizer applies pressure towards a default that items are dissimilar. It will often be the case that a typical random pair of instances should be regarded as dissimilar. An overall more accurate and efficient visual image discovery results. [00181] In yet another implementation, the representation identity includes primary protein structure similarity between the protein representations, and is disclosed by the primary protein structure similarity determination logic 232. In yet another implementation, the representation identity includes tertiary protein structure similarity between the protein representations, and is disclosed by the tertiary protein structure similarity determination logic 242. [00182] In one implementation, the representation identity includes protein function similarity between the protein representations, and is disclosed by the protein function similarity determination logic 252. In some implementations, a higher similarity threshold creates more of the clusters of reference representation-variant representation pairs, and a lower similarity threshold creates less of the clusters of reference representation-variant representation pairs. [00183] Figure 3 is a diagram of an exemplary overview 300 of the steps to perform sequence-based similarity clustering, specifically clustering done by the MMSeq2 algorithm. The process begins with an input FASTA file 302, which is converted by the program to a MMSeq2 database file format 312. The algorithm then performs a clustering step to generate a collection of cluster files 322. Cluster IDs and Sequence IDs are extracted from the generated cluster files 322 in a TSV file format. These Cluster and Sequence IDs are then mapped to the original sequences input to the algorithm, converting the same FASTA file to the TSV file 332 output of the MMSeq2 algorithm that contains the IDs. Once the sequences are mapped, a CSV file 342 is generated and output from the MMSeq2 algorithm. Output protein IDs are ideally the same as the IDs provided in the original FASTA file 302. [00184] Figure 4 is a table of an exemplary output 400 of sequence listings from a similarity clustering processing method. The output 400 follows the sequence based or primary structure similarity clustering and can comprise three columns of information. The first column 402 represents the Cluster ID. The Cluster ID is the ID of the representative sequence of the cluster, as determined by the identity threshold. The second column 404 represents the Sequence ID, representing the ID of the member sequences of each cluster. The third column 406 lists the Attorney Docket No. PRTN1006WO01 sequence of each member. In implementations of the present disclosure, the output 400 is a CSV file. Pairing Logic [00185] Figure 5A illustrates one implementation of the disclosed pairing logic 132. The paired sequence ordering matters, as the sequence generator 172 learns to generate a second pair from the first pair. For example, for a cluster with n=3 member sequences and these three sequences are represented as A, B, and C, the sequence-variant auto pairs that can be generated are shown in Figure 5. This dataset expansion is based on two assumptions: (1) the sequence- variant auto pair A-B is different from the sequence-variant auto pair B-A; and (2) the uniqueness of the paraphrased variants is controlled by accepting variants of similarity within the similarity threshold. [00186] In one implementation, a particular cluster of reference representation-variant representation pairs comprises a plurality of representations that have different compositions but share at least one common function. In some implementations, the representations are gene representations. In other implementations, the representations are protein representations. In one implementation, reference representation-variant representation pairs in the particular cluster pair each representation in the plurality of representations with every other representation in the plurality of representations. Thereby, each of the reference representation-variant representation pairs 506 comprises a reference representation 502 that is paired with a variant representation 504 that is different from the reference representation by at least one element but shares at least one common function with the reference representation. In some implementations, the variant representation differs from the reference representation by many elements. In some implementations, the element is an amino acid element. In some implementations, the variant representation shares multiple common functions with the reference representation. In other implementations, the common functions are common gene functions. In other implementations, the common functions are common protein functions. Filtering Pipeline [00187] Figure 6 portrays one implementation of the disclosed filtering logic 152. In some implementations, the technology disclosed includes filtering out from the clusters of reference representation-variant representation pairs those clusters of reference representation-variant representation pairs that have a representation count lower than a cluster size threshold, and is implemented by the disclosed cluster size filtering logic 602. In one implementation, the cluster Attorney Docket No. PRTN1006WO01 size threshold is determined based on cluster size distributions observed across the clusters of reference representation-variant representation pairs. [00188] In some implementations, the technology disclosed includes filtering out from the clusters of reference representation-variant representation pairs those representations that are either below a rare short representation length threshold or above a rare high representation length threshold, and is implemented by the disclosed rare short protein sequence length filtering logic 612 and the disclosed rare long protein sequence length filtering logic 622, respectively. In one implementation, the rare short representation length threshold and the rare high representation length threshold are determined based on representation length and count distributions observed across representations in the clusters of reference representation-variant representation pairs. [00189] In some implementations, the technology disclosed includes replacing rare elements observed in the representations in the clusters of reference representation-variant representation pairs with an element mask, and is implemented by the disclosed rare amino acid replacement logic 632. In one implementation, the rare elements are rare amino acid elements. [00190] Figure 7 is a graph 700 illustrating an exemplary distribution of similarity clusters, according to techniques disclosed herein. The graph 700, representing the plotted distribution, shows that there are clusters with either single sequences or small numbers of sequences. This distribution governs the elimination of those clusters, proceeding only with Cluster 3, 6, 8, 9, 10, and 13, and dropping the rest. [00191] Figure 8 is a graph 800 illustrating an exemplary distribution of sequence lengths, according to techniques disclosed herein. Graph 800 shows the plotted distribution of the sequence lengths, highlighting that the majority of the sequences are within three length ranges ([119-124], (124-129], and [114-119]). This distribution governs the elimination of any sequence with a length larger than 129 amino acids. Supervised Training [00192] Figure 9 depicts one implementation of supervised training of the disclosed sequence generator 172 using the dense training dataset. In one implementation, the sequence generator 172 is trained to process reference input representations 912 in the reference representation- variant representation pairs and, in response, generate approximations that progressively match corresponding variant output representations 914 in the reference representation-variant representation pairs. In other words, the supervised learning/training uses the variant output Attorney Docket No. PRTN1006WO01 representations 914 as ground truth labels/counterparts to the reference input representations 912. [00193] The technology disclosed can include generating a set of variant target sequences from the filtered set of sequences. The set of variant target sequence are similar to the filtered set of sequences, such that the variant target sequences perform the same function as the filtered set of sequences. The technology disclosed include mapping remaining clusters to their target variants. Mapping can occur through the use of supervised learning models that update weights/coefficients of the sequence generator 172 based on error 930 and gradients determined using backpropagation 924. For example, a language model in general and a Seq2Seq model in particular. The Seq2Seq model may be useful due to its compatibility with text data, as well as its efficient handling of long-term dependencies such as LSTM, attention, or transformer layers. A simple LSTM-Encoder-Decoder architecture with RMSprop for an optimizer and Categorical Cross Entropy for a loss function can also be used. [00194] First, the model (e.g., the sequence generator 172) is trained using a set of input sequences. From these input sequences, the model learns to map a first sequence to a second sequence. Here, the cluster level does not matter—the focus of the training is to teach the model how to map. A context can be derived from each input sequence, where the context teaches the model how to form a target variant for that sequence. For example, the context can be a required similarity of primary structure. Inference [00195] Figure 10 depicts one implementation of runtime execution of the disclosed trained sequence generator during inference 1000. In some implementations, the technology disclosed uses the trained sequence generator 172 for processing the input representations and generating output representations. The output representations have compositions that are different from compositions of the input representations. The input representations and the output representations share at least one common function. In some implementations, the output representations have enhanced capabilities relative to the input representations. In one implementation, the common function is a common gene function. In another implementation, the common function is a common protein function. [00196] After the model is trained on the input sequences, the (e.g., the sequence generator 172) undergoes an inference stage. During the inference stage, the context is known from the learned set of parameters, and for each input sequence a variant is produced depending on the context. The sequences are then mapped to their generated variant, based on recursive sampling. Attorney Docket No. PRTN1006WO01 Sampling Logic [00197] Figure 11 portrays one implementation of the disclosed sampling logic 1100. In some implementations, the technology disclosed includes, during inference, controlling exploration- exploitation trade-off in outputs of the sequence generator 172 using one or more sampling parameters. In one implementation, the sampling parameters include a temperature sampling parameter, which is implemented by the disclosed temperature sampling logic 1102. In some implementations, a lower temperature sampling parameter promotes exploitation that causes the sequence generator 172 to generate outputs that are more similar to input representations used during the supervised learning. In other implementations, a higher temperature sampling parameter promotes exploration that causes the sequence generator 172 to generate outputs that are less similar to the input representations used during the supervised learning. [00198] In other implementations, the sampling parameters include a k-value sampling parameter selected based on top-k sampling, which is implemented by the disclosed k-value sampling logic 1112. In yet other implementations, the sampling parameters include a p-value sampling parameter selected based on top-p sampling, which is implemented by the disclosed p- value sampling logic 1122. In yet further implementations, the sampling parameters include a beam count sampling parameter selected based on beam search sampling, which is implemented by the disclosed beam count sampling logic 1132. In yet other implementations, the sampling parameters include a contrastive sampling parameter selected based on contrastive search sampling, which is implemented by the disclosed contrastive sampling logic 1142. [00199] Temperature sampling can be used to control the exploration-exploitation trade off and to address sequence pairings with high similarity thresholds. Low temperatures denote exploitation promotion, or sequences generated similar to the training data, and high temperatures denote exploration promotion, where sequences are less similar to the training data. A sampling temperature can also determine a grammar flexibility, where the higher temperature reflects less flexibility. [00200] An optimum temperature can be generated based on a desired similarity or functionality of the output data set sequence. The optimum temperature may be dependent on the input sequences or user determined similarity thresholds. An acceptable threshold value can be used to automatically reject temperatures below this value, where the threshold value is based on how paired sequences retain functionality. For example, if a context parameter results in dissimilar sequences being paired, the output paired sequences may be rejected based on the optimum temperature. Attorney Docket No. PRTN1006WO01 [00201] The technology disclosed can include outputting a set of supervised sequences. The output set is roughly the size of the input data set squared, i.e., the data set grows from n to n 2 based on the inclusion of the generated variants and paired sequences. This supervised data set can be used in any number of ways, including for the generation of intelligent therapeutics. The output supervised datasets can also allow for a direct mapping of input therapeutic to a target therapeutic, without restriction due to a small dataset of a certain therapeutic family. Accordingly, a downstream therapeutic pipeline can be envisioned using this output. Input-Output Distribution [00202] Figure 12 is a flowchart illustrating a computer-implemented method 1200 for conducting supervised training of a model to learn an input-output distribution that generates output proteins which share at least one common function with input proteins, according to techniques disclosed herein. [00203] At action 1202, the method includes initializing a population of synonymous proteins that share at least one common function. [00204] At action 1212, the method further includes grouping the population of synonymous proteins into a plurality of sub-populations of synonymous proteins based on one or more biological constraints. [00205] At action 1222, the method further includes, for each sub-population of synonymous proteins in the plurality of sub-populations of synonymous proteins, generating permutations of sequence-variant pairs by pairing each synonymous protein in a given sub-population of synonymous proteins with every other synonymous protein in the given sub-population. [00206] At action 1232, the method further includes conducting supervised training of a model to learn an input-output distribution that generates output proteins which share at least one common function with input proteins. The supervised training uses sequences in the permutations of sequence-variant pairs as inputs and variants in the permutations of sequence- variant pairs as ground truth target outputs of the inputs. [00207] At action 1242, the method further includes controlling sampling from the input- output distribution for customized exploitation-exploration trade-off to generate the output proteins that diverge from the input proteins but are biologically valid. Attorney Docket No. PRTN1006WO01 Performance Results, Technical Improvements, and Technical Advantages as Objective Indicia of Novelty, Inventiveness, Non-Obviousness, and Subject-Matter Eligibility Explanation of the Computational Evaluation/Filtration of the Generated Sequences [00208] This work will employ two types of computational performance metrics aiming to evaluate this model’s learning of intrinsic properties then the model’s expansion beyond the natural sequences’ space. Both sets are adopted from ProteinGAN. In other implementations, any viable means of synthetic sequence evaluation can be utilized. Learning of Intrinsic Properties Pairwise Identity [00209] This work will compute a similarity score per every generated sequence vs all the natural/input sequences. This metric will be used to visualize the similarity of the generated sequences and obtain a sense of the ratio of the sequences to be rejected based on not meeting a pre-determined similarity score ranging from 70-95%. Shannon Entropy [00210] This work will compute Shannon Entropy for every position where low entropy values will aid in reflecting preserved areas for the protein family governing important functionality and high entropy values will aid in reflecting unpreserved areas for the protein family indicating the possibility for variation. Accordingly, the Shannon Entropy curves will indicate how well this work’s approach retains the natural sequences’ conservation/mutation trends and patterns, a desired characteristic. Sequence Logo [00211] This work will utilize Sequence Logos in identifying the most frequent amino acids in every position of the generated vs. original MSA alignments and plotting those two sets of amino acids together forming what is known as the consensus sequence of conserved regions denoting different types of binding. Expansion beyond Natural Sequences’ Space [00212] This type of performance metric evaluates the degree of the model’s learning of valid variants’ characteristics beyond the natural sequences present in the dataset. Attorney Docket No. PRTN1006WO01 CATH Domain Search [00213] This work will perform CATH Domain Search in order to evaluate whether the model was able to generate new structural domains which are actually functional and accordingly validates that the generation of these new functional structural domains is not done by chance. Requirements and Technical Specifications [00214] The research problem this work tackles is the generation of valid protein variants of a certain family (output) given a set of natural/experimentally validated protein sequences of the same family (input). This undergoes four phases as to be detailed in the selected design subsequent section. As for this section, the required input formatting and needed parameters of these phases are demonstrated. Preprocessing [00215] The preprocessing module of this work is required to produce/assert a Seq2Seq- compatible supervised dataset. This module is expected to handle both initially unsupervised and already supervised data. In case of an initially unsupervised dataset of variants (e.g., proteins off the same family performing the same function with a problem-specific similarity thresholds and clustering), this module needs to auto-pair proteins of the same clusters hypothesizing they serve as supervised variants for each other. Afterward, it needs to perform the needed cleaning, preprocessing, and data splitting that is required to do the following: x Make sure the model has not encountered the testing sequences before, even if their pairing was different. x Infer on the entire unique dataset on different random seeds per temp to test the model’s performance when both seeing a new seq and seeing a seq it witnessed before. [00216] As for the parameters, this module requires three main parameters summarized as follows: x The first parameter is the dataset path that is a string representing a csv file name/directory. If the data is unsupervised, the sequences need to be in the first column and their clusters in the second column. If the data is supervised, the input sequences need to be in the first column and the target sequences need to be in the second column. x The second parameter is a Boolean parameter that determines whether the dataset path belongs to supervised data (True) or unsupervised data (False). Attorney Docket No. PRTN1006WO01 x The third parameter is another Boolean parameter that indicates whether or not to conduct clustering on the preprocessed dataset. An example where we might need this is the production phase, where we aim to produce variants for the entire dataset. Feature Extraction [00217] The feature extraction phase is divided across two modules: the tokenization module and the batch embedding generation module. The following paragraphs outline the requirements from these two modules and their needed parameters. [00218] The tokenization module is required to tokenize sequences based on our pre-trained model’s vocabulary and do the needed token preprocessing in preparation to the modeling phase. This module only has one parameter denoting the data frame of amino acid sequences to be tokenized and preprocessed. [00219] The embedding generation modules, on the other hand, is a part of a general-purpose data generator. This data generator is required to generate batches of embeddings, decoder input sequences and shifted decoder target sequences. It only has two parameters which are the tokenized data frame containing (at least) the encoder tokenized inputs named “encoder_tokenized_input”, the decoder tokenized targets named “decoder_tokenized_target and the batch size to be generated every time the generator object is called. Modelling [00220] The modelling module does not need different dimensionality requirements aside from the already mentioned in the previous modules. In fact, all what this module needs is the initialization of the training model’s inputs as a list containing the encoder inputs and decoder inputs and the model’s targets as the decoder targets. Sampling [00221] The sampling phase consists of two modules, the temperature redistribution and the batch decoding. The temperature redistribution module is required to guide the next-token sampling in the inference decoding function based on temperature sampling. Its parameters are the original distribution of the vocab data we want to reweight. In case of a Seq2Seq encoder- decoder model, this distribution is required to represent the output tokens generated so far from the previous tokens. [00222] As for the batch decoding module, it is required to produce a batch of decoded sequences from a batch of input embedding sequences sampled by temperature. Its parameters Attorney Docket No. PRTN1006WO01 are the input batch of embeddings extracted from the input sequences we want to produce variants/mappings for, the batch size of the embeddings and accordingly the decoded sequences, and the temperature value to be used in the sampling. Description of the Selected Design [00223] This section is concerned with demonstrating the adopted design that is viable for a baseline architecture that is trained from scratch or transfer-learned models that are trained downstream. To elaborate, it lists the employed architecture, the dataset used for the training process, the sampling approach used and the motives behind choosing it, and the adopted performance metrics. Dataset: CovAbDab [00224] In light of the unfortunate deterioration of the coronavirus pandemic, the molecular characterizations of the SARS-CoV-2 antibodies are of tremendous significance. This significance relates to the general usages of antibodies that directly map to the development of efficient biotherapeutics and evaluation of candidate vaccines. Accordingly, the initial antibody protein sequences are those of nanobodies binding to SARS-CoV-2. Nanobodies are single- domain antibodies (i.e., antibodies that consist of a single monomeric variable domain that are still able to bind selectively to a specific antigen). The employed dataset is derived from a database known as CoV-AbDab, which accommodates over 1400 released antibodies/nanobodies proven to be able to bind to one beta coronavirus at least. In addition to the nanobodies binding to SARS-CoV-2, this dataset contains a wide set of antibodies proven to bind to other beta coronaviruses including SARS-CoV-1 and MERS-CoV and it provides a set of attributes allowing filtration of more specific subsets. These attributes include the type (antibody/nanobody), virus strain (SARS-CoV-1, SARS-CoV-2, etc.), protein/epitope (spike protein). The filtration parameters that the CoV-AbDab database supports range from the type of protein, its binding target, its origin, and several other biochemical attributes that can be observed in Figure 13. The indicated attribute values resulted in a total of 320 nanobody sequences binding to the same epitope spike protein, making them serve as variants for each other. A subset of the dataset can be observed in Figure 14 indicating that proteins are just strings of amino acids. [00225] The actual sequences are found in the column outlined in purple. We can see in Figure 14 that these sequences span 20 alphabetical characters which represent the letter Attorney Docket No. PRTN1006WO01 abbreviation of the 20 primary amino acids, each reflecting specific numerical codons mapping to these amino acids. An illustration of this can be found in Figure 15. Preprocessing [00226] Although much larger protein sequence datasets can be obtained, producing nanobody variants, in particular, is of minimum complexity and cost in comparison to the broader class that is antibodies. Moreover, part of this work’s hypothesis is that the initial 320 data points can be expanded to 22,382 data points in terms of all the possible sequence-variant auto pairs pairing only within the same protein clusters. To elaborate, it is proposed that from every n data point, n-1 paraphrase targets can be generated. For example, if n = 3 and those three sequences are A, B, and C. The sequence-variant auto pairs that can be generated can be observed in Figure 5. This dataset expansion is based on two assumptions: x The sequence-variant auto pair A-B is different from the sequence-variant auto pair B-A x The novelty of the paraphrased variants can be controlled by accepting variants of similarity within a certain threshold. Feature Extraction [00227] The difference between a baseline model and a transfer-learned model is that the transfer-learned model does not input the nanobodies’ vectorized representation but rather inputs their embedding representation learned by the pre-trained model, ProtBert-BFD-TF, trained on the biggest protein dataset, BFD, at the time of release. This representation does not just provide a way to differentiate amino acids and sequences from each other but rather provide a semantic representation learned from all proteins available to us, a representation that translates how these proteins evolved over time to denote a certain functionality and that satisfies the combinations seen in nature even if not seen in the nanobodies dataset. Modelling [00228] The seq2seq architecture shared across the baseline and transfer-learned model utilizes an LSTM-Encoder-Decoder architecture. The choice of this architecture is motivated by its design for supervised learning fulfilling the criteria of the downstream generation task being of a supervised nature and maps sequences through paraphrasing and translation along with its applicability of scaling and ease of training. To elaborate, LSTM-Encoder-Decoder architecture is compatible with trainable embeddings, pre-trained context-free embeddings, and contextual embeddings. Furthermore, it is also compatible with custom attention that serves in approaching Attorney Docket No. PRTN1006WO01 the edge of transformers without being bound by its training complexity. Therefore, it is promoted as the optimum architecture fulfilling this work’s anticipated efficiency and simplicity. Sampling [00229] In language models or sequence generators, sampling is defined as the process of picking the following token according to the generated conditional distribution. The adopted sampling method of this work is temperature sampling where high temperatures denote more linguistic variety while low temperatures are more “grammatically accurate.” The reason behind this choice is that stochastic sampling on its own can, by definition, generate very random tokens by chance. Accordingly, temperature sampling gives more tangible control over this randomness. Implementation and Discussion Solution Overview [00230] This work’s structure can be divided into three phases, Preprocessing, Modelling, and Computational Evaluation/Filtration as seen in Figure 16. Preprocessing [00231] As previously discussed, this work generalises to either producing protein variants (analogous to paraphrasing) or mapping proteins of a certain domain to those of another (analogous to translation). Accordingly, it accepts both unsupervised and supervised data, respectively, depending on which downstream task is in development. However, since encoder- decoder models are supervised, it needs to transfer unsupervised data to supervised ones. This is where the auto-pairing hypothesis presented in this work comes in handy, where proteins of the same family and same cluster are treated as variants for each other and accordingly fit for supervision. This phase produces data that fulfils this auto-pairing along with data clearing (detailed in the following section). Modelling [00232] This work aims to benchmark a baseline LSTM-Encoder Decoder architecture whose input is solely the input dataset with a similar transfer-learned architecture whose input is the embedding representation of the dataset, extracted from a pre-trained top model. The baseline model not only provides a reference for comparison, but it also proves the concept of auto- pairing. The transfer-learned model, however, provides a tangible demonstration of the effect of transfer learning in the problem of protein variant generation. More on the details of both models in the following section. Attorney Docket No. PRTN1006WO01 Computational Evaluation/Filtration [00233] In natural language processing (NLP), it is more approachable to evaluate the paraphrased or the translated texts based on the language rules. However, this process is not as straightforward in protein modelling. Accordingly, the computational evaluation of synthetic biologics, in general, depends, to a large extent, on reflecting the synthetic sequences on the natural ones. This reflection takes into account maintaining the conservation/variation trends, within sequences identity and produced functional groups. Hence, this phase evaluates the synthetic antibodies and filters the top-performing ones to proceed to in-vitro experimentation. Implementation Procedure Preprocessing [00234] As elaborated on, this work generalizes to both protein variant generation (analogous to paraphrasing) and mapping proteins of a certain domain to those of another (analogous to translation). As a result, depending on which downstream task is in development, it accepts both unsupervised and supervised input. However, because encoder-decoder models are supervised, unsupervised data must be transferred to supervised data. This is where the auto-pairing hypothesis provided in this paper comes in help, in which proteins from the same family and cluster are viewed as variations for one another and hence suitable for supervision. This phase generates data that satisfies the auto-pairing requirement as well as data cleaning. The initial data frame we start with contains the sequences and their corresponding clusters which are formed based on a similarity score ranging from 70-95%. The lower limit is chosen to retain the similarity estimated to occur in proteins within the same family and the upper limit is chosen to ensure that the sequence is not almost identical to another. Hence, the first pre-processing step is doing the auto-pairing within the same clusters. The auto-pairing iterates over all sequences within the same cluster and considers each one a variant or a paraphrase of the other. [00235] After having a supervised data frame where the inputs are the sequences and the targets are their variants, there comes the data cleaning part. We begin by replacing the rarely occurred amino acids (U, Z, O, and B) with (X) as this is the approach adopted in the pre-trained model whose features are to be transferred. Afterward, the amino acids are separated with spaces to follow the same pre-trained model’s formatting. Feature Extraction [00236] To prepare the input and target sequences for feature extraction, the pre-trained tokenization needs to be adopted where the sequences are padded to a pre-decided maximum Attorney Docket No. PRTN1006WO01 length of 400 amino acids (a [PAD] token is added at the end of the sequences whose length is less than 400), a beginning-of-sentence tag [BOS] and an end-of-sentence tag [EOS] is added to the decoder’s input for it to be able to gain a sense of what does the sequence beginning and end mean in terms of tokens. Finally, the three added tokens ([PAD], [BOS], and [EOS]) along with the amino acids are mapped to unique integer indices. Now, the tokenized sequences are ready for feature extraction. Well, almost ready. [00237] Due to the massive size of the sequence embeddings (every token’s embedding vector has a dimension of 1024), the embeddings/features are loaded via a custom-made generator whose batch size is 64. As per the decoder’s inputs and targets, they are initiated as 3D arrays whose dimensions are the number of sequences (restricted to the batch size), maximum sequence length, and the total number of unique tokens. The difference between the encoder and decoder’s data is that the encoder’s data is now represented by the pre-trained model’s learned feature while the decoder’s data is still represented by the token indices so that the model learns how to reconstruct the variants from the latent/embedding space. Furthermore, to fulfil the language modelling objective at the decoder’s side (predicting the next token given the previous token(s), the decoder utilizes the “Teacher Forcing” method. The decoder learns how to construct its target sequences by shifting them one timestep forward or by delaying its input sequences one timestep backward. As a result, the decoder’s inputs are now the latent/embedding space vectors, as well as the unshifted or delayed target sequences, and its output is the shifted target sequences. Modelling [00238] An LSTM-Encoder-Decoder architecture is used in the Seq2Seq model that is shared by the baseline and transfer-learned models. This architecture was chosen because of its design for supervised learning, which meets the requirements of the downstream supervised generation task as well as its scalability and ease of training. To clarify, the LSTM-Encoder-Decoder architecture works with trainable embeddings, context-free embeddings that have been pre- trained, and contextual embeddings. Indeed, the same classic Seq2Seq architecture is used for both the training and inference models with few differences to be indicated. For the training inputs, a list is compiled with the encoder and decoder inputs and for the training output, the decoder shifted target is passed. For the compilation, the optimizer is set as RMSprop, the loss is set as categorical cross entropy loss, and the metric is set as the accuracy. Since the fit function is not compatible with custom-made generators, a train on batch function is utilized in a custom- made training and validation loops instead. As per the inference model that has no target Attorney Docket No. PRTN1006WO01 sequences, the inference inputs are still the decoder inputs composed of the latent space vectors and the sequences while the outputs are the now generated sequences obtained token by token as governed by the sampling algorithm. Sampling [00239] Sampling is described as the process of selecting the next token based on the created conditional distribution in language models or sequence generators. Temperature sampling was used in this study, with higher temperatures indicating more linguistic variation and lower temperatures indicating more “grammatically precise” language. The reason for this decision is that stochastic sampling by itself can produce extremely random tokens by chance. Temperature sampling, as a result, provides a more tangible control over this randomness. The inference process starts by encoding the input as latent space/state vectors and then generating an empty target sequence batch of lengths 1 then by populating the first character of the target sequence with the start character ([BOS] tag) and finally starting the sampling loop. The sampling loop continues until its stopping conditions are fulfilled. These stopping conditions are either generating an end-of-sentence character ([EOS] tag) or reaching the maximum sequence length. As long as these conditions have not been met, the loop continues to sample the next amino acid according to temperature sampling. Testing and Evaluation Testing [00240] This work tested nine different values of the temperature (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1) on the split test data. For every sequence of the input test data, a synthetic sequence was generated summing up to a total number of synthetic variants equal to the original number of the test sequences. However, only a random set of 1000 of the generated test sequences of every temperature were taken to the preliminary evaluation. Furthermore, in a late version of this work’s implementation, four values of random seeds were added per temperature to govern the sampling probabilities and increase the number of generated sequences. The testing code automatically writes the generated synthetic sequences in a CSV evaluation file. The preliminary evaluation consisted of the performance metrics mentioned in the project design. Its adaptation details are found in the following subsections. Attorney Docket No. PRTN1006WO01 Evaluation [00241] This sub-section discusses the modules, parameters, and servers used in conducting the evaluation. a. Learning of Intrinsic Properties i. Pairwise Identity Pairwise Identity uses global alignment which is followed by calculating an identity score for the variant sequence with the aid of a tool known as BLAST. ii. Shannon Entropy The MSA is calculated using a tool known as ClustalO which generates a FASTA file with the multiple sequences aligned. They are later visualized with the help of a tool called AlignmentViewer. iii. Sequence Logo The Sequence Logo is also done using AlignmentViewer. b. Expansion beyond Natural Sequences’ Space i. CATH domain search The process of CATH domain search is done as follows: Ɣ The natural sequences were merged with the generated sequences in the same file in balanced quantities (to avoid search bias) (natural seqs = generated seqs). Ɣ CATH domains database (contains functional motifs/domains sequences) was downloaded. Ɣ HMMER3 tool was used for the CATH domain search, using HMMScan of fasta sequences (merged files) against HMM profile CATH database (seq Vs HMM database). Ɣ The output hits were parsed and filtered per least E-values (similar to p-value but for seq similarity) and Coverage (how much the sequence covers from the reference length). Attorney Docket No. PRTN1006WO01 Discussion of the Performance of the Implemented Solution [00242] In this section, we start by discussing the performance of the baseline-model. As a brief reminder, the baseline model uses untrained vectorized protein representations. After that, we discuss the transfer learning model. The transfer learning model uses pre-trained protein embeddings leveraged from training on the largest primary sequence dataset. Baseline Model a. Learning of Intrinsic Properties i. Accuracy and Loss Although accuracy/loss are not sufficiently expressive performance metrics in the case of generative models, they both can be an initial good indicator of the correctness of the training process. Indeed, the model has scored an accuracy of 98.3% and a continuously decreasing validation loss plateauing around 0.035. ii. Pairwise Identity The pairwise identity histograms were calculated for each of the nine temperatures as seen in Figure 17. Before explaining the frequency histograms, it is necessary to pinpoint that it is natural for the total unique number of sequences (remaining number after BLAST scores) to increase as we increase the temperature. Hence, the following remarks are taken according to the percentage with respect to the remaining sequences per each temperature. Naturally, very low temperatures such as 0.2 possess high similarity and its majority is concentrated at 100% similarity. On the other hand, high-temperature scores such as 1 show more variations across smaller similarity/identity scores. Hence, it is logical that the reasonable frequencies per similarity scores would exist somewhere in the middle which is achieved in the shown pairwise identities in Figure 17 where higher temperatures produce more variants spanning higher diversity range. iii. Shannon Entropy As elaborated in Chapter four, Multi-Sequence Alignment is essential for calculating both Shannon Entropy and Sequence Logo. A sample of the MSAs is shown in Figure 18 where multiple primary sequences are aligned with gaps placed where needed for reference. As per the Shannon Entropy, it is calculated Attorney Docket No. PRTN1006WO01 according to and a sample of the entropy curve of a generated set of sequences vs. its alignment with the natural set of sequences is shown in Figure 19. [00243] We want the generated sequences to retain the pattern/trend (natural seq’s low entropy in conserved areas and high entropy in variable areas) with observable variability. Why do we want this variability? Figure 20 represents the fitness level within the genotype space, we can see two plots, A and B. Plot A represents the case where the generated sequences (green dots) are of different total sequences but in the same fitness cluster (denoting little variability in the entropy). This is not preferred because it means that the generated sequences can only reach local variants or local maxima in the natural’s fitness cluster and that this local maximum can turn out to be global if and only if the natural sequences were already in the global maximum’s fitness clusters. Naturally, it is common that this type would give a bigger functional set. However, a functional set of the same fitness cluster can be obtained through several non- learning tools and does not necessarily reflect the motives behind using a language generation model. In Plot A, we can see that the generated sequences are within the same cluster, which is non optimal to begin with. Alternatively, as seen in Plot B, the generated sequences are not within the same fitness cluster and in fact, are in the global maximum’s cluster denoting that the generation model has learned different sequences of variable entropy. Naturally, there is no guarantee that the model will in fact reach this global maximum, but we want it to have the biggest chance doing so by retaining the same entropy curve trends while having different mutations in the high entropy regions. In Figure 19, we could see that the generated entropy curves retain the entropy trends with increasing variability in proportion with the temperature reflecting less alignment between the natural and generated entropy curves. This pattern is achieved in all the generated sequences which are found in the results folder. Yet, the ones with the most reasonable variability levels are of temperatures 0.3, 0.4, and 0.5 as seen in Figure 19 which are the ones proceeded with in the CATH domain search in the extrinsic properties’ section. iv. Sequence Logo The variations between the natural and the generated sequence logos depend on whether the area is among the most conserved (subsequences which are usually similar among the same type of proteins) and the variables (subsequence which allow diversity even within the same type of proteins). It also depends on the temperature allowing/prohibiting the tolerance of change of these conservative sequences. Hence, it is logical to observe that in low temperatures, conserved Attorney Docket No. PRTN1006WO01 logos are highly similar amongst the natural and generated and this similarity slightly decreases as the temperature increases for the conserved area and significantly increases as the temperature increases for the variable areas. This can be observed in a sample of an entire sequence and a subsample of a conserved subsequence seen respectively in Figures 21 and 22. This pattern is repeated across all sequences as found in the results folder. Note that the light grays vertical lines are indicators of MSA alignment gaps. b. Expansion beyond Natural Sequences’ Space i. CATH domain search ٳ Natural Sequences: Total Sequences = 332 ٳ Functional Sequences = 332 ٳ Functional Domains = 3ntcH01[1] ٳ Functional Domain 3ntcH01 notes: ٳ Link on the online database: http://www.cathdb.info/version/latest/domain/3ntcH01 ٳ It is the main functional domain in the Natural sequences’ dataset. It belongs to the Immunoglobulin protein superfamily. enerated Sequences of Temperature 0.2: Total Generated sequences = 113 ٳ Number of functional sequences = 113/113 (100%!) ٳ Identified domains: 3ntcH01 (all 113 sequences) (same domain of natural set)enerated Sequences of Temperature 0.3: ٳ Total generated seq = 262 ٳ Number of functional seq = 262/262 (100%) ٳ Identified domains: 3ntcH01 (all 262) enerated Sequences of Temperature 0.4: ٳ Total generated seq = 490 Attorney Docket No. PRTN1006WO01 ٳ functional seqs = 490/490 (100%) ٳ Identified domains: 3ntcH01 (all 490) enerated Sequences of Temperature 0.5: ٳ Total generated seq = 701 ٳ Number of functional seq = 701/701 (100%) ٳ Identified domains: 3ntcH01 (by 700 seqs) + a new domain 4b41A00 (by 1 seq) ٳ one new domain: 4b41A00 link: http://www.cathdb.info/version/latest/domain/4b41A00 ٳ Structure (superfamily of immunoglobins too) can be found in Figure 23. enerated Sequences of Temperature 0.6: ٳ Total generated seq = 976 ٳ Number of functional seq = 976/976 (100%) ٳ Identified domains: 3ntcH01 (by 975 seqs) + a new domain 2icwG02 (by 1 seq) ٳ one new domain:2icwG02 link: http://www.cathdb.info/version/latest/domain/2icwG02 Structure (superfamily of mam-mhc complex) can be seen in Figure 24. Transfer-Learned Model a. Learning of Intrinsic Properties i. Accuracy and Loss The transfer-learned model has scored an accuracy of 99.67% and a continuously decreasing validation loss plateauing around 0.021. ii. Pairwise Identity The pairwise identity histograms were calculated for each of the nine temperatures in a manner similar to the baseline’s. We can see in Figure 25 that the inverse proportionality between the identity scores and temperature retains in the transfer-learned model while producing almost the same numbers of each temperature sequences and spanning the same identity ranges with negligible differences. Attorney Docket No. PRTN1006WO01 iii. Shannon Entropy The transfer-learned model has managed to generate excellent Shannon Entropy curves seen in Figure 26 that despite producing the indicated increasing number of sequences spanning wider ranges of identity scores as the temperature increases, it retains the entropy curve trends in a way that shows the model learns exactly where should it perform mutations and where should it not, in a way that can only be fulfilled by leveraging the learned language of life from the a massive dataset spanning the evolution of natural selection in proteins over tens and tens of years! The best trends were surprisingly found in the highest three temperatures, indicating that even further temperature increase may be beneficial! iv. Sequence Logo As elaborated, it is logical to observe that in low temperatures, conserved logos are highly similar amongst the natural and generated and this similarity slightly decreases as the temperature increases for the conserved area and significantly increases as the temperature increases for the variable areas. This behavior was present also in the baseline model. However, we can see from Figure 27 that the transfer-learned model abides by the conservation/mutation patterns in a stronger manner that is tangibly visualized in its sequence logos where low-entropy regions almost have no mutation and high-entropy regions almost have no conservations. b. Expansion beyond Natural Sequences’ Space i. CATH domain search o Natural Sequences: ٳ Total Sequences = 332 ٳ Functional Sequences = 332 ٳ Functional Domains = 3ntcH01[1] Functional Domain 3ntcH01 notes: ٳ Link on the online database: http://www.cathdb.info/version/latest/domain/3ntcH01 ٳ It is the main functional domain in the Natural sequences’ dataset. ٳ It belongs to the Immunoglobulin protein superfamily. Attorney Docket No. PRTN1006WO01 Generated Sequences of Temperature 0.2: ٳ Total Generated sequences = 198 ٳ Number of functional sequences = 198/198 (100%!) ٳ Identified domains: 3ntcH01 (all 198 sequences) (same domain of natural set) Generated Sequences of Temperature 0.3: ٳ Total generated seq = 352 ٳ Number of functional seq = 352/352 (100%) ٳ Identified domains: 3ntcH01 (all 352) Generated Sequences of Temperature 0.4: ٳ Total generated seq = 480 ٳ functional seqs = 480/480 (100%) ٳ Identified domains: 3ntcH01 (all 480) Generated Sequences of Temperature 0.5: ٳ Total generated seq = 635 ٳ Number of functional seq = 635/635 (100%) ٳ Identified domains: 3ntcH01 (all 635) Generated Sequences of Temperature 0.6: ٳ Total generated seq = 706 Number of functional seq = 706/706 (100%) ٳ Identified domains: 3ntcH01 (all 706) Generated Sequences of Temperature 0.7: ٳ Total generated seq = 817 ٳ Number of functional seq = 817/817 (100%) ٳ Identified domains: 3ntcH01 (all 817) Generated Sequences of Temperature 0.8: Attorney Docket No. PRTN1006WO01 ٳ Total generated seq = 872 ٳ Number of functional seq = 872/872 (100%) ٳ Identified domains: 3ntcH01 (in 871) + 1 new domain 2icwG02 (by 1 seq) Generated Sequences of Temperature 0.9: ٳ Total generated seq = 946 Number of functional seq = 946/946 (100%) ٳ Identified domains: 3ntcH01 (all 946) Generated Sequences of Temperature 1: ٳ Total generated seq = 973 ٳ Number of functional seq = 973/973 (100%) Identified domains: 3ntcH01 (in 971) + 2 new domains 2icwG02 (by 1 seq) & 2p45B00 (by 1 seq) [00244] The above results show that the transfer-learned model generally retains the wild type’s conservation/mutation trends more accurately across all temperatures. Interestingly, as the temperature increases, the performance of the transfer-learned model improves and the appearance of more functional domains is increased, denoting that despite the exploration drive usually starting to deteriorate the model’s performance above a certain temperature threshold, this threshold is rather increased for pre-trained representations. Computer System [00245] Figure 32 shows an example computer system 3200 that can be used to implement the technology disclosed. Computer system 3200 includes at least one central processing unit (CPU) 3242 that communicates with a number of peripheral devices via bus subsystem 3226. These peripheral devices can include a storage subsystem 3202 including, for example, memory devices and a file storage subsystem 3226, user interface input devices 3228, user interface output devices 3246, and a network interface subsystem 3244. The input and output devices allow user interaction with computer system 3200. Network interface subsystem 3244 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems. [00246] In one implementation, the sequence generator 172 is communicably linked to the storage subsystem 3202 and the user interface input devices 3228. Attorney Docket No. PRTN1006WO01 [00247] User interface input devices 3228 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 3200. [00248] User interface output devices 3246 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 3200 to the user or to another machine or computer system. [00249] Storage subsystem 3202 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by processors 3248. [00250] Processors 3248 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Processors 3248 can be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples of processors 3248 include Google’s Tensor Processing Unit (TPU)™, rackmount solutions like GX4 Rackmount Series™, GX32 Rackmount Series™, NVIDIA DGX-1™, Microsoft' Stratix V FPGA™, Graphcore's Intelligent Processor Unit (IPU)™, Qualcomm’s Zeroth Platform™ with Snapdragon processors™, NVIDIA’s Volta™, NVIDIA’s DRIVE PX™, NVIDIA’s JETSON TX1/TX2 MODULE™, Intel’s Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM’s DynamicIQ™, IBM TrueNorth™, Lambda GPU Server with Testa V100s™, and others. [00251] Memory subsystem 3212 used in the storage subsystem 3202 can include a number of memories including a main random access memory (RAM) 3222 for storage of instructions and data during program execution and a read only memory (ROM) 3224 in which fixed instructions are stored. A file storage subsystem 3226 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations can be stored by file storage subsystem 3226 in the storage subsystem 3202, or in other machines accessible by the processor. Attorney Docket No. PRTN1006WO01 [00252] Bus subsystem 3236 provides a mechanism for letting the various components and subsystems of computer system 3200 communicate with each other as intended. Although bus subsystem 3236 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses. [00253] Computer system 3200 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 3200 depicted in Figure 32 is intended only as a specific example for purposes of illustrating the preferred implementations of the present invention. Many other configurations of computer system 3200 are possible having more or less components than the computer system depicted in Figure 32. [00254] In various implementations, a learning system is provided. In some implementations, a feature vector is provided to a learning system. Based on the input features, the learning system generates one or more outputs. In some implementations, the output of the learning system is a feature vector. In some implementations, the learning system comprises an SVM. In other implementations, the learning system comprises an artificial neural network. In some implementations, the learning system is pre-trained using training data. In some implementations training data is retrospective data. In some implementations, the retrospective data is stored in a data store. In some implementations, the learning system may be additionally trained through manual curation of previously generated outputs. [00255] In some implementations, the sequence generator 172 is a trained classifier. In some implementations, the trained classifier is a random decision forest. However, it will be appreciated that a variety of other classifiers are suitable for use according to the present disclosure, including linear classifiers, support vector machines (SVM), or neural networks such as recurrent neural networks (RNN). [00256] Suitable artificial neural networks include but are not limited to a feedforward neural network, a radial basis function network, a self-organizing map, learning vector quantization, a recurrent neural network, a Hopfield network, a Boltzmann machine, an echo state network, long short term memory, a bi-directional recurrent neural network, a hierarchical recurrent neural network, a stochastic neural network, a modular neural network, an associative neural network, a deep neural network, a deep belief network, a convolutional neural networks, a convolutional deep belief network, a large memory storage and retrieval neural network, a deep Boltzmann machine, a deep stacking network, a tensor deep stacking network, a spike and slab restricted Attorney Docket No. PRTN1006WO01 Boltzmann machine, a compound hierarchical-deep model, a deep coding network, a multilayer kernel machine, or a deep Q-network. [00257] The present disclosure may be embodied as a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure. [00258] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire. [00259] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device. [00260] Figure 32 is a schematic of an exemplary computing node. Computing node 3200 is only one example of a suitable computing node and is not intended to suggest any limitation as to the scope of use or functionality of embodiments described herein. Regardless, computing Attorney Docket No. PRTN1006WO01 node 3200 is capable of being implemented and/or performing any of the functionality set forth hereinabove. [00261] In computing node 3200 there is a computer system/server, which is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with computer system/server include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed computing environments that include any of the above systems or devices, and the like. [00262] Computer system/server may be described in the general context of computer system- executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system/server may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices. [00263] As shown in Figure 32, computer system/server in computing node 3200 is shown in the form of a general-purpose computing device. The components of computer system/server may include, but are not limited to, one or more processors or processing units, a system memory, and a bus that couples various system components including system memory to processor. [00264] The Bus represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, Peripheral Component Interconnect (PCI) bus, Peripheral Component Interconnect Express (PCIe), and Advanced Microcontroller Bus Architecture (AMBA). [00265] Computer system/server typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server, and it includes both volatile and non-volatile media, removable and non-removable media. Attorney Docket No. PRTN1006WO01 [00266] System memory can include computer system readable media in the form of volatile memory, such as random access memory (RAM) and/or cache memory. Algorithm Computer system/server may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a "hard drive"). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus by one or more data media interfaces. As will be further depicted and described below, memory may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the disclosure. [00267] Program/utility, having a set (at least one) of program modules, may be stored in memory by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules generally carry out the functions and/or methodologies of embodiments as described herein. [00268] Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state- setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user’s computer, partly on the user’s computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some implementations, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable Attorney Docket No. PRTN1006WO01 program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure. [00269] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to implementations of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions. [00270] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks. [00271] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks. [00272] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted Attorney Docket No. PRTN1006WO01 that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions. Clauses [00273] The technology disclosed can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections – these recitations are hereby incorporated forward by reference into each of the following implementations. [00274] One or more implementations and clauses of the technology disclosed, or elements thereof can be implemented in the form of a computer product, including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations and clauses of the technology disclosed, or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. Yet further, in another aspect, one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media). [00275] The clauses described in this section can be combined as features. In the interest of conciseness, the combinations of features are not individually enumerated and are not repeated with each base set of features. The reader will understand how features identified in the clauses described in this section can readily be combined with sets of base features identified as implementations in other sections of this application. These clauses are not meant to be mutually exclusive, exhaustive, or restrictive; and the technology disclosed is not limited to these clauses Attorney Docket No. PRTN1006WO01 but rather encompasses all possible combinations, modifications, and variations within the scope of the claimed technology and its equivalents. [00276] Other implementations of the clauses described in this section can include a non- transitory computer readable storage medium storing instructions executable by a processor to perform any of the clauses described in this section. Yet another implementation of the clauses described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the clauses described in this section. [00277] We disclose the following clauses: Clauses Set 1 1. A computer-implemented method, including: receiving input representations; processing the input representations; and based on the processing, generating output representations, wherein the output representations have compositions that are different from compositions of the input representations, and wherein the input representations and the output representations share at least one common function. 2. The computer-implemented method of clause 1, wherein the input representations are input gene representations, and the output representations are output gene representations. 3. The computer-implemented method of clause 2, wherein the common function is a common gene function. 4. The computer-implemented method of clause 1, wherein the input representations are input protein representations, and the output representations are output protein representations. 5. The computer-implemented method of clause 4, wherein the input protein representations are protein sequences, protein structures, and/or n-dimensional feature vector embeddings. Attorney Docket No. PRTN1006WO01 6. The computer-implemented method of clause 5, wherein the protein structures include primary protein structures, secondary protein structures, tertiary protein structures, and quaternary protein structures. 7. The computer-implemented method of clause 5, wherein n > 1. 8. The computer-implemented method of clause 4, wherein the common function is a common protein function. 9. The computer-implemented method of clause 1, further including using at least one neural network system for processing the input representations. 10. The computer-implemented method of clause 9, wherein the neural network system processes the input representations as input and generates the output representations as output. 11. The computer-implemented method of clause 9, wherein the neural network system is at least one of a language model neural network, a sequence-to-sequence neural network, an encoder-decoder neural network, an autoencoder neural network, a variational autoencoder neural network, a generative adversarial neural network, a diffusion neural network, a Transformer neural network, a recurrent neural network, a long-short term memory neural network, an autoregressive neural network, an energy-based neural network, and a flow-based neural network. 12. The computer-implemented method of clause 9, wherein the neural network system is trained on a training dataset using supervised learning, wherein the training dataset comprises clusters of reference representation-variant representation pairs. 13. The computer-implemented method of clause 12, wherein a particular cluster of reference representation-variant representation pairs comprises a plurality of representations that have different compositions but share at least one common function. 14. The computer-implemented method of clause 13, wherein the representations are gene representations. 15. The computer-implemented method of clause 13, wherein the representations are protein representations. Attorney Docket No. PRTN1006WO01 16. The computer-implemented method of clause 13, wherein reference representation-variant representation pairs in the particular cluster pair each representation in the plurality of representations with every other representation in the plurality of representations, thereby each of the reference representation-variant representation pairs comprises a reference representation that is paired with a variant representation that is different from the reference representation by at least one element but shares at least one common function with the reference representation. 17. The computer-implemented method of clause 16, wherein the variant representation differs from the reference representation by many elements. 18. The computer-implemented method of clause 16, wherein the element is an amino acid element. 19. The computer-implemented method of clause 16, wherein the variant representation shares multiple common functions with the reference representation. 20. The computer-implemented method of clause 19, wherein the common functions are common gene functions. 21. The computer-implemented method of clause 19, wherein the common functions are common protein functions. 22. The computer-implemented method of clause 16, wherein the neural network system is trained to process reference representations in the reference representation-variant representation pairs and, in response, generate approximations that progressively match corresponding variant representations in the reference representation-variant representation pairs. 23. The computer-implemented method of clause 12, wherein the clusters of reference representation-variant representation pairs are created by clustering based on one or more representation attributes. 24. The computer-implemented method of clause 23, wherein the representation attributes correspond to biological constraints. 25. The computer-implemented method of clause 24, wherein the biological constraints include identity similarity, homology, structural similarity, size, length, distribution, and rarity. Attorney Docket No. PRTN1006WO01 26. The computer-implemented method of clause 12, wherein the clusters of reference representation-variant representation pairs are created by clustering those representations in a same cluster that have an identity score for at least one representation identity higher than a similarity threshold. 27. The computer-implemented method of clause 26, wherein the representation identity includes homology overlap between the representations. 28. The computer-implemented method of clause 26, wherein the representations are embedded in an embedding space, wherein the representation identity includes embedding distances between the representations in the embedding space. 29. The computer-implemented method of clause 26, wherein the representation identity includes primary protein structure similarity between the protein representations. 30. The computer-implemented method of clause 26, wherein the representation identity includes tertiary protein structure similarity between the protein representations. 31. The computer-implemented method of clause 26, wherein the representation identity includes protein function similarity between the protein representations. 32. The computer-implemented method of clause 26, wherein a higher similarity threshold creates more of the clusters of reference representation-variant representation pairs, and a lower similarity threshold creates less of the clusters of reference representation-variant representation pairs. 33. The computer-implemented method of clause 12, further including filtering out from the clusters of reference representation-variant representation pairs those clusters of reference representation-variant representation pairs that have a representation count lower than a cluster size threshold. 34. The computer-implemented method of clause 33, wherein the cluster size threshold is determined based on cluster size distributions observed across the clusters of reference representation-variant representation pairs. 35. The computer-implemented method of clause 12, further including filtering out from the clusters of reference representation-variant representation pairs those representations that are Attorney Docket No. PRTN1006WO01 either below a rare short representation length threshold or above a rare high representation length threshold. 36. The computer-implemented method of clause 35, wherein the rare short representation length threshold and the rare high representation length threshold are determined based on representation length and count distributions observed across representations in the clusters of reference representation-variant representation pairs. 37. The computer-implemented method of clause 12, further including replacing rare elements observed in the representations in the clusters of reference representation-variant representation pairs with an element mask. 38. The computer-implemented method of clause 37, wherein the rare elements are rare amino acid elements. 39. The computer-implemented method of clause 10, further including, during inference, controlling exploration-exploitation trade-off in outputs of the neural network system using one or more sampling parameters. 40. The computer-implemented method of clause 39, wherein the sampling parameters include a temperature sampling parameter. 41. The computer-implemented method of clause 40, wherein a lower temperature sampling parameter promotes exploitation that causes the neural network system to generate outputs that are more similar to input representations used during the supervised learning. 42. The computer-implemented method of clause 41, wherein a higher temperature sampling parameter promotes exploration that causes the neural network system to generate outputs that are less similar to the input representations used during the supervised learning. 43. The computer-implemented method of clause 39, wherein the sampling parameters include a k-value sampling parameter selected based on top-k sampling. 44. The computer-implemented method of clause 39, wherein the sampling parameters include a p-value sampling parameter selected based on top-p sampling. 45. The computer-implemented method of clause 39, wherein the sampling parameters include a beam count sampling parameter selected based on beam search sampling. Attorney Docket No. PRTN1006WO01 46. The computer-implemented method of clause 39, wherein the sampling parameters include a contrastive sampling parameter selected based on contrastive search sampling. 47. The computer-implemented method of clause 1, wherein the output representations have enhanced capabilities relative to the input representations. 48. A system for native expansion of a sparse training dataset into a dense training dataset, comprising: memory storing a sparse training dataset that lacks target output sequences required as annotations for supervised training of a sequence generator, wherein the sparse training dataset has n unlabeled training examples; pairing logic configured for native expansion of the sparse training dataset into a dense training dataset of input-output pairs, wherein the dense training dataset has m labeled training examples whose generation is confined to the n unlabeled training examples, wherein m >> n, and wherein the pairing logic is configured to construct the dense training dataset by: generating the input-output pairs by pairing each unlabeled training example in the sparse training dataset with every other unlabeled training example in the sparse training dataset, wherein a particular input-output pair comprises an input training example labeled with an output training example; and training logic configured to implement the supervised training of the sequence generator using the dense training dataset by causing the sequence generator to process input training examples in the input-output pairs and, in response, generate approximations that progressively match corresponding output training examples in the input-output pairs. 49. The system of clause 48, wherein m = n 2 . 50. The system of clause 48, wherein m = ~n 2 . 51. The system of clause 48, wherein the input training example and the output training example are protein sequences. 52. The system of clause 51, wherein the output training example is different from the input training example by at least one amino acid but shares at least one common protein function with the input training example. Attorney Docket No. PRTN1006WO01 53. A therapeutic system, comprising: memory storing a sparse therapeutic training dataset that lacks target output therapeutics required as annotations for supervised training of a therapeutic generator, wherein the sparse therapeutic training dataset has n unlabeled therapeutic training examples; pairing logic configured for native expansion of the sparse therapeutic training dataset into a dense therapeutic training dataset of input-output therapeutic pairs, wherein the dense therapeutic training dataset has m labeled therapeutic training examples whose generation is confined to the n unlabeled therapeutic training examples, wherein m >> n, and wherein the pairing logic is configured to construct the dense therapeutic training dataset by: generating the input-output therapeutic pairs by pairing each unlabeled therapeutic training example in the sparse therapeutic training dataset with every other unlabeled therapeutic training example in the sparse therapeutic training dataset, wherein a particular input-output therapeutic pair comprises an input training therapeutic example labeled with an output training therapeutic example; and training logic configured to implement the supervised training of the therapeutic generator using the dense therapeutic training dataset by causing the therapeutic generator to process input therapeutic training examples in the input-output therapeutic pairs and, in response, generate approximations that progressively match corresponding output therapeutic training examples in the input-output therapeutic pairs. 54. The therapeutic system of clause 53, wherein m = n 2 . 55. The therapeutic system of clause 53, wherein m = ~n 2 . 56. The therapeutic system of clause 53, wherein the input therapeutic training example and the output therapeutic training example are protein sequences. 57. The therapeutic system of clause 56, wherein the output therapeutic training example is different from the input therapeutic training example by at least one amino acid but shares at least one common protein function with the input therapeutic training example. 58. A computer-implemented method, including: initializing a population of synonymous proteins that share at least one common function; Attorney Docket No. PRTN1006WO01 grouping the population of synonymous proteins into a plurality of sub-populations of synonymous proteins based on one or more biological constraints; for each sub-population of synonymous proteins in the plurality of sub-populations of synonymous proteins, generating permutations of sequence-variant pairs by pairing each synonymous protein in a given sub-population of synonymous proteins with every other synonymous protein in the given sub-population; and conducting supervised training of a model to learn an input-output distribution that generates output proteins which share at least one common function with input proteins, wherein the supervised training uses sequences in the permutations of sequence-variant pairs as inputs and variants in the permutations of sequence-variant pairs as ground truth target outputs of the inputs. 59. The computer-implemented method of clause 58, further including controlling sampling from the input-output distribution for customized exploitation-exploration trade-off to generate the output proteins that diverge from the input proteins but are biologically valid. 60. The computer-implemented method of clause 58, wherein the output proteins have enhanced capabilities relative to the input proteins. Clauses Set 2 1. A computer-implemented method, comprising: receiving an unsupervised set of sequences; grouping the unsupervised set into one or more clusters of sequences based on a sequence similarity, the sequence similarity being above a sequence similarity threshold; iteratively pairing each sequence within a cluster to every other sequence within the cluster; filtering the one or more clusters based on a number of sequences within the cluster; filtering the sequences within the cluster based on a sequence length distribution within the sequences; generating a set of variant target sequences from the filtered set of sequences; Attorney Docket No. PRTN1006WO01 mapping the filtered set of sequences to the set of variant target sequences using a trained model; and outputting a supervised set of sequences. 2. The method of clause 1, wherein iteratively pairing each sequence comprises: generating a paraphrase sequence based on each sequence within the cluster, where the paraphrase sequence is another sequence within the cluster; pairing the paraphrase sequence with each sequence within the cluster; and ordering the paired sequences. 3. The method of clause 1, wherein filtering the one or more clusters based on a number of paired sequences comprises: determining a cluster count for each cluster; and removing each cluster having a cluster count below a threshold cluster count from the one or more clusters. 4. The method of clause 1, wherein filtering the sequences within the cluster based on a length distribution comprises: determining a sequence length distribution for each cluster; determining a length representation threshold; and removing each cluster having a length distribution beneath the length representation threshold from the one or more clusters. 5. The method of clause 1, wherein filtering the sequences within the cluster based on a length distribution comprises: determining a sequence length distribution for each cluster; determining a length representation threshold; and removing each cluster having a length distribution above the length representation threshold from the one or more clusters. Attorney Docket No. PRTN1006WO01 6. The method of clause 5, wherein the sequence length distribution is a number of sequences of a length within a range. 7. The method of clause 1, wherein training the model comprises receiving a set of training sequences; generating a training variant for each sequence, wherein the training variant is within a predetermined context similarity; mapping each input training sequence to each training variant; and outputting a set of parameters to the model. 8. The method of clause 1, wherein mapping the set of filtered clusters comprises using the trained model to produce a variant for each sequence; mapping each variant to the sequence to produce a dataset of sequences; sampling a temperature of the dataset, the temperature resulting from a similarity between the sequences in the dataset; generating an optimum temperature of the dataset based on a determined functionality of the dataset. 9. The method of clause 1, further comprising: identifying one or more rare elements in the set of sequences, the rare elements being members of a minority class; and replacing each rare element with a known element in the sequence. 10. The method of clause 1, wherein the sequences are amino acid sequences. 11. The method of clause 1, wherein the sequence similarity is a structural similarity. 12. The method of clause 1, wherein the trained model is a chosen language model. 13. The method of clause 1, wherein each variant target sequence performs the same function as each sequence. Attorney Docket No. PRTN1006WO01 [00278] The preceding description is presented to enable the making and use of the technology disclosed. Various modifications to the disclosed implementations will be apparent, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein. The scope of the technology disclosed is defined by the appended claims. [00279] A number of workflows illustrating logic are described herein. The logic can be implemented using processors programmed using computer programs stored in memory accessible to the computer systems and executable by the processors. With all workflows herein, it will be appreciated that many of the steps can be combined, performed in parallel or performed in a different sequence without affecting the functions achieved. In some cases, as the reader will appreciate, a rearrangement of steps will achieve the same results only if certain other changes are made as well. In other cases, as the reader will appreciate, a re-arrangement of steps will achieve the same results only if certain conditions are satisfied. Furthermore, it will be appreciated that the workflows herein show only steps that are pertinent to an understanding of the invention, and it will be understood that numerous additional steps for accomplishing other functions can be performed before, after and between those shown. [00280] The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such feature or combination of features. In view of the foregoing descriptions, it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention. [00281] The foregoing description of preferred implementations of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously, many modifications and variations will be apparent to practitioners skilled in this art. In addition, any and all variations described, suggested or incorporated by reference herein with respect to any one implementation are also to be considered taught with respect to all other implementations. The implementations described herein were chosen and described in order to best explain the principles of the Attorney Docket No. PRTN1006WO01 invention and its practical application, thereby enabling others skilled in the art to understand the invention for various implementations and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents. [00282] What we claim is: