Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A CRISPR COUNTER-SELECTION INTERRUPTION CIRCUIT (CCIC) AND METHODS OF USE THEREOF
Document Type and Number:
WIPO Patent Application WO/2024/086848
Kind Code:
A2
Abstract:
The present invention generally relates to compositions comprising a CRISPR based regulatory element comprising a barcode sequence that serves as a binding site for a Cas9/gRNA molecule and which alters expression of a downstream gene when bound by the Cas9/gRNA molecule.

Inventors:
BRADY SEAN (US)
BURIAN JAN (US)
LIBIS VINCENT (US)
Application Number:
PCT/US2023/077533
Publication Date:
April 25, 2024
Filing Date:
October 23, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV ROCKEFELLER (US)
BRADY SEAN F (US)
BURIAN JAN (US)
LIBIS VINCENT K (US)
International Classes:
C12N15/74; C40B30/04
Attorney, Agent or Firm:
FONVILLE, Natalie, C. et al. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A nucleic acid molecule comprising a CRISPR based regulatory element, wherein the CRISPR based regulatory element comprises a barcode binding site for a Cas9/gRNA between an upstream promoter region and a downstream sequence encoding an effector protein.

2. The nucleic acid molecule of claim 1, wherein the effector protein comprises a toxic molecule.

3. The nucleic acid molecule of claim 1, wherein the toxic molecule comprises sacB.

4. The nucleic acid molecule of claim 1, wherein the effector protein comprises a marker protein.

5. A library comprising a plurality of nucleic acid molecules, wherein each nucleic acid molecule comprises a CRISPR based regulatory element, wherein the CRISPR based regulatory element comprises a barcode binding site for a Cas9/gRNA between an upstream promoter region and a downstream sequence encoding an effector protein.

6. The library of claim 5, wherein the effector protein comprises a toxic molecule.

7. The library of claim 5, wherein the toxic molecule comprises sacB.

8. The library of claim 5, wherein the effector protein comprises a marker protein.

10. The library of claim 5, wherein the nucleic acid molecules are selected from the group consisting of COSMID molecules, BAC molecules and PAC molecules.

11. A plurality of cells comprising the library of any one of claims 5-10.

12. An assay system or kit comprising: a) a plurality of cells of claim 5; b) a plurality of gRNA molecules that target the barcodes of the nucleic acid molecules in the plurality of cells of a), and c) a Cas9 protein or nucleic acid molecule encoding a Cas9 protein.

13. The assay system or kit of claim 12, wherein the Cas9 is catalytically dead.

14. A method of selecting a specific target cell from a plurality of cells of any one of claims 5-10, the method comprising: a) contacting the plurality of cells of any one of claims 5-10 with a Cas9/gRNA complex that specifically binds to the barcode binding site for the Cas9/gRNA of the nucleic acid molecule in the target cell, and b) selecting the target cell based on detection of an altered expression level of the downstream effector gene in the target cell.

15. The method of claim 13, wherein expression of the downstream effector gene results in toxicity of the cell, and further wherein binding of the Cas9/gRNA complex to the barcode binding site for the Cas9/gRNA of the nucleic acid molecule in the target cell results in cell survival.

16. The method of any one of claims 14-15, wherein the Cas9 is catalytically dead.

Description:
TITLE OF THE INVENTION

A CRISPR Counter- Sei ection Interruption Circuit (CCIC) and Methods of Use Thereof

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR

DEVELOPMENT

This invention was made with government support under GM122559 awarded by the National Institutes of Health. The government has certain rights in the invention.

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Application No.

63/380,451, fded October 21, 2022, which is hereby incorporated by reference herein in its entirety.

REFERENCE TO A “SEQUENCE LISTING” SUBMITTED AS AN XML FILE The Sequence Listing written in the xml file: “046531-5027- 00WO_SequenceListing”; created on October 23, 2023, and 21,211 bytes in size, is hereby incorporated by reference.

BACKGROUND OF THE INVENTION

CRISPR-Cas systems can be programmed to target essentially any unique DNA or RNA sequence when loaded with a homologous guide RNA, and have been adapted for numerous genetic engineering and synthetic biology tools (Wang et al., 2022, Nature reviews, Microbiology; Xu et al., 2019, Journal of molecular biology, 431 :34-47; Adli, 2018, Nature communications, 9: 1911). Cas9 is a dual-RNA-guided DNA nuclease that binds a trans-activating crRNA (tracrRNA) base-paired to a target encoding CRISPR RNA (crRNA) forming an active complex that can target 20 bp DNA regions which contain a 3’ NGG PAM site (Jinek et al., 2012, Science, 337:816-821). CRISPR interference (CRISPRi) exploits the sequence specific localization of a nuclease deficient Cas (e.g., dCas9) to repress gene expression by blocking either access to a promoter or transcript elongation. (Bikard et al., 2013, Nucleic acids research, 41 :7429-7437) Recently, Cas9 and dCas9 have been adapted as enrichment tools for targeted sequencing, enrichment of mutants within a heterogeneous population, and genotypic enrichment during chemical-genetic profiling (Schultzhaus et al., 2021, Biotechnology advances, 46: 107672; Feldman et al., 2020, BMC biology, 18:177; Li et al., 2022, Nature microbiology, 7:766-779; Jost et al., 2017, Molecular cell, 68:210-223 e216).

There continues to be a need in the art for improved enrichment tools. This invention satisfies this unmet need.

SUMMARY OF THE INVENTION

In some embodiments, the invention provides a nucleic acid molecule comprising a CRISPR based regulatory element, wherein the CRISPR based regulatory element comprises a barcode binding site for a Cas9/gRNA between an upstream promoter region and a downstream sequence encoding an effector protein.

In some embodiments, the effector protein comprises a toxic molecule. In some embodiments, the toxic molecule comprises sacB. In some embodiments, the effector protein comprises a marker protein.

In some embodiments, the invention provides a library comprising a plurality of nucleic acid molecules, wherein each nucleic acid molecule comprises a CRISPR based regulatory element, wherein the CRISPR based regulatory element comprises a barcode binding site for a Cas9/gRNA between an upstream promoter region and a downstream sequence encoding an effector protein.

In some embodiments, the effector protein comprises a toxic molecule. In some embodiments, the toxic molecule comprises sacB. In some embodiments, the effector protein comprises a marker protein.

In some embodiments, the nucleic acid molecules are selected from the group consisting of COSMID molecules, BAC molecules and PAC molecules.

In some embodiments, the invention provides a plurality of cells comprising a library comprising a plurality of nucleic acid molecules, wherein each nucleic acid molecule comprises a CRISPR based regulatory element, wherein the CRISPR based regulatory element comprises a barcode binding site for a Cas9/gRNA between an upstream promoter region and a downstream sequence encoding an effector protein. In some embodiments, the effector protein comprises a toxic molecule. In some embodiments, the toxic molecule comprises sacB. In some embodiments, the effector protein comprises a marker protein. In some embodiments, the nucleic acid molecules are selected from the group consisting of COSMID molecules, BAC molecules and PAC molecules.

In some embodiments, the invention provides an assay system or kit comprising: a) a plurality of cells comprising a plurality of nucleic acid molecules, wherein each nucleic acid molecule comprises a CRISPR based regulatory element, wherein the CRISPR based regulatory element comprises a barcode binding site for a Cas9/gRNA between an upstream promoter region and a downstream sequence encoding an effector protein; b) a plurality of gRNA molecules that target the barcodes of the nucleic acid molecules in the plurality of cells of a), and c) a Cas9 protein or nucleic acid molecule encoding a Cas9 protein. In some embodiments, the Cas9 is catalytically dead.

In some embodiments, the invention provides a method of selecting a specific target cell from a plurality of cells comprising a plurality of nucleic acid molecules, wherein each nucleic acid molecule comprises a CRISPR based regulatory element, wherein the CRISPR based regulatory element comprises a barcode binding site for a Cas9/gRNA between an upstream promoter region and a downstream sequence encoding an effector protein, the method comprising: a) contacting the plurality of cells comprising a plurality of nucleic acid molecules, wherein each nucleic acid molecule comprises a CRISPR based regulatory element, wherein the CRISPR based regulatory element comprises a barcode binding site for a Cas9/gRNA between an upstream promoter region and a downstream sequence encoding an effector protein with a Cas9/gRNA complex that specifically binds to the barcode binding site for the Cas9/gRNA of the nucleic acid molecule in the target cell, and b) selecting the target cell based on detection of an altered expression level of the downstream effector gene in the target cell. In some embodiments, expression of the downstream effector gene results in toxicity of the cell, but binding of the Cas9/gRNA complex to the barcode binding site for the Cas9/gRNA of the nucleic acid molecule in the target cell results in cell survival. In some embodiments, the Cas9 is catalytically dead. BRIEF DESCRIPTION OF THE DRAWINGS

The following detailed description of embodiments of the invention will be better understood when read in conjunction with the appended drawings. It should be understood that the invention is not limited to the precise arrangements and instrumentalities of the embodiments shown in the drawings.

Figure 1 A through Figure 1H depict the development of a CRISPR counter- sei ection interruption circuit (CCIC) and its application for retrieval of target sequences from large clone libraries. Figure 1A depicts a schematic of the core CCIC concept. When dCas9 is loaded with a guide RNA (solid box) targeting a sequence downstream of the promoter (open box), it leads to inhibition of sacB expression allowing for survival on media containing sucrose. Expression of sacB from a promoter containing a different downstream sequence that is not recognized by the guide RNA (open box) will be unaffected leading to cell death upon sucrose exposure. Figure IB depicts a schematic where a guide RNA corresponding to a sequence in pPAC-T, but not in pPAC-N, was designed to test the feasibility of selective dCas-mediated sacB repression. Figure 1C depicts a graph quantifying the effectiveness of dCas9-mediated sacB repression measured as % survival of transformants containing pPAC-T or pPAC-N on sucrose containing media relative to non-sucrose media. The components for dCas9 suppression were provided on a single plasmid (pdCas9), or sgRNA was added to a strain with genomically incorporated dCas9 with its native promoter (gdCas; E. coli NdC). n=3, error bars represent standard deviation. Figure ID depicts a graph quantifying the direct retrieval of pPAC-T when mixed with pPAC-N at different ratios. Number of pPAC-T colonies identified and the colonies screened are indicated. Figure IE depicts a schematic illustrating the general concept of CCIC based clone libraries. Genomic or metagenomic DNA is cloned into a collection of barcoded CCIC vectors to generate a (meta)genomic library. The resulting library is sequenced to link a unique barcode with each DNA insert. A target clone of interest can then be retrieved by dCas9-mediated repression of sacB using unique barcode specific guide RNAs and plating on sucrose containing media. Figure IF depicts a plasmid map of pCCIC with the incorporation of the degenerate barcode validated by Sanger sequencing of a barcoded pool of pCCIC vectors. Figure 1 G depicts a graph quantifying the silencing of barcoded metagenomic cosmids (cosmid “A”, cosmid “B” and pool of “5,000” cosmids) at various sucrose concentrations using either a sequence specific sgRNA (S), a universal NP1 sgRNA (U), or a non-specific sgRNA (N). E. coli JdC, which contains a genomically integrated dCas9 with a strong constitutive promoter, was used. n=3, error bars represent standard deviation. Figure 1H depicts a graph quantifying specific retrieval of cosmid “A” and “B” used in ‘g’ from cosmid libraries of different sizes using 0.07% sucrose counter-selection. Pure guide plasmid clones and guide plasmid ligations specific for the cosmid “A” and “B” barcode sequences were tested for retrieval. n=3, error bars represent standard deviation.

Figure 2A through Figure 2B depicts the targeted retrieval from metagenomic and genomic CCIC libraries. Figure 2A depicts a schematic providing a general outline of metagenomic mining using CCIC retrieval (top). DNA extracted directly from soil is cloned into a pool of barcoded pCCIC vectors (barcodes represented by boxes, sacB is represented by an arrow) using lambda phage packaging to create a metagenomic cosmid library in E. coli. The cosmid library is sequenced using PacBio HiFi CSS long-read technology. Bioinformatic analysis of the assembled sequence data is used to identify captured biosynthetic diversity and the vector barcode sequences. Guide RNA matching a desired barcode is transformed into the library pool triggering target specific dCas9 silencing of sacB, leading to target retrieval by clone-specific survival on sucrose. CCIC-retrieval was used to isolate 4 CRISPR-Cas systems and a diverse collection of BGCs representing 12 different major biosynthetic classes (bottom). Figure 2B depicts a schematic providing a general pipeline for genomic library mining using edge mapping and CCIC (top). Genomic DNA is ligated into a pool of barcoded pCCIC vectors (barcodes are represented by boxes, sacB is represented by an arrow) and introduced into E. coli using lambda phage to generate a genomic DNA cosmid library. Cosmid DNA extracted from the genomic library is fragmented using Nextera ‘tagmentation’ (i.e., Tn5 transposase) allowing for PCR amplification of fragments containing both a vector barcode and the edge of the cloned sequence. Sequence ready amplicons are generated with a second nested-PCR allowing for paired-end Miseq reads that link barcodes with the edge of each captured sequence. Based on the fact that lambda phage captures 30-40 kb of sequence, this data is used to generate a comprehensive index of captured regions across a reference genome. A guide RNA matching a desired barcode linked to a target genomic region is transformed into the library pool triggering target specific dCas9 silencing of sacB and leading to target retrieval by clone-specific growth on sucrose. Edge mapping data from an 11,000 membered Streptomyces albidoflavus cosmid library is overlaid on the reference genome annotated with the location of 23 BGCs (middle). Each line represents the edge sequence associated with a unique barcode, with the orientation of the insert indicated. All previously uncharacterized BGCs that could fit on a single cosmid were successfully isolated using the edge mapping and CCIC-retrieval (bottom). The precision of edge mapping also allowed isolation of 2 overlapping cosmids that contained a 41 kb polyketide synthase BGC (#5). The arrows indicate the edge of each cosmid. Full insert length is shown with predicted BGCs indicated.

Figure 3A through Figure 3B depicts data where inducible sacB was tested with several replication origins. Figure 3A depicts a schematic illustrating that multiple origins of replication were paired with lac or tetO regulated promoters. Figure 3B depicts data quantifying counter-selection efficiency of the tested constructs.

Figure 4A through Figure 4D depicts data demonstrating that barcode ‘scrubbing’ generates a pool of clones with very low sucrose escape frequency. Figure 4A depicts an experiment where 2-Fragment and self-ligation cloning were tested for barcode addition. 2-Fragment cloning was carried forward due to ~3-fold lower sucrose escape frequency. Figure 4B depicts a work-flow diagram of the barcode ‘scrubbing’ method. A barcoded vector pool or cosmid library was grown overnight to confluence in LB + chloramphenicol at 37 °C and 200-r.p.m. shaking. The optical density at 600 nm (OD600) was measured, and the culture was diluted to a titer of -200 cells per 50 pl in fresh LB + chloramphenicol. Then, 25 ml of diluted culture was prepared for each 384- well microplate to be seeded. A 12-channel pipette was used to seed each well in a 384- well microplate (VWR, 781281) with 50 pl of the diluted cells. The plate(s) were then grown overnight at 37 °C and 400-r.p.m. shaking. LB agar with chloramphenicol, 1% sucrose and 100 ng ml-1 of anhydrotetracycline was prepared in OmniTrays (Thermo Fisher Scientific, 242811) that match the shape of the 384-well microplates. A 384-pin multi-blot replicator (VP 384, V&P Scientific) that delivers ~0.2 pl was used to replicaplate the microplate cultures onto the agar OmniTrays. The pinned OmniTrays were then grown overnight at 37 °C, and the microplates were stored at 4 °C. After overnight incubation, the OmniTrays were examined to determine which microplate wells did not generate any visible growth. The ‘no-growth’ wells were combined to generate a ‘scrubbed’ pool. Figure 4C depicts an example of a stamped plate (wells selected for pooling are indicted boxes). Figure 4D depicts the resulting improvement in escape frequency of a ‘scrubbed pool’ shown. Abbreviations include: chi - 15 pg/pL chloramphenicol, sue -1% w/v sucrose, aTc - 100 ng/pL anhydrotetracycline.

Figure 5A though Figure 5E depicts experiments that optimized sucrose and dCas9 expression for efficient barcode-targeted sacB repression. Figure 5A depicts the barcode location and sequences used for optimization experiments. Figure 5B depicts a graph quantifying that dCas9 targeting, using A. coll NdC, of initial barcode constructs did not lead to sucrose survival using a barcode specific sgRNA (S), however, targeting the universal NP1 region (U) immediately upstream of the barcode sequences afforded full survival. Non-specific sgRNA (N) did not afford any protection. Data are presented as mean values ± standard deviation, n=3 independent transformations. Figure 5C depicts a graph demonstrating that sucrose concentration gradient showed 0.25% as the lowest effective concentration for counter-selection. Data are presented as mean values ± standard deviation, n=2 biologically independent platings. Figure 5D depicts a graph demonstrating that shifting barcode sequences to the highly active NP1 location did not provide effective survival at 5% sucrose but showed survival at 0.25%. Data are presented as mean values ± standard deviation, n=3 independent transformations. The ‘*’ indicates visibly poor growing (i.e., small) colonies. Figure 5E depicts a graph quantifying that barcode targeting within the increased dCas9 expression strain E. coli JdC led to survival at both 5% and 0.25%. In all cases, non-barcode specific guide RNAs (indicated by ‘(-)’) do not provide any sucrose protection. Data are presented as mean values ± standard deviation, n=3 independent transformations. The ‘*’ indicates visibly poor growing (i.e., small) colonies. Figure 6A though Figure 6B depicts an analysis of guide RNA sacB- silencing strength of cosmids that showed a range of recovery efficiencies. Figure 6A depicts a summary of clones targeted and their positive hit rates (PHR %). The cosmid concentration obtained from a standard plasmid miniprep (50 LLL elution volume) is shown; data are presented as mean values ± standard deviation, n=3 independent plasmid purifications. Figure 6B depicts a graph showing the percent survival of clone monocultures cultures transformed with their corresponding guide RNA (Specific gRNA), the universal NP1 guide RNA (Universal gRNA), or a non-specific guide RNA (Negative gRNA) with increasing sucrose concentration (i.e., increasing counterselection pressure). Data are presented as mean values ± standard deviation, n=3 independent transformations. The indicates visibly poor growing (i.e., small) colonies.

Figure 7 depicts an example visualization of edge mapping. Screen captures of the UGENE edge mapping visualization used to predict OTU 2277 as the target cosmid containing the Region 8 biosynthetic cluster.

Figure 8A through Figure 8B depicts a schematic of edge mapping of the metagenomic library identified overlapping clones. Figure 8A depicts a schematic illustrating an overview of the concept of using a combination of long-read sequencing and edge mapping to mine complete BGCs from large metagenomic libraries. In this process a sub-sample of a large metagenomic library is subjected to long-read sequencing to identify BGCs of interest. Cost-effective edge mapping is then used to index the complete library to identify overlapping cosmids that complete partially captured BGCs. A hypothetical BGC is shown as gene blocks, and the vector sequence is shown as thicker black line. Figure 8B depicts an experiment demonstrating that the ability to detect and isolate overlapping cosmids by edge mapping of the 10,000 metagenomic library was confirmed by recovery of barcode-OTU 477 (associated with PacBio contig 1912) and its internally mapped barcode-OTU 315 yielding two overlapping cosmids.

Figure 9 depicts that edge mapping corrected low quality PacBio barcodes. Edge mapping corrected a low quality PacBio barcode sequence previously disregarded because it did not match the expected degenerate pattern. The corrected barcode-OTU was successfully used to retrieve the lassopeptide BGC containing cosmid. PacBio and barcode-OTU determined barcode sequences are shown with lower case letters used to denote nucleotides that do not match expected degenerate barcode possibilities. The barcode location is shown in orange and vector sequence is denoted by the thick line.

DETAILED DESCRIPTION

The present invention relates to genetic circuits comprising a barcode comprising a degenerate CRISPR target sequence for binding by a Cas9/sgRNA complex and systems and methods incorporating the genetic circuits incorporating the barcodes of the invention. In some embodiment, the barcode is included between an upstream promoter sequence and a downstream sequence encoding an effector gene.

In one embodiment, the downstream effector gene is a selection molecule. In some embodiments, the downstream selection marker is a negative selection marker. In some embodiments, the downstream selection marker is a counter selection marker. In some embodiments, the downstream selection marker is a positive selection marker. In some embodiments, the selection marker is a blue/white selection marker. In some embodiments, the selection marker is a toxin.

In some embodiments, the invention provides methods for positive or negative selection of a target molecule comprising contacting a genetic circuit comprising a barcode comprising a degenerate CRISPR target sequence located between an upstream promoter sequence and a downstream sequence encoding an effector gene with a Cas9/sgRNA complex that binds to the barcode, selecting the target molecule based on an alteration in expression or effect of the effector molecule. In some embodiments, the effector molecule is silenced by the interruption of transcription from the promoter due to the presence of the bound Cas9/sgRNA complex. In some embodiments, the effector gene is a toxin or conditional toxin and silencing of the effector gene results in survival of a cell or colony comprising the barcoded construct. Definitions

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

As used herein, each of the following terms has the meaning associated with it in this section.

The articles “a” and “an” are used herein to refer to one or to more than one (i.e., to at least one) of the grammatical object of the article. By way of example, “an element” means one element or more than one element.

“About” as used herein when referring to a measurable value such as an amount, a temporal duration, and the like, is meant to encompass variations of ±20%, ±10%, ±5%, ±1%, or ±0.1% from the specified value, as such variations are appropriate to perform the disclosed methods.

The term “activate,” as used herein, means to induce or increase an activity or function, for example, about ten percent relative to a control value. Preferably, the activity is induced or increased by 50% compared to a control value, more preferably by 75%, and even more preferably by 95%. “Activate,” as used herein, also means to increase a molecule, a reaction, an interaction, a gene, an mRNA, and/or a protein’s expression, stability, function or activity by a measurable amount or to increase entirely. Activators are compounds that, e.g., bind to, partially or totally induce stimulation, increase, promote, induce activation, activate, sensitize, or up regulate a protein, a gene, and an mRNA stability, expression, function and activity, e.g., agonists.

As used herein in reference to a display library, a “barcode” refers to a unique molecular identifier to distinguish cells expressing distinct display molecules. For example, the barcode may be a unique DNA sequence within a cell that corresponds to a display molecule expressed by said cell. This barcode may be detected using methods including, but not limited to, next generation sequencing

“Coding sequence” or “encoding nucleic acid” as used herein may refer to the nucleic acid (RNA or DNA molecule) that comprise a nucleotide sequence which encodes an antigen set forth herein. The coding sequence may further include initiation and termination signals operably linked to regulatory elements including a promoter and polyadenylation signal capable of directing expression in the one or more cells of an individual or mammal to whom the nucleic acid is administered. The coding sequence may further include sequences that encode signal peptides.

A “constitutive” promoter is a nucleotide sequence which, when operably linked with a polynucleotide which encodes or specifies a gene product, causes the gene product to be produced in a cell under most or all physiological conditions of the cell.

A “disease” is a state of health of an animal wherein the animal cannot maintain homeostasis, and wherein if the disease is not ameliorated then the animal’s health continues to deteriorate. In contrast, a “disorder” in an animal is a state of health in which the animal is able to maintain homeostasis, but in which the animal’s state of health is less favorable than it would be in the absence of the disorder. Left untreated, a disorder does not necessarily cause a further decrease in the animal’s state of health.

A disease or disorder is “alleviated” if the severity of a sign or symptom of the disease, or disorder, the frequency with which such a sign or symptom is experienced by a patient, or both, is reduced.

The term “expression” as used herein is defined as the transcription of a particular nucleotide sequence driven by its promoter and/or the translation of said nucleotide sequence into an amino acid sequence.

The term “gene” means the segment of DNA involved in producing a polypeptide chain. It may include regions preceding and following the coding region (leader and trailer) as well as intervening sequences (introns) between individual coding segments (exons).

As used herein, an “inducible” promoter is a nucleotide sequence which, when operably linked with a polynucleotide which encodes or specifies a gene product, causes the gene product to be produced substantially only when an inducer which corresponds to the promoter is present.

The term “inhibit,” as used herein, means to suppress or block an activity or function, for example, about ten percent relative to a control value. Preferably, the activity is suppressed or blocked by 50% compared to a control value, more preferably by 75%, and even more preferably by 95%. “Inhibit,” as used herein, also means to reduce a molecule, a reaction, an interaction, a gene, an mRNA, and/or a protein’s expression, stability, function or activity by a measurable amount or to prevent entirely. Inhibitors are compounds that, e.g., bind to, partially or totally block stimulation, decrease, prevent, delay activation, inactivate, desensitize, or down regulate a protein, a gene, and an mRNA stability, expression, function and activity, e.g., antagonists.

As used herein, an “instructional material” includes a publication, a recording, a diagram, or any other medium of expression which can be used to communicate the usefulness of a compound, composition, vector, or delivery system of the invention in the kit for effecting alleviation of the various diseases or disorders recited herein. Optionally, or alternately, the instructional material can describe one or more methods of alleviating the diseases or disorders in a cell or a tissue of a mammal. The instructional material of the kit of the invention can, for example, be affixed to a container which contains the identified compound, composition, vector, or delivery system of the invention or be shipped together with a container which contains the identified compound, composition, vector, or delivery system. Alternatively, the instructional material can be shipped separately from the container with the intention that the instructional material and the compound be used cooperatively by the recipient.

“Measuring” or “measurement,” or alternatively “detecting” or “detection,” means assessing the presence, absence, quantity or amount (which can be an effective amount) of a given substance.

The term “modulate,” as used herein, refers to mediating a detectable increase or decrease in a desired response. For example, a small molecule may be used to increase or decrease the level of interaction between two proteins.

As used herein, the term “next generation sequencing” refers to sequencing methods that allow for massively parallel sequencing of clonally amplified molecules and of single nucleic acid molecules. Next generation sequencing is synonymous with “massively parallel sequencing” for most purposes. Non-limiting examples of next generation sequencing include sequencing-by-synthesis using reversible dye terminators, and sequencing-by-ligation.

The term “nucleic acid” or “polynucleotide” refers to deoxyribonucleic acids (DNA) or ribonucleic acids (RNA) and polymers thereof in either single- or double- stranded form. Unless specifically limited, the term encompasses nucleic acids containing known analogues of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions), alleles, orthologs, SNPs, and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al., Nucleic Acid Res. 19:5081 (1991); Ohtsuka et al., J. Biol. Chem. 260:2605-2608 (1985); and Rossolini et al, Mol. Cell. Probes 8:91-98 (1994)). The term nucleic acid is used interchangeably with gene, cDNA, and mRNA encoded by a gene.

“Operably linked” as used herein may mean that expression of a gene is under the control of a promoter with which it is spatially connected. A promoter may be positioned 5' (upstream) or 3' (downstream) of a gene under its control. The distance between the promoter and a gene may be approximately the same as the distance between that promoter and the gene it controls in the gene from which the promoter is derived. As is known in the art, variation in this distance may be accommodated without loss of promoter function.

As used herein in reference to interactions, “promote” refers to inducing or increasing an interaction between two species. For example, a small molecule may promote or increase interactions between two proteins.

“Promoter” as used herein may mean a synthetic or naturally-derived molecule which is capable of conferring, activating or enhancing expression of a nucleic acid in a cell. A promoter may comprise one or more specific transcriptional regulatory sequences to further enhance expression and/or to alter the spatial expression and/or temporal expression of same. A promoter may also comprise distal enhancer or repressor elements, which can be located as much as several thousand base pairs from the start site of transcription. A promoter may be derived from sources including viral, bacterial, fungal, plants, insects, and animals. A promoter may regulate the expression of a gene component constitutively, or differentially with respect to cell, the tissue or organ in which expression occurs or, with respect to the developmental stage at which expression occurs, or in response to external stimuli such as physiological stresses, pathogens, metal ions, or inducing agents. Representative examples of promoters include the promoters from GALI (galactose), PGK (phosphoglycerate kinase), ADH (alcohol dehydrogenase), A0X1 (alcohol oxidase), HIS4 (histidinol dehydrogenase), metallothionein, 3- phosphoglycerate kinase, such as enolase, glyceraldehyde-3 -phosphate dehydrogenase, hexokinase, pyruvate decarboxylase, phospho-fructokinase, glucose-6-phosphate isomerase, 3 -phosphoglycerate mutase, pyruvate kinase, triosephosphate isomerase, phospho-glucose isomerase, and glucokinase.

The term “regulating” as used herein can mean any method of altering the level or activity of a substrate. Non-limiting examples of regulating with regard to a protein include affecting expression (including transcription and/or translation), affecting folding, affecting degradation or protein turnover, and affecting localization of a protein. Non-limiting examples of regulating with regard to an enzyme further include affecting the enzymatic activity. “Regulator” refers to a molecule whose activity includes affecting the level or activity of a substrate. A regulator can be direct or indirect. A regulator can function to activate or inhibit or otherwise modulate its substrate.

The terms “subject”, “individual”, “patient” and the like are used interchangeably herein, and refer to any animal, or cells thereof whether in vitro or in situ, amenable to the methods described herein. In some non-limiting embodiments, the patient, subject or individual is a human. In various embodiments, the subject is a human subject, and may be of any race, sex, and age.

“Vector” as used herein may mean a nucleic acid sequence containing an origin of replication. A vector may be a plasmid, bacteriophage, bacterial artificial chromosome or yeast artificial chromosome. A vector may be a DNA or RNA vector. A vector may be either a self-replicating extrachromosomal vector or a vector which integrates into a host genome.

Ranges: throughout this disclosure, various aspects of the invention can be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 2.7, 3, 4, 5, 5.3, and 6. This applies regardless of the breadth of the range.

Description

The invention is based, in part, on the development of a CRISPR-based selection methodology. The CRISPR-based selection methodologies of the invention employ a nuclease-dead CRISPR-associated protein (dCas), that complexes with small RNAs as guides (gRNAs) to target a barcode located between a promoter and a downstream effector molecule in a sequence-specific manner and silence or alter expression of the downstream effector molecule. In some embodiments, the CRISPR based system may use separate guide RNAs known as the crRNA and tracrRNA. In some embodiments, these two separate RNAs are combined into a single RNA to enable sequence-specific binding through the design of a short guide RNA. dCas and guide RNA (gRNA) may be synthesized by known methods. In some embodiments, the system comprises using a set of barcodes in a library of nucleic acid molecules which are targeted by specific dCas/guide-RNA complexes in a sequence specific manner to select a specific barcoded sequence from the library. In one embodiment, a guide RNA (gRNA) targeted to the barcode region of a nucleic acid molecules, and a dCas peptide form a complex with the barcode-containing genetic construct. In some embodiments, targeting of the barcode-containing genetic construct by the dCas/gRNA complex results in silencing of a downstream effector molecule. In some embodiments, the downstream effector molecule is a negative selection gene, therefore silencing of the negative selection gene promotes counterselection of the targeted barcode-containing genetic construct.

In general, “CRISPR-Cas system” or “CRISPR system” refers collectively to transcripts and other elements involved in the expression of or directing the activity of CRISPR-associated (“Cas”) genes, including sequences encoding a Cas gene, a tracr (trans-activating CRISPR) sequence (e.g., tracrRNA or an active partial tracrRNA), a tracr-mate sequence (encompassing a “direct repeat” and a tracrRNA-processed partial direct repeat in the context of an endogenous CRISPR system), a guide sequence (also referred to as a “spacer” in the context of an endogenous CRISPR system), or other sequences and transcripts from a CRISPR locus. In some embodiments, one or more elements of a CRISPR system is derived from a type I, type II, or type III CRISPR system. In some embodiments, one or more elements of a CRISPR system are derived from a particular organism comprising an endogenous CRISPR system, such as Streptococcus pyogenes. In general, a CRISPR system is characterized by elements that promote the formation of a CRISPR complex at the site of a target sequence (also referred to as a protospacer in the context of an endogenous CRISPR system).

In some embodiments, the CRISPR target site is determined by the CRISPR-Cas system guide RNA. In general, a “CRISPR-Cas guide RNA” or “guide RNA” refers to an RNA that directs sequence-specific binding of a CRISPR complex to the target sequence. Typically, a guide RNA comprises (i) a guide sequence that has sufficient complementarity with a target polynucleotide sequence to hybridize with the target sequence and (ii) a trans-activating cr (tracr) mate sequence. In some embodiments, the degree of complementarity between a guide sequence and its corresponding target sequence, when optimally aligned using a suitable alignment algorithm, is about or more than about 50%, 60%, 75%, 80%, 85%, 90%, 95%, 97.5%, 99%, or more. In some embodiments, a guide sequence is about or more than about 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 75, or more nucleotides in length. In some embodiments, a guide sequence is less than about 75, 50, 45, 40, 35, 30, 25, 20, 15, 12, or fewer nucleotides in length. The ability of a guide sequence to direct sequence-specific binding of a CRISPR complex to a target sequence may be assessed by any suitable assay. For example, the components of a CRISPR system sufficient to form a CRISPR complex, including the guide sequence to be tested, may be provided to a host cell having the corresponding target sequence, such as by transfection with vectors encoding the components of the CRISPR sequence, followed by an assessment of preferential cleavage within the target sequence, such as by Surveyor assay as described herein. Similarly, cleavage of a target polynucleotide sequence may be evaluated in a test tube by providing the target sequence, components of a CRISPR complex, including the guide sequence to be tested and a control guide sequence different from the test guide sequence, and comparing binding or rate of cleavage at the target sequence between the test and control guide sequence reactions. Other assays are possible, and will occur to those skilled in the art.

In the context of formation of a CRISPR complex, a “target sequence” or “a sequence of a target DNA” refers to a sequence to which a guide sequence is designed to have complementarity, where hybridization between a target sequence and a guide sequence promotes the formation of a CRISPR complex. Full complementarity is not necessarily required, provided there is sufficient complementarity to cause hybridization and promote formation of a CRISPR complex. A target sequence may comprise any polynucleotide, such as DNA or RNA polynucleotides or DNA/RNA hybrid polynucleotides. In some embodiments, a target sequence is located in a barcode region of a polynucleotide, wherein the polynucleotide comprises a promoter operably linked to an effector gene or a gene sequence encoding an effector protein such that expression of the effector gene is regulated by the promoter, and further comprises a barcode sequence located between the promoter and the effector gene sequence. In some embodiments, the barcode sequence does not alter expression of the effector gene in the absence of a bound Cas/sgRNA but disrupts or alters the expression of the effector gene in the presence of a bound Cas/sgRNA. In some embodiments, the polynucleotide sequence comprising the target sequence may be within an organelle of a eukaryotic cell, for example, nucleus, mitochondrion or chloroplast. In some embodiments, the polynucleotide sequence comprising the target sequence may be an extrachromosomal polynucleotide sequence (e.g., a vector, COSMID, or BAC). In some embodiments, the polynucleotide sequence comprising the target sequence may be integrated into a chromosome of a host cell.

In some embodiments, the CRISPR-Cas domain comprises a Cas protein. Non-limiting examples of Cas proteins include Casl, CaslB, Cas2, Cas3, Cas4, Cas5, Cas6, Cas7, Cas8, Cas9 (also known as Csnl and Csxl2), CaslO, Csyl, Csy2, Csy3, Csel,Cse2, Cscl, Csc2, Csa5, Csn2. Csm2, Csm3, Csm4, Csm5, Csm6, Cmrl, Cmr3, Cmr4, Cmr5, Cmr6, Csbl, Csb2, Csb3, Csxl7, Csxl4, CsxlO, Csxl6, CsaX, Csx3, Csxl, Csxl5, Csfl, Csf2, Csf3, Csf4, homologs thereof, orthologs thereof, or modified versions thereof. In some embodiments, the Cas protein has DNA or RNA cleavage activity. In some embodiments, the Cas protein directs cleavage of one or both strands of a nucleic acid molecule at the location of a target sequence, such as within the target sequence and/or within the complement of the target sequence. In some embodiments, the Cas protein directs cleavage of one or both strands within about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100, 200, 500, or more base pairs from the first or last nucleotide of a target sequence.

In some embodiments, the Cas protein is a catalytically dead variant of a Cas protein as described herein. In some embodiments, the catalytically dead Cas protein retains the ability to complex with an sgRNA molecule and bind to a target, but lacks nuclease activity. In some embodiments, the formation of a complex between the Cas/ sgRNA and the target serves to inhibit expression of one or more downstream effector protein.

In one embodiment, the invention provides an assay system comprising a) a library of host cells comprising CCIC barcoded nucleic acid molecules, b) a library of gRNA molecules that can be used to target the unique barcode region of the library of barcoded nucleic acid molecules and c) a Cas9 peptide or nucleic acid molecule encoding a Cas9 peptide. In some embodiments, the CCIC barcoded nucleic acid molecules comprises nucleic acid molecules comprising a promoter operably linked to a negative selection gene and further comprising a barcode sequence located between the promoter and the negative selection gene.

In one embodiment, the Cas9 protein comprises an amino acid sequence identical to the wild type Streptococcus pyogenes Cas9 amino acid sequence. In some embodiments, the Cas protein may comprise the amino acid sequence of a Cas protein from other species, for example other Streptococcus species, such as thermophilus; Psuedomona aeruginosa, Escherichia coli, or other sequenced bacteria genomes and archaea, or other prokaryotic microogranisms. Other Cas proteins, useful for the present invention, known or can be identified, using methods known in the art (see e.g., Esvelt et al., 2013, Nature Methods, 10: 1116-1121). In some embodiments, the Cas protein may comprise a modified amino acid sequence, as compared to its natural source. For example, in one embodiment, the wild type Streptococcus pyrogenes Cas9 sequence can be modified. For example, in some embodiments, the Cas9 protein comprises dCas9 having point mutations DI OA and H840A, thereby rendering the protein as catalytically deficient. In one embodiment, the assay system comprises a dCas peptide or a nucleic acid molecule encoding a dCas peptide.

Expression Constructs

In one embodiment, the invention relates to recombinant nucleic acid construct comprising a barcode comprising a dCas9/gRNA target which functions as a regulatory element to modulate the expression of one or more downstream effector molecule.

The recombinant nucleic acid sequence construct described above can be placed in one or more vectors. The one or more vectors can contain an origin of replication. The one or more vectors can be a plasmid, bacteriophage, bacterial artificial chromosome or yeast artificial chromosome. The one or more vectors can be either a selfreplication extra chromosomal vector, or a vector which integrates into a host genome.

Vectors include, but are not limited to, plasmids, expression vectors, recombinant viruses, any form of recombinant "naked DNA" vector, and the like. A "vector" comprises a nucleic acid which can infect, transfect, transiently or permanently transduce a cell. It will be recognized that a vector can be a naked nucleic acid, or a nucleic acid complexed with protein or lipid. The vector optionally comprises viral or bacterial nucleic acids and/or proteins, and/or membranes (e.g., a cell membrane, a viral lipid envelope, etc.). Vectors include, but are not limited to replicons (e.g., RNA replicons, bacteriophages) to which fragments of DNA may be attached and become replicated. Vectors thus include, but are not limited to RNA, autonomous self-replicating circular or linear DNA or RNA (e.g., plasmids, viruses, and the like, see, e.g., U.S. Pat. No. 5,217,879), and include both the expression and non-expression plasmids. In some embodiments, the vector includes linear DNA, enzymatic DNA or synthetic DNA. Where a recombinant microorganism or cell culture is described as hosting an "expression vector" this includes both extra-chromosomal circular and linear DNA and DNA that has been incorporated into the host chromosome(s). Where a vector is being maintained by a host cell, the vector may either be stably replicated by the cells during mitosis as an autonomous structure, or is incorporated within the host's genome.

The one or more vectors can be a plasmid. The plasmid may be useful for transfecting cells with the recombinant nucleic acid sequence construct. The plasmid may be useful for introducing the recombinant nucleic acid sequence construct into the subject. The plasmid may also comprise a regulatory sequence, which may be well suited for gene expression in a cell into which the plasmid is administered.

The plasmid may also comprise a mammalian origin of replication in order to maintain the plasmid extra-chromosomally and produce multiple copies of the plasmid in a cell.

CCIC circuit

In certain example embodiments, the invention incudes a nucleic acid molecule comprising a CRISPR based regulatory element comprising a barcode for binding by a dCas9/gRNA complex, wherein the barcode is located between a promoter and one or more downstream effector genes. In one embodiment, the CRISPR based regulatory element is a CCIC circuit, comprising a barcode for binding by a dCas9/gRNA complex, wherein the barcode is located between a promoter and one or more downstream counterselection genes.

In one embodiment, the CCIC circuit comprises a downstream effector encoding a selection marker. In one embodiment, the CCIC circuit comprises a downstream effector encoding a screening marker (e.g., a blue/white screening marker). In one embodiment, the CCIC circuit comprises a downstream effector encoding a reporter marker.

In one embodiment, the reporter molecule is a molecule, including polypeptide as well as polynucleotide, expression of which in a cell confers a detectable trait to the cell. In various embodiments, reporter markers include, but are not limited to, chloramphenicol-acetyl transferase (CAT), P-galactosyltransferase, horseradish peroxidase, luciferase, NanoLuc®, alkaline phosphatase, and fluorescent proteins including, but not limited to, green fluorescent proteins (e.g. GFP, TagGFP, T-Sapphire, Azami Green, Emerald, mWasabi, mClover3), red fluorescent proteins (e.g. mRFPl, JRed, HcRedl , AsRed2, AQ143, mCherry, mRuby3, mPlum), yellow fluorescent proteins (e.g. EYFP, mBanana, mCitrine, PhiYFP, Ta YFP, Topaz, Venus), orange fluorescent proteins (e.g. DsRed, Tomato, Kusabria Orange, mOrange, mTangerine, TagRFP), cyan fluorescent proteins (e.g. CFP, mTFPl, Cerulean, CyPet, AmCyanl), blue fluorescent proteins (e.g. Azurite, mtagBFP2, EBFP, EBFP2, Y66H), near-infrared fluorescent proteins (e.g. iRFP670, iRFP682, iRFP702, iRFP713 and iRFP720), infrared fluorescent proteins (e.g. IFP1.4) and photoactivatable fluorescent proteins (e.g. Kaede, Eos, IrisFP, PS-CFP).

A selection marker sequence can be a positive selection marker or negative selection marker. Positive selection markers permit the selection for cells in which the gene product of the marker is expressed. This generally comprises contacting cells with an appropriate agent that, but for the expression of the positive selection marker, kills or otherwise selects against the cells.

Examples of selection markers also include, but are not limited to, proteins conferring resistance to compounds such as antibiotics, proteins conferring the ability to grow on selected substrates, proteins that produce detectable signals such as luminescence, catalytic RNAs and antisense RNAs. A wide variety of such markers are known and available, including, for example, a Zeocin™ resistance marker, a blasticidin resistance marker, a neomycin resistance (neo) marker (Southern & Berg, J. Mol. Appl. Genet. 1 : 327-41 (1982)), a puromycin (puro) resistance marker; a hygromycin resistance (hyg) marker (Te Riele et al., Nature 348:649-651 (1990)), thymidine kinase (tk), hypoxanthine phosphoribosyltransferase (hprt), and the bacterial guanine/xanthine phosphoribosyltransferase (gpt), which permits growth on MAX (mycophenolic acid, adenine, and xanthine) medium. See Song et al., Proc. Natl Acad. Sci. U.S.A. 84:6820- 6824 (1987). Other selection markers include histidinol-dehydrogenase, chloramphenicol-acetyl transferase (CAT), dihydrofolate reductase (DHFR), P- galactosyltransferase and fluorescent proteins such as GFP.

Expression of a fluorescent protein can be detected using a fluorescent activated cell sorter (FACS). Expression of P-galactosyltransferase also can be sorted by FACS, coupled with staining of living cells with a suitable substrate for -galactosidase. A selection marker also may be a cell-substrate adhesion molecule, such as integrins, which normally are not expressed by the host cell. In one embodiment, the cell selection marker is of mammalian origin, for example, thymidine kinase, aminoglycoside phosphotransferase, asparagine synthetase, adenosine deaminase or metallothionien. In one embodiment, the cell selection marker can be neomycin phosphotransferase, hygromycin phosphotransferase or puromycin phosphotransferase, which confer resistance to G418, hygromycin and puromycin, respectively.

Suitable prokaryotic and/or bacterial selection markers include proteins providing resistance to antibiotics, such as kanamycin, tetracycline, and ampicillin. In one embodiment, a bacterial selection marker includes a protein capable of conferring selectable traits to both a prokaryotic host cell and a mammalian target cell.

In one embodiment, the CCIC circuit comprises a downstream effector encoding a negative selection marker. Negative selection markers permit the selection against cells in which the gene product of the marker is expressed. In some embodiments, the presence of appropriate agents causes cells that express “negative selection markers” to be killed or otherwise selected against. Alternatively, the expression of negative selection markers alone kills or selects against the cells.

Such negative selection markers include a polypeptide or a polynucleotide that, upon expression in a cell, allows for negative selection of the cell. Illustrative of suitable negative selection markers are (i) herpes simplex virus thymidine kinase (HSV- TK) marker, for negative selection in the presence of any of the nucleoside analogs acyclovir, gancyclovir, and 5-fluoroiodoamino-Uracil (FIAU), (ii) various toxin proteins such as the diphtheria toxin, the tetanus toxin, the cholera toxin and the pertussis toxin, (iii) hypoxanthine-guanine phosphoribosyl transferase (HPRT), for negative selection in the presence of 6-thioguanine, (iv) activators of apoptosis, or programmed cell death, such as the bc!2-binding protein (BAX), (v) the cytidine deaminase (codA) gene of E. coli, (vi) phosphotidyl choline phospholipase D, or (vii) sacB which causes cell death by producing toxic levan in the presence of sucrose. In one embodiment, the negative selection marker requires host genotype modification (e.g. ccdB, to/C, thyA, rpsl and thymidine kinases.)

In accordance with the present invention, the selection marker usually is selected based on the type of the cell undergoing selection. For instance, it can be eukaryotic (e.g., yeast), prokaryotic (e g., bacterial) or viral. In such an embodiment, the selection marker sequence is operably linked to a promoter that is suited for that type of cell.

In some embodiments, the invention provides a set or library of barcoded nucleic acid molecules, wherein each barcode is unique from the other barcodes within the library wherein each barcode functions as a target for binding by a dCas9/gRNA complex. In some embodiments, each nucleic acid molecule in the nucleic acid molecule library further comprises one or more nucleic acid sequence for screening, sequencing or testing. For example, in one embodiment, the library of nucleic acid molecules is a COSMID library, a bacterial artificial chromosome (BAC) library or a Pl-derived artificial chromosome (PAC) based library wherein each COSMID, BAC, or PAC in the library comprises a CCIC circuit comprising a unique barcode for counterselection. In one embodiment, the library is a metagenomic library wherein each nucleic acid molecule in the metagenomic library comprises a CCIC circuit comprising a unique barcode for counterselection. In one embodiment, the library of nucleic acid molecules is a genomic library wherein each nucleic acid molecule in the genomic library comprises a CCIC circuit comprising a unique barcode for countersei ection. In one embodiment, the library of nucleic acid molecules encodes a library of molecules to be screened (e.g., peptides, proteins, antibodies, etc.) wherein each nucleic acid molecule in the screening library comprises a CCIC circuit comprising a unique barcode for counterselection.

In certain example embodiments, the invention incudes a library of gRNA molecules that can be used to target the degenerate barcodes present in a nucleic acid library of the invention.

Methods of Use

The invention provides methods of use of the compositions of the invention to modulate the expression of a downstream effector molecule.

In some embodiments, the invention provides methods for counter selection of a nucleic acid molecule of interest. In such an embodiment, the method comprises contacting a cell comprising a nucleic acid molecule comprising a CCIC circuit of the invention with a dCas9/gRNA complex which binds to the barcode region in the CCIC circuit silencing expression of the downstream effector molecule.

In one embodiment, the nucleic acid molecule encoding the downstream effector comprises a suicide gene, where expression of the gene results in the death of the cell comprising the nucleic acid molecule. In one embodiment, expression of the suicide gene is inducible, for example with the use of an inducible promoter regulating suicide gene expression or with the use of growth media comprising an inducer molecule. For example, in one embodiment, the suicide gene is sacB which is toxic in the presence of sucrose. In such an embodiment, silencing of the expression of sacB results in survival of the cell comprising the targeted CCIC circuit in the presence of sucrose.

Cells

In one embodiment, the invention relates to cells or cell libraries containing the CCIC-targeted barcode containing constructs of the invention. Methods of introducing and expressing exogenous nucleic acid molecules (e.g., expression vectors) in a cell are known in the art. In the context of an expression vector, the vector can be readily introduced into a host cell, e.g., mammalian, bacterial, yeast, or insect cell by any method in the art. For example, the expression vector can be transferred into a host cell by physical, chemical, or biological means.

Physical methods for introducing a polynucleotide into a host cell include calcium phosphate precipitation, lipofection, particle bombardment, microinjection, electroporation, and the like. Methods for producing cells comprising vectors and/or exogenous nucleic acids are well-known in the art. See, for example, Sambrook et al. (2012, Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory, New York). In one embodiment, the method of introduction of a polynucleotide into a host cell is calcium phosphate transfection.

Biological methods for introducing a polynucleotide of interest into a host cell include the use of DNA and RNA vectors. Viral vectors, and especially retroviral vectors, have become the most widely used method for inserting genes into mammalian, e.g., human cells. Other viral vectors can be derived from lentivirus, poxviruses, herpes simplex virus I, adenoviruses and adeno-associated viruses, and the like. See, for example, U.S. Pat. Nos. 5,350,674 and 5,585,362.

Chemical means for introducing a polynucleotide into a host cell include colloidal dispersion systems, such as macromolecule complexes, nanocapsules, microspheres, beads, and lipid-based systems including oil-in-water emulsions, micelles, mixed micelles, and liposomes. An exemplary colloidal system for use as a delivery vehicle in vitro and in vivo is a liposome (e.g., an artificial membrane vesicle).

In the case where a non-viral delivery system is utilized, an exemplary delivery vehicle is a liposome. The use of lipid formulations is contemplated for the introduction of the nucleic acids into a host cell (in vitro, ex vivo or in vivo). In another aspect, the nucleic acid may be associated with a lipid. The nucleic acid associated with a lipid may be encapsulated in the aqueous interior of a liposome, interspersed within the lipid bilayer of a liposome, attached to a liposome via a linking molecule that is associated with both the liposome and the oligonucleotide, entrapped in a liposome, complexed with a liposome, dispersed in a solution containing a lipid, mixed with a lipid, combined with a lipid, contained as a suspension in a lipid, contained or complexed with a micelle, or otherwise associated with a lipid. Lipid, lipid/DNA or lipid/expression vector associated compositions are not limited to any particular structure in solution. For example, they may be present in a bilayer structure, as micelles, or with a “collapsed” structure. They may also simply be interspersed in a solution, possibly forming aggregates that are not uniform in size or shape. Lipids are fatty substances which may be naturally occurring or synthetic lipids. For example, lipids include the fatty droplets that naturally occur in the cytoplasm as well as the class of compounds which contain long- chain aliphatic hydrocarbons and their derivatives, such as fatty acids, alcohols, amines, amino alcohols, and aldehydes.

Lipids suitable for use can be obtained from commercial sources. For example, dimyristyl phosphatidylcholine (“DMPC”) can be obtained from Sigma, St. Louis, MO; dicetyl phosphate (“DCP”) can be obtained from K & K Laboratories (Plainview, NY); cholesterol (“Choi”) can be obtained from Calbiochem-Behring; dimyristyl phosphatidylglycerol (“DMPG”) and other lipids may be obtained from Avanti Polar Lipids, Inc. (Birmingham, AL). Stock solutions of lipids in chloroform or chloroform/methanol can be stored at about -20°C. Chloroform is used as the only solvent since it is more readily evaporated than methanol. “Liposome” is a generic term encompassing a variety of single and multilamellar lipid vehicles formed by the generation of enclosed lipid bilayers or aggregates. Liposomes can be characterized as having vesicular structures with a phospholipid bilayer membrane and an inner aqueous medium. Multilamellar liposomes have multiple lipid layers separated by aqueous medium. They form spontaneously when phospholipids are suspended in an excess of aqueous solution. The lipid components undergo self-rearrangement before the formation of closed structures and entrap water and dissolved solutes between the lipid bilayers (Ghosh et al., 1991 Glycobiology 5: 505-10). However, compositions that have different structures in solution than the normal vesicular structure are also encompassed. For example, the lipids may assume a micellar structure or merely exist as nonuniform aggregates of lipid molecules. Also contemplated are lipofectamine-nucleic acid complexes.

Regardless of the method used to introduce exogenous nucleic acids into a host cell, in order to confirm the presence of the recombinant DNA sequence in the host cell, a variety of assays may be performed. Such assays include, for example, “molecular biological” assays well known to those of skill in the art, such as Southern and Northern blotting, RT-PCR and PCR; “biochemical” assays, such as detecting the presence or absence of a particular peptide, e.g., by immunological means (ELISAs and Western blots) or by assays described herein to identify agents falling within the scope of the invention.

In one embodiment, the present invention provides a cell or population of cells modified to express barcoded polynucleotides of the invention. In one embodiment, the cells are prokaryotic cells. In one embodiment, cells are eukaryotic cells. In one embodiment, a cell is a mammalian cell, such as a murine or human cell. The target cell may be a somatic cell or a germ cell. The germ cell may be a stem cell, such as embryonic stem cells (ES cells), including murine embryonic stem cells. The target cell may be an induced pluripotent stem cell (iPSC) or a myeloid cell that can be differentiated into microglia-like or myeloid-like cells. The target cell may be chosen from commercially available mammalian cell lines. The target cell may be a primary cell isolated from a subject. A target cell may be any type of diseased cell, including cells with abnormal phenotypes that can be identified using biological or biochemical assays. In one embodiment, a cell may be an HEK293 cell. In one embodiment, a cell may be a myeloid cell line that expresses TREM2 and its signaling partner DAP12 (TYROBP) (Satoh et al., 2012, Cell Mol Neurobiol, 32:337-343), such as THP-1 cells.

The cells of the invention and cells derived therefrom can be derived from, inter alia, humans, primates, rodents and birds. In one embodiment, the cells of the invention are derived from mammals, especially mice, rats and humans. In one embodiment, cells may be either wild-type or genetically modified cells.

The cells of the present invention, whether grown in suspension or as adherent cell cultures, are grown in contact with culture media.

In one embodiment, culture media used in the present invention comprises a basal medium, optionally supplemented with additional components. Basal medium is a medium that supplies essential sources of carbon and/or vitamins and/or minerals for the cells. The basal medium is generally free of protein and incapable on its own of supporting self-renewal/symmetrical division of the cells. Media formulations that support the growth of cells include, but are not limited to, Minimum Essential Medium Eagle, ADC-1, LPM (bovine serum albumin-free), F10 (HAM), F12 (HAM), DCCM1, DCCM2, RPMI 1640, BGJ Medium (with and without Fitton-Jackson Modification), Basal Medium Eagle (BME-with the addition of Earle's salt base), Dulbecco's Modified Eagle Medium (DMEM-without serum), Yamane, IMEM-20, Glasgow Modification Eagle Medium (GMEM), Leibovitz L-15 Medium, McCoy's 5A Medium, Medium M199 (M199E-with Earle's salt base), Medium Ml 99 (M199H-with Hank's salt base), Minimum Essential Medium Eagle (MEM-E-with Earle's salt base), Minimum Essential Medium Eagle (MEM-H-with Hank's salt base) and Minimum Essential Medium Eagle (MEM-NAA with nonessential amino acids), and the like.

It is further recognized that additional components may be added to the culture medium. Such components include, but are not limited to, antibiotics, antimycotics, albumin, growth factors, amino acids, and other components known to the art for the culture of cells. Antibiotics which can be added into the medium include, but are not limited to, penicillin and streptomycin. The concentration of penicillin in the culture medium is about 10 to about 200 units per ml. The concentration of streptomycin in the culture medium is about 10 to about 200 pg/ml. However, the invention should in no way be construed to be limited to any one medium for culturing the cells of the invention. Rather, any media capable of supporting the cells of the invention in tissue culture may be used.

Typical substrates for culture of the cells in all aspects of the invention are culture surfaces recognized in this field as useful for cell culture, and these include surfaces of plastics, metal, composites, though commonly a surface such as a plastic tissue culture plate, widely commercially available, is used. Such plates are often a few centimeters in diameter. For scale up, this type of plate can be used at much larger diameters and many repeat plate units used. For high throughput assays multi -well plates, having 6, 12, 24, 48, 96 or more wells can be used.

The culture surface may further comprise a cell adhesion protein, usually coated onto the surface. Receptors or other molecules present on the cells bind to the protein or other cell culture substrate and this promotes adhesion to the surface and promotes growth. In certain embodiments, the cultures of the invention are adherent cultures, i.e., the cells are attached to a substrate.

Library

The present invention relates generally to a method of selecting for at least at least one specific nucleic acid molecule within a library of nucleic acid molecules, the method comprising: contacting a library of host cells comprising CCIC barcoded nucleic acid molecules, with one or more sgRNA molecule from a library of sgRNA molecules that can be used to target the unique barcode region of the library of barcoded nucleic acid molecules and a Cas9 peptide or nucleic acid molecule encoding a Cas9 peptide. Therefore, in some embodiments the invention provides a library of barcoded CCIC vectors, wherein each CCIC vector comprises a multiple cloning site for insertion of a target nucleotide for analysis by sequencing (insert). In some embodiments the invention provides a sequencing library of barcoded CCIC vectors, wherein each CCIC vector comprises a unique target nucleotide for analysis by sequencing (insert). In some embodiments the invention provides a host cell library comprising the sequencing library of barcoded CCIC vectors, wherein each CCIC vector comprises a unique target nucleotide for analysis by sequencing (insert). In some embodiments the invention provides a sgRNA library, wherein the sgRNA sequences are designed to target the barcodes of the CCIC vectors.

In one embodiment, the CCIC vector library comprises at least about 1, 2, 10, 100, 1,000, 10,000, 100,000, 200,000, 400,000, 1 million, or more than 1 million unique vectors, wherein each vector comprises a MCS for insertion of a nucleotide sequence for sequencing analysis, and a unique barcode sequence that can be targeted by sgRNA molecules. In one embodiment, the at least about 1, 2, 10, 100, 1,000, 10,000, 100,000, 200,000, 400,000, 1 million, or more than 1 million CCIC vectors are each incorporated into a host cell in a library. In one embodiment, at least one barcode sequence comprises less than about 40, 30 or less than 20 nucleotides. In one embodiment, at least one barcode sequence comprises more than about 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,22,23, 24, 25, 30, or more than 35 nucleotides.

In one embodiment, the sequencing library comprises at least about 1, 2, 10, 100, 1,000, 10,000, 100,000, 200,000, 400,000, 1 million, or more than 1 million CCIC vectors comprising unique inserts in the MCS of the CCIC vector. In one embodiment, the at least about 1, 2, 10, 100, 1,000, 10,000, 100,000, 200,000, 400,000, 1 million, or more than 1 million insert sequences are each incorporated into at least one CCIC vector which is incorporated into a host cell in a library.

In one embodiment, the library of host cells comprises a library of nucleic acid molecules comprising CCIC barcodes. In one embodiment, the library of nucleic acid molecules comprising CCIC barcodes comprises at least about 1, 2, 10, 100, 1,000, 10,000, 100,000, 200,000, 400,000, 1 million, or more than 1 million unique barcodes that can be targeted by sgRNA molecules. In one embodiment, the at least about 1, 2, 10, 100, 1,000, 10,000, 100,000, 200,000, 400,000, 1 million, or more than 1 million amino acid sequences are each incorporated into at least one nucleic acid molecule which is incorporated into a host cell in a library. In one embodiment, at least one barcode sequence comprises less than about 40, 30 or less than 20 nucleotides. In one embodiment, at least one barcode sequence comprises more than about 10, 1 1, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,22,23, 24, 25, 30, or more than 35 nucleotides.

In one embodiment, the library of host cells comprises a library of engineered cells comprising an integrated Cas9 gene. In some embodiments, the library of cells has been transfected with a CCIC vector library, wherein each nucleic acid molecule of the library comprises a) a multiple cloning site for integration of a nucleic acid sequence (insert), the sequence of which is to determined by a sequencing-based method, b) a promoter region for expression of a downstream selection gene, c) a barcode sequence located between the promoter region and the downstream selection gene and further located in proximity to the multiple cloning site, such that the barcode is useful as 1) a sequencing barcode for sequencing of the insert into a) and 2) a binding site for an Cas9/sgRNA complex, and d) a downstream selection gene operably linked to the promoter of b) such that the expression of the selection marker is driven by the promoter.

In some embodiments, the library of cells has been transfected with a sequencing library, wherein each nucleic acid molecule of the library comprises a) an insert a nucleic acid sequence, the sequence of which is to determined by a sequencingbased method, b) a promoter region for expression of a downstream selection gene, c) a barcode sequence located between the promoter region and the downstream selection gene and further located in proximity to the multiple cloning site, such that the barcode is useful as 1) a sequencing barcode for sequencing of the insert into a) and 2) a binding site for an Cas9/sgRNA complex, and d) a downstream selection gene operably linked to the promoter of b) such that the expression of the selection marker is driven by the promoter.

In one embodiment, the library of nucleic acid molecules comprising CCIC barcodes comprises at least about 1, 2, 10, 100, 1,000, 10,000, 100,000, 200,000, 400,000, 1 million, or more than 1 million unique barcodes that can be targeted by sgRNA molecules. In one embodiment, the at least about 1, 2, 10, 100, 1,000, 10,000, 100,000, 200,000, 400,000, 1 million, or more than 1 million amino acid sequences are each incorporated into at least one nucleic acid molecule which is incorporated into a host cell in a library.

In one embodiment, the library of sgRNA molecules comprises at least about 1, 2, 10, 100, 1,000, 10,000, 100,000, 200,000, 400,000, 1 million, or more than 1 million unique sgRNA molecules which can bind to the barcode sequence(s). In one embodiment, at least one sgRNA molecule comprises less than about 40, 30 or less than 20 nucleotides. In one embodiment, at least one sgRNA molecule comprises more than about 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,22,23, 24, 25, 30, or more than 35 nucleotides.

Sequencing

In one embodiment, the methods of the invention include the step of sequencing the inserts of the nucleic acid libraries. In one embodiment, high-throughput next generation sequencing can be used to determine the sequences of a library of nucleotide sequences that are inserted into the multiple cloning sites of the library of barcoded sequencing vectors of the invention.

In one embodiment, the methods described herein can utilize nextgeneration sequencing technologies that allow multiple samples to be sequenced individually as genomic molecules (i.e., singleplex sequencing) or as pooled samples comprising indexed genomic molecules (e.g., multiplex sequencing) on a single sequencing run. These methods can generate up to several hundred million reads of DNA sequences. In various embodiments, the sequences of nucleic acid sequence barcodes can be determined using, for example, the next generation sequencing technologies described herein. In various embodiments, analysis of the massive amount of sequence data obtained using next-generation sequencing can be performed using one or more processors as described herein.

In some embodiments, the nucleic acid product can be sequenced by next generation sequencing methods. In some embodiments, the next generation sequencing method comprises a method selected from the group consisting of Ion Torrent, Illumina, SOLiD, 454; Massively Parallel Signature Sequencing, solid phase reversible dye terminator sequencing; and DNA nanoball sequencing may be included. In some embodiments, the first and second sequencing primers are compatible with the selected next generation sequencing method.

In some embodiments, sequencing can be performed by next generation sequencing methods. As used herein, “next generation sequencing” refers to the speeds that were possible with conventional sequencing methods (e.g., Sanger sequencing) by reading thousands of millions of sequencing reactions simultaneously. Means an oligonucleotide sequencing technique that has the ability to sequence oligonucleotides at a greater rate. Non-limiting examples of next generation sequencing methods/platforms include Massively Parallel Signature Sequencing (Lynx Therapeutics); pyrophosphate sequencing/454; 454 Life Sciences/Roche Diagnostics; Solid Phase Reversible Dye Terminator Sequencing (Solexa /illumina ): SOLiD technology (Applied Biosystems); ion semiconductor sequencing (ION Torrent.); DNA nanoball sequencing (Complete Genomics); and technologies available from Pacific Biosciences, Intelligen Bio-systems, Oxford Nanopore Technologies, and Helicos Biosciences. In some embodiments, the sequencing primer may comprise a moiety that is compatible with the selected next generation sequencing method.

Next generation sequencing techniques and related sequencing primer constraints and design parameters are well known in the art (e.g., Shendure et al., 2008, Nature, 26:1135-1145; Mardis, 2007, Trends in Genetics, 24: 133-141; Su et al., 2011, Expert. Rev. Mol. Diagn., 11 :333-43; Zhang et al., 2011, J. Genet. Genomics, 38:95-109; Nyren P et al. 1993, Anal. Biochem., 208:17175; Bentley et al., 2006, Curr. Opin. Genet. Dev., 16:545-552; Strausberg et al., 2008, Drug Disc. Today, 13:569-577; U.S. Patent No. 7,282,337; U.S. Patent No. 7,279,563; U.S. Patent No. 7,226,720; U.S. Patent No. 7,220,549; U.S. Patent No. 7,169,560; U.S. Patent Application Publication No. 20070070349; U.S. Patent No. 6,818,395; U.S. Patent No. 6,911,345; U.S. Patent Application Publication No. 2006/0252077; No. 2007/0070349).

Several targeted next generation sequencing methods are described in the literature (for review see e.g., Teer and Mullikin, 2010, Human Mol. Genet. 19:R145- 151), all of which can be used in conjunction with the present invention. Many of these methods (described e.g. as genome capture, genome partitioning, genome enrichment etc.) use hybridization techniques and include array-based (e.g., Hodges et al., 2007, Nat. Genet., 39: 1522-1527) and liquid based (e.g., Choi et al., 2009, Proc. Natl. Acad. Sci USA, 106: 19096-19101) hybridization approaches. Commercial kits for DNA sample preparation are also available: for example, Illumina Inc. (San Diego, California) offers the TruSeq™ DNA Sample Preparation Kit and the Exome Enrichment Kit TruSeq™ Exome Enrichment Kit.

There are many methods known in the art for the detection, identification, and quantification of specific nucleic acid sequences (e.g., nucleic acid sequence barcodes) and new methods are continually reported.

Selection

In some embodiments, the methods include selecting one or more specific host cell comprising a CCIC vector comprising a target insert from a library of host cells comprising a library of CCIC vectors. In some embodiments the selection methods of the invention include methods of contacting a host cell library of the invention with one or more sgRNA specific for binding to the barcode sequence associated with the target insert of interest, contacting the host cell with an inducer molecule to induce cell death when the effector molecule is expressed, and selecting a clonal isolate that grows in the presence of the inducer molecule as comprising the target insert. In some embodiments, the effector molecule is the sacB gene and the inducer molecule is sucrose.

Kits

The present invention also pertains to kits useful in the methods of the invention. Such kits comprise various combinations of components useful in any of the methods described elsewhere herein. For example, in one embodiment, the kit comprises components useful for generating or performing a CCIC assay as described herein. In one embodiment, the kit contains additional components. In one embodiment, an additional component includes but is not limited to instructional material. In one embodiment, instructional material for use with a kit of the invention may be provided electronically.

EXPERIMENTAL EXAMPLES

The invention is further described in detail by reference to the following experimental examples. These examples are provided for purposes of illustration only, and are not intended to be limiting unless otherwise specified. Thus, the invention should in no way be construed as being limited to the following examples, but rather, should be construed to encompass any and all variations which become evident as a result of the teaching provided herein.

Without further description, it is believed that one of ordinary skill in the art can, using the preceding description and the following illustrative examples, make and utilize the present invention and practice the claimed methods. The following working examples therefore are not to be construed as limiting in any way the remainder of the disclosure.

Example 1: Repurposing CRISPRi for positive selection and facile sequence retrieval from complex clone libraries

While dCas9 has generally been targeted to either endogenous genomic loci or a limited set of genetic circuits, it was reasoned that incorporating a degenerate target sequence (barcode) into a synthetic circuit would generate a pool of unique constructs that could be selectively targeted by dCas9 to trigger clone specific tasks. By placing a counter-selection marker under the control of such a circuit, silencing a specific barcode sequence would lead to target retrieval (i.e., survival) under selective conditions. The development of a method for rapid and high-throughput retrieval of specific sequences from large clone libraries would accelerate the discovery of novel genes and biosynthetic gene clusters. This is particularly true when exploring microbial communities using culture-independent (e.g., metagenomic) methods where metagenomic libraries can contain tens of millions of unique clones due to the immense diversity present in natural ecosystems. The targeted retrieval of clones from large metagenomic libraries has been limited to a laborious multi-step dilution and PCR screening method which remains a key bottleneck in discovery pipelines/platforms (Owen et al., 2015, Proceedings of the National Academy of Sciences of the United States of America, 112:4221-4226). While CRISPR-Cas has been used to streamline a number of methods for the precise cloning of target genomic sequences, these methods have not been adapted to access sequences from complex metagenomic libraries. (Jiang et al., 2015, Nature communications, 6:8101; Lee et al., 2015, Nucleic acids research, 43:e55; Wang et al., 2018, Nucleic acids research, 46:e28). Here, a barcoded CRISPR counter- sei ection interruption circuit (CCIC) was developed which was combined with two advanced sequencing methods, PacBio long-read sequencing (Eid et al., 2009, Science, 323:133-138) and edge mapping (Burian et al., 2018, Molecular microbiology, 107:402-415), for rapid indexing and retrieval of target sequences from complex metagenomic and genomic libraries.

The materials and methods used for the experiments are now described.

Strains and growth conditions

Unless otherwise specified Escherichia coli strains were grown in LB- Miller at 37 °C and shaking at 200 rpm. Escherichia coli EPI300 was used for genetic manipulations and plasmid maintenance, unless otherwise noted. If needed, antibiotics were added to the media at the following concentrations: kanamycin 30 mg/L, chloramphenicol 15 mg/L, spectinomycin 50 mg/L, tetracycline 12.5 mg/L, and ampicillin 200 mg/L.

General cloning

Restriction enzymes were purchased from NEB and used as per manufacturer instructions. Q5 High-Fidelity DNA Polymerase (2 U/pL; NEB) was used to generate all amplicons for cloning reactions. BioReady Taq DNA Polymerase (5 U/pL; Bulldog-bio) was used when screening by PCR. Sanger sequencing was performed by Genewiz. Plasmids were extracted using the Monarch Plasmid miniprep kit (NEB) as per manufacturer instructions. The Nucleospin Gel and PCR Clean-up kit (Takara Bio) was used for PCR clean up and gel purifications as per manufacturer instructions.

Cloning pPAC-T and pPAC31-N pPACsacB: The PAC vector pADlOsacBII (maintained in the Pl -cl expressing E. coli NS3607), was transferred to cre+ E. coli NS3145. This led to recombination between its two loxP sites causing a -13.5 Kb excision that removed the adenovirus stuffer sequence, pBR322 origin, and the pac site resulting in pPACsacB. pPAC31a: A miniature Streptomyces functionalization cassette containing oriT, an apramycin resistance gene, the <DC31 attP site and the <I>C31 attP integrase was constructed in pUC19. The oriT together with the apramycin resistance gene were obtained from pTAR-lys using the primers oriAp F and oriAp R, and the product was digested with Xbal and KpnI. Primer oriAp R was designed to remove the BstXI site from the C- terminus of the apramycin resistance gene. The C31 attP site and integrase were obtained from pTAR-lys using the primers C31int_F and C31int_R, and the product was digested with PstI and KpnI. The two digested PCR amplicons were ligated with Xbal and PstI digested pUC19 to generate the desired Streptomyces functionalization cassette containing pUC-C31. The cassette was extracted from pUC-C31 by PCR using primers NheOriApC3 I F and AflOriApC31_R. The resulting amplicon was digested with Nhel and Aflll and ligated with similarly digested pPACsacB to give pPAC31a, which was maintained in E. coli NS3607. pPAC31Bst: Two BstXI sites needed to capture adaptor ligated DNA fragments (Osoegawa, K. et al. 2007, Genomics 89, 291-299) were incorporated into the multiple cloning region of pUC19 by whole vector PCR amplification using primers pUCbst F and pUCbst R. The PCR product was digested with BamHI and self-ligated to produce pUC-Bst. pUC-Bst was linearized using BamHI and cloned into similarly digested pPAC3 la to construct pPAC3 la-19. pPAC3 la-19 was then digested with Xbal to excise the pUC19 backbone, and self-ligated to construct pPAC31Bst. pPAC31-N and pPAC31-T: A C-terminal SsrA degradation tag (AANDENYALAA) was added to SacB on both pPAC3 la and pPAC3 IBst. The degradation tag sequence was added to sacB by PCR using the primers Spe sacB F and Nhe sacBdeg R with pADlOsacBII as the template. The resulting PCR product was digested with Spel and Nhel and ligated into similarly digested pPAC3 la or pPAC3 IBst to yield pPAC31-N and pPAC31-T, respectively.

Cloning pdCas9-BR

For sequence specific dCas9 targeting, a 30 bp sequence present between the promoter and sacB in pPAC31- T, but not pPAC31-N, was cloned into the direct repeat array of pdCas9. For this reaction, primers blockBst F and blockBst R were selfannealed, and then ligated into Bsal digested pdCas9 to generate pdCas9-BR. Cloning sgRNA constructs

Single guide RNA (sgRNA) sequences were added to pTargetF by whole vector PCR using a forward primer containing a desired 20 bp guide sequence paired with primer Target R. PCR products were Spel digested and self-ligated to generate sgRNA expressing ‘guide plasmids’. The same guide sequence that was incorporated into pdCas-BR was added using the primer Target20BR_F. The initial barcode guides used in Figure 5 were generated using the primers BC1, BC2, BC3, and BC4, respectively. A fixed sequence downstream of the promoter, used as a ‘universal’ silencing sgRNA, was cloned using the NP1 primer. All clones were verified by Sanger sequencing using the L4440 primer.

The pCCIC construct (described below) contained the same pMBl origin of replication as the vectors carrying sgRNA (i.e., pTargetF derivatives). To ensure compatibility for silencing experiments, the replication origin for guide plasmids was replaced with the pl5A origin. The pTarget-NPl backbone (NP1 sgRNA sequence), without its origin of replication, was amplified by PCR using the primers pTargetRepMin F and pTargetRepMin R, and the product was digested with EcoRV. The pl5A origin of replication from pACYC184 was extracted by PCR with the primers pl5Aori_sphF and pl5Aori_sacR, and the product was ligated with the digested pTarget- NP1 amplicon to produce pl5ATarget-NPl. New guide sequences were cloned into the pl5ATarget-NPl as described above for pTargetF. The guides targeting barcodes in Figure 5 were added using the primers BC C1, BC C2 and BC C3. The guide sequences used in Figure 1G for the initial recovery benchmarks were added using primers gRNA_350A and gRNA_350B. The pl 5 A origin added a second L4440 sequence and therefore pl5ATarget derivatives were confirmed by Sanger sequencing using the specOUT primer. For the large-scale recovery experiments, the BC C2 guide plasmid was used as the PCR template, and as ligation reactions were used directly for recovery no Sanger sequencing was submitted.

Cloning sacB with different regulators and replication origins The degradation tagged sacB and its promoter were extracted from pPAC31-N by PCR using primers sacB F and sacB R. The PCR product was cloned into SmaT digested pUC19 with the insertion having sacB in the same orientation as the lacZa to generate pUC19-sacB; the vector was cloned and maintained in E. coli NS3607. The cos site from pWEB-TNC was extracted by PCR using primers XbaSnabCOS F and COS R, and the amplicon was digested with Xbal and Mfel. The chloramphenicol resistance gene was extracted from pWEB-TNC by PCR using primers SbfCMr_F and MfeCMr R, and the product was digested with Mfel and SbfT. pUC19-sacB was digested with Xbal and SbfT, and added to a ligation reaction with the digested cos site as well as the chloramphenicol resistance gene to assemble pUC19-sacBcosCM. This vector was cloned and mainlined in E. coli NS3607. The sacB/cos/chloramphenicol resistance gene cassette from pUC19-sacBcosCM was extracted by PCR using primers SacBcosCM R and SacBcosCM F and the product was digested with Pmel. The origins of replication from pET19b (pMBl) and pACYC184 (pl5A) were extracted by PCR using primers pETori F/pETori R and pl5Aori_sphF and pl5Aori_sacR, respectively. These PCR products were ligated with the Pmel digested sacB/cos/chloramphenicol resistance to generate pET-sacBcosCM and pl5A- sacBcosCM, respectively. In both cases the insert was cloned such the forward primer region of the replication origins was adjacent to the chloramphenicol resistance gene. These constructs were cloned and maintained in E. coli NS3607. A fragment containing laclq and Ptac was extracted from pTAC -Mat-Tag-2 by PCR using primers sacLacO F and sacLacO R, and incorporated into BamHI digested pUC19-sacBcosCM, pET- sacBcosCM, and pl5A-sacBcosCM using NEBuilder HiFi DNA Assembly (NEB) to produce pUC19- lacO: sacBcosCM, pET-lacO: sacBcosCM, and pl5A-lacO:sacBcosCM, respectively. This replaced the constitutive Pl-cl repressible promoter with a laclq repressed, IPTG inducible Ptac promoter. A fragment containing tetR and Ptet was extracted from the pwtCas9-bacteria vector by PCR using primers sacTetO F and sacTetO R, and incorporated into BamHI digested pET-sacBcosCM and p!5A-sacBcosCM using NEBuilder HiFi DNA Assembly to produce pET- tetO: sacBcosCM and p!5A-tetO:sacBcosCM, respectively. This replaced the constitutive Pl -cl repressible promoter with a TetR repressed, anhydrotetracycline (aTc) inducible Ptet promoter.

Construction of pCCIC The TetR regulated sacB, along with tetR, was extracted from pET- tetO:sacBcosCM using primers SacBsph F and TSBsmaI R, and the product was digested with SphI and Smal. The pWEB-TNC backbone, minus the ampicillin resistance gene, was amplified by PCR using primers pWEBsph R and pWEBsnaB F, and the product was digested with SphI and SnaBI. The two fragments were then ligated together to generate pWEB- tS. pWEB-tS was digested with Mfel and Agel, blunt ended (NEB quick blunting kit) and the 5886 bp fragment was then self-ligated to yield pWEB- tS.MA. Sequencing of this constructed revealed that this cloning step inadvertently captured a 217 bp fragment of the E. coli genome, which was not predicted to alter the function of the vector and therefore it was not removed. Finally, the sacB gene was amplified from pWEB-tS. MA using primers SacB Spel F and SacB nd sphI R, which removed the degradation tag. This amplicon was digested with Spel and SphI, and cloned into similarly digested pWEB-tS. MA to generate the pCCIC. The full, annotated sequence of pCCIC has been deposited in Genbank ON804120.

Genomic integration of dCas9

To incorporate dCas9 into the E. coli EPI300 genome the pCas/pTarget system (Jiang et al., 2015, Applied and environmental microbiology, 81 :2506-2514) was used. The thermosensitive pCas was transformed into E. coli EPI300 which was maintained at 30 °C. This pCas containing strain was then grown to an OD600nm of -0.1 and the X-RED genes were induced by the addition of arabinose to a final concentration of 0.15%. The culture was grown until an OD600nm of ~0.6, and then CaC12 competent cells were prepared as described below. A homology directed repair cassette containing the rhaA (B3903) sequence was assembled into pUC19. The ‘left end’ homology fragment was amplified by PCR from the E. coli genome using primers rha LE F and rha LE R and the PCR product digested with PstI and Mfel. The ‘right end’ homology fragment was amplified by PCR from the E. coli genome using primers rha RE F and rha RE R and the PCR product was digested with PstI and Hindlll. The two fragments were then cloned into EcoRI/Hindlll digested pUC19 in a three-way ligation to give pUC19-rhaKI. The dcas9 sequence was extracted from pdCas9 by PCR using primers dCas_Nsi_F and dCas_Nsi_R. The PCR product was digested with Nsil and cloned into PstI digested pUC19-rhaKI so that dcas9 was in the same orientation as rhaA, to give pUC19- KIdC. The protocol described in the “Cloning sgRNA constructs” section was used with the primer TargetRha3 F to construct pTargetRha3 which expressed a sgRNA targeting a sequence within rhaA. To generate the dCas9 integration, pUC19-KIdC and pTargetRha3 were co-transformed into the arabinose induced pCas containing E. coli EPI300, and transformants were selected on spectinomycin and kanamycin at 30 °C. Transformants were screened by colony PCR using primers ScrRhaKO F and ScrRhaKO R. A PCR positive colony (i.e., a colony yielding a -1250 bp band indicating dCas9 integration) had the pTargetRha3 plasmid cured by growing the cells overnight in LB with 1 mM IPTG and kanamycin at 30 °C. The pCas plasmid was then cured by streaking the cells onto LB agar and growing overnight at 42 °C. The resulting colonies were replica picked onto LB with and without kanamycin or spectinomycin, and a colony sensitive to both antibiotics was re-screened for the -1250 bp band. This established the E. coli NdC strain with a genomically integrated dCas9 expressed from its native promoter.

Constitutively expressed dcas9 under the strong J23119 promoter was incorporated into the E. coli EP 1300 genome as above. The homology directed repair construct could not use the pUC19 backbone due to toxicity issues, so a lower copy number pSClOl origin construct was made. Using pUC19-KIdC as template dCas9 was extracted using primers PJdCas9_F and PJdCas9_R, the left homology region and the ampicillin marker was extracted using primers 101KO F and rhaKO R, and the right homology region was extracted using primers rhaKO F and 101KO R. The pSClOl replication origin sequence was extracted from pSClOl using primers pSClOl F and pSClOl R. These four PCR products were assembled together using NEBuilder HiFi DNA Assembly to generate plOIRhaKI-JdC. This construct was used to integrate J23119 driven dcas9 into the E. coli EPI300 genome as described above. Correct integration was confirmed by PCR with a -950 bp band using primers SpydCas9_out and RhaKOscr2_F, and a -1330 bp band using primers SpydCas9_Lout and rhaKIchk. This established the E. coli JdC strain with a genomically integrated, strongly expressed dCas9.

The experimental results are now described. Targeting dCas9 to a sequence between a counter-selection gene and its promoter abrogated expression and allowed for survival under otherwise selective conditions (Figure 1 A). A pair of Pl -derived artificial chromosome (PAC) vectors were used that were identical except for their multiple cloning sites (MCS) located between sacB and its strong constitutive promoter (Figure IB) (Pierce et al., 1992, Proceedings of the National Academy of Sciences of the United States of America, 89:2056-2060). This allowed the design of a guide RNA homologous to a sequence within the pPAC-T MCS (z.e., Target) that was not present in the MCS of pPAC-N (z.e., Negative). SacB causes cell death by producing toxic levan in the presence of sucrose (Gay et al., 1985, Journal of bacteriology, 164:918-921). Sequence specific survival on sucrose was only observed when a plasmid expressing dCas9, tracrRNA, and target specific cRNA was provided (Figure 1C). The highest level of survival was observed using Escherichia coli with a genomically integrated dcas9 transformed with a guide plasmid expressing a tracrRNA/cRNA chimera (sgRNA; Figure 1C). It was tested whether this system could be used to recover a single target PAC present in populations of 5,000, 50,000, and 100,000 non-target PACs. Efficient retrieval of the target PAC (positive hit rate >70%) was achieved in mixtures of up to 50,000 non-target sequences (Figure ID). The target PAC was also retrieved from the 100,000 non-target mixture albeit with ten-fold reduced efficiency (Figure ID). These studies, using a model two vector mixture, confirmed the potential for a CCIC to allow targeted retrieval of sequences from complex mixtures.

Cosmid library construction using lambda phage packaging offers a simple method for large scale capture of high-molecular weight metagenomic DNA fragments (Brady, 2007, Nature protocols, 2: 1297-1305). To test CCIC based sequence retrieval from large insert clone libraries, a CCIC containing cosmid vector was developed. Among the various sacB promoter and replication origin options tested, the combination of TetR repressed sacB carried on a pMB 1 backbone showed the best overall CCIC performance (Figure 3). These components were therefore used to construct the cosmid cloning vector pCCIC. Lambda phage efficiently packages DNA fragments from 37 to 52 kb in size, and so libraries constructed with the 6 kb pCCIC are expected to contain 31 to 46 kb metagenomic inserts average insert of 35.6±0.4 kbp (Haley, 1988, New Nucleic Acid Techniques, 257-283). CCIC-based retrieval relies on each vector within a library containing a unique sequence between sacB and its promoter. This sequence acts both as a barcode for the captured DNA and a guide RNA target for dCas9- mediated repression (Figure IE). Generating a large pool of unique barcode sequences located between sacB and its promoter was achieved by introducing a degenerate 24 bp dCas9-targetable sequence into pCCIC using two-fragment cloning (Figure IF, Figure 4 A). Sequencing of a (meta)genomic library constructed using a collection of barcoded pCCIC vectors will link specific captured sequences with the unique vector barcodes. Any clone within the library can then be easily retrieved by barcode-specific dCas9- mediated inhibition of sacB expression leading to clone specific survival on sucrose containing media (Figure IE). It was found that addition of the barcode invariably led to sucrose escape mutants, with the highest fidelity cloning (two-fragment ligation) leading to -0.15% escapes (Figure 4A). Sucrose escape mutants will result in false positive colonies appearing on selective media, thereby increasing the number of colonies that must be screened to identify a desired target during retrieval. As an example, at 0.15% one would expect 15 false positives when retrieving a target from a 10,000 membered library. To generate high-fidelity pools of CCICs, a procedure was developed where subpools of clones were grown, CCIC integrity was checked, and then pooled to generate ‘scrubbed’ pools with drastically reduced escape frequencies (Figure 4B; fidelity >99.999%). With the fidelity improvement, one would now expect 0.1 false positive colonies for every on-target colony found using CCIC retrieval from a 10,000 member library. dCas9-mediated silencing of sacB expression by targeting pCCIC barcodes was then optimized by increasing dCas9 expression levels and screening the sucrose concentration used for selection (Figure 5).

To investigate CCIC utility for isolating target sequences from complex mixtures, a cosmid-based library was generated from soil metagenomic DNA and attempted to recover two randomly selected clones. Barcodes from these clones were sequenced and used to generate guide plasmids expressing homologous sgRNAs. A range of sucrose concentrations was tested for optimal dCas9-mediated survival of the individual barcoded clones using their specific guide plasmids (Figure 1G: S). Fidelity of the CCIC in individual clones, as well as a pool of 5,000 metagenomic clones, was confirmed by silencing with a universal guide plasmid that targets a constant sequence in the vector (Figure 1G: U), and by lack of survival on sucrose with non-specific guides (Figure 1G: N). Using the final CCIC selection protocol, the randomly chosen barcoded clones were easily retrieved from pools of up to 20,000 metagenomic clones using their corresponding guide plasmids (Figure 1H). In these initial studies, purified guide plasmids were used for each retrieval experiment. In an effort to further simplify the CCIC method, it was tested whether the ligation reactions that generated the guide plasmids could be used directly for retrieval. As seen when using purified guide plasmid clones, direct transformation of the ligation reactions resulted in successful target retrieval from pools containing as many as 20,000 distinct clones (Figure 1H). These studies indicated that barcoded CCIC clone libraries could provide a simple and rapid way to recover targeted DNA sequence from a complex mixture.

The CCIC method was used to recover potentially high value sequences of interest (/.e., natural product biosynthetic gene clusters (BGCs) and CRISPR-Cas systems) from the 10,000 clone metagenomic library used in Figure 1H. To link barcodes with specific cosmid inserts, the library pool was sequenced using PacBio HiFi long-read technology. (Eid et al., 2009, Science, 323: 133-138) Assembled contigs containing barcodes were analysed for BGC content and phage-defense systems using publicly available prediction tools Antismash and DefenseFinder, respectively (Blin et al., 2021, Nucleic acids research, 49:W29-W35; Tesson et al., 2022, Nature communications, 13:2561). Based on these analyses, 66 cosmids predicted to contain 12 different classes of core biosynthetic genes and 4 cosmids containing CRISPR-Cas genes were selected for retrieval. Barcodes from the 70 target cosmids were used to design sgRNAs, and direct transformation of the guide plasmid ligation (like Figure 1H) allowed recovery of over 95% of our desired targets (Figure 2A). Guide RNAs are known to have varied activities, with low efficacy guides possibly contributing the failure of some retrieval attempts (Calvo- Villamanan et al., 2020, Nucleic acids research, 48:e64). A “postmortem analysis” of the CCIC method indicated that the positive hit rate for a retrieval likely depended on a combination of guide strength, cosmid copy number, and varied clone abundance within the library (Figure 6, Example 2). Multiple mechanisms of sucrose escape were observed upon sequencing a collection of false positive clones (Example 2). All desired targets could likely be retrieved from a library with increased coverage (i.e., as few as 2 available barcodes for each target). From receiving primers to recovery of the clone of interest, CCIC retrieval requires just two days which is a dramatic increase in efficiency compared to all other methods explored (Example 3).

As an alternative to linking inserts to barcodes using long-read sequencing, the proximity of the barcode to one edge of the DNA captured in pCCIC led to the development of a PCR-based ‘edge mapping’ method that can be used to index a complex clone library with minimal effort and sequencing resources. While several methods exist to precisely extract sequences from sequenced genomes, the large-scale parallel cloning of target genomic regions remains cumbersome (Wang et al., 2021, Frontiers in bioengineering and biotechnology, 9:692797). As genomic libraries are created in a sequence independent manner, they can easily capture all of the encoded diversity with only the bottleneck of a laborious screening step limiting the rate of target sequence retrieval (Wang et al., 2021, Frontiers in bioengineering and biotechnology, 9:692797). For edge mapping, Tn5 transposase tagmentation was used to insert known sequence tags upon DNA fragmentation, which allowed amplification of fragments containing the vector barcode and the edge of each cloned sequence by PCR (Picelli et al., 2014, Genome research, 24:2033-2040). Paired-end Miseq reads then linked the barcode to the edge sequence generating a comprehensive index of captured regions with the assumption that lambda phage packaging captured the expected 31-46 kb of sequence (Figure 2B). Paired with CCIC, this allowed for high-throughput retrieval of specific genomic loci.

To demonstrate the utility of edge mapping for BGC retrieval, a high- density cosmid library (~11,000 clones = ~195x genome coverage) was generated from the gDNA of Streptomyces albidoflavus J 1074, a representative BGC-rich Actinomycete. Edge sequences from this library were linked to a total of 10,145 unique barcodes, mapped to the S. albidoflavus genome, and clones predicted to contain 9 uncharacterized BGCs that could each be carried on a single cosmid were identified and retrieved from the library by CCIC (Figure 2B). With the exception of one BGC, all target BGCs were retrieved on the first attempt. Due to the saturation level of the library (Figure 7), a second barcode/guide was identified and used to successfully recover the final BGC. As the edge mapping produced base pair resolution of captured edges, it was possible to identify precise sets of overlapping cosmids for BGCs too large to capture on a single cosmid. As an example, the mapping data was used to recover 2 overlapping cosmids that contained a polyketide synthase BGC (Figure 2B). Overall, these data demonstrated the utility of CCIC to easily scale the capture and retrieval of target sequences from sequenced genomes (Example 4). In the case of metagenomic libraries, edge mapping should be particularly useful for indexing large libraries to guide the identification of overlapping clones (Figure 8 A), which has been a key bottleneck in the cloning of large complete metagenomic BGCs that require multiple cosmids to fully assemble. Alignment of edge mapping data from our 10,000 membered metagenomic library to the PacBio assemblies predicted several instances of potentially overlapping cosmids. One of the predictions was confirmed by retrieval of cosmids associated with PacBio contig 1912, including the edge barcode (barcode-OTU 477) and the internally mapped barcode-OTU 315 (Figure 8B, Example 3). It was found that edge mapping provides a cost-effective means of correcting low quality barcode sequences found in long-read sequencing datasets (Figure 9, Example 3).

As next-generation sequencing methods have increasingly provided unprecedented bioinformatic access to genetic diversity, methods for physically accessing sequences of interest have remained rudimentary (Athanasopoulou et al., 2021, Life, 12). By harnessing the vast target potential of dCas9, CCIC cloning opens the door to rapid, scalable, and cost-effective indexing and retrieval of target sequences. While CCIC was applied to accelerate (meta)genomic mining, there is excitement to see how the general concept of incorporating degenerate Cas-targetable barcodes into genetic circuits is expanded into other areas of synthetic biology in the future.

Example 2: CCIC method expanded discussion

Prediction of guide strength for dCas9 has been studied in depth (Calvo- Villamanan et al., 2020, Nucleic acids research, 48:e64; Cui et al., 2018, Nature communications, 9: 1912). The overall conclusion is that dCas9 targeting is largely sequence independent (Calvo- Villamanan et al., 2020, Nucleic acids research, 48:e64), but one major concern in the design of a guide is the ‘bad seed’ where some combination of sequences of the last 5 residues lead to general toxicity in E. coli (Cui et al ., 2018, Nature communications, 9: 1912). Analysis of the four guides that failed during retrieval experiments showed that their ‘bad seed’ combinations were relatively low on the rank table generated by Cui et al., 2018, Nature communications, 9: 1912: #84 (contig_2940), #121 (streptomyces OTU 2583), #240 (contig_1414), #845 (contig_4987). In fact, the guide for contig_4998 retrieval had the #4 ‘bad seed’ combination, but retrieval was successful. As the ‘bad seed’ effect can be minimized by titrating the dCas9 expression level (Cui et al., 2018, Nature communications, 9: 1912), it was likely that this system expressed dCas9 below the toxicity threshold, thus negating ‘bad seed’ effects. As a “post-mortem analysis” of the CCIC method, whether the strength of the guide RNA contributed to the positive hit rate during retrieval was investigated, as well as the mechanisms leading to false positive hits. For this analysis, 6 targets were chosen and their positive hit rates were determined by looking at 32 colonies for each retrieval experiment using 0.07% sucrose counter-selection plates. Positive hit rates ranged from 15 to 88% (Figure 6A). For false positive analysis, three off- target clones from each retrieval were sequenced and analysed in an effort to determine the mechanism of sucrose escape.

Role of guide RNAs and other factors in clone retrieval: To determine if the positive hit rate (i.e., the percent of positive clones observed on the sucrose selection plates) correlated with the ‘strength’ of the guide RNA, a monoculture of 6 different targets was transformed with three different guide plasmids: 1) the barcode specific guide, 2) the ‘universal’ NP1 guide, and 3) a non-specific guide. The percent survival of the transformants on selective media with increasing concentrations of sucrose wasthen monitored (Figure 6B). As expected, the non-targeting guide RNAs provided no protection under selective conditions, and at 0.07% sucrose (the concentration used in the retrieval method) all barcode specific guides elicited -100% survival (Figure 6B). As the sucrose concentration increased, the guide for c5461 (i.e., PacBio contig #5461) retrieval generated the strongest survival (up to 0.15% sucrose), while the guide for c5218 retrieval was the weakest generating survival at only 0.07% sucrose. All other guides were roughly similar in generating sucrose survival with a final ranking of c5461>c2220>c4003>c3807>c4246>c5218 (Figure 6B). As the strongest and weakest guides both showed low positive hit rates (25% and 15%, respectively), it was concluded that individual guide strength was not the key factor in the positive hit rate. Interestingly, silencing of the c5218 CCIC using the 'universal’ guide also showed a weak response (i.e., -50% survival at 0.15% sucrose and no survival at 0.3% sucrose, Figure 6B). The weak silencing correlated with the DNA concentration of the c5218 plasmid mini-prep which had almost a 4 times higher concentration than the other target cosmids (Figure 6A). Increased copy number would yield higher sacB expression, thereby requiring a higher effective strength of silencing than is needed for a lower copy target. However, guide strength for recovery failures cannot be fully discounted as in this dataset only 4 of 6 guides would be expected to successfully recover their target using 0.10% sucrose selection, implying there may be some guides too weak to elicit survival even at 0.07%. Importantly, at 0.07% sucrose all specific guides tested here afforded -100% survival, and, in combination with the already achieved 95% success rate across all retrieval experiments, variance in silencing strength has been effectively minimized.

It is known that clones will be present at different concentrations in a library depending on how fast they grow. Underrepresented clones a in complex mixture would therefore have effective recovery ratios that are smaller than the 1 in 10,000 assumed for the metagenomic library. As observed in Figure 1H, a target ratio of 1 in 20,000 showed a 2-4 fold decrease in positive hit rate relative to a 1 in 10,000 mixture. Conversely, faster growing clones would be more abundant and likely easier to recover. Overall, the positive hit rate of CCIC retrieval was likely a combination of multiple variables including guide strength, cosmid copy-number, and the population of the target clone in the library. Using the specific experimental conditions outlined here, cosmid copy number and clone titer in the library are likely key factors as the data suggests that the contribution of guide strength variance to CCIC retrieval has little effect when using 0.07% sucrose for counter- sei ection.

False positives: To determine mutations leading to false positive clones, three off-target colonies (clones A, B and C) per retrieval were sequenced and the sacB- tetR region analysed. The initial hypothesis was that sucrose escapes were most likely due to mutations inactivating sacB. However, analysis of the 18 sucrose escape colonies indicated only three had altered sacB sequences, with one containing a large deletion and two containing sequence insertions. Decreased sacB expression likely explains the survival of c2220 clone A, which contained a point mutation in the promoter driving sacB expression. The mutation was mapped to the -35 hexamer, changing the sequence from a consensus TTGACA to TTGCCA, which is expected to result in weaker promoter activity (Kobayashi et al., 1990, Nucleic acids research, 18:7367-7372). Similarly, two clones (c5218 clone A, c4003 clone C) contained mutations in TetR that would yield Prol05Ser and Ala56Val mutants, respectively. Proline 105 contributes to hydrophobic interactions stabilizing tetracycline binding (Hinrichs et al., 1994, Science, 264:418-420). The serine mutation likely either prevents or destabilizes this interaction resulting in continued TetR repression (i.e., a lack of induction of sacB expression). Alanine 56 is not known to contribute to ligand binding, but the increased bulk of valine might destabilize the binding pocket as His64, which hydrogen bonds to tetracycline, is near Ala56 in the structure (Hinrichs et al., 1994, Science, 264:418-420). Curiously, 9 of the 18 sucrose escape colonies contained the same cosmid (identified as PacBio contig_3949) even though they arose from different retrieval experiments. In all cases the cosmid sequence did not contain any mutations within the tetR-sacB region. Similarly, cosmids c4003 clone A, c2220 clone B and c5218 clone C did not contain mutations in the sacB-tetR sequence. Analysis of the captured metagenomic inserts against the Comprehensive Antibiotic Resistance Database (CARD7) revealed that all but c4003 clone A contained predicted tetracycline efflux pumps. Similarly to the promoter and TetR mutations discussed above, efflux of anhydrotetracycline would likely lead to lowered, or lack of, sacB induction accounting for sucrose survival. Annotation of the metagenomic insert contained by c4003 clone A indicated the presence of an AI-2E family transporter which may play a similar anhydrotetracycline efflux role, however contribution to antibiotic resistance of these transporters has not been previously established. In all cases, the sequencing revealed that the guide plasmid was present with the correct sgRNA suggesting that no mutation in the guide contributed to the sucrose escape phenotype. Together, these data indicate that false positives can be generated through multiple distinct mechanisms.

Combined, the guide strength and false positive analysis indicated that it is unlikely further engineering of the CCIC circuit could prevent all of the variables presented. This is simply the nature of the challenges presented when working with complex metagenomic libraries propagated in a living host. However, CCIC showed a remarkable 95% success rate for retrieval, demonstrating that most of the challenges have already been overcome.

Example 3: CCIC accelerates metagenomic discovery.

Next-generation sequencing has allowed for an unprecedented view into the genetic biodiversity of environmental microbes, the majority of which remain recalcitrant to laboratory culture. A key obstacle to leveraging this vast reservoir of diversity is gaining physical access to sequences of interest. One way that has been explored to overcome this bottleneck has been the synthesis of metagenomic genes or BGCs of interest. Recent examples include the synthesis of two RiPP BGCs (Paoli et al., 2022, Nature, 607: 111-118) as well as two CRISPR-Cas systems (Burstein et al., 2017, Nature, 542:237-241). Unfortunately, the application ofDNA synthesis approaches to the large-scale study of metagenomic BGCs remains prohibitive due to high costs. In place of DNA synthesis, the field has largely relied on lambda phage packaging to construct (large) cosmid based libraries in E. coli from which clones containing target sequences of interest can be recovered.

Lambda phage based cosmid libraries allow for the relatively facile and large-scale capture of sequences from metagenomes. Lambda phage efficiently packages 37 to 52 kb ofDNA (Haley et al., in New Nucleic Acid Techniques, (ed. J.M. Walker) 257-283 (Humana Press, Totowa, NJ; 1988) and therefore, with standard vectors, 30-45 kb of metagenomic DNA is generally captured in each cosmid clone. Compared to “shotgun” sequencing methods, cosmid libraries offer two key advantages: 1) they store a physical copy of the DNA for retrieval, and 2) they provide a finite scale of genetic information for sequencing efforts. Libraries are maintained as collections of sub-pools (generally 5,000 to 25,000 clones/pool) due to the scale of clones required to effectively capture metagenomic diversity (i.e., >10,000,000 clones for soil metagenomes). With libraries of this size retrieval of specific sequences of interest is the key bottleneck in their efficient use as a source of novel genetic diversity. The most commonly used method for recovering a target sequence from a complex pool of clones involves successive round of dilution and PCR screening of library sub-pools until a single clone is isolated (Owen et al., 2015, Proceedings of the National Academy of Sciences of the United States of America, 112:4221-4226). Due to fluctuations of clonal populations, each dilution step is often arrayed at a 10-fold excess to ensure screening success. For a 10,000 clone pool, this would entail two rounds of dilutions, followed by screening individual clones. This generally takes 2 weeks and is limited to an individual working with ~10 targets in parallel. The use of a liquid handler, robot picker, and plate pool strategies increases the throughput and limits the PCR reactions needed for identification; however, this strategy only increases the scale to 24 targets recovered in parallel, does not reduce the length of the experiment, and still requires 195 total PCR screening reactions per recovery. In contrast, the CCIC recovery method can be done over two days with 9 or fewer PCR reactions per recovery (1 PCR per guide cloning, and 8 or fewer PCRs for screening). One was able to target 70 recoveries in parallel by hand over 3 days. Simply put, the CCIC recovery method is radically superior to the successive dilution method and, without the need for specialized equipment, is accessible for any laboratory.

Edge mapping and its application to metagenomic libraries. In the case of metagenomic libraries, edge mapping will likely prove most useful for indexing large libraries to guide the identification of clones that overlap sequences found in smaller scale long-read sequencing datasets (Figure 8A). BGCs larger than ~30 kb cannot be captured on single cosmid clones and so they must instead be captured on sets of overlapping clones. The recovery of overlapping clones that complete a partial BGC found on a metagenomic cosmid is one of the most time-consuming steps in studying large metagenomic BGCs. Edge mapping provides the opportunity to cost-effectively index large batches of clones, thereby rapidly identifying cosmids that overlap with target BGC sequences. Saturating soil metagenomic libraries (i.e., libraries large enough to contain multiple copies of each DNA region) require in excess of ten million cosmids; however, even in a small scale 10,000 library pool, several examples of PacBio assembled contigs with an internal ‘edge map’ were easily identified indicating the presence of a second, overlapping cosmid in the pool. PacBio contig 1912 (associated with barcode-OTU 477) and its internally mapped barcode-OTU 315 were used to confirm the reliability of edge mapping to identify overlapping cosmids (Figure 8B).

Edge mapping of metagenomic libraries should also prove useful for correcting instances of low-quality barcode sequences that can arise from low long-read sequencing coverage. For example, the edge mapping data from the metagenomic library corrected a previously disregarded barcode sequence that was predicted to be associated with a complete lassopeptide BGC. The corrected barcode allowed for the retrieval of the cosmid containing this BGC (Figure 9). This suggests, for future surveys of metagenomic diversity, long-read sequencing coverage may be titrated to the minimal amount required to identify clones of interest and then use cheaper edge mapping to determine the barcodes required for isolating these clones.

Example 4; Use of CRISPR-Cas for Genomic Recoveries.

In a growing number of cases CRISPR-Cas has been used to either improve or develop methods for targeted cloning of BGCs. Recent examples include augmenting Gibson-based capture methods (Jiang et al., 2015, Nature communications, 6:8101), transformation-associated-recombination (TAR) using yeast (Lee et al., 2015, Nucleic acids research, 43:e55), and the establishment of ExoCET (Exonuclease Combined with RecET recombination) (Wang et al., 2018, Nucleic acids research, 46:e28). In fact, TAR is routinely used to assemble BGCs captured across multiple overlapping metagenomic cosmids (Kallifidas et al., 2012, Methods in enzymology, 517:225-239) and in vitro Cas9-cutting with TAR is used for BGC promoter refactoring (Kim et al., 2019, ACS synthetic biology, 8:109-118). Targeted cloning approaches offer the benefit of precise cluster excision and the cloning of sizes in excess of what can be captured in an individual cosmid. However, considerable effort is required to generate the components for capture, and unique reagents must be created for each target BGC (i.e., precision comes at the cost of scale). Generation of genomic clone libraries has historically been a rewarding, sequence-independent, and relatively simple approach to isolate BGCs with only “tedious screening” representing a significant drawback (Wang et al., 2021, Frontiers in bioengineering and biotechnology, 9:692797). The edge-mapping approach offers rapid library indexing thereby simplifying the tedious screening process, and when combined with CCIC represents a means for high-throughput recovery of effectively any cloned sequence. In essence, CRISPR-Cas is used to improve librarybased retrieval of targeted sequences.

While the focus in these experiments is on establishing CCIC in cosmids due to an interest in exploring metagenomic diversity, the CCIC method could be adapted for ultra-large insert bacterial artificial chromosome (BAC) or Pl -derived artificial chromosome (PAC) based libraries. Importantly, the scale, simplicity, and costeffectiveness of edge mapping should allow for the parallel identification of BGCs across multiple libraries as well as very large multiplexed libraries, offering researchers the potential to rapidly explore hundreds of BGCs simultaneously. Outside of the bacterial natural product space, edge-mapping with CCIC cloning should simplify the use of large- scale libraries for indexing genomic sequences from higher level organisms. One particularly compelling area to explore is the rapid indexing of large clone libraries that are often used to complete genomic sequencing projects (Schmid et al., 2018, Nucleic acids research, 46:8953-8965). Compared to targeted cloning approaches, CCIC cloning offers an unmatched scale of target sequence retrieval.

The disclosures of each and every patent, patent application, and publication cited herein are hereby incorporated herein by reference in their entirety. While this invention has been disclosed with reference to specific embodiments, it is apparent that other embodiments and variations of this invention may be devised by others skilled in the art without departing from the true spirit and scope of the invention. The appended claims are intended to be construed to include all such embodiments and equivalent variations.




 
Previous Patent: UNIVERSAL CONNECTOR

Next Patent: ROBOTIC CARTON FORMING SYSTEM