Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
RNA APTAMERS AND THEIR USE
Document Type and Number:
WIPO Patent Application WO/2022/162211
Kind Code:
A1
Abstract:
The invention relates to a nucleic acid construct comprising the nucleic acid sequence as set forth in SEQ ID NO: 1, provided that said sequence is not SEQ ID NO:2.

Inventors:
RYCKELYNCK MICHAËL (FR)
KLYMCHENKO ANDREY (FR)
BOUHEDDA FARAH (FR)
CUBI ROGER (FR)
COLLOT MAYEUL (FR)
Application Number:
PCT/EP2022/052175
Publication Date:
August 04, 2022
Filing Date:
January 31, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV STRASBOURG (FR)
CENTRE NAT RECH SCIENT (FR)
International Classes:
C12N15/115; C09B11/24; G01N33/58
Domestic Patent References:
WO2020254654A12020-12-24
Other References:
BOUHEDDA FARAH ET AL: "A dimerization-based fluorogenic dye-aptamer module for RNA imaging in live cells", NATURE CHEMICAL BIOLOGY, NATURE PUB. GROUP, NEW YORK, vol. 16, no. 1, 21 October 2019 (2019-10-21), pages 69 - 76, XP036965704, ISSN: 1552-4450, [retrieved on 20191021], DOI: 10.1038/S41589-019-0381-8
FARAH BOUHEDDA ET AL: "Light-Up RNA Aptamers and Their Cognate Fluorogens: From Their Development to Their Applications", INT. J. MOL. SCI., vol. 19, no. 1, 23 December 2017 (2017-12-23), pages 44, XP055486711, ISSN: 1661-6596, DOI: 10.3390/ijms19010044
Attorney, Agent or Firm:
MONNI, Richard (FR)
Download PDF:
Claims:
- 47 -

Claims

1. An nucleic acid construct comprising:

- a first sequence as set forth in SEQ ID NO: 87

- a second sequence as set forth in SEQ ID NO: 87

|GGAACC|NnCGC ; wherein the sequence in bold of the first sequence interacts with the sequence in bold of the second sequence, and wherein the boxed sequence of the first sequence interacts with the boxed, sequence of the second sequence, wherein Ni , N2, N3, N4 and N5 such that

- when N1 is C, G or U, then N2, N3, N4 and N5 are any ribonucleotide; and when N1 is A, then N2 is not G, or N3 is not G, or N4 is not C or N5 is not G;

- when N2 is C, A or U, then N1, N3, N4 and Ns are any ribonucleotide; and when N2 is G, then N1 is not A, or Ns is not G, or N4 is not e or Ns is not G;

- when Ns is C, A or U, then N1, N2, N4 and Ns are any ribonucleotide; and when N3 is G, then N1 is not A, or N2 is notG. or N4 is note or Ns is not G;

- when N4 is A, G or U, then N1 , N2, N3 and Ns are any ribonucleotide; and and when N4 is C, then N1 is not A, or N2 is not G, or N3 is not G, Ns is not G;

- when Ns is A, C or U, then N1, N2, N3 and N4 are any ribonucleotide; and when Ns is G, then N1 is not A, N2 is not G, or N3 is not G, or N4 is not C; and wherein N10, Ng, Ns, N7 and Ns, are such that N1 is complementary to Ns, N2 is complementary to N7, Ns is complementary to Ns, N4 is complementary to Ng and Ns is complementary to Nw. the complementarity being according to Wobble pairing model of ribonucleotides, and wherein No is any nucleotide, and Nn is complementary to No according to either the Watson and Crick pairing model or the Wobble pairing model of ribonucleotides.

2. The nucleic acid construct according to claim 1 , wherein the first and the second sequence belong to the same nucleic acid molecule.

3. The nucleic acid construct according to claim 1 or claim 2, comprising the nucleic acid sequence as set forth in SEQ ID NO: 1

GGAACCUCGCUUCGGCGAUGAUGGAGNIN2N3N4N5CAAGGUUAACNION9N8N7N6CAGGUUCC wherein N1 , N2, N3, N4 and Ns such that - 48 -

- when Ni is C, G or U, then N2, N3, N4 and N5 are any ribonucleotide; and when N1 is A, then N2 is not G, or N3 is not G, or N4 is not C or N5 is not G;

- when N2 is C, A or U, then N1, N3, N4 and Ns are any ribonucleotide; and when N2 is G, then N1 is not A, or Ns is not G, or N4 is not e or Ns is not G;

- when Ns is C, A or U, then N1, N2, N4 and Ns are any ribonucleotide; and when N3 is G, then N1 is not A, or N2 is notG. or N4 is note or Ns is not G;

- when N4 is A, G or U, then N1 , N2, N3 and Ns are any ribonucleotide; and and when N4 is C, then N1 is not A, or N2 is not G, or N3 is not G, Ns is not G;

- when Ns is A, C or U, then N1, N2, N3 and N4 are any ribonucleotide; and when Ns is G, then N1 is not A, N2 is not G, or N3 is not G, or N4 is not C; and wherein N10, Ng, Ns, N7 and Ns, are such that N1 is complementary to Ns, N2 is complementary to N7, Ns is complementary to Ns, N4 is complementary to Ng and Ns is complementary to Nw. the complementarity being according to Wobble pairing model of ribonucleotides.

4. The nucleic acid construct according to anyone of claims 1 to 3, said nucleic acid construct conferring to a fluorophore molecule a brightness at least 2-folds higher compared to the brightness of a fluorophore interacting with a ribonucleic acid molecule SRB-2 as set forth in SEQ ID NO: 2.

5. The nucleic acid construct according to anyone of claims 1 to 4, wherein N2 is G

6. The nucleic acid construct according to anyone of claims 1 to 5, wherein N1 is G.

7. The nucleic acid construct according to anyone of claims 1 to 6, wherein Ns is C.

8. The nucleic acid construct according to anyone of claims 1 to 7, wherein said nucleic acid construct comprises one of the following sequences: SEQ ID NO: 9 to SEQ ID NO: 36, preferably SEQID NO: 16 to SEQ ID NO: 36.

9. The nucleic acid construct according to anyone of claims 1 to 8, wherein said nucleic acid construct comprises one of the following sequences: SEQ ID NO: 30 to 36.

10. A molecular complex comprising:

- a fluorophore molecule; and

- the nucleic acid construct according to one of claims 1 to 8 specifically bound to the fluorophore molecule.

11. The molecular complex according to claim 10, wherein said fluorophore has the following formula 1 : wherein Ri and R’i independently from each other, are H, a halogen atoms or a (Ci-Cis) alkyls, linear or cyclic, possibly branched,

R2, R’2, R3, R’3 can be H, sulfonyl such as sulfonate (SO3-) or sulfonamide;

R2 and R4 may form, together with the atoms of the carbon cycle to which R2 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

R’2 and R’4 may form, together with the atoms of the carbon cycle to which R’2 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

Rs and R3 may also form, together with the atoms of the carbon cycle to which R3 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

R’s and R’3 may also form, together with the atoms of the carbon cycle to which R’3 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

R4 and R5 may also form at least one fused aromatic heterocycle, said heterocycle cycle having 3 to 9 atoms, - 50 -

R’4 and R’5 may also form at least one fused aromatic heterocycle, said heterocycle cycle having 3 to 9 atoms, and

R4, R’4, Rs, R’s, Re and R’7, independently from each other, are polymethylene unit having 1 carbon to about 20 carbons, inclusive, optionally comprising at least one hetero atom selected from N, O and S l_3 are as defined above, and A’ and A” are independently from each other ether bond, ester, thioether, thioester, amide, sulfonamide, carbamate, thiocarbamate urea or thiourea,

G is H, an alkane (CH3), amido, an amino, a keto, an oxy, a carboxyl, a sulfo, sulfonyl or sulfonate group), a halide atom, G can be in ortho, or meta or para position and can be repeated on the benzyl cycle, and

A’ and A” independently from each other are C1-C12 alkyl, linear or cyclic, possibly substituted or an aryl, preferably a phenyl, substituted or not, A’ and A” can be in ortho, meta or para position said fluorophore being submitted to quenching or energy transfer when it is not associated to said nucleic acid construct in aqueous solution.

12. The molecular complex according to claim 10 or 11 , wherein said fluorophore is one of the following compounds having the following formulas:

Gemini 561-1

(4),

- 54 - Gemini 552-alkyne (7).

13. A DNA molecule coding for a nucleic acid construct according to anyone of claims 2 to 9.

14. The DNA molecule according to claim 13 further comprising a promoter able to control expression of the said nucleic acid molecule as defined above.

15. A host cell containing the nucleic acid construct as defined in any one of claims 1 to 7, or a molecular complex according to anyone of claims 10 to 12, or containing the DNA molecule according to claim 13 or 14.

16. A detection array comprising one or more nucleic acid construct according to one of claims 1 to 9 tethered to a discrete location on a surface of the array.

17. Use of the nucleic acid construct according to anyone of claims 1 to 9, or the molecular complex according to anyone of claims 10 to 12, or the DNA molecule coding for a nucleic acid molecule according to claim 13 or 14, or the host cell according to claim 15, or a combination thereof, for detecting fluorescent emission, in vitro or ex vivo, upon sensing of small molecules, nucleic acid molecules and proteins.

18. A method for detecting, preferably in vitro or ex vivo, small molecules, nucleic acid molecules or proteins in a sample, the method comprising - 55 -

- contacting the sample with i) a nucleic acid construct according to anyone of claims 1 to 9, operably linked to a polynucleotide molecule and ii) a fluorophore interacting with said nucleic acid construct and

- detecting the small molecules, nucleic acids or proteins by detecting fluorescent emission emitted by the fluorophore interacting with said nucleic acid construct, wherein said polynucleotide molecule interacts with said small molecules, nucleic acids or proteins.

Description:
Description Title: RNA aptamers and their use

The invention relates to the use of RNA aptamers and their use for fluorescence.

RNA is able to perform a wide range of functions (e.g., scaffolding, recognition or catalysis) that are intimately linked to the three-dimensional architecture of the molecule and its capacity to properly display key residues in space. Whereas, in general, interaction with macromolecules (e.g., proteins or nucleic acids) does not necessarily involve high structural complexity, on contrary the tight and specific recognition of small ligands usually occurs through sophisticated structures best exemplified by those found in riboswitches aptamer domain. The fine understanding of the recognition mechanism at work is greatly facilitated by the knowledge of the tridimensional structure of the RNA complexed to its ligand using NMR or X-ray crystallography. However, such deep structural characterization is not systematically performed for every newly discovered RNA (especially the synthetic ones), as these methods are both time-consuming and laborious. Faster, yet less precise, structural data can be obtained from probing in solution or in silico folding prediction. However, none of the aforementioned approaches gives access to a complete data set since, apart for some residues found to establish specific contacts and expected to have a low tolerance to mutations, it remains difficult to anticipate to what extent an RNA domain (i.e., a stem, a loop or a more elaborated architecture) may tolerate mutations and/or sequence permutations. For instance, even for an element as simple as a stem, it is difficult to precisely predict what impact (beneficial, negative or neutral) the conservative mutation of base pairs may have on RNA functionality. Nonetheless, such information would be extremely valuable to assess the mutational robustness of a molecule or even to assist its engineering.

An efficient way to evaluate mutation tolerance and engineerability of an RNA domain consists in looking at nucleotide conservation through the alignment of its orthologs. Then, using such alignment one can evaluate the tolerance of each position to variability by scoring the conservation of each residue (e.g., using Shannon uncertainty) and even compute the informational complexity of the molecule. As a matter of fact, the more sequences are used to generate the alignment, the more accurate the predictions will be. The situation may slightly differ with the origin of the RNA. On the one hand, when studying natural RNAs, large sets of sequences can rapidly be collected by browsing the vast reservoir of genomes. A detailed view of a sequence evolution can be then obtained by its phylogenetic study. Such comparative genomic analysis is typically used to identify and characterize riboswitches. As a first approximation, one may consider that finding a sequence evolutionary conserved in a genome indicates that the molecule is likely functional. Then, searching for covariations allows to rapidly highlight putative secondary structure elements while looking at nucleotide conservation allows to anticipate which residues are important for the function of the molecule (e.g., ligand recognition) as they are expected to be highly conserved. On the other hand, artificial RNAs (e.g., RNA aptamers or ribozymes) can rapidly be isolated using in vitro selection procedures like SELEX, at the end of which several sequence families are typically identified. The best prototype sequence can then possibly be refined through a so-called doped SELEX step during which the sequence is partially randomized prior to being subjected to a few rounds of SELEX to isolate the best fitted sequences. This strategy is particularly efficient to characterize binders (i.e., aptamers), but it is more questionable for RNAs endowed with more complex functions like catalysis or fluorescence activation since a capacity to bind a target is not necessarily synonymous of efficient substrate conversion or fluorescence emission. Instead, such RNAs are expected to be more efficiently developed through functional screening where each sequence contained in a library is individually assayed for the target function and sorted from the bulk accordingly.

To be viable and competitive, a screening technology should be rapid, cost-effective and operate in high-throughput manner while offering the best possible control over reaction conditions. In this view, microfluidics is particularly attractive as it both allows to significantly decrease reaction volumes while increasing analytical throughputs. For instance, several hundreds of sequences can be individually expressed and analyzed in parallel using a microfluidic chip in which several hundreds of nanoliter volume microcompartments were fabricated. Such large-scale integration devices are interesting to exhaustively screen libraries made of a few hundreds of different sequences. Substantial gain in the throughput can be achieved by repurposing high-throughput sequencer and using their sequencing microfluidic flow-cells to sequence immobilize variant coding genes, sequence them prior to expressing and measuring their phenotype. Yet, this requires a high-throughput sequencing platform to be customized and it is not easily applicable to devices currently available. An alternative affording similar, if not higher, analysis throughput exploits emulsion-based technologies like particle display or dropletbased microfluidics. These technologies allow the parallel analysis of several million variants in a quantitative way. Over the past years, we demonstrated that microfluidic- assisted In Vitro Compartmentalization (pIVC in short) is extremely efficient at identifying optimized ribozymes (Ryckelynck et al. 2015) and light-up RNA aptamers. Moreover, monitoring the fate of individual sequences throughout the screening using Next Generation Sequencing (NGS) allowed us to identify biosensors with optimized communication modules, though relatively basic and time-consuming sequence analysis was performed.

The invention relates to a nucleic acid construct comprising:

- a first sequence as set forth in SEQ ID NO: 87

- a second sequence as set forth in SEQ ID NO: 88 |GGAACC|NnCGC ; wherein the sequence in bold of the first sequence interacts with the sequence in bold of the second sequence, and wherein the boxed sequence of the first sequence interacts with the boxed, sequence of the second sequence, wherein Ni , N2, N3, N4 and N5 such that

- when N1 is C, G or U, then N2, N3, N4 and N5 are any ribonucleotide; and when N1 is A, then N2 is not G, or N3 is not G, or N4 is not C or N5 is not G;

- when N2 is C, A or U, then N1, N3, N4 and Ns are any ribonucleotide; and when N2 is G, then N1 is not A, or Ns is not G, or N4 is not e or Ns is not G;

- when Ns is C, A or U, then N1, N2, N4 and Ns are any ribonucleotide; and when N3 is G, then N1 is not A, or N2 is notG. or N4 is note or Ns is not G;

- when N4 is A, G or U, then N1 , N2, N3 and Ns are any ribonucleotide; and and when N4 is C, then N1 is not A, or N2 is not G, or N3 is not G, Ns is not G;

- when Ns is A, C or U, then N1, N2, N3 and N4 are any ribonucleotide; and when Ns is G, then N1 is not A, N2 is not G, or N3 is not G, or N4 is not C; and wherein N10, Ng, Ns, N7 and Ns, are such that N1 is complementary to Ns, N2 is complementary to N7, Ns is complementary to Ns, N4 is complementary to Ng and Ns is complementary to Nw. the complementarity being according to Wobble pairing model of ribonucleotides, and wherein No is any nucleotide, and Nn is complementary to No according to either the Watson and Crick pairing model or the Wobble pairing model of ribonucleotides.

The inventors unexpectedly identified that aptamers that harbor specific mutations compared to the sequence of the SRB2 aptamer enhance significantly the brightness properties of a fluorophore that interacts with said aptamers. A molecular complex comprising essentially a fluorophore and the aptamers of the invention that is soluble in aqueous solution, can be used in cell culture and in vivo, is only activatable when both compounds interact together.

The nucleic acid construct as defined in the invention, also called aptamer, harbors a spatial conformation such that it forms hairpins and loops that are important for the interaction and activation of fluorophores. What is it important in the invention, it is some of the nucleotides involved in the hairpin called P3, which determine the efficacy and the efficiency of the interactions with the fluorophores. Other hairpins P1 and P2 are also important for the structure of the aptamer.

It is important to note that the two sequences comprised in the molecular construct of the invention can i) either belong to the same molecule, or 2) belong to two different molecules.

When the molecular construct is constituted by two nucleic acid molecules, the two molecules are connected to each other since their respective sequences contain nucleotides liable to form hydrogen bounds according to the Watson and Crick pairing model or the Wobble pairing model of ribonucleotides.

Chen the nucleic acid construct is constituted by only one nucleic acid molecule the sequences SEQ ID NO:86 and 87 can be either linked to each other (the first sequence is followed immediately by the second sequence, or the second sequence is followed immediately by the second sequence). By followed immediately, it is meant that the las nucleotide of the first sequence is covalently linked via a phosphodiester bridge with the first nucleotide of the other sequence. The two sequences can also be separated by one or more nucleotide. What is it important is that the molecule can fold to adopt the expected secondary structure, with the above defined hairpins and loops.

Advantageously, the invention relates to the nucleic acid construct as defined above, wherein the first and the second sequence belong to the same nucleic acid molecule.The invention relates to nucleic acid construct or a nucleic acid molecule comprising the nucleic acid sequence as set forth in SEQ ID NO: 1

GGAACCUCGCUUCGGCGAUGAUGGAGNiNjNa^NsCAAGGUUAACNioNsNsNvNeC AGGUUCC wherein Ni, N2, N3, N4 and Ns are any nucleotides, and wherein N10, Ng, Ns, N7 and Ne, are such that N1 is complementary to Ne, N2 is complementary to N7, N3 is complementary to Ns, N4 is complementary to Ng and N5 is complementary to N10, the complementarity being according to Wobble pairing model of ribonucleotides, provided that SEQ ID NO: 1 is not SEQ ID NO: 2

GGAACCUCGCUUCGGCGAUGAUGGAGAGGCGCAAGGUUAACCGCCUCAGGUUCC.

Advantageously, the invention relates to a nucleic acid construct or nucleic acid molecule comprising the nucleic acid sequence as set forth in SEQ ID NO: 1

GGAACCUCGCUUCGGCGAUGAUGGAGNiNsNa^NsCAAGGUUAACNioNsNsNvNeC AGGUUCC wherein N1 , N2, N3, N4 and N5 such that

- when N1 is C, G or U, then N2, N3, N4 and N5 are any ribonucleotide; and when N1 is A, then N2 is not G, or N3 is not G, or N4 is not C or N5 is not G;

- when N2 is C, A or U, then N1, N3, N4 and Ns are any ribonucleotide; and when N2 is G, then N1 is not A, or Ns is not G, or N4 is not e or Ns is not G;

- when Ns is C, A or U, then N1, N2, N4 and Ns are any ribonucleotide; and when N3 is G, then N1 is not A, or N2 is notG. or N4 is note or Ns is not G;

- when N4 is A, G or U, then N1 , N2, N3 and Ns are any ribonucleotide; and and when N4 is C, then N1 is not A, or N2 is not G, or N3 is not G, Ns is not G;

- when Ns is A, C or U, then N1, N2, N3 and N4 are any ribonucleotide; and when Ns is G, then N1 is not A, N2 is not G, or N3 is not G, or N4 is not C; and wherein N10, Ng, Ns, N7 and Ns, are such that N1 is complementary to Ns, N2 is complementary to N7, Ns is complementary to Ns, N4 is complementary to Ng and Ns is complementary to N10, the complementarity being according to Watson and Cricks pairing model or Wobble pairing model of ribonucleotides

The aptamers according to the invention are derived from, but not similar to, the aptamer SRB2, known in the art and comprising the nucleic acid sequence as set forth in SEQ ID NO: 2. The mutations identified by the inventors are located in the stem P3 of the aptamer and significantly enhance the brightness properties of a fluorophore that interact with said aptamers.

In the invention, when considering the nucleic acid construct or nucleic acid molecule, reference is made to A, C, G and U, i.e., adenosine 5'-phosphate, cytidine 5'- phosphate, guanosine 5'-phosphate and uridine 5'-phosphate, the four ribonucleotides mainly present in RNA molecules.

Ribonucleotides in the invention consists of a phosphate group, a ribose sugar group, and a nucleobase, in which the nucleobase is attached. Without the phosphate group, the composition of the nucleobase and sugar is known as a nucleoside. The interchangeable nitrogenous nucleobases are derived from two parent compounds, purine and pyrimidine. Nucleotides are heterocyclic compounds, that is, they contain at least two different chemical elements as members of its rings. The ribonucleotides according to the invention have the following formula: wherein R a is OH, a Halogen, in particular F, an amine or a O-methyl group, and possibly H, Ps ,

R c is O or S.

Therefore, in the invention when reference is made to A, U, G, C within the context of the nucleic acid construct or nucleic acid molecule, reference is made to natural ribonucleic A, U, G and C, but also to a variant of said natural ribonucleic A, U, G and C, in particular in R a and R c residues.

In the invention, the Wobble base pairing model is a pairing between two nucleotides in RNA molecules that does not strictly follow Watson-Crick base pair rules (A-T and G- C pairs). The four main wobble base pairs are guanine-uracil (G-U), hypoxanthine-uracil (l-U), hypoxanthine-adenine (l-A), and hypoxanthine-cytosine (l-C). The thermodynamic stability of a wobble base pair is comparable to that of a Watson-Crick base pair. Wobble base pairs are fundamental in RNA secondary structure and are critical for the proper translation of the genetic code. Preferably, in the invention the mains pairs are A-U, G- U and G-C.

Advantageously, the invention relates to the above-mentioned nucleic acid construct nucleic acid molecule, said nucleic acid construct or nucleic acid molecule conferring to a fluorophore molecule a brightness at least 2-folds higher, preferably at least 4-folds higher, compared to the brightness of a fluorophore interacting with a nucleic acid molecule SRB-2 as set forth in SEQ ID NO: 2.

Advantageously, invention relates to the above-defined nucleic acid construct or nucleic acid molecule, wherein N? is G.

In the above-mentioned nucleic acid molecule, N? is preferably a G (guanosine 5'- phosphate) such that the nucleic acid construct or nucleic acid molecule comprises, or consists essentially or consists of the nucleic acid sequence as set forth in SEQ ID NO: 3 as shown hereafter:

Advantageously, invention relates to the above-defined nucleic acid construct or nucleic acid molecule, wherein Ni in SEQ ID NO: 1 is G.

In the above-mentioned nucleic acid construct or nucleic acid molecule, N1 is preferably a G (guanosine 5'-phosphate) such that the nucleic acid construct or nucleic acid molecule comprises, or consists essentially or consists of the nucleic acid sequence as set forth in SEQ ID NO: 4 or 5 as shown hereafter

Advantageously, invention relates to the above-defined nucleic acid construct or nucleic acid molecule, wherein Ns is C. In the above-mentioned nucleic acid construct or nucleic acid molecule, Ns is preferably a C (cytidine 5'-phosphate) such that the nucleic acid construct or nucleic acid molecule comprises, or consists essentially or consists of the nucleic acid sequence as set forth in SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 86 or SEQ ID NO:8 as shown hereafter Advantageously, invention relates to the above-defined nucleic acid construct or nucleic acid molecule, wherein said nucleic acid construct or nucleic acid molecule comprises one of the following sequences: SEQ ID NO: 9 to SEQ ID NO: 36, preferably SEQID NO: 16 to SEQ ID NO: 36.

The most advantageous nucleic acid construct or nucleic acid molecules according to the invention are the nucleic acid construct or nucleic acid molecules that comprise, consist essentially or consist of the nucleic acid sequences listed below:

Advantageously, invention relates to the above-defined nucleic acid construct or nucleic acid molecule, wherein said nucleic acid construct or nucleic acid molecule comprises one of the following sequences: SEQ ID NO: 30 to 36.

Nucleic acid construct or nucleic acid molecules comprising, consisting essentially or consisting of the nucleic acid sequences as set forth in SEQ ID NO: 30 to 36 are particularly advantageous since they confer to a complex comprising a fluorophore and said nucleic acid construct or nucleic acid molecule a brightness at least 8-folds higher compared to the brightness of a complex comprising a fluorophore and SRB2 aptamer (SEQ ID NO: 2).

The invention also relates to a molecular complex comprising:

- a fluorophore molecule; and

- the nucleic acid construct or nucleic acid molecule as defined above, specifically bound to the fluorophore molecule.

The invention relates to a molecular complex emitting fluorescent light, the molecular complex comprising, or consisting essentially of a fluorophore, and a nucleic acid construct or nucleic acid molecule, wherein said fluorophore has of the following formula 1 wherein - independently from each other, Fd1 and Fd2 are fluorescent dyes, N

- D1 represents a group chosen from: I OP , or from a cyclo(C3-C7)alkyl, a monocyclic aromatic group, heterocyclic group or a monocyclic nonaromatic, alkane or heterocyclic group, wherein R’ represents a hydrogen atom or a (Ci- Cs)alkyl, linear or cyclic, saturated or not,

- independently from each other, L1 and L2 is covalently bound to D1 , is a group consisting of a single bond; a linear or branched alkyl group having from 1 to 24 carbon atoms (C1-C24), at least one of said carbon atoms being replaced by an heteroatom, e.g. O, N, S, or not, said alkyl group being substituted or not by an amido, an amino, a keto, an oxy or a carboxyl group or a linear or branched unsaturated or not alkyl group having from 2 to 24 carbon atoms, at least one of said carbon atoms being replaced by an heteroatom e.g. O, N, S, or not, said alkyl group being substituted or not by an amido, an amino, a keto, an oxy, a carboxyl group;

- L3 is a hydrogen atom or corresponds to L1 or L2, i.e. a linear or branched alkyl group having from 1 to 24 carbon atoms (C1-C24), at least one of said carbon atoms being replaced by an heteroatom, e.g. O, N, S, or not, said alkyl group being substituted or not by an amido, an amino, a keto, an oxy or a carboxyl group or a linear or branched unsaturated or not alkyl group having from 2 to 24 carbon atoms, at least one of said carbon atoms being replaced by an heteroatom e.g. O, N, S, or not, said alkyl group being substituted or not by an amido, an amino, a keto, an oxy, a carboxyl group, possibly substituted by a functionalizable moiety, e.g. azide, alkyne, DBCO, active ester, carboxylic acid, maleimide group or a functional molecule such as a ligand or a biomolecule e.g. biotin, or desthiobiotin, and

- A is a C1-C12 alkyl, linear or cyclic, possibly substituted by an aryl, preferably a phenyl, substituted or not, said fluorophore being submitted to quenching or energy transfer when it is not associated to said nucleic acid construct or nucleic acid molecule in aqueous solution, or said fluorophore being submitted to quenching or energy transfer when considered alone in aqueous solution, wherein said nucleic acid construct or nucleic acid molecule is able to activate the fluorescence of said fluorophore in an aqueous solution, when interacting with said fluorophore, and wherein said nucleic acid construct or nucleic acid molecule is able to specifically interact, in a sequence specific manner, with said fluorophore. The inventors unexpectedly identified molecular complex comprising essentially a fluorophore and a nucleic acid construct or nucleic acid molecule as defined above that is soluble in aqueous solution, can be used in cell culture and in vivo, and harbours high brightness properties and is only activatable when both compounds interact together.

The compounds that constitute the complex, are therefore the fluorophore and the nucleic acid construct or nucleic acid molecule.

The fluorophore of the complex described above is a fluorophore of formula (I), and contains two fluorescent dyes Fd1 and Fd2 that can be identical or different.

Both Fd1 and Fd2 are dyes that can re-emit light upon light excitation. Fd1 and Fd2 typically contain several combined aromatic groups, or planar or cyclic molecules with several TT bonds. It can be coumarins, pyrenes, cyanines, BODIPYs, merocyanines and their derivatives well known in the art. It is advantageous that Fd1 and Fd2 be xanthene derivatives such as fluorescein dye, rhodamine dye, sulforhodamine dye, Oregon green dye, eosin dye, and Texas red dye, silicon-rhodamine dye, or one of their derivatives well known in the art.

Due to the structure of the fluorophore, both Fd1 and Fd2 dyes are chemically linked to each other and, depending upon the environmental conditions can be close together. This results in a decrease of the fluorescence intensity, or an absence of fluorescence at the emitting wavelength, when both dyes are excited at the specific wavelength. This phenomenon is the quenching or energy transfer.

Thus, when the fluorophore is in an environment, e.g. aqueous solution, that induces the rapprochement of both dyes, quenching occurs and no fluorescence, or a decreased fluorescence, is emitted by the fluorophore when excited at the appropriated wavelength. On the contrary, when the fluorophore is in an appropriate environment, e.g. organic solvent, Fd1 and Fd2 are far from each other and the quenching does not occur.

Based on these properties, the inventors engineered a strategy to specifically activate the fluorescence of said fluorophore, when the fluorophore is in aqueous solution, i.e. when the fluorophore is in physiological conditions to be used in living cells.

The inventors identified that the nucleic acid construct or nucleic acid molecules as defined above can specifically interact with said fluorophore, such that: the fluorescence is enhanced compared to the fluorescence of the fluorophore, when it does not interact with said nucleic acid construct or nucleic acid molecule, or when said fluorophore is placed alone in an organic solvent that does not induce quenching, and the interaction is very specific with a high affinity.

In the fluorophore described above, L3 represents a functionalizable moiety that can be used to detect, isolate or purify the fluorophore.

L1 and L2 correspond to the “arms” of the fluorophore that associate to each other Fd1 and Fd2 dyes. L1 and L2 are covalently linked to each other via D1 , as defined above.

L1 and L2 independently from each other can be: either a single bound, such that the fluorophore will have the following formula when both L1 and L2 are a single bound, or a linear or branched alkyl group having 1 , or, 2, or 3, or 4, or 5, or 6, or 7, or 8, or 9, or 10, or 11 , or 12, or 13, or 14, or 15, or 16, or 17, or 18, or 19, or 20, or 21 , or 22, or 23 or 24 carbon atoms or a linear or branched alkyl group having 1 , or, 2, or 3, or 4, or 5, or 6, or 7, or 8, or 9, or 10, or 11 , or 12, or 13, or 14, or 15, or 16, or 17, or 18, or 19, or 20, or 21 n, or 22, or 23 or 24 carbon atoms, wherein at least one carbon atom is substituted by an hetero atom, e.g. O, N or S, or a linear or branched alkyl group having 1 , or, 2, or 3, or 4, or 5, or 6, or 7, or

8, or 9, or 10, or 11 , or 12, or 13, or 14, or 15, or 16, or 17, or 18, or 19, or 20, or 21 n, or 22, or 23 or 24 carbon atoms, said alkyl group being itself substituted by an amido, an amino, a keto, an oxy, a carboxyl group, a linear or branched unsaturated or not alkyl group having from 2 to 24 carbon atoms, or a linear or branched alkyl group having 1 , or, 2, or 3, or 4, or 5, or 6, or 7, or 8, or 9, or 10, or 11 , or 12, or 13, or 14, or 15, or 16, or 17, or 18, or 19, or 20, or 21 n, or 22, or 23 or 24 carbon atoms, wherein at least one carbon atom is substituted by an hetero atom, e.g. O, N or S, the carbon and/or the heteroatoms of said alkyl group being themselves substituted by an amido, an amino, a keto, an oxy, a carboxyl group, a linear or branched unsaturated or not alkyl group having from 2 to 24 carbon atoms. In the fluorophore, A represents a C1-C12 alkyl, i.e. a Ci, a C2, a C3, a C4, a C5, a Ce, a C7, a Cs, a Cg, a C10, a C11, or a C12 alkyl, or a C1-C12 alkyl substituted by an aryl group, said aryl being substituted or not.

In the complex disclosed above, the nucleic acid construct or nucleic acid molecule interacts with the fluorophore such that it inhibits or avoids quenching that occurs between both Fd1 and Fd2 dyes. This interaction is specific of the nucleic acid construct or nucleic acid molecule sequence, such that the nucleic acid construct or nucleic acid molecule should advantageously have a determined nucleic acid sequence to interact with said fluorophore.

The molecular complex between the fluorophore and a nucleic acid construct or nucleic acid molecule according to the invention has a brightness enhanced compared to the brightness obtain with a complex comprising a fluorophore and SRB2 aptamer.

Advantageously, the invention relates to the molecular complex as defined above, wherein Fd1 and Fd2 are represented by formula 2:

Wherein

X is NH, C(R) 2 , O, Si(R) 2 Ge(R) 2 Sn(R) 2 P(R) 2 B(R) 2 S, SO 2 , Se, Te, TeO, wherein R can be alkyl or aromatic groups, or O, O-alkyl, sulfonyl such as sulfonate (SO3-) or sulfonamide;

+ ' R 4

Y is O, N-R 6 or 5

+ ' R, 4

Y’ is O-R’ 6 or N-R’ 7 or R 5

R1 and R’1 independently from each other, are H, a halogen atoms or a (Ci-C ) alkyls, linear or cyclic, possibly branched,

R2, R’2, R3, R’3 can be H, sulfonyl such as sulfonate (SO3-) or sulfonamide;

R2 and R4 may form, together with the atoms of the carbon cycle to which R2 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms, R’2 and R’4 may form, together with the atoms of the carbon cycle to which R’2 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

Rs and R3 may also form, together with the atoms of the carbon cycle to which R3 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

R’s and R’3 may also form, together with the atoms of the carbon cycle to which R’3 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

R4 and R5 may also form at least one fused aromatic heterocycle, said heterocycle cycle having 3 to 9 atoms,

R’4 and R’s may also form at least one fused aromatic heterocycle, said heterocycle cycle having 3 to 9 atoms, and

R4, R’4, Rs, R’s, Re and R’7, independently from each other, are polymethylene unit having 1 carbon to about 20 carbons, inclusive, optionally comprising at least one hetero atom selected from N, O and S.

Advantageously, the invention relates to the molecular complex as defined above, wherein said fluorophore has the following formula 1 : wherein R1 and R’1 independently from each other, are H, a halogen atoms or a (Ci-C ) alkyls, linear or cyclic, possibly branched,

R2, R’2, R3, R’3 can be H, sulfonyl such as sulfonate (SO3-) or sulfonamide; R2 and R4 may form, together with the atoms of the carbon cycle to which R2 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

R’2 and R’4 may form, together with the atoms of the carbon cycle to which R’2 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

Rs and R3 may also form, together with the atoms of the carbon cycle to which R3 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

R’s and R’3 may also form, together with the atoms of the carbon cycle to which R’3 is connected to, at least one fused aromatic heterocycle, said heterocycle cycle having 5 to 9 atoms,

R4 and R5 may also form at least one fused aromatic heterocycle, said heterocycle cycle having 3 to 9 atoms,

R’4 and R’s may also form at least one fused aromatic heterocycle, said heterocycle cycle having 3 to 9 atoms, and

R4, R’4, Rs, R’s, Re and R’7, independently from each other, are polymethylene unit having 1 carbon to about 20 carbons, inclusive, optionally comprising at least one hetero atom selected from N, O and S l_3 are as defined above, and A’ and A” are independently from each other ether bond, ester, thioether, thioester, amide, sulfonamide, carbamate, thiocarbamate urea or thiourea,

G is H, an alkane (CH3), amido, an amino, a keto, an oxy, a carboxyl, a sulfo, sulfonyl or sulfonate group), a halide atom, G can be in ortho, or meta or para position and can be repeated on the benzyl cycle, and

A’ and A” independently from each other are C1-C12 alkyl, linear or cyclic, possibly substituted or an aryl, preferably a phenyl, substituted or not, A’ and A” can be in ortho, meta or para position said fluorophore being submitted to quenching or energy transfer when it is not associated to said nucleic acid construct or nucleic acid molecule in aqueous solution.

More advantageously, the invention relates to the molecular complex as defined above, wherein said -A-Fd1 and -A-Fd2 groups are one of the following fluorophores: Rhodamine, Sulfo-Rhodamine, non-N-Alkylated Rhodamine, Ethyl-alkylated rhodamine, fluorescein, Silicon-Rhodamine, or carborhodamine.

More advantageously, the invention relates to the above defined complex, wherein said fluorophore is one of the following compounds having the following formulas:

Gemini 561-1

(4), Gemini 552-alkyne (7).

In another advantageous embodiment, the invention relates to the above mentioned molecular complex, wherein said complex harbors a fluorescence intensity at least 3-fold higher compared to the fluorescence intensity of corresponding free uncomplexed fluorophore in aqueous medium and wherein said nucleic acid construct or nucleic acid molecule has an affinity quantified by a Kd value of at most 500 nM, preferably lower, for said fluorophore.

In the invention, affinity has its common sense well known in the art, the tendency of a chemical species to react with another species to form a chemical compound. Affinity can also be referred to as the tendency of certain atoms (or molecules) to aggregate or bond together, and includes electrostatic interactions, hydrogen bounds, ...

The term "specifically binding", “specifically binds” or “specifically interacts” is used herein to indicate that this moiety has the capacity to recognize and interact specifically with the molecular target of interest, while having relatively little detectable reactivity with other structures present in the aqueous phase such as other molecular targets that can be recognized by other probes. There is commonly a low degree of affinity between any two molecules due to non-covalent forces such as electrostatic forces, hydrogen bonds, Van der Waals forces and hydrophobic forces, which is not restricted to a particular site on the molecules and is largely independent of the identity of the molecules. This low degree of affinity can result in non-specific binding. By contrast when two molecules bind specifically, the degree of affinity is much greater than such non-specific binding interactions. In specific binding a particular site on each molecule interacts, the particular sites being structurally complementary, with the result that the capacity to form non- covalent bonds is increased.

The fluorescence enhancement can be measured by a fluorometer and can be obtained by dividing the maximum fluorescence intensity of the fluorophore alone in aqueous medium by the maximum fluorescence intensity of the fluorophore in the presence of the said nucleic acid in the same medium an at the same concentration.

The Kd value can be obtained by measuring the fluorescence intensity of the fluorophore in aqueous medium with increasing amount of the said nucleic acid. The plot of the fluorescence intensity versus the concentration of the said nucleic acid will provide the Kd value after fitting with the proper equation (example: Hill equation).

The change in the brightness and Kd values can be acquired using standard fluorescence spectrometer, where the complex and the fluorophore alone are measured in aqueous medium in a cuvette.

The affinity of a molecule X for its partner Y can generally be represented by the dissociation constant (Kd). In preferred embodiments, the Kd representing the affinity between the capture moiety and the molecular target of interest is from 1 ,10' 7 M or lower, preferably from 1.10' 8 M or lower, and even more preferably from 1.10' 9 M or lower. Specificity and affinity can be relatively determined by binding or competitive assays, using e.g., Biacore instruments.

The invention also relates to a DNA molecule coding for a nucleic acid construct or nucleic acid molecule as defined above.

As mentioned above the nucleic acid construct or nucleic acid molecules as defined in the invention are constituted by ribose-based nucleotides, i.e, a ribonucleotide as defined above, having the following formula: wherein R a is OH, a Halogen, in particular F, an amine or a O-methyl group, and possibly H,

Rb is Cytosine Guanine , Adenine

Pseudouracil or their methylated analogues, and

R c is O or S.

Advantageously, the nucleic molecule is synthetized by incorporating natural or modified ribonucleotide as defined above, according to the transcription process well known in the art. In the invention, the DNA molecule encompasses a single-stranded DNA molecule coding said nucleic acid construct or nucleic acid molecule, and also a double-stranded DNA molecule, having two complementary antiparallel strands according to the Watson and Crick model, one of the strand coding for the nucleic acid construct or nucleic acid molecule according to the invention.

It is also encompassed a single-stranded DNA molecule that is complementary to the single-stranded DNA molecule that codes for the nucleic acid construct or nucleic acid molecule according to the invention.

It could be more convenient to carry out mutational analysis, or expression experiments by using molecules constituted by deoxyribose-based nucleotides, that code for said aptamers, i.e. said nucleic acid constructs or nucleic acid molecules.

Advantageously, the DNA molecules coding for the nucleic acids molecules as set forth in SEQ ID NO: 9 to SEQ ID NO: 36 are the following molecules:

Advantageously, the invention relates to the DNA molecule as defined above, further comprising a promoter able to control expression of the said nucleic acid construct or nucleic acid molecule as defined above. It is advantageous that the DNA molecule mentioned above be preceded (5’- end of the DNA molecule) by a translation initiation sequence to allow proper translation initiation, a constitutive promoter positioned with respect to the DNA molecule sequence so as to be capable of promoting its transcription upon activation of the promoter.

The DNA sequence is advantageously inserted in a vector, linear or circular, such as a bacterial vector or a eukaryotic vector, or a virus vector.

The invention also relates to a host cell containing the nucleic acid construct or nucleic acid molecule as defined above, or a molecular complex as defined above, or containing the DNA molecule as defined above, or the genetically engineered DNA molecule allowing the expression of said nucleic acid construct or nucleic acid molecule, or a combination thereof.

Once the constructed DNA molecule has been cloned into an expression system, it is ready to be incorporated into a host cell. Such incorporation can be carried out by the various forms of transformation, depending upon the vector/host cell system such as transformation, transduction, conjugation, mobilization, or electroporation. The DNA sequences are cloned into the vector using standard cloning procedures in the art, as described by Maniatis et al, Cold Springs Harbor, New York (1982)), Suitable host cells include, but are not limited to, bacteria, yeast, mammalian cells, insect cells, plant cells, and the like. The host cell is preferably present either in a cell culture (ex vivo) or in a whole living organism (in vivo).

Mammalian cells suitable for carrying out the present invention include, without limitation, COS (e.g., ATCC No. CRL 1650 or 1651 ), BHK (e.g., ATCC No. CRL 6281 ), CHO (ATCC No. CCL 61 ), HeLa (e.g., ATCC No. CCL 2), 293 (ATCC No. 1573), CHOP, NS-1 cells, embryonic stem cells, induced pluripotent stem cells, and primary cells recovered directly from a mammalian organism. With regard to primary cells recovered from a mammalian organism, these cells can optionally be reintroduced into the mammal from which they were harvested or into other animals.

The invention also relates to a detection array comprising one or more nucleic acid constructs or nucleic acid molecules as defined above, tethered to a discrete location on a surface of the array.

By using an array comprising one or more nucleic acid constructs or nucleic acid molecules as defined above, a method of screening fluorophores can be carried out in order to isolate fluorophore having the best interaction with said nucleic acid constructs or nucleic acid molecules, and/or the complexes fluorophore/ nucleic acid constructs or nucleic acid molecules having the best brightness. The array can be prepared in a similar manner than the array used for evaluating expression level of genes. These techniques are well known in the art, and the skilled person could easily prepare such an array.

The invention also relates to the use of:

- the nucleic acid construct or nucleic acid molecule as defined above, or

- the molecular complex as defined above, or

- the DNA molecule coding for a nucleic acid construct or nucleic acid molecule as defined above, or

- the host cell as defined above,

-or a combination thereof, for detecting fluorescent emission, preferably in vitro or ex vivo, upon sensing of small molecules, nucleic acid molecules and proteins.

These detections can be carried out by using different technics well known in the art using fluorescence detection technics like imaging or any fluorometric approach.

The target molecule of interest can be any biomaterial or small molecule including, without limitation, proteins, nucleic acids (RNA or DNA), lipids, oligosaccharides, carbohydrates, small molecules, hormones, cytokines, chemokines, cell signaling molecules, metabolites, organic molecules, and metal ions. The target molecule of interest can be one that is associated with a disease state or pathogen infection.

The invention also relates to a method for fluorescence detection in vitro or ex vivo using fluorometry or imaging of small molecules, RNA and proteins in cells, comprising the administration, to a living in vivo and ex vivo culture of cells, a nucleic acid construct or nucleic acid molecule as defined above, operably linked to an RNA, along with a fluorophore, preferably as defined above. It also relates to in vitro methods in which the nucleic acid construct or nucleic acid molecule is used to report on a successful amplification of a target nucleic acid.

In other words, the invention relates to a method for detecting, preferably in vitro or ex vivo, small molecules, nucleic acids or proteins in a sample, the method comprising contacting the sample with i) a nucleic acid construct or nucleic acid molecule as defined above operably linked to a polynucleotide molecule and ii) a fluorophore (that interacts with said nucleic acid construct or nucleic acid molecule) and detecting the small molecules, nucleic acids or proteins by detecting fluorescent emission emitted by the fluorophore interacting with said nucleic acid construct or nucleic acid molecule, wherein said polynucleotide molecule interacts with said small molecules, nucleic acids or proteins. The above sample can be either a natural sample (for instance a liquid water sample liable to contain small molecules, nucleic acid molecules or proteins), or a biological sample such a biological fluid (such as urine, lymph, blood, cerebrospinal fluid ...), a cell extract or a cell.

For instance, for imaging RNA in cells, it is possible to provide to a cell, or to a cell- free expression system, a molecule allowing the expression of a fusion RNA constituted by : the RNA to be studied in the cell, operably linked, preferably in its 3’-end, but possibly to its 5’end to an aptamer according to the above definition, i.e. a nucleic acid construct or nucleic acid molecule according to the invention.

The above-disclosed fusion RNA is then expressed in the cell, or in a cell-free expression system, and in presence of the fluorophore, the part of the fusion molecule will interact with the fluorophore. This will result in a fluorescence emission upon exposure of an appropriate wavelength, and it could be possible to track, and thus to image, the RNA to be studied, because it is covalently linked to the aptamer.

It would be therefore possible to monitor the trafficking, the localization, the accumulation... of the RNA to be studied, in particular in living cells without alteration of their integrity.

It is also disclosed a method for imaging small molecules, RNA and proteins mammals, comprising the administration to a mammal a nucleic acid according the above definition operably linked to a biomolecule, along with a fluorophore molecule according to the above definition.

From conventional techniques of molecular biology, the skilled person would be able to obtain all the necessary fusion RNA molecules.

The invention will be better understood in light of the following figure and the examples.

Legend to the figures

[Fig. 1] Figure 1 represents the structure of SRB-2 aptamer (SEQ ID NO: 2) and P3 mutant library. The molecule is shown as initially described. The 10 positions randomized in the P3 stem are dashed boxed and the domains proposed to interact with sulforhodamine B is shadowed in grey. SRB-2 and its derivatives were flanked with two constant regions used as primer binding sequences (PBS) for PCR amplifications.

[Fig. 2] Figure 2 represents the structure of Gemini-561 fluorogen. The fluorogen is made of a dimer of sulforhodamine B linked together by a lysine and a PEG linker. The fluorogen also carry a biotin moiety initially used for aptamer selection. [Fig. 3] Figure 3 represents the working principle of Gemini-561 . H-aggregates forms when the molecule is dissolved in an aqueous environment, leading to sulforhodamine B fluorescence quenching. However, modulating medium polarity (e.g., by adding methanol) or in the presence of SRB-2 aptamers, both sulforhodamine B moieties separate and recover their fluorescence capacity.

[Fig. 4] Figure 4 represents an overview of the pIVC-Useq pipeline. The microfluidic- assisted In Vitro Compartmentalization (pIVC) step is made of three main steps during which: i. the genes contained in a library are individualized prior to being amplified, II. Droplets containing amplified genes are fused one-to-one with an in vitro expression mixture supplemented in fluorogen and ill. the fluorescence of each droplet is measure and used to sort them accordingly. In pIVC-Useq, the screening step is followed by a Next Generation Sequencing (NGS) analysis of the sequences contained in enriched libraries while an unsupervised bioinformatic pipeline allows the rapid identification of molecules of interest.

[Fig. 5] Figure 5 represents fluorescence profile of droplets obtained at first round of pIVC. Fusion efficiency can be assessed by the blue fluorescence of coumarin added in the droplets while the orange fluorescence informs on RNA function (Gemini-561 fluorescence activation). The population of droplets gated and sorted during the first round of screening are dashed boxed. Their DNA content was recovered and used to prime a new round of screening.

[Fig. 6] Figure 6 represents fluorescence profile of droplets obtained at second round of pIVC. Fusion efficiency can be assessed by the blue fluorescence of coumarin added in the droplets while the orange fluorescence informs on RNA function (Gemini-561 fluorescence activation). The population of droplets gated and sorted during the second round of screening are dashed boxed. Their DNA content was recovered and used to prime a new round of screening.

[Fig. 7] Figure 7 represents fluorescence profile of droplets obtained at third round of pIVC. During the third round, a relaxed (small dashed box, population R3A) and a stringent (large dashed box, population R3B) gating were used.

[Fig. 8] Figure 8 represents a graph representing the normalized fluorescence for each tested molecules: A- SRB-2; B- RO: C- R1 ; D- R2; E- R3.A and F- R3.B. For each round of pIVC, the enriched DNA libraries were in vitro transcribed in the presence of 500 nM Gemini-561 and the fluorescence was monitored over time. The fluorescence apparition rate was computed for each library and normalized to that of the parental SRB- 2 aptamer. The values are the mean of n = 3 independent experiments. The error bars correspond to ± 1 s.d. [Fig. 9] Figure 9 represents a graph showing the folding energy (kCal/mol) of the sequences computed using RNAfold program from the ViennaRNA Package. A= RO; B= R3A and C= R3B.

[Fig. 10] Figure 10 represents a graph showing base-pairs formation in the P3-L3 stem loop. The number of base pairs formed was extracted from the optimal secondary structure in dot-bracked notation generated by the RNAfold program of the ViennaRNA Package. Y-axis: proportion of pairs formation; X-axis: pair formation.

[Fig. 11] Figure 11 represents a graph showing base-pairs formation in the P3 stem only. Here, any sequence displaying at least one base pair involving L3 was set to 0. Y- axis: proportion of pairs formation; X-axis: pair formation.

[Fig. 12] Figure 12 represents the motif identified in the 50 sequences displaying the highest occurrence in R3A library. The motifs were generated using MEME algorithm (Bailey and Elkan 1994). Y-axis: bits.

[Fig. 13] Figure 13 represents the motif identified in the 50 sequences displaying the highest occurrence in R3B library. The motifs were generated using MEME algorithm (Bailey and Elkan 1994). Y-axis: bits.

[Fig. 14] Figure 14 represents a three-dimensional graph showing the fitness landscape of the R3A selection round generated from the map of the sequence space obtained upon sequence classification by the Self-Organizing Map (SOM) algorithm. X- axis: sequence space; Y-axis: % frequency, grey bar: R3A % frequency.

[Fig. 15] Figure 15 represents a three-dimensional graph showing the fitness landscape of the R3B selection round generated from the map of the sequence space obtained upon sequence classification by the SOM algorithm. X-axis: sequence space; Y-axis: % frequency, grey bar: R3B % frequency.

[Fig. 16] Figure 16 represents a contour plot of the R3B fitness landscape identifying 3 sequence clusters. Corresponding logo of the sequences contained in each cluster is also shown. X-axis: sequence space; Y-axis: sequence space, grey bar: frequency.

[Fig. 17] Figure 17 represents functional analysis of SRB-2 mutants. Different sequences were selected from the self-organizing map (SOM) shown on figure 15. Each construct was in vitro transcribed in the presence of 500 nM of Gemini-561 and the fluorescence monitored over time. The fluorescence apparition rate was computed and normalized to that of the parental SRB-2 aptamer. The values are the mean of n = 3 independent experiments. The error bars correspond to ± 1 s.d. The bars are grey-coded with respect to the SOM cluster from which the sequence was originally selected from. The P3 sequence of each variant is given in the table under the plot. Sequences of both strands of the stem were concatenated into a single line.

[Fig. 18] Figure 18 represents the bioinformatic pipeline used in the invention. [Fig. 19] Figure 19 represents occurrence distribution of the sequences in R3 rounds. The occurrence of each sequence was measured for each round and a polynomial curve was fit to the data. The first minimum of the curve was used to define an occurrence threshold (vertical line) used as a cut-off value to filter-out undesired minor sequences likely to result from PCR mutations or sequencing errors. Threshold = 21. X-axis: UDI occurrence. Y-axis: occurrence of occurrence. UDI distribution in library R3A.

[Fig. 20] Figure 20 represents occurrence distribution of the sequences in R3 rounds. The occurrence of each sequence was measured for each round and a polynomial curve was fit to the data. The first minimum of the curve was used to define an occurrence threshold (vertical line) used as a cut-off value to filter-out undesired minor sequences likely to result from PCR mutations or sequencing errors. Threshold = 32. Y-axis: occurrence of occurrence. UDI distribution in library R3A.

[Fig. 21] Figure 21 represents Fitness landscape representations obtained including different information levels to the sequence codifying vector used to generate the SOM map using the TDT vector representing the sequence.

[Fig. 22] Figure 22 represents Fitness landscape representations obtained including different information levels to the sequence codifying vector used to generate the SOM map using the TDT vector representing the sequence including its folding energy.

[Fig. 23] Figure 23 represents Fitness landscape representations obtained including different information levels to the sequence codifying vector used to generate the SOM map using the TDT vector representing the sequence including its folding energy and the number of pairs formed at the P3 stem, considering 0 pairs formed if any pair exist with the L3 loop.

[Fig. 24] Figure 24 represents Robustness of the SOM map approach. SOM maps (Model 1 ) were generated independently from R3B dataset using the TDT vector representing the sequence including its folding energy and the number of pairs formed at the P3 stem at the different selection rounds, considering 0 pairs formed if any pair exist with the L3 loop.

[Fig. 25] Figure 25 represents Robustness of the SOM map approach. SOM maps (Model 2) were generated independently from R3B dataset using the TDT vector representing the sequence including its folding energy and the number of pairs formed at the P3 stem at the different selection rounds, considering 0 pairs formed if any pair exist with the L3 loop.

[Fig. 26] Figure 26 represents Robustness of the SOM map approach. SOM maps (Model 3) were generated independently from R3B dataset using the TDT vector representing the sequence including its folding energy and the number of pairs formed at the P3 stem at the different selection rounds, considering 0 pairs formed if any pair exist with the L3 loop.

[Fig. 27] Figure 27 represents a Confusion matrix comparing the different models in relation of the sequences found at the R3B and present in the different identified clusters (Cluster 1 , 2 and 3). The good correlation of the different replicates is given by the elevated values found in the diagonal.

[Fig. 28] Figure 28 represents a graph showing the correlation between sequences and functionality of the 36 variants. The 36 variants were ordered according to their functionality relative to that of the SRB-2 aptamer. The values are the mean of n = 3 independent experiments and the error bars correspond to ±1 s.d.

[Fig. 29] Figure 29 represents a graph showing the correlation between functionality and the minimum free energy (MFE) of the 36 variants. The 36 variants were ordered according to the MFE of their P3-L3 stem-loop. The values are the mean of n = 3 independent experiments and the error bars correspond to ±1 s.d.

[Fig. 30] Figure 30 represents iSRB-34 aptamer sequence (SEQ ID NO: 84).

[Fig. 31] Figure 31 represents iSRB-34_CirPer aptamer sequence (SEQ ID NO: 89).

[Fig. 32] Figure 32 represents iSRB-34_CirPer_P2CG aptamer sequence (SEQ ID NO: 90). The U-A to C-G mutation is shown in bold.

[Fig. 33] Figure 33 represents a graph showing in vitro transcription monitoring of iSRB-34 aptamer and its derivatives in the presence of Gemini-552-alk. An in vitro transcription mixture containing Gemini-552-alk was supplemented with DNA coding for iSRB-34, iSRB-34_CirPer or iSRB-34_CirPer_P2CG aptamers (respectively crosses, triangles and squares) or not (circles) and the orange fluorescence (ex. = 560 nm/ em. = 600 nm) was monitored over the time at 37°C.

EXAMPLES

Example 1

The function of an RNA is intimately linked to its three-dimensional structure. X-ray crystallography or NMR allow the fine structural characterization of small RNA (e.g., aptamers) with a precision down to atomic resolution. Yet, these technics are time consuming, laborious and do not inform on mutational robustness and the extent to which a sequence can be modified without altering RNA function, an important set of information to assist RNA engineering. On another hand, though powerful, in silico predictions still lack the required accuracy. These limitations can be overcome by using high-throughput microfluidic-assisted functional screening technologies, as they allow exploring large mutant libraries in a rapid and cost-effective manner. Among them, the microfluidic-assisted In Vitro Compartmentalization (pIVC) was recently introduced, an efficient screening strategy in which reactions are performed in picoliter droplets at rates of several thousand per second. pIVC efficiency was later improved by using in tandem with high throughput sequencing, though a laborious bioinformatic step was still required at the end of the process. In the present work, the automation level of the pipeline was strongly increased by implementing an artificial neural network enabling unsupervised bioinformatic analysis, the efficiency of this “pIVC-Useq” technology was demonstrated by rapidly identifying a set of sequences readily accepted by a key domain of the light- up RNA aptamer SRB-2. This work not only shed some new light on the way this aptamer can be engineered, but it also allowed to easily identify new variants with an up-to 10- fold improved performance.

MATERIAL AND METHODS

Library design

The template sequence was designed on the basis of SRB-2 aptamer. The P3 stem was randomized 5’-GGAACCTCGCTTCGGCGATGATGGAGNNNNNCAAGGTTAACNNNNNCAGGTTCC- 3’ (SEQ ID NO: 37) and flanked with a

5’ (5’-GGGAGACAGCTAGAGTAC-3’ - SEQID NO: 38) and a

3’ (5’-GTACACTGTGCTCGTGTC-3’- SEQID NO: 39) constant regions to yield SRB-2 P3N10-ext template (Table 3).

[Table 3] adaptor sequences

[N]2O corresponds to a randomized sequence used as Unique Droplet Identifier (UDI) barcode

The two [N]s corresponds to the randomized SRB-2 P3 stem Underlined sequence corresponds to the T7 RNA polymerase promoter Bolded sequences correspond to PCR primer annealing sites and have a 55°C < Tm < 65°C

DNA amplification and barcoding

To allow the transcription and the identification of each variant of the library, a T7 RNA polymerase promoter and a Unique Droplet Identifier (UDI) random barcode (Autour et al. 2019) was appended to the 5’ end of SRB-2 P3N10-ext template by PCR amplification. To do so, 1 pmol of the SRB-2 P3N10-ext template library was introduced in 100 pL of PCR mixture containing 0.5 pM of primer 2.A (5 - TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGaaNNNNNNNNNNNNNNNNNNN NaaTATcTAATACGACTCACTATAGGGAGACAGCTAGAGTAC-3’ SEQ ID NO: 48) and 2.B (5’-GTACACTGTGCTCGTGTC-3’ SEQ ID NO: 49), 0.2 mM each dNTPs, 1X Evagreen (Biotium), 2 U of Q5 DNA polymerase (New England Biolabs) and the corresponding buffer at the recommended concentration. 20 pL of this mixture was introduced into a qPCR machine (CFX-96, Bio-Rad) and was thermocycled starting with an initial step of denaturation of 30 s at 95 °C followed by 40 cycles of 5 s at 95 °C, 30 s at 60 °C and 30 s at 72 °C. Upon determination of the threshold cycle (Ct), the remaining 80 pL of mixture were thermocycled at Ct+2 cycles. PCR products were purified on gel by the "Wizard® SV Gel and PCR Clean-up System" kit (Promega) prior to being quantified by NanoDrop™.

Then, a second PCR was performed to amplify the barcoded sequences. 0.1 pmol of the former PCR products was introduced in 100 pL of PCR mixture containing 0.5 pM of each primer 3.A (5’-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3’ SEQ ID NO: 50) and 1.B (5’- GTACACTGTGCTCGTGTC-3’ SEQ ID NO: 51 ), 0.2 mM each dNTPs, 1X Evagreen (Biotium), 2 U of Q5 DNA polymerase (New England Biolabs) and the corresponding buffer at the recommended concentration. 20 pL of this mix were introduced into a qPCR machine (CFX-96, Bio-Rad) as above and, upon determination of the threshold cycle (Ct), the remaining 80 pL of mixture were thermocycled at Ct+2 cycles. PCR products were purified by the "Wizard® SV Gel and PCR Clean-up System" kit (Promega). Droplet-based microfluidic screening

Microfluidic chips were fabricated in polydimethylsiloxane (PDMS) as described in (Ryckelynck et al. 2015).

Droplet digital PCR: DNA mutant libraries were diluted in 200 pg/mL yeast total RNA solution (Ambion) to obtain the desired occupancy of droplets. 1 pL of this dilution was then introduced in 100 pL of PCR mixture containing 0.5 pM of each primer 3.A (5 - TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3’ SEQ ID NO: 52) and 1.B (5 - GTACACTGTGCTCGTGTC-3’ SEQ ID NO: 53), 0.2 mM each dNTPs, 20 pM coumarin acetate (Sigma-Aldrich), 0.1 % Pluronic F68 (Sigma-Aldrich), 2 U of Q5 DNA polymerase (New England Biolabs) and the corresponding buffer at the recommended concentration. The mixture was loaded in a length of PTFE tubing and infused into droplet generator microfluidic chip where it was dispersed in 2.5 pL droplets (production rate of about 12000 droplets/s) carried by Novec 7500 fluorinated oil (3M) supplemented with 3% of a fluorosurfactant (proprietary synthesis). Droplet production frequency was monitored in real time using an optical device and software developed by the team (Ryckelynck et al. 2015) and used to determined droplet volume. 2.5 pL droplets were generated by adjusting pumps flowrates (MFCS, Fluigent). The emulsion was collected in 0.2 mL tubes and subjected to an initial denaturation step of 2 min at 98°C followed by 30 cycles of: 10 sec at 98°C, 30 sec at 55°C, 30 sec at 72°C. Droplets were then reinjected into a droplet fusion microfluidic device.

Droplet fusion: PCR droplets were reinjected and spaced into a fusion device at a rate of about 1500 droplets/s. Each PCR droplet was then synchronized with a 16 pL in vitro transcription (IVT) droplet containing 2 mM each NTP (Larova), 25 mM MgCI?, 44 mM Tris-HCI pH 8.0 (at 25°C), 5 mM DTT, 1 mM Spermidine, 0.1 % of Pluronic F68 (Sigma- Aldrich), 1 pg of pyrophosphatase (Roche), 500 nM Gemini-561 , 1 pM coumarin acetate (Sigma-Aldrich) and 17.5 pg/mL T7 RNA polymerase (purified in the laboratory). IVT mixture was loaded in a length of PTFE tubing and kept on ice during all the experiment. PCR droplets were spaced and IVT droplets produced using a dedicated stream of Novec 7500 fluorinated oil (3M) supplemented with 2% (w/w) of fluorinated. Flowrates (MFCS, Fluigent) were adjusted to generate 16 pL IVT droplets and maximize synchronization of 1 PCR droplet with 1 IVT droplet. Pairs of droplets were then fused with an AC field (400 V at 30 kHz) and the resulting emulsion was collected off-chip and incubated for 2 h at 37°C.

Droplet sorting. The emulsion was finally reinjected into an analysis and sorting microfluidic device at a frequency of about 150 droplets/s and spaced with a stream of surfactant-free Novec 7500 fluorinated oil (3M). The orange fluorescence (Gemini-561 in complex with an aptamer) of each droplet was analyzed and the most orange fluorescence droplets were sorted (from 6.7% to 0.18% depending on the round of pIVC, Table S1 ). The gated droplets were deflected into collecting channel by applying an AC fields (1000 V at 30 kHz) and collected into a 1.5 mL tube. Sorted droplets were recovered from the collection tubing by flushing 200 pL of HFE fluorinated oil (3M). 100 pL of 1 H, 1 H, 2H, 2H-perfluoro-1 -octanol (Sigma-Aldrich) and 200 pL of 200 pg/mL yeast total RNA solution (Ambion) were then added and the droplets broken by vortexing the mixture. DNA-containing aqueous phase was then transferred into a classical Eppendorf tube.

Amplification of sorted DNA

2 pL of aqueous phase obtained upon droplet breaking were introduced in 100 pL of PCR mixture containing primers 2.A and 1.B as above (see § DNA amplification and barcoding) to reset the UDI carried by the DNA. This was essential to preserve the quantitativity of the method over successive pIVC rounds. The DNA was treated as above and an aliquot of purified PCR product was subjected to a second PCR using primers 3.A and 1 .B.

Libraries indexing for high-throughput sequencing

The starting library and those obtained upon each round of screening were indexed using Nextera technology (Illumina). First, a Nextera-compatible sequence was appended to the 3’ end of each gene. To do so, 2 pL of aqueous phase obtained upon droplet breaking were introduced in 100 pL of PCR mixture containing 0.5 pM of each primer 3.A (5 - TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3’ SEQ ID NO: 54) and 2.B (5 - GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGTACACTGTGCTCGTGTC-3’ SEQ ID NO: 55), 0.2 mM each dNTPs, 1X Evagreen (Biotium), 2 U of Q5 DNA polymerase (New England Biolabs) and the corresponding buffer at the recommended concentration. 20 pL of this mix were introduced into a qPCR machine (CFX-96, BioRad) as above and, upon determination of the threshold cycle (Ct), the remaining 80 pL of mixture were thermocycled at Ct+2 cycles. PCR products were purified by the "Wizard® SV Gel and PCR Clean-up System" kit (Promega).

A second PCR was then performed to add Illumina indexes both at the 5’ and 3’ ends of each recovered DNA molecules. 0.1 pmol of recovered he first PCR product was introduced in 100 pL of PCR mixture containing 0.5 pM of each Nextera Illumina primer (N7 and N5; a different pairfor each library to index), 0.2 mM each dNTPs, 1X Evagreen (Biotium), 2 U of Q5 DNA polymerase (New England Biolabs) and the corresponding buffer at the recommended concentration. 20 pL of this mix were introduced into a qPCR machine (CFX-96, Bio-Rad) as above and, upon determination of the threshold cycle (Ct), the remaining 80 pL of mixture were thermocycled at Ct+2 cycles. PCR products were purified by the "Wizard® SV Gel and PCR Clean-up System" kit (Promega). Libraries were finally loaded on a V3-150 chip (Illumina) and analyzed on a MiSeq sequencing platform (Illumina).

Bioinformatic sequence analysis

Sequencing data were analyzed using a custom Python bioinformatic pipeline in 10 main steps (Figure 18). First, fastq files were parsed using the Biopython library and only reads with a Q-score > 30 were conserved for the rest of the analysis (step 1 ). Then, UDI and 10-mer randomized regions were extracted from each read (step 2). An UDI occurrence cut-off was automatically set for each library (Figures 19 and 20). Sequences with an occurrence below that threshold were likely mutants (raised from PCR or sequencing errors) and were no longer considered for the rest of the analysis (step 3). Moreover, sequences displaying mutations outsides of randomized regions (i.e. the UDI and the P3 stem) were also filtered out. Next, identical sequences with different UDI and the expected length were clustered together, and their occurrence measured (step 4) while the 10-mer randomized sequences of each selection round were isolated in parallel (step 5). Sequences of the P3/L3 stem-loop regions were used to compute the minimum free energy (MFE) and determine the number of base pairs formed using RNAup from the RNAlib python library of the ViennaRNA Package (step 6). Nucleotide sequences were codified in three-dimensional trajectories (TDT) vector as described previously and, in some analyses, MFE and the number of base pairs formed were added to the sequence TDT vector (step 7). Using the SOMPY python library, a Self-Organizing Map (SOM) of the sequence space was trained using the sequences TDT vectors generated in step 7, eventually appending MFE or the MFE and the nucleotide pair number formation to the vector (step 8). A grid of 50x50 neurons with randomly weights generation was selected to represent the sequence space. To train the model we used a rough train with 40 iterations with a radius of 10 followed by 80 finetune train iterations with a radius of 4. Next, a fitness landscape was constructed from the nodes grid generated by the SOM algorithm (step 9). For each node a fitness value (Z axis of the fitness landscape) was calculated by adding the sum of the occurrence frequency of all the sequences present on that node. Finally, neighboring nodes sharing a high fitness were clustered together in view of further analyzing their sequence content and features.

Functional validation of selected sequences For each tested sequence, a template oligonucleotide was chemically synthetized by IDT (Integrated DNA Technologies). 0.1 pmol of template was then added to 20 pL of PCR mixture containing 0.5 pM of each primer 1.A (5 -

CTTTAATACGACTCACTATAGGGAGACAGCTAGAGTAC-3’ SEQ ID NO: 56, adding the T7 promotor) and 1.B (5’-GTACACTGTGCTCGTGTC-3’ SEQ ID NO: 57), 0.2 mM each dNTPs, 2 U of Q5 DNA polymerase (New England Biolabs) and the corresponding buffer at the recommended concentration. The PCR mixtures were thermocycled for 25 cycles starting with an initial step of denaturation of 30 s at 95°C followed by 40 cycles of 5 s at 95°C, 30 s at 60°C and 30 s at 72°C. PCR products were then purified by the "Wizard® SV Gel and PCR Clean-up System" kit (Promega) and quantified by NanoDrop™.

40 ng of purified DNA were then introduced in 40 pL of in vitro transcription mixture containing 2 mM each NTP (Larova), 25 mM MgCI?, 44 mM Tris-HCI pH 8.0 (at 25°C), 5 mM DTT, 1 mM Spermidine, 1 pg of pyrophosphatase (Roche), 17.5 pg/mL T7 RNA polymerase (purified in the laboratory) and 500 nM of Gemini-561 . This mixture was then incubated at 37°C in a real-time thermocycler (Stratagene Mx3005P, Agilent Technologies) and the fluorescence of the reaction was monitored for 2 hours (ex/em 575 nm/602 nm). Note that library enrichment monitoring was performed following exactly the same procedure.

RESULTS

SRB-2 aptamer as a model aptamer

To set-up and evaluate the potential of our technology to identify variants of interest in an unsupervised manner, we chose SRB-2 aptamer (Figure 1 ) as a model RNA. This aptamer was originally isolated for its capacity to interact with the sulforhodamine B dye. Later on, SRB-2 was also found to be able to activate fluorogenic forms of the sulforhodamine B as well as close homologues made of a dye conjugated to a dinitroaniline moiety inducing a contact quenching. Moreover, we recently expanded the set of SRB-2 fluorogenic dyes by developing Gemini-561 (Figures 2 and 3), a dimerized form of sulforhodamine B that self-quenches by forming H-aggregates. While Gemini-561 is optimally activated by the light-up aptamer o-Coral (an SRB-2 derived aptamer), it can also form a fluorescent complex with this SRB-2 to some extent.

On a structural point-of-view, the minimal form of SRB-2 was proposed to fold into 3 stems (P1 , P2 and P3) closed by apical loops L2 and L3 and spaced by a long unpaired stretch J2-3 (Figure 1 ). Previous studies showed that P2 and L2 can be modified without compromising the function of the aptamer. Moreover, we recently showed that the sequence of P1 can be readily modified, provided a stem can still form. The remaining elements J2/3, P3 and L3 were originally predicted to encompass the sulforhodamine B-binding site. It is quite likely that J2/3 and L3 adopt a rather elaborated tridimensional structure stabilized by tertiary interactions, that poorly tolerates mutations. Properly deciphering the intimate organization of this region of the dye-binding will require a long and dedicated structural characterization using X-ray crystallography or NMR characterization. Finally, whereas the existence of a P3 stem has been supported by the identification of sequence covariation, the actual tolerance of this region to sequence permutation as well as its degree of optimization have never been studied, making it an interesting model to validate our technology. Therefore, to shed further light on this region we prepared a mutant library in which 10 out of the 12 nucleotides of this putative P3 stem were randomized (Figure 1). The analysis was limited to 10 positions to be able to screen the resulting library as exhaustively as possible (the pIVC screening being currently limited to the analysis of 10 million mutants per experiment). The rather simple organization of this region should ease interpretations and validation of the approach while allowing to determine the minimal length P3 should adopt, identify eventual sequence biases and, perhaps, sequences displaying properties better than the wild-type parental molecule.

Functional screening of the libraries

The P3 mutant library was obtained by chemical synthesis. The region encoding SRB-2-derived aptamer was surrounded by constant regions (used for PCR amplification) and placed downstream T7 RNA polymerase promoter. Moreover, a Unique Droplet Identifier (UDI) was appended upstream the T7 RNA polymerase promoter. By analogy with the Unique Molecular Identifiers (UMIs) used to precisely establish transcript copy-number in transcriptomic analyses, UDIs are sequences of 20 randomized nucleotides expected to be unique to each molecule of the library (10 12 different possible sequence permutations with only 10 7 molecules tested at most during a conventional pIVC screening) and later used to cluster together molecules originating from the same droplet (see below). DNA molecules were then diluted into a PCR amplification mixture prior to being individualized into 2.5 pL water-in-oil (W/O) droplets (step i on Figure 4). Adjusting DNA dilution allowed to modulate the average starting number of genes per droplet, a value also known as , and doing so to control the rate of multiple encapsulation events. Indeed, as DNA molecules distribute into the droplets according to Poisson statistics, knowing makes it possible to precisely calculate the fraction of droplet initially occupied by one or more genes (Table 1). I was initially kept high (i.e., I = 2) to maximize the number of analyzed molecules. Yet, once theoretical sequence diversity decreased, I was reduced in order to limit multiple encapsulation of several templates and gain in analysis accuracy. W/O droplets were then produced by infusing the aqueous phase into a microfluidic chip together with a fluorinated oil phase supplemented in a fluorosurfactant to stabilize droplets. Upon production, droplets were thermocycled to PCR-amplify each gene into ~ 300,000 copies.

[Table 1]

TABLE 1: pIVC experimental parameters a average number of template DNA molecules per droplet.

Then, droplets were re-injected into a droplet fusion microfluidic device in which they were spaced by a stream of fluorinated oil (supplemented with 2% fluorosurfactant) and synchronized one-to-one with larger (16 pL) droplets generated on-chip and containing an in vitro transcription (IVT) mixture supplemented in Gemini-561. Pairs of droplets were then fused together when passing between a pair of electrodes energized by an AC field (step ii on Figure 4). One-to-one pairing was maximized by monitoring the blue fluorescence (coumarin acetate added into both sets of droplets). Indeed, whereas PCR droplets were highly fluorescent (20 pM of coumarin acetate), IVT ones displayed a much lower signal (1 pM of coumarin acetate). As a consequence, the blue fluorescence of the merged droplets allows to discriminate unfused IVT droplets (low blue fluorescence; 20-30 RFUs on Figures 5 and 6 and 10-20 RFUs on Figure 7) from those fused with one (moderate blue fluorescence; 50-70 RFUs on Figures 5 and 6and 25-30 RFUs on Figure 7) or even two PCR droplets (high blue fluorescence; 80-90 RFUs on Figure 5, 90-100 RFUs on Figure 6 and 35-45 RFUs on Figure 7). In general, more than 86% of IVT droplets were fused with one PCR droplet (Table 1). Upon fusion, droplets were collected and incubated for 2 hours at 37°C to allow the genes to be transcribed and, if functional, RNA to complex and activate the orange fluorescence of Gemini-561.

Finally, droplets were re-injected into a sorting device, in which they were spaced by a surfactant-free oil stream prior to getting their fluorescence analyzed (step ill on Figure 4). Those droplets displaying a blue fluorescence corresponding to single-fused droplets as well as a significant increase of their orange fluorescence (Gemini-561/RNA complex formation) were gated as positive (boxed populations on Figures 5, 6 and 7) and specifically sorted from the bulk. Upon sorting, droplets were broken, their DNA content recovered by PCR and their original UDI was exchanged for a new one to be able to track individual droplets during the next round of screening.

Three rounds of such pIVC screening were performed while gently increasing the selection stringency. Indeed, the first round was performed at a high droplet occupancy (Table 1) to maximize the number analyzed molecule and we selected any droplet displaying an orange fluorescence above the background (Figure 5) to limit the loss of molecules of interest. The second round was performed at lower droplet occupancy and only droplet displaying a significant fluorescence were recovered Fig. 3B. Proper enrichment of the libraries was confirmed by transcribing the libraries in microtiter plates in the presence of Gemini-561. Indeed, a slight but progressive increase of R1 and R2 average fluorescence was observed with respect to RO library (Figure 8). The resulting R2 enriched library was subjected to a third round of screening during which two sorting gates were used: a relaxed yet selective gate (population R3A, boxed in black on Figure 7) and a more stringent gate (population R3B, boxed in red on Figure 7) expected to contain aptamers in general more efficient than those exclusive to R3A. As a result of these slightly more stringent selections, both R3A and R3B libraries displayed a marked increase in fluorescence, with R3B performing better than R3A as expected (Figure 8). Since the objective was not necessarily to identify improved aptamers but to preserve some sequence diversity, it was decided to stop the process after this third round. Nevertheless, both libraries displayed a higher fluorescence than the parental SRB-2 molecule, suggesting that improved mutants were likely present in these libraries. Sequences contained in both R3 libraries were indexed together with the starting library and the three libraries (RO, R3A and R3B) were finally sequenced on a MiSeq high- throughput sequencing platform.

Unsupervised bioinformatic sequencing data processing allows significant data reduction

Upon sequencing, reads were QC-filtered and those displaying mutations outside the initially randomized regions (i.e., UDI and P3) were discarded (see the overall analytical pipeline on Figure 18). Next, sequences sharing the same UDI were considered to originate from the same droplet and were clustered together. At this stage, only those sequences with an occurrence above an automatically computed threshold (Figures 19 and 20) were conserved. Indeed, and as deeper explained elsewhere (Autour et al. 2019), UDI/sequence pairs with an occurrence below that threshold were likely point mutants raised from PCR or sequencing errors and were therefore no longer considered in the rest of the analysis. Then, counting the number of different UDIs associated with each P3 sequence, allowed to count the number of droplets containing this sequence, so to precisely compute the enrichment of each sequence.

The minimum free energy (MFE) of each P3/L3 sequence was computed using RNAfold from the ViennaRNA package and found that the region tends to structure all the better the selection stringency was increased (Figure 9). Indeed, while 2 to 6 base pairs tend to form in RO library, this number was strongly biased toward the formation of 6 base pairs in libraries R3A and R3B (Figure 10), confirming the higher structuration of the selected molecules. Moreover, in the starting library (RO) most of the contacts are formed with L3 loop and only a very small fraction of molecule forms base pairs only in the P3 region (Figure 11 ). This somehow contrasts with molecules contained in R3A and B that rather tend to form the parental P3 stem and no interaction with L3 (Figures 10 et 11 ). These data not only confirm that, to be functional, the aptamer should adopt an SRB-2-like P3/L3 structure made of 6 base pair-long stem closed by a L3 region that should stay free of interaction with P3, but they also demonstrate the biological relevance of our analysis. Deeper sequence analysis did not reveal marked sequence preference other than an A31-U42 base pair closing L3 and an overall A/U richness in R3A (Figure 12). Interestingly, in the better fit R3B library a subset of sequences tends to display and increased G/C content, especially a C31-G42 base-pair to which the parental sequence (G31-C42) does not conform (Figure 13).

T o further reduce the information and identify sub-populations of sequences while maintaining a high degree of automation and a low level of supervision, the fitness landscape of RO, R3A and R3B libraries was studied. To do so, sequences were first organized in a two-dimension plan using a Self Organizing Map (SOM), an artificial neural network. The randomized sequence (i.e., nucleotides 27 to 31 and 42 to 46) of each molecule (note that the map was restricted to those sequences seen upon sequencing and that only 613,459 of the million expected sequences are represented) was first converted into a vector in which nucleotide identity was coded by its three- dimensional (3D) trajectory (TDT). Briefly, each one of four nucleotides was assigned as one point in the 3D space, being the relative position of the nucleotides determined by the 3D coordinates of the four vertices of a regular tetrahedron. Using the sequence as the only information led to many (18 to 68) clusters (Figure 21 ) leaving as many molecules to be functionally tested. While adding the free energy of the molecule to the vector did not significantly reduced the number of peaks (Figure 22), including in the vector both the free energy and the number of base-pairs formed in the P3 region allowed to cluster most of R3A and R3B sequences in two main clusters (Figures 14 and 15 and Figure 23). Remarkably, looking more closely at the content of each peak revealed that, whereas cluster 1 contains mainly A/U rich sequences highlighted above, those sequences contained in cluster 2 are rather rich in G/C (Figure 16). Finally, cluster 3 contained those sequences excluded from the two other clusters for which no composition preference was expected. Though the process starts with weights randomly assigned to each one of the map nodes during initiation (learning) step, it appeared to be quite robust since repeating it 3 times always allowed both cluster 1 and 2 to be identified (though at variable coordinate on the map due to the starting network randomization) with the similar sequence content (Figures 24, 25, 26 and 27).

Functional validation and identification of improved sequences

In order to functionally validate the bioinformatic clustering, several sequences representative of each cluster were tested (Figure 17). A functional clustering greatly relevant to the bioinformatic one was found. Indeed, the sequences coming from the G/C-rich cluster 2 tended to cluster together as those displaying the best function, an observation in good agreement with the earlier observation that G/C-rich sequences better accumulated in the best-fit R3B library (see above). Moreover, sequences from cluster 1 formed a distinct functional group that, even though displaying a lower fluorescence than cluster 2 variants, had a fluorescence significantly above that of sequences taken from cluster 3. Furthermore, among cluster 2 sequences, the two most represented variants (sequences 34 and 35 on Figure 17 - Table 2) formed with Gemini- 561 a complex an order of magnitude more fluorescent than the parental SRB-2. Based on the same nomenclature we used in our earlier work on Spinach and Mango III aptamers, we named these aptamers iSRB-2A (27GGUAC31-42GUACC46) and iSRB-2B (27GGACC31-42GGUCC46). Excitingly, both sequences possessed C31-G42 base-pair predicted above and form two GC base pairs (G27-C46 and G28-C45) stabilizing the basis of P3 stem. Just looking at the free energy and the number of base pairs formed does not allow to distinguish these variants from the parental SRB-2 (27AGGCG31-42CGCCU46) sequence. Therefore, additional characterization will be needed in the future to decipher the origin of the 10-fold improvement observed. Altogether, these observations brought a strong functional validation of our unsupervised approach combining ultrahigh- throughput functional screening and unsupervised bioinformatic analysis.

[Table 2]

Table 2: Sequences of both strands of the stem of the indicated variants are separated by /.

DISCUSSION

For a long time, the analysis of nucleic acids isolated by SELEX and other in vitro selection procedures was limited to the few tens of sequences that are generally the most represented ones in the final population. However, technological advances like next generation sequencing (NGS) now allow the whole selection process to be characterized at once and is commonly used to assist hits identification. The global view offered by such analyses makes it possible to rapidly identify variants of interest, even though they are underrepresented at the end of the selection process. Yet, these approaches may still require significant manual analyses and may be fastidious. In the present work, we further increased the automation level by using NGS in tandem with an artificial neural network algorithm. An important point in the development of this new methodology was the proper encoding of nucleotide in a vector format allowing the Self Organizing Map (SOM) algorithm to properly handle sequences. Original attempts to encode nucleotides composing a sequence in a binary format failed at producing convincing clustering (data not shown), driving us to explore alternative encoding formats. Eventually, it was found that using the TDT codification, in which each nucleotide of a given sequence is encoded by the coordinates of a regular tetrahedron vertices prior to being concatenated in its 3D sequence trajectory, gave the best results. Yet, this was not sufficient to get tight clustering and required additional information about energy and base pairing to be included as well. Note that all this information was directly collected by the algorithm, therefore preserving the use of a unique pipeline incorporating all the functionalities (Figure 18). Using this pipeline, it was managed to reduce the overall sequence information of a functional screening process initiated with about 1 million different sequences down to 2 clusters of functional sequences, of which one contained at least two major sequences endowed with a significantly improved (about 10 times better) function. Therefore, applying this strategy to the analysis of other selection processes may further improve the discovery rates of highly efficient molecules. In the work described herein, the inventors used this unsupervised bioinformatic analysis in tandem with the ultrahigh-throughput microfluidic-assisted functional screening the inventors originally named pIVC. pIVC has previously demonstrated its great efficiency at selecting RNAs endowed with functions (e.g. catalysis, fluorogen lighting-up) involving a phenotype (e.g., substrate cleavage, fluorescence emission) that physically dissociates from the RNA. Indeed, by confining biological reactions into picoliter volume droplets, this technology allows to functionally and accurately assay millions of molecules in a single experiment. Though robust, this technology initially suffered from the same limitation than other in vitro selection technologies; i.e., the analysis of only a small subset of the most abundant sequences. We recently reported on a first level of improvement by using pIVC in tandem with NGS (a method called pIVC- seq in short) to rapidly identify optimized communication modules during the development of small molecule biosensors. However, the computer-assisted analysis of the selected sequences still required extensive intervention of the experimenter that are no longer needed in the present format of pIVC pipeline we propose to call pIVC-Useq for pIVC coupled with Unsupervised sequence analysis.

The light-up RNA aptamer SRB-2 was chosen as model system to set-up and demonstrate the efficiency of pIVC-Useq. This aptamer has been extensively used for the development of new efficient RNA imaging tools either by directly using the RNA as is or by further evolving it to recognize other fluorogens. However, despite this strong interest and the great perspectives of this aptamer, the exact tridimensional structure of the molecule remains unknown, leaving important questions on its intimate working mechanism and capacity to sustain engineering largely unanswered. Indeed, whereas P1 stem and P2/L2 stem-loop were successfully modified without compromising SRB-2 sulforhodamine B-binding capacity, less was known about P3/L3 stem-loop except that, together with J2/3, it encompasses the sulforhodamine B-binding site. Therefore, to shed further light on this element a mutant library was prepared in which 5 of the 6 base pairs of P3 were randomized. Performing 3 rounds of pIVC screening on this library followed by NGS sequencing and the use of our unsupervised bioinformatic pipeline allowed us to confirm the existence of the up-to-now putative P3 stem as well as to draw several new important conclusions on the P3/L3 region of SRB-2. First, all the functional molecules possess 6 base pairs long P3 helix. Therefore, the length of P3 should play an important role in the proper structuration of sulforhodamine B-binding site since, in the inventors’ experiment, the RNA had the possibility to acquire shorter stems, which it did not. Second, the sequence of the stem should not allow interaction with L3 to take place as sequences able to establish such contact, though present in the starting library, were strongly counter selected. This observation further supports a key function of L3 loop in the recognition of the dye. Third, though a significant variety of sequences able to form the required stem was tolerated by the molecule, they lead to a wide range of phenotypes spanning an order of magnitude. Therefore, even though they are conservative and would be predicted as being optimal by a computer-assisted RNA folding prediction software, not all sequences are tolerated the same way. This is typically exemplified in our study by iSRB-2A and B mutants that both have the same overall free energy as the parental SRB-2 but display a 10-fold better capacity to activate Gemini-561 fluorescence. The exact mechanism by which these mutations act would require a dedicated study that is out of the scope of the present one. Nevertheless, at this stage, one may imagine different scenarios such as the direct promotion of extra stabilizing tertiary contacts with the loop elements or, on the opposite side, a more indirect effect by which the mutant sequences would prevent/limit the formation of undesirable alternative folding prone to reduce the fraction of molecules competent to bind the dye. Whatever the mechanism at work, it is extremely unlikely that any rational design approach would have been able to predict such sequences, further reinforcing the need of high-throughput technologies like pIVC-seq and now pIVC-Useq to properly assist the design and engineering of RNA molecules.

Example 2

Following to the identification of iSRB1 and 2 mutants introduced in the joined manuscript, we decided to individually test additional sequences to better refine the sequence preferences of the motif. Indeed, iSRBs both share the motif 5’- G1G2N3N3C4...G5N6N7C8C9-3’, giving 36 sequence permutations considering that only canonical base-pairs (AU, UA, GU, UG, GC or CG) should form.

Method:

For each sequence containing the motif 5’-G1G2N3N3C4...G5N6N7C8C9-3’ and forming a fully paired P3 stem (36 sequences), a template oligonucleotide was chemically synthetized by IDT (Integrated DNA Technologies) and amplified by PCR. PCR products were then purified and quantified. 40 ng of purified DNA were then introduced in 40 pL of in vitro transcription mixture supplemented with 400 nM of G-561 fluorogen. The mixture was then incubated at 37°C in a real-time thermocycler (Stratagene Mx3005P, Agilent Technologies) and the fluorescence of the reaction was monitored for 2 hours (ROX Channel: ex/em 585 nm/610 nm).

Results:

The 36 iSRB variants were clustered in 4 groups according to the presence of purine (R) and/or pyrimidine (Y) bases in number 3-4 and 6-7 positions of the motif. Groups were named after the nature of the nucleotide at position 3 and 4: YY, RR, YR or RY. Orange fluorescence constantly increased in tubes containing every iSRB variants whereas no signal was observed in the absence of templates, demonstrating the capacity of these variants to activate G-561 fluorescence. Moreover, the G-561 activation capacity of each aptamer was then quantified by normalizing the slope of the fluorescence apparition kinetic of the variant to that obtained with the parental SRB-2 (in Figures 28 and 29).

Apart variants #64 and #69, all the tested aptamers carrying the 5’- G1G2N3N3C4...G5N6N7C8C9-3’ motif displayed a higher fluorescence than SRB-2, yet with different efficiencies (Figure 28). Indeed, variants belonging to RY or YR groups (alternation of purine and pyrimidine bases) tended to display a higher capacity for the activation of G-561 (up to 12-fold in comparison with SRB-2 aptamer). Yet, it is likely that R/Y alternance is not the unique factor influencing aptamer functionality and that more complex effect might be at work in parallel. Classifying the variants according to their minimum free energy (MFE) of the P3 steams, did not reveal any obvious correlation with the level functionality (Figure 29), suggesting that more subtle effects than structure stability may be involved here.

Example 3 Circular permutation of iSRB-34 aptamer and activation of Gemini-552-alk

This example shows that both iSRB-34 aptamer (SEQ ID NO: 84) can be circularly permutated and that both versions of the aptamer are able to efficiently activate Gemini-552-alk fluorescence. iSRB-34: (SEQ ID NO: 84) iSRB-34_CirPer (SEQ ID NO: 89)

5’-GGGAGACAGCUAGAGUACGCGAUGAUGGAGGGUACCAAGGUUAACGU ACCCAGGUUCCUUCGGGAACCUCGCGACACGAGCACAGUGUAC-3’ iSRB-34_CirPer_P2GC (SEQ ID NO: 90)

5’-GGGAGACAGCUAGAGUACGCGGUGAUGGAGGGUACCAAGGUUAACGU ACCCAGGUUCCUUCGGGAACCCCGCGACACGAGCACAGUGUAC-3’

Method:

Template DNA sequences (iSRB-34: 5’-GGGAGACAGCTAGAGTAC GGAACCTCGCTTCGGCGATGATGGAGGGTACCAAGGTTAACGTACCCAGGTTCC GACACGAGCACAGTGTAC-3’; iSRB-34_CirPer: 5’-

GGGAGACAGCTAGAGTACGCGATGATGGAGGGTACCAAGGTTAACGTACCCAGG TTCCTTCGGGAACCTCGCGACACGAGCACAGTGTAC-3’: and iSRB-

34_CirPer_P2CG: 5’-

GGGAGACAGCTAGAGTACGCGGTGATGGAGGGTACCAAGGTTAACGTACCCAGG TTCCTTCGGGAACCCCGCGACACGAGCACAGTGTAC-3’). in which the iSRB-34 coding region, circularly permutated or not (underlined sequence) was surrounded by two extensions (italicized sequence), were first PCR-amplified by introducing 1 ng of template DNA (Integrated DNA technology) into 100 pL of reaction mixture containing 50 pmoles of Fwd primer (5’-CTTTAATACGACTCACTATA GGGAGACAGCTAGAGTAC-3’ (SEQ ID NO: 90) bearing T7 RNA polymerase promoter, bolded sequence, upstream the template), 50 pmoles of Rev primer (5’- GTACACTGTGCTCGTGTC-3’ (any one of the sequences SEQ ID NO: 39, 44, 49, 51 , 53 and 57), 0.2 mM of each dNTPs, 1 U of Q5 DNA polymerase (New England Biolabs) and the corresponding buffer at the recommended concentration. PCR mixtures were then thermocycled starting with an initial step of denaturation of 30 sec at 95°C followed by 25 cycles of: 5 sec at 95°C and 30 sec at 60°C. PCR products were purified using “Monarch PCR purification kit” (New England Biolabs) following supplier recommendations and the recovered DNA was quantified using NanoDrop™.

20 ng of purified PCR product were then introduced into 40 pL of in vitro transcription mixture containing 2 mM of each NTP (Larova), 25 mM MgCI2, 44 mM Tris- HCI pH 8.0 (at 25°C), 5 mM DTT, 1 mM Spermidine, 1 pg of pyrophosphatase (Roche), 400 nM Gemini-552-alk and 17.5 pg/mL T7 RNA polymerase (prepared in the laboratory). This mixture was then incubated at 37 °C into a microplate reader (SpectraMax i D3, Molecular Devices) and the orange fluorescence (ex./em., 560 nm/600 nm) monitored every minute for 2 hours.

Results:

Orange fluorescence constantly increased in tubes containing iSRB-34 (Figure 1.A and i.e., crosses on Figure 33), iSRB-34_CirPer (Figure 31 and i.e., triangles on Figure 33) or iSRB-34_CirPer_P2CG (Figure 32 and i.e., squares on Figure 33) templates whereas no signal was observed in a template-free reaction (i.e., circles on Figure 33).

These data demonstrate that the fluorescence of Gemini-552-alk can be activated by iSRB-34 aptamer and that iSRB-34 tolerates circular permutations with little or even no (in the case of replacing the U-A pair in C-G in P2 (bolded nucleotides of Figure 32)) loss of functionality.

RECTIFIED SHEET (RULE 91) ISA/EP