Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS FOR PROCESSING BREAST TISSUE SAMPLES
Document Type and Number:
WIPO Patent Application WO/2024/097838
Kind Code:
A2
Abstract:
Provided herein according to some aspects is a method for processing a tissue sample from a subject, the sample comprising cells of a breast tissue site comprising or suspected of comprising ductal carcinoma in situ (DCIS), and detecting an expression level of a plurality of genes in the cells. Also provided according to some aspects is a method for generating a classifier capable of determining a risk of DCIS recurrence and/or progression. Further provided is a system for determining the risk of DCIS recurrence and/or progression in a subject in need thereof.

Inventors:
HWANG EUN-SIL (US)
WEST ROBERT B (US)
Application Number:
PCT/US2023/078463
Publication Date:
May 10, 2024
Filing Date:
November 02, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV DUKE (US)
UNIV LELAND STANFORD JUNIOR (US)
International Classes:
C12Q1/6886; G16B20/00
Attorney, Agent or Firm:
MURPHY, Sherry L. (US)
Download PDF:
Claims:
We Claim:

1. A method for processing a tissue sample (e.g., biopsy) from a subject, comprising:

(a) providing the sample from the subject, said sample comprising cells of a breast tissue site of interest, said site of interest comprising or suspected of comprising ductal carcinoma in situ (DCIS) (e.g., suspected based on an abnormal mammogram), wherein said cells comprise a plurality of messenger ribonucleic acid (mRNA) molecules; and

(b) optically detecting an expression level of said plurality of mRNA molecules to thereby quantify expression levels of a plurality of genes in the cells.

2. The method of claim 1, wherein (b) comprises reverse transcribing said plurality of mRNA molecules to generate a plurality of complementary deoxyribonucleic acid (cDNA) molecules, and subsequently optically detecting said plurality of cDNA molecules.

3. The method of claim 2, further comprising, prior to optically detecting, performing nucleic acid amplification of the plurality of cDNA molecules.

4. The method of claim 3, wherein said nucleic acid amplification comprises polymerase chain reaction (PCR) or isothermal amplification.

5. The method of claim 2, wherein said optically detecting comprises detecting an optical signal from a probe coupled to a cDNA molecule of said plurality of cDNA molecules.

6. The method of claim 5, wherein said optical signal is a fluorescent signal.

7. The method of claim 1, further comprising processing said cells to access (and optionally extract) the plurality of mRNA molecules prior to said optically detecting.

8. The method of claim 1, wherein said sample comprises a heterogeneous mixture of cells (e.g., mixed epithelial and stromal cells) (e.g., from a core biopsy or lumpectomy).

9. The method of claim 1, wherein the subject has undergone surgery for DCIS (i.e., lumpectomy).

10. The method of claim 1, wherein the subject has not undergone surgery for DCIS.

11. The method of claim 1, wherein said plurality of genes comprises at least 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90 or 100 of the genes listed in Table 1.

12. The method of claim 1, wherein said plurality of genes comprises at least 30, 50, 80, 100, 200, or 300 of the genes listed in Table 1.

13. The method of claim 1, wherein said plurality of genes comprises at least 100, 300, 500, 600, 700, or 800 of the genes listed in Table 1.

14. The method of claim 1, further comprising determining an increased or decreased risk of recurrence and/or progression of DCIS based upon the expression levels of the plurality of genes.

15. The method of claim 14, further comprising treating the subject upon determining an increased risk of recurrence and/or progression of DCIS.

16. The method of claim 15, wherein the treating comprises surgery, radiation, and/or chemotherapy (e.g., endocrine therapy).

17. A method for generating a classifier, comprising:

(a) providing tissue samples (e.g., biopsies) from a plurality of subjects, said samples comprising cells of a breast tissue site of interest, said site of interest comprising or suspected of comprising ductal carcinoma in situ (DCIS) (e.g., suspected based on an abnormal mammogram), wherein said cells comprise a plurality of messenger ribonucleic acid (mRNA) molecules;

(b) optically detecting an expression level of said plurality of mRNA molecules to thereby quantify expression levels of a plurality of genes in the cells; and

(c) using the expression levels of the plurality of genes to train a classifier, said classifier capable of determining a risk of DCIS recurrence and/or progression, to thereby generate the classifier.

18. The method of claim 17, wherein (b) comprises reverse transcribing said plurality of mRNA molecules to generate a plurality of complementary deoxyribonucleic acid (cDNA) molecules, and subsequently optically detecting said plurality of cDNA molecules.

19. The method of claim 18, further comprising, prior to optically detecting, performing nucleic acid amplification of said plurality of cDNA molecules.

20. The method of claim 19, wherein said nucleic acid amplification comprises polymerase chain reaction (PCR) or isothermal amplification.

21. The method of claim 18, wherein said optically detecting comprises detecting an optical signal from a probe coupled to a cDNA molecule of said plurality of cDNA molecules.

22. The method of claim 21, wherein said optical signal is a fluorescent signal.

23. The method of claim 17, further comprising processing said cells to extract the plurality of mRNA molecules prior to said optically detecting.

24. The method of claim 17, wherein said sample comprises a heterogeneous mixture of cells (e.g., mixed epithelial and stromal cells) (e.g., from a core biopsy or lumpectomy).

25. The method of claim 17, wherein the subject has undergone surgery for DCIS (i.e., lumpectomy).

26. The method of claim 17, wherein the subject has not undergone surgery for DCIS.

27. The method of claim 17, wherein the classifier is agnostic to the biological type of DCIS and/or subsequent invasive cancer.

28. The method of claim 17, wherein the classifier is trained based on a subsequent ipsilateral occurrence of DCIS and/or invasive breast cancer in the plurality of subjects (e.g., within about 3, 5 or 8 years from collection of the tissue samples).

29. A system for determining the risk of DCIS recurrence and/or progression in a subject in need thereof, comprising: at least one processor; a sample input circuit configured to receive a tissue sample from the subject; a sample analysis circuit coupled to the at least one processor and configured to determine gene expression levels of the tissue sample; an input/output circuit coupled to the at least one processor; a storage circuit coupled to the at least one processor and configured to store data, parameters, and/or a classifier; and a memory coupled to the processor and comprising computer readable program code embodied in the memory that when executed by the at least one processor causes the at least one processor to perform operations comprising: controlling/performing measurement via the sample analysis circuit of gene expression levels of a plurality of genes in said tissue sample; optionally, normalizing the gene expression levels to generate normalized gene expression values; retrieving from the storage circuit a DCIS classifier; entering the gene expression values into the classifier; and determining a score or risk of DCIS recurrence and/or progression based upon said classifier.

30. The system of claim 29, wherein said plurality of genes comprises at least 5, 10, 15, 20,

30. 40, 50, 60, 70, 80, 90 or 100 of the genes listed in Table 1.

31. The system of claim 29, wherein said plurality of genes comprises at least 30, 50, 80, 100, 200, or 300 of the genes listed in Table 1.

32. The system of claim 29, wherein said plurality of genes comprises at least 100, 300, 500, 600, 700, or 800 of the genes listed in Table 1.

33. The system of any one of claims 29-32, wherein the classifier was generated by the method according to any one of claims 17-28.

Description:
METHODS FOR PROCESSING BREAST TISSUE SAMPLES

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of United States Provisional Patent Application Serial No. 63/422,108, filed November 3, 2022, the disclosure of which is incorporated by reference herein in its entirety.

FEDERAL FUNDING LEGEND

This invention was made with Government support under Federal Grant nos. U2CCA233254-01 and CA185138-01 awarded by the National Institutes of Health/NCI, and Federal Grant no. BC132057 awarded by the Department of Defense. The Federal Government has certain rights to this invention.

BACKGROUND

As nonobligate precursors of invasive disease, precancers provide a unique vantage point to study molecular pathways and evolutionary dynamics leading to the development of lifethreatening cancers. Breast ductal carcinoma in situ (DCIS) is one of the most common precancers across all tissues. Current treatment of DCIS involves breast conserving surgery or mastectomy, with the goal of preventing invasive cancer. However, DCIS consists of a molecularly heterogeneous group of lesions, with highly variable risk of invasive progression. Improved understanding of which DCIS is likely to progress could better focus treatment options.

Identification of factors associated with disease progression has been studied extensively. Epidemiologic cancer progression models indicate that clinical features like age at diagnosis, tumor grade, and hormone receptor expression may have some prognostic value, but have limited ability to identify the biologic conditions that govern DCIS progression to invasive breast cancer (IBC). Previous molecular analyses of DCIS have studied either 1) cohorts of pure DCIS with known outcomes (e.g., disease-free vs recurrent), or 2) cross-sectional cohorts of DCIS with or without adjacent IBC. These approaches have tested potentially divergent assumptions: recurrence of the DCIS as IBC may arise from neoplastic cells left behind when the DCIS was removed, be related to initial field effect, or develop independently. Longitudinal cohorts provide a perspective of cancer progression over time. Analysis of DCIS adjacent to IBC assumes these preinvasive areas are good models for pure DCIS and are ancestors of the invasive cancer cells, with synchronous lesions inferring progression. To date, these studies have not produced clear evidence for a common set of events associated with invasion.

Moreover, few genomic aberrations have been identified that can differentiate DCIS from IBC and microenvironmental processes, including collagen organization, myoepithelial changes, and immune suppression, may contribute to IBC development. Presently, it remains unknown how these different molecular axes contribute to DCIS evolution.

Improved methods of analyzing DCIS tissue that may yield risk prediction for recurrence or development of IBC are needed.

SUMMARY

The Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in limiting the scope of the claimed subject matter.

Provided herein according to some aspects is a method for processing a tissue sample (e.g., biopsy) from a subject, comprising: (a) providing the sample from the subject, said sample comprising cells of a breast tissue site of interest, said site of interest comprising or suspected of comprising ductal carcinoma in situ (DCIS) (e.g., suspected based on an abnormal mammogram), wherein said cells comprise a plurality of messenger ribonucleic acid (mRNA) molecules; and (b) detecting (e.g. optically detecting) an expression level of said plurality of mRNA molecules to thereby quantify expression levels of a plurality of genes in the cells.

In some aspects, (b) comprises reverse transcribing said plurality of mRNA molecules to generate a plurality of complementary deoxyribonucleic acid (cDNA) molecules, and subsequently detecting (e.g. optically detecting) said plurality of cDNA molecules. In some aspects, the method comprises performing nucleic acid amplification (e.g., a polymerase chain reaction (PCR) or isothermal amplification) of the plurality of cDNA molecules (e.g., before the detecting).

In some aspects, detecting comprises detecting an optical signal from a probe coupled to a cDNA molecule of said plurality of cDNA molecules. In some aspects, the optical signal is a fluorescent signal.

In some aspects, the method includes processing said cells to access (and optionally extract) the plurality of mRNA molecules prior to said detecting. In some aspects, the sample comprises a heterogeneous mixture of cells (e.g., mixed epithelial and stromal cells) (e.g., from a core biopsy or lumpectomy).

In some aspects, the subject has undergone surgery for DCIS (e.g., lumpectomy). In some aspects, the subject has not undergone surgery for DCIS.

In some aspects, the plurality of genes comprises at least 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90 or 100 of the genes listed in Table 1. In some aspects, the plurality of genes comprises at least 30, 50, 80, 100, 200, or 300 of the genes listed in Table 1. In some aspects, the plurality of genes comprises at least 100, 300, 500, 600, 700, or 800 of the genes listed in Table 1.

In some aspects, the method includes determining an increased or decreased risk of recurrence and/or progression of DCIS based upon the expression levels of the plurality of genes.

In some aspects, the method includes treating the subject upon determining an increased risk of recurrence and/or progression of DCIS. In some aspects, the treating comprises surgery, radiation, and/or chemotherapy (e.g., endocrine therapy).

Also provided is the use of surgery, radiation, and/or chemotherapy (e.g., endocrine therapy) in a method for treating a subject upon determining an increased risk of recurrence and/or progression of DCIS. Further provided is the manufacture of a medicament (such as chemotherapy) for use in treating a subject upon determining an increased risk of recurrence and/or progression of DCIS.

Also provided according to some aspects is a method for generating a classifier, comprising: (a) providing tissue samples (e.g., biopsies) from a plurality of subjects, said samples comprising cells of a breast tissue site of interest, said site of interest comprising or suspected of comprising ductal carcinoma in situ (DCIS) (e.g., suspected based on an abnormal mammogram), wherein said cells comprises a plurality of messenger ribonucleic acid (mRNA) molecules; (b) detecting (e.g. optically detecting) an expression level of said plurality of mRNA molecules to thereby quantify expression levels of a plurality of genes in the cells; and (c) using the expression levels of the plurality of genes to train a classifier, said classifier capable of determining a risk of DCIS recurrence and/or progression, to thereby generate the classifier.

In some aspects, (b) comprises reverse transcribing said plurality of mRNA molecules to generate a plurality of complementary deoxyribonucleic acid (cDNA) molecules, and subsequently detecting (e.g. optically detecting) said plurality of cDNA molecules. In some aspects, the method comprises performing nucleic acid amplification (e.g., polymerase chain reaction (PCR) or isothermal amplification) of the plurality of cDNA molecules (e.g., before the detecting). In some aspects, detecting comprises detecting an optical signal from a probe coupled to a cDNA molecule of said plurality of cDNA molecules. In some aspects, the optical signal is a fluorescent signal.

In some aspects, the method includes processing said cells to access (and optionally extract) the plurality of mRNA molecules prior to said detecting.

In some aspects, the sample comprises a heterogeneous mixture of cells (e.g., mixed epithelial and stromal cells) (e.g., from a core biopsy or lumpectomy).

In some aspects, the subject has undergone surgery for DCIS (e.g., lumpectomy). In some aspects, the subject has not undergone surgery for DCIS.

In some aspects, the classifier is agnostic to the biological type of DCIS and/or subsequent invasive cancer.

In some aspects, the classifier is trained based on a subsequent ipsilateral occurrence of DCIS and/or invasive breast cancer in the plurality of subjects (e.g., within about 3, 5 or 8 years from collection of the tissue samples).

Further provided is a system for determining the risk of DCIS recurrence and/or progression in a subject in need thereof, comprising: at least one processor; a sample input circuit configured to receive a tissue sample from the subject; a sample analysis circuit coupled to the at least one processor and configured to determine gene expression levels of the tissue sample; an input/output circuit coupled to the at least one processor; a storage circuit coupled to the at least one processor and configured to store data, parameters, and/or a classifier; and a memory coupled to the processor and comprising computer readable program code embodied in the memory that when executed by the at least one processor causes the at least one processor to perform operations comprising: controlling/performing measurement via the sample analysis circuit of gene expression levels of a plurality of genes in said tissue sample; optionally, normalizing the gene expression levels to generate normalized gene expression values; retrieving from the storage circuit a DCIS classifier; entering the gene expression values into the DCIS classifier; and determining a score or risk of DCIS recurrence and/or progression based upon said DCIS classifier.

In some aspects, the plurality of genes comprises at least 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90 or 100 of the genes listed in Table 1.

In some aspects, the plurality of genes comprises at least 30, 50, 80, 100, 200, or 300 of the genes listed in Table 1.

In some aspects, the plurality of genes comprises at least 100, 300, 500, 600, 700, or 800 of the genes listed in Table 1.

In some aspects, the classifier was generated by a method as taught herein. BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying Figures are provided by way of illustration and not by way of limitation. The foregoing aspects and other features of the disclosure are explained in the following description, taken in connection with the accompanying example figures (“FIG.”) relating to one or more embodiments, in which:

FIG. 1 is an exemplary flow diagram illustrating cohorts and methods used in a tissue analysis described herein. Two retrospective study cohorts were generated, consisting of ductal carcinoma in situ (DCIS) patients with either a subsequent ipsilateral breast event (iBE) or no later events after surgical treatment. Translational Breast Cancer Research Consortium (TBCRC) samples were macrodissected for downstream RNA and DNA analyses. Resource of Archival Breast Tissue (RAHBT) samples were 1) macrodissected like TBCRC, or 2) organized into a tissue microarray (TMA) from which serial sections were made for RNA, DNA, and protein (MIBI) analysis (RAHBT LCM cohort). TMA cores were laser capture microdissected to ensure pure epithelial and stromal components.

FIGS. 2A - 2F present validation data of the 812 gene classifier. FIG. 2A: ROC curve of the 812 gene classifier in RAHBT. FIG. 2B: Kaplan-Meier plot of time to iBE (5-year outcome) stratified by classifier risk groups in RAHBT. FIGS. 2C and 2D: Kaplan-Meier plot of time to invasive progression (full follow-up) stratified by classifier risk groups in TBCRC (FIG. 2C) and RAHBT (FIG. 2D). FIGS. 2E and 2F: Forest plot of multivariable Cox regression analysis including classifier risk groups, treatment, age, DCIS grade, and ER status for invasive iBEs (full follow-up) in TBCRC (FIG. 2E) and RAHBT (FIG. 2F).

FIGS. 3A - 3B show outcome-associated pathways in individual samples. FIG. 3A: Percentage of samples in 5-year outcome groups enriched for each pathway. FIG. 3B: Plot of Pearson’s correlations between pathways. Color intensity and circle size are proportional to correlation coefficients, with positive correlation indicated as "+" and negative correlation indicated as

FIG. 4 is an exemplary block diagram of a tissue processing system and/or computer program product that may be used in a platform in accordance with the present invention. A tissue processing system and/or computer program product 1100 may include a processor subsystem 1140, including one or more Central Processing Units (CPU) on which one or more operating systems and/or one or more applications run. While one processor 1140 is shown, it will be understood that multiple processors 1140 may be present, which may be either electrically interconnected or separate. Processor(s) 1140 are configured to execute computer program code from memory devices, such as memory 1150, to perform at least some of the operations and methods described herein. The storage circuit 1170 may store databases which provide access to the data/parameters/classifier used by the tissue processing system 1110 such as the list of genes, weights, thresholds, etc. An input/output circuit 1160 may include displays and/or user input devices, such as keyboards, touch screens and/or pointing devices. Devices attached to the input/output circuit 1160 may be used to provide information to the processor 1140 by a user of the tissue processing system 1100. Devices attached to the input/output circuit 1160 may include networking or communication controllers, input devices (keyboard, a mouse, touch screen, etc.) and output devices (printer or display). An optional update circuit 1180 may be included as an interface for providing updates to the tissue processing system 1100 such as updates to the code executed by the processor 1140 that are stored in the memory 1150 and/or the storage circuit 1170. Updates provided via the update circuit 1180 may also include updates to portions of the storage circuit 1170 related to a database and/or other data storage format which maintains information for the tissue processing system 1100, such as the list of genes, weights, thresholds, etc. The sample input circuit 1110 provides an interface for the tissue processing system 1100 to receive tissue samples to be analyzed. The sample processing circuit 1120 may further process the tissue sample within the tissue processing system 1100 so as to prepare the tissue sample for automated analysis.

DETAILED DESCRIPTION

For the purposes of promoting an understanding of the principles of the present disclosure, reference will now be made to preferred embodiments and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended, such alteration and further modifications of the disclosure as illustrated herein, being contemplated as would normally occur to one skilled in the art to which the disclosure relates.

Articles “a” and “an” are used herein to refer to one or to more than one (i.e., at least one) of the grammatical object of the article. By way of example, “an element” means at least one element and can include more than one element.

"About" is used to provide flexibility to a numerical range endpoint by providing that a given value may be slightly above or slightly below (e.g., by 2%, 5%, 10% or 15%) the endpoint without affecting the desired result.

The use herein of the terms "including," "comprising," or "having," and variations thereof, is meant to encompass the elements listed thereafter and equivalents thereof as well as additional elements. As used herein, “and/or” refers to and encompasses any and all possible combinations of one or more of the associated listed items, as well as the lack of combinations where interpreted in the alternative (“or”).

As used herein, the transitional phrase "consisting essentially of (and grammatical variants) is to be interpreted as encompassing the recited materials or steps "and those that do not materially affect the basic and novel characteristic(s)" of the claimed invention. Thus, the term "consisting essentially of as used herein should not be interpreted as equivalent to "comprising."

Moreover, the present disclosure also contemplates that in some embodiments, any feature or combination of features set forth herein can be excluded or omitted. To illustrate, if the specification states that a complex comprises components A, B and C, it is specifically intended that any of A, B or C, or a combination thereof, can be omitted and disclaimed singularly or in any combination.

Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. For example, if a concentration range is stated as 1% to 50%, it is intended that values such as 2% to 40%, 10% to 30%, or 1% to 3%, etc., are expressly enumerated in this specification. These are only examples of what is specifically intended, and all possible combinations of numerical values between and including the lowest value and the highest value enumerated are to be considered to be expressly stated in this disclosure.

Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

Provided herein according to embodiments are methods for processing a tissue sample from a subject. In some embodiments, the tissue sample is a breast tissue sample. In some embodiments, the sample is a biopsy (e.g., a core biopsy). In some embodiments, the tissue sample is breast tissue removed during surgery such as a lumpectomy procedure or a mastectomy procedure. In other embodiments, the sample is not obtained from surgery. The tissue sample may include cells from a site of interest, for example, a site confirmed or suspected of having a tumor or pre-cancerous cells (such as DCIS). The site of interest may, for example, be suspected of having DCIS or other pre-cancerous cells based on imaging, such as the result of an abnormal mammogram finding.

In some embodiments, the tissue sample comprises a heterogeneous mixture of cells (e.g., mixed epithelial and stromal breast tissue cells). In some embodiments, the sample contains isolated cell types, or is enriched for a particular cell type or types. Isolation of cells may be performed by any suitable method, for example, by laser-capture microdissection (LCM). The cells of a site of interest have a plurality of messenger ribonucleic acid (mRNA) molecules reflecting expression of genes in the cells. In embodiments of the present invention, a plurality of the mRNA molecules are detected (e.g., optically detected) in order to identify and/or quantify expression levels of their corresponding genes. In some embodiments, the cells are processed (e.g., lysed and optionally mRNA molecules separated from other cell components) to access the plurality of mRNA molecules from the cells.

In some embodiments, the plurality of mRNA molecules are reverse transcribed to generate a plurality of complementary deoxyribonucleic acid (cDNA) molecules representative of the mRNA molecules, and the detection includes detecting the plurality of cDNA molecules. In some embodiments, the method includes performing nucleic acid amplification of the plurality of cDNA molecules (e.g., by polymerase chain reaction (PCR)) prior to the detection. A non-limiting example method for cDNA library preparation from mRNA molecules is Smart-3 SEQ. See Foley et al., "Gene expression profiling of single cells from archival tissue with laser-capture microdissection and Smart-3SEQ," Genome Research 29: 1816-1825 (2019).

Detection may be performed by suitable means known in the art. In some embodiments, optically detecting comprises detecting an optical signal from a probe coupled to the mRNA and/or cDNA molecules. In some embodiments, the optical signal is a fluorescent signal.

The expression levels of a plurality of genes as taught herein may be informative of a biological state (e.g., DCIS), and/or prognosis of recurrence or progression of the biological state (e.g., recurrence of DCIS and/or progression to invasive breast cancer). This biological state may be considered in determining treatment options for the subject. In some embodiments, methods include determining an increased or decreased risk of recurrence and/or progression of DCIS based upon the expression levels of the plurality of genes, and may further include treating the subject upon determining an increased risk of recurrence and/or progression of DCIS. The expression levels of the plurality of genes may be deteremined as taught herein, e.g., by quantifying and/or detecting mRNA/cDNA molecules.

As used herein, "treatment,” “therapy” and/or “therapy regimen” refer to the clinical intervention made in response to a disease, disorder or physiological condition manifested by a patient or to which a patient may be susceptible. The aim of treatment includes the alleviation or prevention of symptoms, slowing or stopping the progression or worsening of a disease, disorder, or condition and/or the remission of the disease, disorder or condition. In some embodiments, the treating comprises surgery, radiation, and/or chemotherapy (e.g., endocrine therapy).

The term "effective amount" or “therapeutically effective amount” refers to an amount sufficient to effect a beneficial or desirable biological and/or clinical result. As used herein, the term "subject" and "patient" are used interchangeably herein and refer to both human and nonhuman animals. The term "nonhuman animals" of the disclosure includes all vertebrates, e.g., mammals and non-mammals, such as nonhuman primates, sheep, dog, cat, horse, cow, chickens, amphibians, reptiles, and the like, for research and/or veterinary purposes.

In some embodiments, expression levels of the plurality genes may be incorporated into a classifier. The term "classifier" refers to an analysis that uses the gene expression levels, and optionally a pre-determined coefficient (or weight) for each gene expression level component, to generate an output or score for the purpose of assignment to a category or predicted outcome. A classifier may be obtained by a procedure known as "training," which makes use of a set of data containing observations with known category membership (e.g., recurrence or iBE after an initial finding of DCIS). Training may seek to find the optimal coefficient (i.e., weight) for each component of a set of gene expression level components, as well as an optimal list of gene expression level components to include, where the optimal result is determined by the highest achievable classification accuracy. See, e.g., U.S. Publication No. 2023/0212699.

In some embodiments, a classifier as taught herein is trained base on a subsequent ipsilateral occurrence of DCIS and/or invasive breast cancer in the plurality of subjects (e.g., within about 3, 5 or 8 years from collection of the tissue samples).

The classifier may be linear and/or probabilistic. A classifier is linear if scores are a function of summed signature values weighted by a set of coefficients. Furthermore, a classifier is probabilistic if the function of signature values generates a probability, a value between 0 and 1.0 (or between 0 and 100%) quantifying the likelihood that a subject or observation belongs to a particular category or will have a particular outcome, respectively. Probit regression and logistic regression are examples of probabilistic linear classifiers that use probit and logistic link functions, respectively, to generate a probability.

In some embodiments, the classifier/classification is "agnostic" in that it is indicative of a general biological state (e.g., risk of DCIS recurrence and/or progression), but it does not provide an indication of a particular biological pathway as a cause of the state.

In some embodiments, a method for generating a classifier as taught herein may include: (a) providing tissue samples (e.g., biopsies) from a plurality of subjects, said samples comprising cells of a breast tissue site of interest, said site of interest comprising or suspected of comprising ductal carcinoma in situ (DCIS) (e.g., suspected based on an abnormal mammogram), wherein said cells comprises a plurality of messenger ribonucleic acid (mRNA) molecules; (b) detecting (e.g. optically detecting) an expression level of said plurality of mRNA molecules to thereby quantify expression levels of a plurality of genes in the cells; and (c) using the expression levels of the plurality of genes to train a classifier, said classifier capable of determining a risk of DCIS recurrence and/or progression, to thereby generate the classifier.

In some embodiments, the generating comprises, consists of, or consists essentially of, iteratively: (i) assigning a weight for each gene expression value, entering the weight and expression value for each gene into a classifier equation and determining a score or classification for a particular outcome for each of the plurality of subjects, then (ii) determining the accuracy of classification for each outcome across the plurality of subjects, and then (iii) adjusting the weight until accuracy of classification is optimized, wherein genes having a non-zero weight are included in the Optionally, components of the classifier (e.g., genes, weights and/or classification threshold value) may be uploaded into one or more databases for later retrieval or use.

In some embodiments, the classifier is trained based on a subsequent ipsilateral occurrence of DCIS and/or invasive breast cancer in a subject as a classification (e.g., within about 3, 5 or 8 years from collection of the tissue samples).

In some embodiments, the plurality of genes may include at least 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90 or 100 of the genes listed in Table 1, which genes were found to be differentially expressed in DCIS tissue based on an outcome, as further described in the examples provided below. In some embodiments, the plurality of genes includes at least 30, 50, 80, 100, 200, or 300 of the genes listed in Table 1. In some embodiments, the plurality of genes includes at least 100, 300, 500, 600, 700, or 800 of the genes listed in Table 1.

TABLE 1: 812 Differentially Expressed Genes log2FoldChange > 0 : Up in ipsilateral breast event (either DCIS or IBC) within 5 years.

Compartment column indicates if the respective gene was significantly differentially expressed (FDR<0.05) in the epithelial or stromal compartment by DESeq2 analysis of stromal vs epithelial RAHBT LCM samples.

Tissue Processing Systems

Systems useful to carry out the methods of tissue processing as described herein can be implemented in hardware, software, firmware, or combinations of hardware, software and/or firmware. In some examples, the systems may be implemented using a non-transitory computer readable medium storing computer executable instructions that when executed by one or more processors of a computer cause the computer to perform operations. Computer readable media suitable for implementing the systems described in this specification include non-transitory computer-readable media, such as disk memory devices, chip memory devices, programmable logic devices, random access memory (RAM), read only memory (ROM), optical read/write memory, cache memory, magnetic read/write memory, flash memory, and application-specific integrated circuits. In addition, a computer readable medium that implements a system (e.g., comprising genes and/or classifiers as taught herein) may be located on a single device or computing platform or may be distributed across multiple devices or computing platforms.

With reference to FIG. 4, a tissue processing system and/or computer program product 1100 may be used according to various embodiments described herein. A tissue processing system and/or computer program product 1100 may be embodied as one or more enterprise, application, personal, pervasive and/or embedded computer systems that are operable to receive, transmit, process and store data using any suitable combination of software, firmware and/or hardware and that may be standalone and/or interconnected by any conventional, public and/or private, real and/or virtual, wired and/or wireless network including all or a portion of the global communication network known as the Internet, and may include various types of tangible, non- transitory computer readable medium.

As shown in FIG. 4, the tissue processing system 1100 may include a processor subsystem 1140, including one or more Central Processing Units (CPU) on which one or more operating systems and/or one or more applications run. While one processor 1140 is shown, it will be understood that multiple processors 1140 may be present, which may be either electrically interconnected or separate. Processor(s) 1140 are configured to execute computer program code from memory devices, such as memory subsystem 1150, to perform at least some of the operations and methods described herein, and may be any conventional or special purpose processor, including, but not limited to, digital signal processor (DSP), field programmable gate array (FPGA), application specific integrated circuit (ASIC), and multi-core processors.

The memory subsystem 1150 may include a hierarchy of memory devices such as Random Access Memory (RAM), Read-Only Memory (ROM), Erasable Programmable Read-Only Memory (EPROM) or flash memory, and/or any other solid state memory devices. A storage circuit 1170 may also be provided, which may include, for example, a portable computer diskette, a hard disk, a portable Compact Disk Read-Only Memory (CDROM), an optical storage device, a magnetic storage device and/or any other kind of disk- or tape-based storage subsystem. The storage circuit 1170 may provide non-volatile storage of data/parameters/classifiers for the tissue processing system 1100. The storage circuit 1170 may include disk drive and/or network store components. The storage circuit 1170 may be used to store code to be executed and/or data to be accessed by the processor 1140. In some embodiments, the storage circuit 1170 may store databases which provide access to the data/parameters/classifiers used for the tissue processing system 1110 such as the list of genes, weights, thresholds, etc. Any combination of one or more computer readable media may be utilized by the storage circuit 1170. The computer readable media may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable readonly memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. As used herein, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

An input/output circuit 1160 may include displays and/or user input devices, such as keyboards, touch screens and/or pointing devices. Devices attached to the input/output circuit 1160 may be used to provide information to the processor 1140 by a user of the tissue processing system 1100. Devices attached to the input/output circuit 1160 may include networking or communication controllers, input devices (keyboard, a mouse, touch screen, etc.) and output devices (printer or display). The input/output circuit 1160 may also provide an interface to devices, such as a display and/or printer, to which results of the operations of the tissue processing system 1100 can be communicated so as to be provided to the user of the tissue processing system 1100.

An optional update circuit 1180 may be included as an interface for providing updates to the tissue processing system 1100. Updates may include updates to the code executed by the processor 1140 that are stored in the memory subsystem 1150 and/or the storage circuit 1170. Updates provided via the update circuit 1180 may also include updates to portions of the storage circuit 1170 related to a database and/or other data storage format which maintains information for the tissue processing system 1100, such as the signatures, weights, thresholds, etc.

The sample input circuit 1110 of the tissue processing system 1100 may provide an interface for the platform as described hereinabove to receive tissue samples to be analyzed. The sample input circuit 1110 may include mechanical elements, as well as electrical elements, which receive a tissue sample provided by a user to the tissue processing system 1100 and transport the tissue sample within the tissue processing system 1100 and/or platform to be processed. The sample input circuit 1110 may include a bar code reader that identifies a bar-coded container for identification of the sample and/or test order form. The sample processing circuit 1120 may further process the tissue sample within the tissue processing system 1100 and/or platform so as to prepare the sample for automated analysis. The sample analysis circuit 1130 may automatically analyze the processed tissue sample. The sample analysis circuit 1130 may be used in measuring, e.g., gene expression levels of a pre-defined set of genes with the tissue sample provided to the tissue processing system 1100. The sample analysis circuit 1130 may also optionally generate normalized gene expression values by normalizing the gene expression levels. The sample analysis circuit 1130 may retrieve from the storage circuit 1170 a DCIS classifier as taught herein. The sample analysis circuit 1130 may enter the gene expression values into the classifier. The sample analysis circuit 1130 may calculate a score or probability of DCIS recurrence and/or progression based upon said classifier, via the input/output circuit 1160.

The sample input circuit 1110, the sample processing circuit 1120, the sample analysis circuit 1130, the input/output circuit 1160, the storage circuit 1170, and/or the update circuit 1180 may execute at least partially under the control of the one or more processors 1140 of the tissue processing system 1100. As used herein, executing "under the control" of the processor 1140 means that the operations performed by the sample input circuit 1110, the sample processing circuit 1120, the sample analysis circuit 1130, the input/output circuit 1160, the storage circuit 1170, and/or the update circuit 1180 may be at least partially executed and/or directed by the processor 1140, but does not preclude at least a portion of the operations of those components being separately electrically or mechanically automated. The processor 1140 may control the operations of the tissue processing system 1100, as described herein, via the execution of computer program code.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the tissue processing system 1100, partly on the tissue processing system 1100, as a stand-alone software package, partly on the tissue processing system 1100 and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the tissue processing system 1100 through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computer environment or offered as a service such as a Software as a Service (SaaS).

The present invention is further described in the following non-limiting examples.

EXAMPLES

Here, as part of the Human Tumor Atlas Network (HTAN) we present two DCIS cohorts, the Translational Breast Cancer Research Consortium (TBCRC) 038 study and the Resource of Archival Breast Tissue (RAHBT), for multimodal molecular analyses. We performed comprehensive integrated molecular profiling of these complementary, clinically annotated, longitudinally sampled cohorts, to understand the spectrum of molecular changes in DCIS and to identify both tumor and stromal predictors of subsequent events. We used multidimensional and multiparametric approaches to address central conceptual themes of cancer progression, ecology, and evolutionary biology. The breast precancer atlas (PCA) presented here may facilitate phylogenetic analysis to reconstruct the relationship between DCIS and IBC, the natural history of DCIS, and factors that underlie progression to invasive disease.

RESULTS

Study Design and Cohorts

We generated two retrospective case-control cohorts of patients initially diagnosed with pure DCIS with or without a subsequent ipsilateral breast event (iBE, either DCIS or invasive breast cancer (IBC)) after surgical treatment. Identical eligibility criteria were used for outcome analysis in both cohorts. The RAHBT cohort used for outcome analysis has 97 cases with median diagnosis at age 53, and 40 months median time to recurrence. Over half (66.0%) had lumpectomy with radiation, 10.3% had lumpectomy without radiation, and 35% were identified as black. The TBCRC cohort included 216 patients with median diagnosis at age 52, and 48 months median time to recurrence. More than half (55.5%) had lumpectomy with radiation, 15.3% had lumpectomy without radiation, and 30.0% were identified as black. FIG. 1 shows an outline of cohorts and analyses in this study. Cohort descriptions are provided in Table 2. TABLE 2. Breast Pre-cancer Atlas Patient Cohorts with RNA-seq data and ipsilateral breast event (iBE) used for outcome analysis.

Time to Recurrence* (months)

* To end of follow-up for no recurrence. Prognostic classifier predicts early recurrence

The TBCRC and RAHBT cohorts were designed to investigate biological determinants of recurrence by matching patients with subsequent iBE to patients that did not have any events during long-term follow-up.

To identify gene expression patterns correlating with outcome, we analyzed RNA from primary DCIS with iBEs within 5 years vs the remaining samples in TBCRC, to avoid including non-clonal events that might be more common in later years. We identified 812 differentially expressed (DE) genes at 0.05 false discovery rate (FDR). Table 1 above lists 812 differentially expressed genes from DESeq2 analysis iBEs within 5 years vs. the rest in TBCRC.

To identify copy number aberrations (CNAs) that correlate with outcome, we performed light-pass whole genome sequencing (WGS) on DNA from FFPE samples in both cohorts (n=228). We identified 29 recurrent CNAs across both cohorts, none of which were predictive of recurrence. Given the absence of significant CNAs, we trained a Random Forest classifier in TBCRC using only the 812 DE genes. The classifier was validated in RAHBT, with an ROC AUC of 0.72 (FIG. 2A), Precision 0.86, Recall 0.91, and Fl score 0.88, indicating that the classifier performed well also in the test cohort. The classifier significantly predicted any subsequent iBE in both cohorts (RAHBT P=0.0004, FIG. 2B). Importantly, it was also a significant predictor of invasive iBEs over the full follow-up time (TBCRC P0.0001, RAHBT P=0.0042, FIGS. 2C - 2D), demonstrating the classifier could specifically identify DCIS that progress to IBC.

Next, we examined whether the 812 gene classifier remained an independent predictor of outcome when combined with clinical features. We performed multivariable Cox regression analysis including the classifier, treatment, age, clinical ER, and DCIS grade. While multivariable analysis demonstrated a trend for treatment type and ER status for outcome, only the 812 gene classifier was significant in both cohorts (RAHBT HR=3.48, (95% CI: 1.14-10.6), P=0.028). Importantly, in multivariable analysis for invasive iBEs only, the classifier showed an even stronger prognostic value in both cohorts, with a hazard ratio of 7.33 in RAHBT (95% CI: 1.57- 34.2, P=0.011, FIGS. 2E- 2F). While previous studies found association between ER status and DCIS outcome, Kaplan-Meier analysis of clinical ER status (IHC -based) demonstrated a trend in RAHBT (P=0.053), but not in TBCRC (P=0.2). Moreover, the 812 gene classifier showed no prognostic value for progression free disease or overall survival for 1064 IBCs from The Cancer Genome Atlas (TCGA), suggesting that the classifier is specific for the DCIS stage.

To compare the 812 gene classifier to commercially available prognostic tests for DCIS, we calculated the Oncotype DCIS score as previously described using TBCRC and RAHBT RNA- sequencing data. We found that, in contrast to the 812 gene classifier, the DCIS Oncotype score did not differ between the outcome groups in either cohort.

The 812 gene classifier likely represents several distinct biologic processes that promote recurrence and invasive progression. To further understand the biology and identify pathways involved in recurrence, we performed Gene Set Enrichment Analysis (GSEA) on DE genes between cases with 5-year recurrence vs the rest in TBCRC. We identified 11 Hallmark pathways significantly associated with early recurrence including those associated with proliferation, immune response, and metabolism.

To further examine pathway activation status, we performed Gene Set Variation Analysis (GSVA) at the individual tumor level in 5-year outcome groups. Here, MYC and mTORcl signaling were increased in cases vs controls and strongly correlated (FIGS. 3A - 3B). We also observed high correlation between cell cycle linked G2M and E2F pathways. Further, Glycolysis and Oxidative Phosphorylation were increased in cases, and the significant positive correlation between these two pathways indicated that metabolically active tumors use both pathways. Overall, this analysis confirmed the finding from the differential abundance and GSEA analysis of 5-year outcome groups.

DCIS RNA clustering defines expression modules that drive outcome

Since proliferation and metabolism were identified as important pathways in recurrence, we next examined whether these pathways are driven by maj or DCIS phenotypes. Previous studies suggested that IBC subtypes do not fit well for DCIS. We hypothesized that a DCIS-specific classification scheme would better address DCIS biology. To investigate the biology behind the outcome analysis with emphasis on epithelial pathways, we performed unsupervised clustering of RNA-seq data from TBCRC (n=216) as well as an additional group of RAHBT cases (n=265) where we generated epithelial-enriched samples by laser capture microdissection (LCM) to evaluate tumor cell expression patterns without contributions from the tumor microenvironment.

We performed non-negative matrix factorization (NMF) on all protein coding genes (GENCODE v33) with non-zero variance, evaluated the fit of 2-10 clusters, and selected a 3- cluster solution based on silhouette width, cophenetic value, maximizing cluster number, and replication in RAHBT. The 3 -cluster solution most reproducibly captured the biologic subgroups in both cohorts. To ensure the identified clusters were not an artifact of the clustering method, we ran consensus clustering in TBCRC, which rediscovered three clusters with high concordance with the NMF clusters (85.6%). In both cohorts, cluster 1 had significantly higher ERBB2 and lower ESRI expression compared to clusters 2 and 3, which both had increased ESRI expression. We termed the three clusters ERi ow , quiescent, and ERhigh respectively. To characterize these clusters, we conducted differential abundance analysis comparing each cluster individually to the other two combined (one-vs-rest). The deregulated pathways in each cluster were highly concordant across both cohorts, further supporting three transcriptional patterns in DCIS that are driven by the tumor cell compartment (PERIOW=2.33X10‘ 2 ; Pquiescent=8.37xl 0' 2 ; PERhigh=9.2Oxl O' 10 ; hypergeometric test).

While we observed a differential expression of the estrogen response in the ERhigh cluster vs ERiow cluster, the most striking patterns involved pathways associated with DCIS recurrence. Pathways including MYC, mTOR signaling, and cell cycle pathways were enriched in ERiow and significantly depleted in the quiescent cluster. Moreover, the Allograft Rejection, p53 and Adipogenesis pathways were high in ERiow and low in ERhigh. Finally, ERhigh tumors were depleted for UV Response Down and enriched for Oxidative Phosphorylation pathways, both of which were associated with recurrence. None of the recurrence-associated pathways were enriched in the quiescent cluster. The presence of the Allograft Rejection pathway in RAHBT LCM epithelial samples, though not significant, suggests that immune cells have infiltrated the epithelial compartment in the involved samples. Thus, the 3 -cluster solution identified pathways associated with recurrence.

Genomic and transcriptomic-based classifications of IBC have characterized the spectrum of invasive breast cancer subtypes, but it remains unclear whether these accurately describe the spectrum of DCIS. To investigate, we applied the PAM50 classification to TBCRC and RAHBT LCM epithelial DCIS samples and evaluated the correlation of each sample to the centroid of its assigned subtype. We compared this correlation to IBCs from TCGA through repeated downsampling of the TCGA. The median correlation was consistently lower in DCIS compared to IBC, with the most pronounced difference in the basal-like subtype, as previously shown. Significantly decreased correlation was also observed for luminal A (P=3.13xl0‘ 3 ) and normallike subtypes (P=6.21xl0‘ 3 ). UMAP projection of the DCIS transcriptome revealed clear deviations from the PAM50 centroids, and PAM50 failed to predict DCIS recurrence. These data suggest that while established IBC subtypes can be identified in DCIS, they do not fit DCIS as robustly as IBC, and are not prognostic in these premalignant lesions.

In support of the 3-cluster solution, we investigated MIBI protein expression for a subset of patients (n=71 ). The frequency of ER+ tumor cells was significantly higher in the quiescent and ERhigh subtypes compared to ERiow (log2FC=2.73; P=2.11xl0' 5 ; Wilcoxon rank sum test) while HER2+ tumor cells were significantly higher in the ERiow subtype (log2FC=4.88; P=3.74xl0' 2 ; Wilcoxon rank sum test). Overall, the frequencies of ER+ and HER2+ tumor cells were well correlated with RNA abundance of ESRI and ERBB2, respectively. PGR levels were upregulated in quiescent and ERhigh compared to ERi ow . Based on MIBI data, quiescent lesions were depleted for Ki67 (log2FC=-1.46; P=8.08xl0' 2 ; Wilcoxon rank sum test) and GLUT1 (log2FC=-2.64; P=8.47xl0' 3 ) positive tumor cells, vs ERhigh and ERi ow tumors, suggesting quiescent lesions are less proliferative and less metabolically active.

In their analysis of DCIS tumors and TME by MIBI, Risom et al. (Cell 185, 299-310. el8 (2022) identified myoepithelial E-cadherin expression as the most discriminative feature for risk of progression. To investigate this in relation to the identified RNA clusters, we compared the distribution of myoepithelial E-cadherin frequency by MIBI in matched RAHBT LCM RNA samples. We found that ERhigh lesions had significantly higher myoepithelial E-cadherin frequency compared to ERiow and quiescent lesions (P<0.026). While most recurrence-associated pathways were enriched in ERiow lesions, this points to a feature associated with recurrence amongst ER+ DCIS tumors, and highlights that there are multiple paths to progression in DCIS.

Amplifications characteristic of high-risk of relapse IBC occur in DCIS

Next, we investigated how CNAs in DCIS contribute to pathways associated with DCIS recurrence. Amongst the 29 recurrent CNAs identified across both cohorts, we found 13 gains and 16 losses, occurring in 10.1-52.6% of DCIS samples (FDR<0.05; GISTIC2). The identification of these common CNAs was not biased by depth of sequencing, but two were associated with cohort (lp21.3 and 1 Op 15.3 deletions). The most frequent alterations were gains of chromosomes Iq and 17q, including 17ql2 where the ERBB2 oncogene is located, and loss of chromosome 17p, 16q, and 1 Iq, confirming prior findings and notably reflecting the CNA landscape of IBC.

Next, we investigated if the distribution of Proportion of the Genome copy number Altered (PGA) was biased in the 5-year outcome groups or 812 gene classifier risk groups, but found no significant differential distribution. PGA was not correlated to sequencing depth, nor predictive of iBEs.

Early patterns of alterations may provide insight into the mechanisms of neoplastic lesion development and progression. To identify genomic subtypes in DCIS, we employed unsupervised NMF clustering of CNA segments on TBCRC and RAHBT jointly and identified eight clusters ranging in size from 2-98 samples which were not biased by depth of sequencing. CNA cluster 1 was characterized by chr20ql3.2 amplification. Three clusters were characterized by chrl7q amplification (Cluster 2: 17ql 1, Cluster 3: chrl7q23.1, Cluster 4: chrl7ql2). Cluster 5 was had chr8pl 1.23 amplification, Cluster 6 chrl 1 ql 3.3 amplification, and Cluster 7 amplification oiMYC on chr8q24. Cluster 8, the largest group (n=98), represented a CNA quiet subgroup, characterized by the absence or diminished signal of these CNAs. Integrative subgroups (ICs) is an IBC classification scheme based on genomic copy number and expression profiles. Intriguingly, despite the eight CNA clusters not being associated with recurrence several of these clusters were attributed to the presence or absence of CNAs characteristic of IC subtypes, namely the four high-risk of relapse ER+/HER2- subgroups (IC1,2,6,9) and the HER2-amplified (IC5) subgroup. Of note, these four high-risk integrative subgroups (IC1,2,6,9) account for 25% of ER+/HER2- IBC and the majority of distant relapses. Integrative subtypes are prognostic in IBC and improve the prediction of late relapse relative to clinical covariates. Understanding the clinical course of DCIS lesions harboring these high-risk invasive features is highly relevant in refining clinically meaningful risk associated with DCIS progression.

To identify enriched pathways in the eight CNA clusters, we investigated the differential abundance in matched RNA samples (DESeq2 one-vs-rest) and performed GSEA Hallmark analysis on the resulting gene lists. Clusters 6 (chrl lql3 amplification) and 7 (chr8q24 (MYC) amplification) were enriched for pathways associated with recurrence (Allograft Rejection and Oxidative Phosphorylation, respectively), whereas Cluster 8 (CNA quiet) was depleted of recurrence associated pathways (Cell Cycle and mTORcl signaling), and Cluster 6 was depleted of MYC targets. The remaining CNA clusters had no significant pathway enrichments. Thus, we identified a CNA-based cluster solution characterized by amplifications seen in high-risk IBC subtypes, including 17ql2 (ERBB2) and 8q24 (MYC) amplification, some of which were significantly enriched or depleted for pathways associated with recurrence.

The DCIS TME reflects distinct immune and fibroblast states

The Hallmark pathways identified represent a diverse set of biologic events and may involve different components of the DCIS ecosystem including the cells within the TME. Accumulating evidence has shown that the TME is crucial for cancer development and progression. To analyze the DCIS TME, we generated RAHBT LCM stromal samples by dissecting stromal tissue from the DCIS edge.

To identify the contribution of epithelial and stromal components to the 812 gene classifier, we performed differential abundance analysis between stromal (n=196) and epithelial (n=265) samples from the RAHBT LCM cohort. We identified 9748 DE genes (FDR<0.05) between epithelium and stroma (5161 epithelial, 4587 stromal). An analysis of the 812 classifier genes showed that 20% were expressed primarily in stromal/TME cells, and 34% in epithelium.

The MIBI method provides an orthogonal view of the TME and generates protein expression and identity of 16 different cell types including epithelial, fibroblasts, and immune cell types. We used adjacent TMA sections to analyze RNA and MIBI expression on the same ducts. We compared MIBI-based cell type distribution across samples with the inferred cell type distribution from RNA expression data using CIBERSORTx (CSx), allowing us to cross-validate findings and extend observations on cell composition to DCIS samples without MIBI data, including the TBCRC cohort.

To define discrete TME phenotypes, we performed shared nearest neighborhood clustering of stromal RNA data and identified four distinct DCIS-associated stromal clusters and DE genes (DESeq2 each-vs-rest). Pathway analyses, MIBI protein expression and cell type distribution, and CSx-inferred cell type distribution were used to describe major characteristics of each cluster, which were termed Immune dense, Desmoplastic, Collagen-rich, and Normal-like. There was a strong correlation with fibroblast states and immune cell density.

The Immune stromal cluster was the most distinct stromal subtype, with enrichment for the outcome-associated Allograft Rejection- and other immune activation pathways. MIBI and CSx data demonstrated a total abundance of immune cells more than twice that of any other cluster, with predominance of lymphoid over myeloid cells. A subgroup within this cluster was highly enriched for B cells, whereas another displayed overall balanced immune cell type composition. The Immune cluster also showed association with MIBI-identified T-cell and B-cell enriched neighborhoods, myoepithelial- and myeloid-enriched neighborhoods, and was enriched for the ERiow subtype.

The normal-like cluster was enriched for Gene Ontology pathways involved with ECM organization, Complement and Coagulation Cascades, Focal Adhesion, and PI3K-AKT signaling. The collagen-rich cluster was characterized by Collagen Metabolism, TGFb signaling, and Proteoglycans in Cancer, and Cell-Substrate and Focal Adhesion. This cluster had the highest fibroblast abundance and total myeloid cells, mostly associated with macrophages and myeloid dendritic cells (mDC). According to MIBI, this cluster was enriched in collagen and fibroblast associated protein positive (FAP+, VIM+, SMA+) myofibroblasts. The desmoplastic cluster was characterized by mammary gland development and fatty acid metabolism, high presence of VIM+, SMA+ myofibroblasts by MIBI, and higher levels of CD8+ T cells assessed by CSx vs the normallike and collagen-rich clusters.

These analyses indicate that the immune response is present in a discrete subset of cases. However, outcome analysis by stromal subtype demonstrated a modest outcome difference, without major contribution from the Immune subcluster (P=0.12, log-rank test). We hypothesized that the outcome differences could be attributed to a subset of immune cells rather than the entire immune response, and analyzed CSx-inferred cell type distribution in 5-year outcome groups in TBCRC and RAHBT combined. We identified significantly higher levels of CD4+ T cells, myeloid- and plasmacytoid dendritic cells (pDC), monocytes, macrophages, and overall immune cells in cases vs. controls. Furthermore, we found that several cell types, including CD4 T-cells, mDCs, and pDCs, were significant predictors of any iBE 5 years after treatment (univariable Cox regression analysis). These differences in outcome groups were overall mirrored by CSx-inferred cell type distributions in the high- and low risk classifier groups. Finally, we investigated the distribution of CSx-based cell types in 5-year outcome groups stratified by iBE type. The results overall reflected the analysis in cases vs. controls, with the largest differences observed between invasive iBEs and controls.

Taken together, these results support the contributions of individual immune cells with high-risk outcomes. However, non-immune cell phenotypes are not well defined by this CSx approach but can still be identified as a biologic response. The desmoplastic cluster had the clearest and most favorable outcome (HR=0.23, P=0.06), despite being enriched for several recurrence- associated pathways, including proliferative signals (MYC and G2M checkpoint) associated with poor outcome in the epithelial compartment. This highlights the complexity and differential contribution from the stromal and epithelial compartments.

DISCUSSION

The aims of the HTAN Breast Pre-Cancer Atlas are to 1) develop a resource of multi-modal spatially resolved data from breast pre-invasive samples that will facilitate discoveries by the scientific community regarding the natural history of DCIS and predictors of progression to lifethreatening IBC; and 2) populate that platform with data from retrospective cohorts of patients with DCIS and demonstrate its use to construct an atlas to test novel biologic insights. Here, we examined two well -annotated, retrospective, longitudinal patient cohorts with or without a subsequent iBE. The two cohorts have important and distinct differences. They comprise subjects from diverse geographical sites, race/ethnicities, median years of diagnosis, and time to recurrence. There were no significant differences in age at diagnosis or treatment across cohorts. Together, these cohorts comprise a large series of matched case-control samples allowing great statistical power to perform the comprehensive studies reported here. A particular strength of the study is the complementary nature of the two cohorts, allowing for validation of our findings, as well as the capability to separately study the epithelial and stromal components in RAHBT LCM samples. Future observations on a DCIS cohort undergoing watchful waiting would provide outcome results that may be more aligned with emerging personalized treatment strategies of DCIS, which could include non-surgical options. DCIS is a heterogeneous disease with variable prognosis but has defied attempts to identify molecular factors associated with future progression. Previous studies have evaluated the prognostic value of biomarkers associated with outcomes, with conflicting conclusions for virtually all markers tested, including ER, HER2, immune markers such as tumor infiltrating lymphocytes, and stromal characteristics. Many promising leads have not been reproducible due to multiple factors, including lack of endpoint standardization, differences between cohorts, small sample size, and limited datasets for validation with long-term outcomes.

Herein, we have developed and validated an 812 gene classifier which independently predicted risk of both overall recurrence and invasive progression. This classifier was highly associated with outcome in a multivariable model which included treatment, age, grade, and clinical ER status; the classifier had a HR of 22.5 (95% CI 8.5- 59.4) in the training set and 7.3 (95% CI 1.6- 34.2) in the validation set, over four-fold higher than has been previously reported for other prognostic markers for DCIS.

Importantly, we found that this classifier was a stronger predictor of 5-year recurrence or progression than previously described clinical factors, including age at diagnosis, tumor grade, ER status, or treatment. The large dataset, with a high number of events, permitted an agnostic analysis of all genome-wide features and was thus less opportunistic than other, more limited studies. Further, since no a priori assumptions were made regarding whether to incorporate the molecular features of invasive cancer, we were able to construct a less biased predictor.

Our classifier is characterized by several Hallmark pathways including some related to cell cycle progression and growth factor signaling (E2F targets, G2M checkpoint, MYC targets, mTORcl signaling) and metabolism (Glycolysis, Oxidative Phosphorylation). Examination of pathway activation status at the individual tumor level revealed the underlying complexity of the classifier. High correlation between cell cycle linked E2F and G2M pathways are consistent with a proliferation related signature. However, the strongest features of the classifier (distinguishing cases from controls) were MYC and MT0RC1 signaling which are strongly correlated with each other but less so with the canonical proliferation pathways indicating that proliferation alone is not the central predictor. Interestingly, both Glycolysis and Oxidative Phosphorylation were increased in cases suggesting that heightened metabolic activity is associated with risk of progression regardless of whether it is anaerobic. Finally, Allograft Rejection, a broad immune pathway, was elevated in cases and in general appeared to be an independent component of the classifier. Overall, there are multiple components to this classifier that are elevated in different subsets of the tumors lending additional evidence that simplified predictors fail to capture the heterogeneity of the disease. IBC has been genomically profiled with several approaches, including the PAM50 and IC classification schemes. While DCIS and IBC are part of the same neoplastic process, there are differences in the TME, evolutionary age, and inter-observer variability in diagnostic labeling at different stages of progression. This suggests that a DCIS-specific classification scheme would correlate better with biologic and clinical features of DCIS. Our analysis indicated the PAM50 subtypes are not apt for DCIS characterization, as previously described (Berghoitz et al., NPJ Breast Cancer 6, 26 (2020)). Instead, we identified three transcriptomic DCIS subgroups, characterized by ER signaling, proliferation and metabolism. These subtypes more accurately capture the spectrum of DCIS biology than IBC-derived subtypes, and represent the fundamental genomic organization at this early stage of breast neoplasia. They may represent the earliest variation in neoplasia transcriptome, potentially applicable to earlier stages such as hyperplasias.

There are several possible reasons why traditional IBC classifiers do not perform well on DCIS. HER2 expression is more common at the DCIS stage than at the IBC stage, which may lead to a different transcriptomic distribution in DCIS vs IBC. Many ER- DCIS express HER2 without amplification, in contrast to IBC, where the HER2-amplified subtype is clearer. Moreover, DCIS cells are confined to the epithelial compartment and interact with myoepithelial cells and the basement membrane, thus presumably restricted by rules of differentiation that govern normal epithelial cells, which could constrain the transcriptomic variability of neoplastic cells and in turn possible subtypes. Finally, the evolutionary age of the neoplasm may influence classification differences in DCIS vs IBC. By comparing WGS data from DCIS and IBCs, we found that the same constellation of copy number changes was present in both, consistent with previous studies. While DCIS had fewer genomic alterations than IBC, and a larger group of DCIS was classified as genomically quiescent, recurrent genomic events that drive the IBC-based IC scheme were evident at the DCIS stage.

A unique aspect of our study is the separate profiling of stromal and epithelial components through CSx analysis of LCM-derived RNA coupled with in situ MIBI protein expression. We identified four stromal subtypes characterized by distinct pathways, stromal-, and immune cell composition. Specific stromal patterns were correlated with epithelial expression patterns, and particularly HER2+ZER- DCIS were associated with a stronger immune response, potentially associated with co-amplification of ERBB2 (HER2) and chemokine encoding genes on the 17ql2 chromosomal region. A limitation of this study is that our CSx approach did not facilitate identification of non-immune stromal cell types.

Generating a DCIS atlas is similar to the effort of TCGA for IBC, but there are important differences. Working with DCIS samples is considerably more challenging; while IBC tumors are evident by gross exam, and can be easily obtained as fresh, fresh frozen, or archival material, this is not the case for pre-invasive lesions. DCIS can sometimes be recognized radiographically but is only precisely detailed by pathologic examination, making prospective tissue collection a challenge. Moreover, the transition from intraepithelial to invasive neoplasia is definitional for IBC. For DCIS, such a clear-cut definition does not exist. DCIS is broadly defined by cytologic and architectural changes compared to normal breast tissue by a growth of neoplastic cells in the inter-epithelial compartment.

One issue that should be noted is the genetic relationship between the primary DCIS and the subsequent ipsilateral cancer. Recent work on a large cohort indicates that 18% of ipsilateral invasive events may be unrelated to the primary DCIS based on mutations and CNAs. Non-clonal recurrences were more likely to be in a different breast quadrant and have discordant ER expression whereas time to recurrence and patient age were not significantly associated with clonality. While we did not examine the recurrences in the current study to determine clonality, it is likely that a similar fraction would be identified as “unrelated.” We anticipate that further refinement and validation of our classifier will be strengthened by eliminating non-clonal iBEs.

In conclusion, we have developed a genomic classifier that predicts both recurrence and invasive progression, using large, comprehensively annotated case-control data sets of primary DCIS. The classifier is comprised of both epithelial and stromal features. Our findings support that progression is a process that requires both invasive propensity among the DCIS cells and stromal permissiveness in the TME. We propose this classifier as the basis for a future clinical test to assess outcomes in patients with primary DCIS to guide a more individualized therapy, based on biologic risk. Future work will include further validation of the classifier and translation to clinical implementation.

EXPERIMENTAL MODEL AND SUBJECT DETAILS

Cohort collection and sample acquisition

RAHBT Cohort

The Resource of Archival Breast Tissue (RAHBT) is a data/tissue resource established by Drs. Allred and Colditz in 2008 focused on premalignant or benign breast disease. Uniform coding of premalignant lesions assures greater consistency and use of research. Follow-up through hospital record linkages documents subsequent breast lesions including IBC. The entire study population includes women ages 18 and older with documented cases of premalignant breast disease (including carcinoma in situ). The study was approved by the Washington University in St. Louis Institutional Review Board (IRB ID #: 201707090). Women were identified as eligible through seven primary sources: Washington University School of Medicine Departmental databases (Surgery, Radiation Oncology, Pathology, and Radiology), and the Siteman Oncology Services Database (local tumor registry), the St. Louis Breast Tissue Repository, and the Women’s Health Repository. We reviewed all records, excluded women with cancer prior to qualifying premalignant lesions and identified 1831 unique women with DCIS or DCIS and subsequent recurrence. A common data set with pathologic details, risk factor data, treatment, and unique identifiers was created and used to follow these women for subsequent breast lesions. Centralized pathology review confirmed 174 cases of DCIS with recurrent lesions. For each case (with subsequent ipsilateral or contralateral breast events) we matched two controls who remained free from subsequent breast events based on race, year of diagnosis (+/- 5 years), age at diagnosis (+/- 5 years), and type of definitive surgery (mastectomy or lumpectomy). For each DCIS diagnosis we retrieved slides and blocks for pathology review, secured a whole slide image of each sample, marked for TMA cores, and prepared for laboratory processing. A total of 172 cases and 338 controls were cored for TMAs. Breast pathology review was completed by Drs. Allred, Warrick, DeSchryver, and Veis.

To define an external validation data set that used identical eligibility criteria to TBCR 038 including year of initial DCIS diagnosis, we identified an additional set of cases from RAHBT and used comparable laboratory procedures for RNA-seq.

For RAHBT, 97 patients were analyzed by RNA-seq (Table 2). The median age at diagnosis was 53, and median year of diagnosis 2006. Time to recurrence with ipsilateral IBC was 36 months, and to diagnosis of ipsilateral DCIS 46.9 months. For women in the cohort with no iBEs, median follow-up extended to 141 months. The total number of deaths by any cause was six. Treatment of initial DCIS ranged from lumpectomy with radiation (66.0%), and no radiation (10.3%) and mastectomy (23.7%). This subset of the RAHBT cohort was composed of 35.1% African American women.

For RAHBT LCM, 265 patients were analyzed by RNA-seq. The median age at diagnosis was 53, and median year of diagnosis 2002. Time to recurrence with ipsilateral IBC was 80 months, and to diagnosis of ipsilateral DCIS 50 months. For women in the cohort with no iBEs, median follow-up extended to 111 months. Treatment of initial DCIS ranged from lumpectomy with radiation (52%), and no radiation (18%) and mastectomy (28%). This subset of the RAHBT cohort was composed of 25% African American women. TBCRC 038 Cohort

TBCRC 038 is a retrospective multi-center study activated at 12 participating TBCRC (Translational Breast Cancer Consortium) sites, which identified women treated for ductal carcinoma in situ (DCIS) at one of the enrolling institutions between 01/01/1998 and 02/29/2016. The TBCRC and the Department of Defense (DOD) approved this study for the collection of archival tissues. Duke served as the initiating and central site for all data, samples, assays, and analysis. The study was approved by the Duke Health Institutional Review Board (Protocol ID: Pro00068646) as well as the IRB at each participating institution. Individual sites reviewed medical records to identify patients eligible for the study.

Study eligibility criteria included: Women aged 40-75 years at diagnosis of DCIS without invasion; no prior treatment for breast cancer; and definitive surgical excision with no ink on tumor margins and treated with mastectomy, lumpectomy with radiation, or lumpectomy. Cases (patients with subsequent iBEs) were matched 1 : 1 to controls with at least 5 years of follow-up without subsequent iBEs. Matching was based on year of diagnosis (+/-5 years), age at diagnosis (+/- 5 years), and DCIS nuclear grade (high grade vs. non-high grade). All cases consisted of initial diagnosis of pure DCIS, with ipsilateral recurrence occurring no less than 12 months from date of primary diagnosis. Clinical data, including treatment data, were collected at each site, and standardized data points were entered into a web-based portal. Tumor tissue was collected from FFPE blocks and cut into 5um sections. All slides were scanned and reviewed centrally by a breast pathologist (AH) to confirm the diagnosis. Tumor tissue marked by the pathologist was macrodissected for bulk analysis assays.

The 216 patients from the TBCRC cohort analyzed by RNA-seq (Table 2) includes 95 women without iBE after 5 or more years, 66 with DCIS iBEs, and 55 with IBC iBEs. Median time to IBC iBE for this subset was 58 months and 40 months to DCIS iBE. The total number of deaths by any cause was 12. 30% of this subset were African American.

METHOD DETAILS

TMA construction

Qualified DCIS or subsequent lesion slides were assembled for pathology review. The research breast pathologist marked the slides for best area to core (1mm) for the carcinoma in situ and later event. The TMAs were designed such that cases/controls were assigned randomly on the map. The Beecher Tissue Arrayer was used to take a core from the patient donor block and place it in the designated area of the recipient TMA block. Slides were then cut for research purposes, and stained H&E and unstained slides were prepared. The TMAs were stored in the St. Louis Breast Tissue Registry Lab at room temperature.

Slide cutting

A TMA cutting breakdown was established to include slides for laser capture microdissection (LCM PEN membrane glass slides) sequencing, multiplex protein (MIBI high- purity gold-coated slides) staining and charged glass slides for FISH analysis of the RAHBT TMAs. The order of the slides for the different assays was as follows:

Slide 1-3: FISH/routine IHC - 4 um slices on charged slides

Slide 4-6: RNA/DNA sequencing - 7 um slices on LCM membrane glass slides

Slide 7: MIBI analysis - 4 um slices on gold coated slides

Slide 8-10: FISH/routine IHC - 4 um slices on charged slides

Slide 11-13: RNA/DNA sequencing - 7 um slices on LCM membrane slides

Slide 14: MIBI analysis - 4 um slices on gold coated slides

Slide 15-17: FISH/routine IHC - 4 um slices on charged slides

Slide 18 H&E stained.

Digital H&E generation (scanners)

At Washington University School of Medicine, the H&E original slide and TMA slide for RAHBT was imaged (20x) by Aperio AT2 (Leica). ImageScope provides the software for viewing the slides. Images are stored on secure servers in the Dept of Pathology, Washington University School of Medicine.

Pathologic analysis and masking

For the TBCRC cohort, whole slide images of the H&E slide made from the block sourced for DNA and RNA was reviewed and scored for grade, presence of necrosis and architecture by a breast pathologist. For the RAHBT LCM cohort, H&E images from the TMAs were used to score for grade, presence of necrosis and architecture by four breast pathologists. Areas of DCIS and normal tissue from the RAHBT TMAs were annotated and masked for LCM by two breast pathologists.

Laser Capture microdissection

Consecutive sections of tissue microarray blocks were cut and mounted on PEN membrane slides. Slides were dissected immediately after staining on an Arcturus XT LCM System based on the masked areas. Epithelial and stromal sections were dissected separately. Each sample adhere to a CapSure HS LCM Cap (Thermo Fisher #LCM0215). After LCM, the cap was sealed in an 0.5 mL tube (Thermo Fisher #N8010611) and stored at -80°C until library preparation. The matching epithelial regions in consecutive slides were dissected for corresponding DNA libraries.

RNA-sequencing (smart-3 seq)

Sequencing libraries were prepared according to the Smart-3 SEQ method starting from dissected FFPE tissue on an Arcturus LCM HS Cap, except for the unique P5 index and universal P7 primers. Three control samples were added to each library preparation batch and sequence batch to allow batch effect analysis. Libraries were pooled together according to qPCR measurements and prepared according to the manufacturer's instructions with a 1% spike-in of the PhiX control library (Illumina #FC-110-3002) and sequenced on an Illumina NextSeq 500 instrument with a High Output v2.5 reagent kit (Illumina # 20024906).

ER, HER.2 status

Clinical ER status (by IHC) was available for 83.3% (180 of 216) of the TBCRC cohort, 83.5% (81 of 97) of the RAHBT cohort, and 46.8% (124 of 265) of the RAHBT LCM cohort.

Additionally, we called ER and HER2 positivity based on mRNA abundance levels of ESRI n ERBB2, respectively. We applied a Gaussian mixture model with two components using the mclust R package (v5.4.7).

PAM50 and IC10

PAM50 subtypes were called using the genefu v2.22.1 R package. We compared the PAM50 subtypes called by genefu against subtypes called adjusting for the expected proportion of ER+ samples, as implemented in. We found both methods to be highly concordant (>96% concordance). We compared the correlation of DCIS and IBC samples to the PAM50 centroids within the genefu R package using Spearman’s correlation. We also compared the silhouette widths based on Euclidean distances of the PAM50 subtypes to the de novo DCIS subtypes using the cluster R package (v2.1.1). IC10 subtypes were called using the iCIO (vl.5) R package. PAM50 subtypes were called in TBCRC and RAHBT separately, using the same protocols, given the differences in measurement techniques used in the two cohorts.

To compare PAM50 centroids in DCIS to TCGA: The TCGA cohort was downsampled to match the size of the DCIS cohort. The downsampling was repeated 1,000 times, and the median correlation for each of the 1,000 iterations was compared to the median DCIS correlations. Differential abundance analyses

Differential abundance analysis was performed using the R package DESeq2 vl .30.1 with default options. P-values were adjusted for multiple testing using the Benjamini -Hochberg method. FDR<0.05 was considered significant for all DESeq2 analyses. Reads matrices were VST normalized for downstream analyses.

Unsupervised clustering: non-negative matrix factorization

We identified RNA and CNA based clusters by non-negative matrix factorization using the NMF R package v0.23.0. Each NMF rank was run 30 times to evaluate cluster stability. We comprehensively evaluated 2-10 clusters for each data type and evaluated cluster fit by cophenetic and silhouette values. RNA clusters were first discovered in TBCRC and replicated in RAHBT. We evaluated replication by quantifying the concordance of de novo clusters identified in RAHBT vs clusters determined from centroids identified in TBCRC.

CNA clusters were discovered in TBCRC and RAHBT jointly and compared against clusters identified in TBCRC and RAHBT individually to ensure robustness.

CIBERSORTx

Using single-cell RNA-seq datasets, a breast specific signature matrix was built to resolve proportions of tumor, fibroblasts, endothelial and immune cells from bulk RNA-seq data. scRNAseq data was downloaded from Gene Expression Omnibus database (GEO data repository accession numbers GSE114727, GSE114725). Normalized counts were obtained using Seurat R package (v3.2.0), and used as single cell matrix input alongside with their cell type identities (code available: cibersortx.stanford.edu/, default parameters for “Create Signature Matrix/ scRNAseq input data”). The resultant signature matrix contained 3484 genes and allowed to resolve different immune cell types, including B, CD8 T, CD4 T, NKT, NK, mast cells, neutrophils, monocytes, macrophages and dendritic cells, “Impute Cell Fractions/Enable batch correction S-mode”, and default parameters). The signature matrix was first in-silico validated. In order to test the accuracy of the signature matrix, a set of samples (1/10 of each type) from the same scRNAseq dataset was reserved to build a synthetic matrix of bulk RNA-seq data. By mixing different proportions of single cell transcripts, the synthetic bulk was used to predict cell type proportions and subsequently correlated with the true proportions used to build the synthetic mix. Pearson’s coefficient was >0.75 in all the cases, and most >0.9. The aforementioned matrix was used to deconvolve the LCM RNA-seq samples and to compare CSx-estimated cell abundance with MIBI-identified cell types. Cell abundance between groups was compared by Wilcoxon rank sum test followed by Benjamini- Hochberg correction for multiple testing.

Shared Nearest Neighbor clustering

LCM stromal samples from RAHBT were classified using the Shared Nearest Neighbor clustering method implemented in the Seurat R package (v3.2.0). Data was normalized by negative binomial regression (sctransform R package, vO.3.2, variable. feature. n = “all. genes”). The first 15 principal components were used to identify the clusters and 16 different resolutions were compared, selecting resolution 0.75 and four clusters as the final solution. Positive markers were selected at a minimum fraction of 0.25 and the resultant gene list was used to further characterize each cluster by gene ontology and KEGG pathway analysis, implemented in clusterProfiler R package (version 3.18.1).

Pathway & Gene Set Enrichment Analyses

Gene set enrichment analyses were performed using fgsea R package (vl.12.0) based on the MSigDB Hallmark pathways v7.4. All genes from differential abundance analyses were included and were ranked by their signed adjusted P-values. Pathways were considered enriched if adjusted P-values<0.05. We evaluated pathway concordance across the DCIS subtypes using a hypergeometric test.

Single sample gene set variation analysis was performed using the GSVA R package (vl.38.2) using default parameters.

Outcome analysis

Associations with time to event were quantified using Cox Proportional Hazard model correcting for treatment as indicated in the text. To standardize follow-up across TBCRC and RAHBT, we censored the follow-up time at 250 months, the maximum follow-up time in TBCRC. Kaplan-Meier plots as implemented in the R packages survival (v3.2.10) and survminer (vO.4.9) were used to visualize outcome differences.

The 812 gene classifier was built using the cforest implementation of Random Forest in the Caret (v6.0-91) R package using default parameters. The TBRCR cohort was used as the training cohort and the model was tested on the RAHBT cohort. Hyperparameters were tuned on the training cohort using four-fold cross validation. The mtry parameters 5, 20, 50, 100, 200, 500, and 800 were tested and the optimal mtry selected was 5. Accuracy of the classifier was assessed using ROC curve, Precision, Recall, and Fl score. Breast cancer data (BRCA) from TCGA was downloaded from www.cancer.gov/tcga. A total of 1064 samples with available follow-up information was used to test the 812 gene classifier towards progression-free survival and overall survival as defined in the TCGA-BRCA metadata.

RNA for the TCGA samples was normalized using the same protocols as the DCIS RNA- sequencing (TBCRC and RAHBT cohorts, above). The accuracy of the classifier in the TCGA cohort was assessed using ROC curve, Precision, Recall, and Fl score.

DNA-sequencing

Genomic DNA was isolated from LCM FFPE cells using PicoPure DNA Extraction kit (Thermo Fisher Scientific # KIT0103). 50ul lysis buffer with Proteinase K were added to each sample and incubated at 65°C overnight. After inactivating proteinase K, the genomic DNA was cleaned up with AMPure XP beads at 3: 1 ratio (Beckman Coulter# A63880) and eluted in the lOmM Tris-HCl (pH8.0).

DNA Libraries were constructed with KAPA HyperPlus Kit (Kapa Biosystems #07962428001). Barcode adapters were used for multiplexed sequencing of libraries with SeqCap Adapter Kit A (Kapa Biosystems #7141530001). DNA libraries were amplified by 19 PCR cycles. AMPure XP beads were used for the size selection and cleaning up. DNA libraries were eluted in the 30 pL lOmM Tris-HCl (pH8.0).

Library size distribution was assessed on an Agilent 2100 Bioanalyzer using the DNA 1000 assay and the concentration was measured by Qubit® dsDNA HS Assay Kit (Thermo Fisher Scientific # Q32851). For each lane, 12 samples were pooled and sequenced by Novogene (Sacramento, CA, US) on the Illumina HiSeq Platform, collecting HOG per 275M reads output of paired-end reads of 150 bp length.

Identification of recurrent CNAs (GISTIC)

Recurrent CNAs were identified from purity-adjusted segment CNA calls from QDNASeq for 228 DCIS samples using GISTIC2 v2.0.23 run with the following parameters: -ta 0.3 -td 0.3 - qvt 0.05 -brlen 0.98 -conf 0.95 -armpeel 1 -res 0.01 -rx 0. To ensure CNAs were not biased by sequencing depth, recurrent CNAs significantly associated (FDR<0.05) with the number of uniquely mapped reads were filtered out. Associations were quantified by Mann-Whitney test. The number of uniquely mapped reads was determined from samtools flagstat (vl.9). MIBI

We used a MIBI panel consisting of 37 metal-conjugated antibodies that capture 16 different cell types including epithelial, fibroblasts, and immune cell types. We took tissue sections from adjacent sections to those used for RNA-seq to spatially align the same ducts for both MIBI and RNA. For full details of the MIBI methods, see the companion paper. Briefly, antibodies were conjugated to isotopic metal reporters. Tissues were sectioned (5pm section thickness) from tissue blocks on gold and tantalum-sputtered microscope slides. Imaging was performed using a MIBI- TOF instrument with a Hyperion ion source.

Multiplexed image sets were extracted, slide background- subtracted, denoised, and aggregate filtered. Nuclear segmentation was performed using an adapted version of the DeepCell CNN architecture. Single cell data was extracted for all cell objects and area normalized. The FlowSOM R package vl .22.0 was used to assign each cell to one of five major cell lineages (tumor, myoepithelial, fibroblast, endothelial, immune). Immune cells were subclustered to delineate B cells, CD4+ T cells, CD8+ T cells, monocytes, MonoDC cells, DC cells, macrophages, neutrophils, mast cells, double-negative CD4-CD8- T cells, and HLADR+ APC cells. Tumor and fibroblast cells were similarly sub clustered to reveal phenotypic subsets. A total of 16 cell populations were quantified and analyzed. For full details of the MIBI methods, see the companion paper.

Data visualization

Boxplots, heatmaps, scatterplots and barplots were generated using the BoutrosLab. plotting. general R package v6.0.3, or the R packages ggplot2 (v3.3.3, boxplots), corrplot (v0.84, scatterplots), and ComplexHeatmap (v.2.6.2, heatmaps). UMAPs were generated using the umap (v0.2.7.0) R package with the number of genes indicated in the text. Mosaic plots were generated using the vcd (vl.4.8) R package.

QUANTIFICATION AND STATISTICAL ANALYSIS

RNA-seq processing

RNA sequencing data was processed with 3SEQtools. Single-end Illumina FASTQ files were generated from NextSeq BCL files with bcl2fastq (v2.20.0.422) and then aligned to reference hg38 with STAR aligner (v2.7.3a). Samples that did not meet a minimum threshold of uniquely aligned reads were filtered out. The samples in this study averaged 1.11 million uniquely aligned reads. Gene expression matrices of raw and normalized read counts were produced from BAM files with featureCounts (vl.6.4) of the Subread package (v2.4.2) and GENCODE Release 33. Read counts were normalized using the variance stabilizing transformation (VST) implemented in the R package, DESeq2 (vl.30.1). The VST normalization procedure normalizes for library size and returns a matrix that is approximately homoscedastic. The same normalization method was used for both the TBCRC and RAHBT cohorts individually.

DNA-seq processing

Low-pass WGS data were preprocessed using the Nextflow-base pipeline Sarek v2.6.1 with BWA vO.7.17 for sequence alignment to the reference genome GRCh38/hg38 and GATK v4.1.7.0 to mark duplicates and calibration. The recalibrated reads were further processed and filtered for mappability, GC content using the R/Bioconductor quantitative DNA-sequencing (QDNAseq) vl.22.0 with R v3.6.0. For QDNAseq, 50-kb bins were generated from (doi.org/10.5281/zenodo.4274556). We kept only autosomal sequences after filtering due to low- depth mappability and GC correction. We used the QDNAseq corrected output and segmented for CN analysis using the circular binary segmentation (CBS) algorithm from DNAcopy R/Bioconductor package vl.60.0. Copy number aberrations were called using CGHcall v2.48.0. The R/Bioconductor package ACE vl.4.0 was used to estimate purity and ploidy. Proportion of the genome copy number altered (PGA) was calculated based on CNAs with |log2 ratio| > 0.3 based on the following: PGA= (number of bases in CNA)/(total number of bases profiled)

Statistical analyses

We used Mann-Whitney U test to compare continuous distributions between two groups, as specified in the text. We used the Kruskal-Wallis test to compare continuous values between three groups. All statistical analyses were implemented in the R statistical language (v3.6.1). P- values were corrected for multiple hypothesis testing via Bonferroni (when <10 independent tests) or Benjamini & Hochberg (when >10 independent tests).

Further details are provided in Strand et al., Cancer Cell 40, 1-16 (2022), and its accompanying Supplementary Materials, which are incorporated by reference herein.

One skilled in the art will readily appreciate that the present disclosure is well adapted to carry out the objects and obtain the ends and advantages mentioned, as well as those inherent therein. The present disclosure described herein is representative of preferred embodiments, which are exemplary, and are not intended as limitations on the scope of the present disclosure. Changes therein and other uses will occur to those skilled in the art which are encompassed within the spirit of the present disclosure as defined by the scope of the claims.

No admission is made that any reference, including any non-patent or patent document cited in this specification, constitutes prior art. In particular, it will be understood that, unless otherwise stated, reference to any document herein does not constitute an admission that any of these documents forms part of the common general knowledge in the art in the United States or in any other country. Any discussion of the references states what their authors assert, and the applicant reserves the right to challenge the accuracy and pertinence of any of the documents cited herein. All references cited herein are fully incorporated by reference, unless explicitly indicated otherwise. The present disclosure shall control in the event there are any disparities between any definitions and/or description found in the cited references.

The foregoing is illustrative of the present invention, and is not to be construed as limiting thereof. The invention is defined by the following claims, with equivalents of the claims to be included therein.