Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DETERMINING OF HEALTH STATUS AND TREATMENT MONITORING WITH CELL-FREE DNA
Document Type and Number:
WIPO Patent Application WO/2024/056720
Kind Code:
A1
Abstract:
The present invention relates to a method of analyzing or determining cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; and iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments.

Inventors:
KIRCHER MARTIN (DE)
HASENLEITHNER SAMANTHA (AT)
SPIEGL BENJAMIN (AT)
SPEICHER MICHAEL (AT)
Application Number:
PCT/EP2023/075122
Publication Date:
March 21, 2024
Filing Date:
September 13, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV GRAZ MEDIZINISCHE (AT)
International Classes:
G16B20/00; C12Q1/6886; G16H50/20
Foreign References:
US20170211143A12017-07-27
US20170211143A12017-07-27
Other References:
ALBERTS, B. ET AL.: "Molecular Biology of the Cell.", 2022, W.W. NORTON & CO
HALL, M.A. ET AL.: "High-resolution dynamic mapping of histone-DNA interactions in a nucleosome", NAT STRUCT MOL BIOL, vol. 16, 2009, pages 124 - 129
HEITZER, E. ET AL.: "Current and future perspectives of liquid biopsies in genomics-driven oncology", NATURE REVIEWS GENETICS, vol. 20, 2019, pages 71 - 88, XP036675874, DOI: 10.1038/s41576-018-0071-5
JIANG, P. ET AL.: "Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, vol. 112, 2015, pages E1317 - 1325, XP055223840, DOI: 10.1073/pnas.1500076112
MICHAEL, A.KTHOMA, N.H.: "Reading the chromatinized genome", CELL, vol. 184, 2021, pages 3599 - 3611
MODING, E.J. ET AL.: "Detecting Liquid Remnants of Solid Tumors: Circulating Tumor DNA Minimal Residual Disease", CANCER DISCOVERY, 2021
MOULIERE, F. ET AL.: "Enhanced detection of circulating tumor DNA by fragment size analysis", SCIENCE TRANSLATIONAL MEDICINE, vol. 10, 2018, pages eaat4921, XP055669959, DOI: 10.1126/scitranslmed.aat4921
SNYDER, M.W. ET AL.: "Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin", CELL, vol. 164, 2016, pages 57 - 68
"Vogel and Motulsky's Human Genetics: Problems and Approaches", 2010, SPRINGER
SUN, K. ET AL.: "Size-tagged preferred ends in maternal plasma DNA shed light on the production mechanism and show utility in noninvasive prenatal testing", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, vol. 115, 2018, pages E5106 - E5114, XP055612386, DOI: 10.1073/pnas.1804134115
ULZ, P. ET AL.: "Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection", NATURE COMMUNICATIONS, vol. 10, 2019, pages 4666, XP055892459, DOI: 10.1038/s41467-019-12714-4
ULZ, P. ET AL.: "Inferring expressed genes by whole-genome sequencing of plasma DNA", NATURE GENETICS, vol. 48, 2016, pages 1273 - 1278
WEINBERG, A.: "The Biology of Cancer", 2013, W. W. NORTON & COMPANY
WINOGRADOFF, D.AKSIMENTIEV, A.: "Molecular Mechanism of Spontaneous Nucleosome Unraveling", JOURNAL OF MOLECULAR BIOLOGY, vol. 431, 2019, pages 323 - 335, XP085576956, DOI: 10.1016/j.jmb.2018.11.013
Attorney, Agent or Firm:
LOIDL, Manuela et al. (AT)
Download PDF:
Claims:
CLAIMS

1 . A computer-implemented method for determining nucleosomal dyads from cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from the sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; and iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments.

2. The computer-implemented method of claim 1 , wherein nucleosomal dyad positions are determined, specifically nucleosomal dyad positions in cfDNA fragments.

3. The computer-implemented method of claim 1 or 2, wherein step iii. comprises mapping of nucleosomal dyads to cfDNA fragments within a coverage peak.

4. The computer-implemented method of any one of claims 1 to 3, wherein step iii. comprises establishment of a peak specific and cfDNA length specific statistics.

5. The computer-implemented method of any one of claims 1 to 4, wherein step iii. comprises establishing a distribution of probabilities of the presence of a nucleosomal dyad.

6. The computer-implemented method of any one of claims 1 to 5, further comprising step iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii..

7. The computer-implemented method of claim 6, wherein determining the average probability of the presence of a nucleosomal dyad comprises Bayesian interference.

8. The computer-implemented method of any one of claims 1 to 7, further comprising step v. mapping peaks of the average probability of the presence of a nucleosomal dyad across the reference genome sequence.

9. The computer-implemented method of claim 8, further comprising step vi. chaining the mapped peaks across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing.

10. The computer-implemented method of claim 9, wherein chaining is grouping the peaks of the average probability of the presence of a nucleosomal dyad that occur consecutively along the reference genome.

11. The computer-implemented method of claim 9 or 10, wherein peaks are chained if a distance of at least 100 bp is between the peaks.

12. The computer-implemented method of any one of claims 9 to 11 , wherein peaks are chained if a distance of at least 115, 120, 125, 130, 135, 140, 145, or 146 bp is between the peaks.

13. The computer-implemented method of any one of claims 9 to 12, wherein one or more chains of peaks are obtained.

14. The computer-implemented method of any one of claims 9 to 13, wherein each chain represents a specific cfDNA origin.

15. The computer-implemented method of claim 14, wherein the specific cfDNA origin is a cell line or a tissue.

16. The computer-implemented method of any one of claims 9 to 15, wherein chaining is performed genome-wide.

17. The computer-implemented method of any one of claims 9 to 16, wherein chaining is performed in coding and non-coding regions.

18. The computer-implemented method of any one of claims 1 to 17, comprising determining an index of fragment length and dyad position.

19. The computer-implemented method of claim 18, wherein for each cfDNA length it is determined how often the dyad is in the center of the cfDNA fragments.

20. The computer-implemented method of any one of claims 1 to 19, wherein the sample is a biological sample from a subject or from a cohort of subjects.

21. The computer-implemented method of any one of claims 1 to 20, further comprising comparing the determined nucleosomal dyads, mapped peaks and/or chained peaks with one or more standard nucleosomal dyads, standard maps of nucleosomal dyads, and/or standard maps of nucleosomal dyad chains.

22. The computer-implemented method of claim 21 , wherein comparing comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is for a specific classification.

23. The computer-implemented method of any one of claims 1 to 22, further comprising screening for a correlation of determined nucleosomal dyads, mapped peaks and/or chained peaks with one or more standard nucleosomal dyads, standard maps of nucleosomal dyads, and/or standard maps of nucleosomal dyad chain peaks.

24. The computer-implemented method of any one of claims 21 to 23, wherein the one or more standard nucleosomal dyads, standard maps of nucleosomal dyads, and/or standard maps of nucleosomal dyad chain peaks is determined for one or more cohorts of subjects having a specific classification.

25. The computer-implemented method of any one of claims 21 to 24, wherein the specific classification is associated with a condition.

26. The computer-implemented method of claim 25, wherein the condition is selected from the group consisting of health status, aging status, cell type, tissue type, and specific disease status.

27. The computer-implemented method of any one of claims 21 to 26, wherein markers for specific conditions are defined.

28. The computer-implemented method of any one of claims 21 to 27, further comprising determining whether a subject has a specific condition.

29. The computer-implemented method of any one of claims 1 to 28, wherein the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments for different lengths of cfDNA fragments is indicative for the health status of a subject.

30. The computer-implemented method of claim 29, wherein the length of cfDNA fragments is obtained in the fragmentation profile.

31. The computer-implemented method of claim 29 or 30, wherein a health status deviating from a healthy status is indicated if the z-score of the set of cumulative deviations and/or informative counts ratios deviate from the set of distributions of cumulative deviations and/or informative counts ratios recorded from healthy subjects.

32. The computer-implemented method of any one of claims 29 to 31 , wherein a health status deviating from a healthy status is cancer or pregnancy-associated complications.

33. The computer-implemented method of any one of claims 21 to 32, wherein the health status of a subject is determined.

34. The computer-implemented method of claim 33, wherein the mapped peaks are compared with a standard map derived from heathy subjects, a standard map derived from unhealthy subjects, an outlier map of nucleosomal dyads derived from unhealthy subjects, and/ or a standard map of nucleosomal dyad chains derived from healthy subjects.

35. The computer-implemented method of claim 34, wherein comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning.

36. The computer-implemented method of any one of claims 33 to 35, wherein a. congruence with the standard maps derived from healthy subjects and difference with the standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the standard maps derived from unhealthy subjects and difference with the standard maps derived from healthy subjects is characteristic for an unhealthy status; c. congruence with the outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps derived from healthy and unhealthy subjects is characteristic for an unhealthy status; and/or d. difference with a standard map of nucleosomal dyad chains derived from healthy subjects is characteristic for an unhealthy status.

37. The computer-implemented method of any one of claims 33 to 26, wherein the unhealthy subjects are subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

38. The computer-implemented method of any one of claims 33 to 37, wherein the subject is considered unhealthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads and standard maps of nucleosomal dyads characteristic for a healthy subject is more than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; wherein the subject is considered healthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads and standard maps of nucleosomal dyads characteristic for a healthy subject is less than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; wherein a subject is considered unhealthy if the z-score of the changes of the informative counts ratios between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3; and/or wherein a subject is considered unhealthy if the z-score of the changes of the cumulative deviations between the sample set of cumulative deviations and the standard set of cumulative deviations exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3.

39. The computer-implemented method of any one of claims 21 to 28, wherein the subject is a patient undergoing treatment of a health condition.

40. The computer-implemented method of claim 29, wherein the one or more standard maps are mapped peaks of a previous result from said patient, a standard map of nucleosomal dyads characteristic for the treatment success, chained peaks of a previous result from said patient, and/or a standard map of nucleosomal dyad chains characteristic for the treatment success.

41 . The computer-implemented method of any one of claims 29 to 31 , wherein differences and/or congruences provide information on the treatment success of the patient.

42. The computer-implemented method of any one of claims 39 to 41 , wherein the treatment success is monitored for the treatment of cancer, specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer; and for the treatment of inflammatory diseases, specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis; chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease; and/or asthma.

43. The computer-implemented method of any one of claims 15 to 22, wherein the standard map is a map of nucleosomal dyads of specific tissues or cell types, or a map of nucleosomal dyad chains of specific tissues or cell types.

44. The computer-implemented method of claim 43, wherein the cell type and/or tissue contribution of cfDNA in a sample is determined.

45. The computer-implemented method of claim 43 or 44, wherein the tissue and/or cell types are selected from the group consisting of cancer cells, specifically lung cancer, colorectal cancer, breast cancer, prostate cancer, and bladder cancer; and normal cells, specifically hematopoietic cells, liver cells, epithelial cells, and bone marrow.

46. The computer-implemented method of any one of claims 21 to 28, wherein the standard map is a map characteristic for an aging status.

47. The computer-implemented method of claim 46, wherein the standard map is determined from a cohort of subjects having a specific aging status.

48. The computer-implemented method of claim 47, wherein the cohort of subjects having a specific aging status is selected from healthy subjects older than 55 years, healthy subjects between 20 and 30 years, pregnant females, and subjects having a disease.

49. The computer-implemented method of claim 48, wherein the disease is cancer, specifically selected from colorectal cancer and prostate cancer.

50. The computer-implemented method of any one of claims 46 to 49, wherein the aging status of a subject is determined.

51. A data processing apparatus comprising means for carrying out the method of any one of claims 1 to 50.

52. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 1 to 50.

53. A computer-readable medium having stored thereon the computer program of claim 52.

Description:
DETERMINING OF HEALTH STATUS AND TREATMENT MONITORING WITH

CELL-FREE DNA

FIELD OF THE INVENTION

The present invention relates to a method of determining the health status of a subject, monitoring the treatment success of a patient, and/or determining the cell type by analyzing cell free DNA (cfDNA) extracted from a sample.

BACKGROUND OF THE INVENTION

Early detection of diseases is associated with more treatment options, quicker action, saving time, and longer survival. Furthermore, as diseases are often associated with aging, it is also known that certain diseases are associated with unhealthy aging.

Most liquid biopsy approaches have been developed aiming cancer diagnosis, cancer therapy monitoring and disease relapse. However, most liquid biopsy approaches are limited so far in patients with cancer for detection of minimal residual disease and for screening approaches.

Circulating cell-free DNA (cfDNA) is an informative biomarker in prenatal diagnostics and in cancer patients. cfDNA consists of highly degraded DNA fragments, which are detectable in the peripheral blood of every warm-blooded subject. As cfDNA is preferentially released from apoptotic, i.e., dying cells, it circulates in the form of nucleosome-protected, mostly mononucleosomal DNA in the body. Therefore, detailed analyses of cfDNA have extraordinary potential for early detection of diseases and the clinical management of patients, including disease monitoring and treatment decisions (Heitzer et al., 2019). For example, in patients with cancer, cfDNA contains DNA derived from tumor cells, termed circulating tumor DNA (ctDNA), which can be utilized to obtain information about the tumor genome (Heitzer et al., 2019).

In healthy individuals, most cfDNA is derived from the hematopoietic system, whereas the remaining cfDNA stems from organs, such as liver or endothelial cells.

The nucleosome is the fundamental protein subunit of chromatin around which the DNA is wrapped to enable its packaging. DNA is held in complexes with structural proteins in chromosomes. These proteins organize the DNA into a compact structure called chromatin. In eukaryotes, this structure involves DNA binding to histones. The histones form a disk-shaped complex called histone octamer. The combination of a stretch of DNA which is wrapped around this histone octamer is called a nucleosome. Chromatin accessibility profiling is helpful for various applications in biology and medicine because changes in chromatin accessibility are implicated in multiple diseases, where they reflect disease-linked changes in cell composition, gene regulation, and epigenetic cell states. For example, alterations in gene regulation and chromatin landscapes are ubiquitous in cancer. Therefore, chromatin accessibility profiling of plasma samples may identify disease-linked changes in chromatin structure and transcription regulation. Furthermore, there is substantial scope for new discoveries in many diseases that have received little attention. Moreover, nucleosome occupancy varies across tissues and cell types so that they reveal information about cfDNA tissue of origin.

Hall and colleagues provided a detailed map of histone-DNA interactions. To this end, they used a mechanical unzipping method, which allowed unzipping single molecules of DNA, which contained a single nucleosome, to map the locations of the histone-DNA interactions to near base-pair accuracy along the DNA (Hall et al., 2009). It was found that the histone-DNA interactions within a nucleosome are not uniform, and the nucleosomal dyad is the region where nucleosomal DNA was most tightly bound. As the central region has the strongest interaction, the nucleosome stability is most sensitive to DNA sequences near the dyad. Once a dyad region of interactions is disrupted, the nucleosome becomes unstable, and histones dissociate from the sequence (Hall et al., 2009).

Nucleosome positioning in cells is impacted by intrinsic factors such as DNA sequence, shape, and DNA bendability, as well as extrinsic factors such as chromatin remodelers and other cofactors (Michael and Thoma, 2021). For example, higher CG content may correspond to more stable nucleosomes. DNA wrapped in nucleosomes is sterically occluded, creating obstacles for proteins that must bind to it for gene regulation, transcription, replication, recombination, and repair. Therefore, mechanisms are required to access buried stretches of nucleosomal DNA. In fact, nucleosomes are highly dynamic structures and may transiently expose their DNA so that they are, in general, not roadblocks for DNA-binding proteins. Some of the best-documented mechanisms by which nucleosomal DNA may become accessible are outlined in the following.

Nucleosomes may temporarily expose portions of their wrapped DNA through spontaneous unspooling from either end. This process by which DNA transiently disengages from the histone octamer is called “site exposure” or “nucleosome breathing”. During nucleosome breathing, nucleosomal DNA ends unwrap from the histone core partially and reversibly on a rapid time scale.

Consistent with the observations described by Hall et al. (2009), other studies confirmed that the outermost three turns of nucleosomal DNA unwrap more easily from the histone core than the inner DNA, implying that the latter is more strongly bound. Different mechanisms may govern the unwrapping of outer and inner nucleosomal DNA. Outer stretches of DNA, far from the dyad, may rapidly unwrap and rewrap from the histone core spontaneously. The probability for a nucleosomal DNA site to be accessible decays roughly exponentially toward the dyad. Importantly, sequence-dependent elasticity may result in highly asymmetric breathing behavior (Winogradoff and Aksimentiev, 2019).

An additional type of thermally driven dynamics represents the spontaneous “mobility” or “thermal sliding” of nucleosomes by which their center of mass repositions on the DNA in an unprompted longitudinal-like movement.

Physiological and pathological states have an impact on nucleosomal DNA accessibility. Nucleosome-bound DNA is more methylated than flanking DNA. However, under certain physiological or pathological conditions, methylation patterns may change. For example, placental DNA is globally hypomethylated, and therefore nucleosomal DNA in placental tissue has more open chromatin structures than the methylated maternal somatic tissue. Thus, nucleosome-bound placental DNA has increased accessibility to endonucleases during apoptosis and hence alternative cleavage sites compared with maternal DNA, which may explain why placentally derived DNA is shorter than maternally derived DNA in the plasma of pregnant females (Sun et al., 2018). Similarly, the size distribution of DNA from cancer cells has been reported to be shorter than DNA fragments from nonmalignant cells (Jiang et al., 2015; Mouliere et al., 2018). The shorter tumor derived cfDNA fragments have been attributed to the genome-wide hypomethylation often observed in tumor genomes or other mechanisms such as cfDNA release during cell proliferation rather than apoptosis. These physiological or pathological states impact the accessibility of nucleosomal DNA and may result in asymmetric DNA digestion where the dyad is displaced from the center of cfDNA fragments.

A common method for cfDNA analysis is next-generation sequencing (NGS) particularly whole-genome sequencing (WGS). Application of NGS to cfDNA allows detecting modifications, such as single-nucleotide variants, insertion-deletion mutations (indels), copy number alterations, methylation arrangements, and fragmentation patterns. WGS of cfDNA is used for nucleosome positioning mapping and the characterization of the associated open chromatin regions between nucleosomes (Snyder et al., 2016; Ulz et al., 2016; Ulz et al., 2019).

US 2017/211143 A1 discloses methods of determining tissue and cell types contributing to cfDNA and methods of identifying a disease or disorder in a subject as a function of determined tissue and cell types contributing to cfDNA in a sample. Thereby, mapping of nucleosome positions is based on sequence coverages using the windowed protection score (WPS).

Using the WPS strategy, the predominant local positions of nucleosomes in tissue(s) contributing to cfDNA is inferred from the distribution of aligned cfDNA fragment endpoints. These cfDNA fragment endpoints should cluster adjacent to nucleosome core particle (NCP) boundaries while also being depleted on the NCP itself. To quantify this, Snyder et al. (2016) developed the windowed protection score (WPS), which is the number of DNA fragments completely spanning a 120 bp window centered at a given genomic coordinate minus the number of fragments with an endpoint within that same window (Fig 25 in US 2017/211143 A1 and Fig. 2A in Snyder et al., 2016). High WPS values indicate increased DNA protection from digestion; low values indicate that DNA is unprotected.

Thus, currently available methods provide only the determination of putative nucleosomal dyads based on coverage-based approaches such as WPS. Thereby, such initial determination of putative nucleosomal dyad positions are inferred from sequencing read depth and encompass the determination of a cfDNA fragmentation profile. The underlying biology is that nucleosomes protect from enzymatic digestion during apoptosis, and the nucleosomal dyad is the region where nucleosomal DNA is most tightly bound. Hence, positions of maximum read depth coverages may overlap with the nucleosome dyad. Such coverage-based approaches have been used in the past by Ulz et al., 2016, Snyder et al. 2016, and in US 2017/211143 A1 to infer putative nucleosome positions.

However, the major limitation of these approaches is the poor resolution, which prohibits the accurate mapping of nucleosomal dyad signals and hence obscures essential biological information which may have high clinical relevance.

Thus, currently known methods do not provide the resolution needed for determining nucleosomal dyads in cfDNA fragments. Thus, currently known methods do not allow diagnosing diseases or physiological states such as e.g., aging, and immune responses from cfDNA fragments. Further, it is also not possible to monitor the development of a disease or physiologic condition in a subject using the liquid biopsy analysis method developed so far.

Thus, there is an unmet need in the art for improved analysis methods of samples from subjects providing increased resolution.

SUMMARY OF THE INVENTION

It is the objective of the present invention to provide a method for analyzing cfDNA fragments with an increased resolution.

The objective is solved by the subject matter of the present invention.

The inventors of the present invention surprisingly found that analyzing the positioning of the nucleosomal dyad highly increases the resolution of cfDNA analysis and that the position of the nucleosomal dyad can be obtained by using the methods described herein. Thereby, based on the method of the invention for determining the nucleosomal dyad from cfDNA fragments, the inventors surprisingly found that the position of a nucleosomal dyad indeed provides information on the health status of a subject and allows monitoring of the treatment success of a patient. Furthermore, additional information is provided by obtaining the nucleosomal dyad, namely the cell type and/or tissue contribution of cfDNA can be determined. The present invention provides the determination of nucleosomal dyad positions in cfDNAs resulting in an unprecedented increase in resolution.

Thereby, the inventors of the present invention surprisingly found that the herein described method allows mapping of the relative position of nucleosome dyads to individual cfDNA fragments and, using this newly found information, mapping the location of nucleosome dyads back to the reference genome with unprecedented high resolution, whereas so far known methods estimated nucleosome dyad positions solely from coverage data.

The present invention provides a computer-implemented method for determining nucleosomal dyads from cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from the sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; and iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments.

Specifically, nucleosomal dyad positions are determined, specifically nucleosomal dyad positions in cfDNA fragments.

Specifically, step iii. comprises mapping of nucleosomal dyads to cfDNA fragments within a coverage peak.

Specifically, step iii. comprises establishment of a peak specific and cfDNA length specific statistics.

Specifically, step iii. comprises establishing a distribution of probabilities of the presence of a nucleosomal dyad.

Specifically, further comprising step iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.

Specifically, determining the average probability of the presence of a nucleosomal dyad comprises Bayesian interference.

Specifically, further comprising step v. mapping peaks of the average probability of the presence of a nucleosomal dyad across the reference genome sequence.

Specifically, further comprising step vi. chaining the mapped peaks across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing.

Specifically, chaining is grouping the peaks of the average probability of the presence of a nucleosomal dyad that occur consecutively along the reference genome.

Specifically, peaks are chained if a distance of at least 100 bp is between the peaks.

Specifically, peaks are chained if a distance of at least 115, 120, 125, 130, 135, 140, 145, or 146 bp is between the peaks.

Specifically, one or more chains of peaks are obtained.

Specifically, each chain represents a specific cfDNA origin.

Specifically, the specific cfDNA origin is a cell line or a tissue.

Specifically, chaining is performed genome-wide.

Specifically, chaining is performed in coding and non-coding regions. Specifically, comprising determining an index of fragment length and dyad position.

Specifically, for each cfDNA length it is determined how often the dyad is in the center of the cfDNA fragments.

Specifically, the sample is a biological sample from a subject or from a cohort of subjects.

Specifically, further comprising comparing the determined nucleosomal dyads, mapped peaks and/or chained peaks with one or more standard nucleosomal dyads, standard maps of nucleosomal dyads, and/or standard maps of nucleosomal dyad chains.

Specifically, comparing comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is for a specific classification.

Specifically, further comprising screening for a correlation of determined nucleosomal dyads, mapped peaks and/or chained peaks with one or more standard nucleosomal dyads, standard maps of nucleosomal dyads, and/or standard maps of nucleosomal dyad chain peaks.

Specifically, the one or more standard nucleosomal dyads, standard maps of nucleosomal dyads, and/or standard maps of nucleosomal dyad chain peaks is determined for one or more cohorts of subjects having a specific classification.

Specifically, the specific classification is associated with a condition.

Specifically, the condition is selected from the group consisting of health status, aging status, cell type, tissue type, and specific disease status.

Specifically, markers for specific conditions are defined.

Specifically, further comprising determining whether a subject has a specific condition.

Specifically, the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments for different lengths of cfDNA fragments is indicative for the health status of a subject.

Specifically, the length of cfDNA fragments is obtained in the fragmentation profile.

Specifically, a health status deviating from a healthy status is indicated if the z- score of the set of cumulative deviations and/or informative counts ratios deviate from the set of distributions of cumulative deviations and/or informative counts ratios recorded from healthy subjects.

Specifically, a health status deviating from a healthy status is cancer or pregnancy-associated complications.

Specifically, the health status of a subject is determined.

Specifically, the mapped peaks are compared with a standard map derived from heathy subjects, a standard map derived from unhealthy subjects, an outlier map of nucleosomal dyads derived from unhealthy subjects, and/ or a standard map of nucleosomal dyad chains derived from healthy subjects.

Specifically, comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning.

Specifically, a. congruence with the standard maps derived from healthy subjects and difference with the standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the standard maps derived from unhealthy subjects and difference with the standard maps derived from healthy subjects is characteristic for an unhealthy status; c. congruence with the outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps derived from healthy and unhealthy subjects is characteristic for an unhealthy status; and/or d. difference with a standard map of nucleosomal dyad chains derived from healthy subjects is characteristic for an unhealthy status.

Specifically, the unhealthy subjects are subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

Specifically, the subject is considered unhealthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads and standard maps of nucleosomal dyads characteristic for a healthy subject is more than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; wherein the subject is considered healthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads and standard maps of nucleosomal dyads characteristic for a healthy subject is less than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; wherein a subject is considered unhealthy if the z-score of the changes of the informative counts ratios between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3; and/or wherein a subject is considered unhealthy if the z-score of the changes of the cumulative deviations between the sample set of cumulative deviations and the standard set of cumulative deviations exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3.

Specifically, the subject is a patient undergoing treatment of a health condition.

Specifically, the one or more standard maps are mapped peaks of a previous result from said patient, a standard map of nucleosomal dyads characteristic for the treatment success, chained peaks of a previous result from said patient, and/or a standard map of nucleosomal dyad chains characteristic for the treatment success.

Specifically, differences and/or congruences provide information on the treatment success of the patient.

Specifically, the treatment success is monitored for the treatment of cancer, specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer; and for the treatment of inflammatory diseases, specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis; chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease; and/or asthma.

Specifically, the standard map is a map of nucleosomal dyads of specific tissues or cell types, or a map of nucleosomal dyad chains of specific tissues or cell types.

Specifically, the cell type and/or tissue contribution of cfDNA in a sample is determined.

Specifically, the tissue and/or cell types are selected from the group consisting of cancer cells, specifically lung cancer, colorectal cancer, breast cancer, prostate cancer, and bladder cancer; and normal cells, specifically hematopoietic cells, liver cells, epithelial cells, and bone marrow.

Specifically, the standard map is a map characteristic for an aging status.

Specifically, the standard map is determined from a cohort of subjects having a specific aging status.

Specifically, the cohort of subjects having a specific aging status is selected from healthy subjects older than 55 years, healthy subjects between 20 and 30 years, pregnant females, and subjects having a disease.

Specifically, the disease is cancer, specifically selected from colorectal cancer and prostate cancer.

Specifically, the aging status of a subject is determined.

The present invention further provides a data processing apparatus comprising means for carrying out the method described herein.

The present invention further provides a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method described herein.

The present invention further provides a computer-readable medium having stored thereon the computer program described herein.

The present invention further provides an in vitro method for analyzing cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; and chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing. Further provided herein is also an in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; vii. chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing; and viii. comparing the mapped peaks obtained in vi. with a library comprising standard maps, comparing the mapped peaks obtained in vi. with a library comprising outlier maps of nucleosomal dyads, and/ or comparing the chained peaks obtained in vii. with a standard map of nucleosomal dyad chains, preferably wherein said comparing the mapped peaks obtained in vi. comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning; wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status; c. congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status; and/or d. difference with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for an unhealthy status; preferably wherein said library of standard maps derived from unhealthy subjects and/or said library of outlier maps of nucleosomal dyads derived from unhealthy subjects is from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

Specifically, the subject is considered unhealthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is more than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; wherein the subject is considered healthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi. and standard maps maps of nucleosomal dyads characteristic for a healthy subject is less than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; and/or wherein a subject is considered unhealthy if the z-score of the changes of the informative counts ratios between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3; and/or wherein a subject is considered unhealthy if the z-score of the changes of the cumulative deviations between the sample set of cumulative deviations and the standard set of cumulative deviations exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3.

Further provided herein is also an in vitro method for monitoring the treatment success of a patient comprising the steps of: i. extracting cfDNA fragments from a sample of said patient; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; vii. chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing; and viii. comparing the mapped peaks obtained in vi. with the mapped peaks of a previous result from said patient, comparing the mapped peaks obtained in vi. with a standard map of nucleosomal dyads characteristic for the treatment success, comparing the chained peaks obtained in vii. with the chained peaks of a previous result from said patient, and/or comparing the chained peaks obtained in vii. with a standard map of nucleosomal dyad chains characteristic for the treatment success, preferably wherein comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in viii. provide information on the treatment success of the patient.

Specifically, the treatment success is monitored for the treatment of cancer, specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer; and for the treatment of inflammatory diseases, specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis; chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease; and/or asthma.

Further provided herein is also an in vitro method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average nucleosomal dyad probability of v. across the reference genome sequence; vii. chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing; and determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the mapped peaks obtained from vi. with a library comprising mapped nucleosomal dyads of specific tissues or cell types and/or determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the chained peaks obtained in vii. with a standard map of nucleosomal dyad chains of specific tissues or cell types.

Specifically, the tissue and/or cell types are selected from the group consisting of cancer cells, specifically lung cancer, colorectal cancer, breast cancer, prostate cancer, and bladder cancer; and normal cells, specifically hematopoietic cells, liver cells, epithelial cells, and bone marrow.

Further provided herein is also a computer-implemented method for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with a library comprising standard maps, comparing the mapped peaks obtained in v. with a library comprising outlier maps of nucleosomal dyads, and/ or comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains, preferably wherein said comparing the mapped peaks obtained in v. comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status; c. congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status; and/or d. difference with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for an unhealthy status; preferably wherein said library of standard maps derived from unhealthy subjects and/or said library of outlier maps of nucleosomal dyads derived from unhealthy subjects is from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

Further provided herein is also a computer-implemented method for monitoring the treatment success of a patient comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with the mapped peaks of a previous result from said patient, comparing the mapped peaks obtained in v. with a standard map of nucleosomal dyads characteristic for the treatment success, comparing the chained peaks obtained in vi. with the chained peaks of a previous result from said patient, and/or comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains characteristic for the treatment success, preferably wherein comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in vii. provide information on the treatment success of the patient.

Further provided herein is also a computer-implemented method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the mapped peaks obtained from v. with a library comprising mapped nucleosomal dyads of specific tissues or cell types and/or determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains of specific tissues or cell types.

Further provided herein is also a data processing apparatus comprising means for carrying out the method described herein.

Further provided herein is also a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method described herein.

Further provided herein is also a computer-readable medium having stored thereon the computer program described herein.

Further provided herein is also a use of a computer-implemented method in a method described herein, said method comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. performing at least one of the steps iii. to viii. according to a method described herein.

Further provided herein is also an in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. determining the sequence of the cfDNA fragments by performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; and iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; wherein the probability obtained from iv. for different fragment lengths of the cfDNA fragments as obtained from iii. provides information on the health status of said subject, preferably wherein a health status deviating from a healthy status is indicated if the z-score of the set of cumulative deviations and/or informative counts ratios deviate from the set of distributions of cumulative deviations and/or informative counts ratios recorded from healthy subjects, preferably wherein a health status deviating from a healthy status is cancer, unhealthy aging, or pregnancy-associated complications.

FIGURES

Figure 1 : Principle of liquid biopsy for circulating tumor DNA analysis

Figure 2: Primary cfDNA analysis

Figure 3: Nucleosome occupancy analysis based on depth-of-coverage

Figure 4: Typical fragment length distribution

Figure 5: Computation of nucleosome dyad prior distribution

Figure 6: Transformation of empiric count distributions to nucleosome prior distributions

Figure 7: Overview heatmap of dyad count distributions with different distribution truncation strategies

Figure 8: Nucleosome occupancy pattern from nucleosome priors

Figure 9: Nucleosome occupancy patterns; Nucleosome posterior signal vs. depth of coverage

Figure 10: Extraction and selection of features from biologically relevant genomic loci for machine learning

Figure 11 : Machine learning classifiers detect pathophysiological states

Figure 12: New insights from nucleosome prior distributions and posterior nucleosome dyad position mapping

Figure 13: Knowledge of cell type/tissue contribution of cfDNA pool improves the diagnostic value of analysis

Figure 14: Kurtosis of dyad placement.

Figure 15: Cumulative deviation of positioning Figure 16: Dyad signal-to-noise ratio

Figure 17: Dilation feature of posterior peak

Figure 18: Phase feature of posterior peak

Figure 19: Confidence feature of posterior peak

Figure 20: Cumulative deviations of prior distributions example

Figure 21 : Computation of the informative counts ratio

Figure 22: Nucleosome occupancy results and dyad chaining algorithms

Figure 23: Nucleosome occupancy results for expected active genes

Figure 24: Nucleosome occupancy results for expected inactive genes Figure 25: Comparison between low-resolution nucleosome positioning

DETAILED DESCRIPTION

Unless indicated or defined otherwise, all terms used herein have their usual meaning in the art, which will be clear to the skilled person. Reference is for example made to the standard handbooks, such as “Molecular Biology of the Cell” (Alberts et al., 2022), “Vogel and Motulsky's Human Genetics: Problems and Approaches” (Speicher et al., 2010), “Human Molecular Genetics” (Strachan and Read, 2018), and “The Biology of Cancer” (Weinberg et al., 2013).

The subject matter of the claims specifically refers to artificial products or methods employing or producing such artificial products, which may be variants of native (wild type) products. Though there can be a certain degree of sequence identity to the native structure, it is well understood that the materials, methods and uses of the invention, e.g., specifically referring to isolated nucleic acid sequences, amino acid sequences, fusion constructs, expression constructs, transformed host cells and modified proteins, are “man-made” or synthetic, and are therefore not considered as a result of “laws of nature”.

The terms “comprise”, “contain”, “have” and “include” as used herein can be used synonymously and shall be understood as an open definition, allowing further members or parts or elements. “Consisting” is considered as a closest definition without further elements of the consisting definition feature. Thus “comprising” is broader and contains the “consisting” definition.

The term “about” as used herein refers to the same value or a value differing by +/-5 % of the given value. As used herein and in the claims, the singular form, for example “a”, “an” and “the” includes the plural, unless the context clearly dictates otherwise.

As used herein, DNA refers to deoxyribonucleic acid. DNA is a type of nucleic acid.

As used herein, the term “nucleic acid” generally refers to a polynucleotide comprising two or more nucleotides. A nucleotide is a monomer composed of three components: a 5-carbon sugar, a phosphate group, and a nitrogenous base. The four naturally occurring types of DNA nucleotides are: adenine (A), thymine (T), guanine (G), and cytosine (C).

As used herein, the term “cfDNA” refers to “cell free DNA”, “cell-free DNA”, “circulating free DNA”, or “circulating-free DNA”. cfDNA consists of highly degraded DNA fragments, which are detectable in the peripheral blood of every human. In healthy individuals, the vast majority of cfDNA is derived from the hematopoietic system. However, the preferential DNA contribution to the cfDNA pool may change under certain physiological or pathological conditions. Furthermore, cfDNA can also provide information about physiological processes such as aging. cfDNA may comprise a footprint representative of its underlying chromatin organization, which may capture one or more of: expressing-governing nucleosomal occupancy, RNA Polymerase II pausing, cell death-specific DNase hypersensitivity, and chromatin condensation during cell death. Such a footprint may carry a signature of cell debris clearance and trafficking, e.g., DNA fragmentation carried out by caspase- activated DNase (CAD) in cells dying by apoptosis, but also may be carried out by lysosomal DNase II after the dying cells are phagocytosed, resulting in different cleavage patterns. cfDNA represents an essential component of “liquid biopsies”, which refers to the analyses of non-solid biological sources (e.g., blood, urine, CSF, ascites) to obtain information similar to tissue biopsies. Analyses of cfDNA are of extraordinary relevance, particularly in oncology, since in patients with cancer, cfDNA contains circulating tumor DNA (ctDNA) shed from tumor cells into the circulation.

Mechanisms for DNA release into the bloodstream can be apoptosis, necrosis, and active release, specifically cfDNA is released by apoptosis. In eukaryotes, DNA is wrapped around histones to form nucleosomes, which are the basic structure of DNA packing. In general, typical cfDNA fragment lengths have a modal distribution of 167 bp. This length corresponds approximately to the size of DNA wrapped around a nucleosome (~147 bp) and a linker fragment (~20 bp). This particular cfDNA size pattern corresponds to fragmentation patterns after enzymatic processing in apoptotic cells. Specifically, the cfDNA fragmentation patterns reflect the association between cfDNA with nucleosome core particles and linker histones, determining where nuclease cleavage may occur. Hence, DNA is frequently cleaved between nucleosomes and only rarely within nucleosomes. The latter circumstance is also called “cleaving resistance” and associated with cfDNA fragments described herein.

The architecture of individual nucleosomes determines access to nucleosomal DNA. The individual nucleosome core particle contains 147 bp of DNA wrapped in ~1.7 left-handed superhelical turns around a central octamer composed of two copies of each of the four core histones H2A, H2B, H3, and H4. These fundamental nucleosome units are connected with intervening linkers ranging from 20 to 100 bp (Michael and Thoma, 2021 ). Usually, the DNA is tightly wrapped around this histone octamer and sharply bent. This sharp bending occurs at every DNA helical repeat, i.e., ~10bp, when the major groove faces inwards towards the histone octamer and ~5 bp away, with opposite direction, when the major groove faces outward. The nucleosome core particle architecture is pseudo-2 -fold symmetric, with the DNA position at the symmetry axis. The symmetry axis, i.e., the dyad, is designated as location 0. The superhelix locations (SHLs) are labeled with ±1 , ±2, and so on and denote where the minor grooves of the DNA double helix structure face away from the histone octamer (shown in Michael and Thoma, 2021).

The methods described herein are based on the analysis of the presence of one or more nucleosomal dyads in cfDNA fragments.

The “dyad” or “nucleosomal dyad” as used herein is the region occupied by the center of the nucleosome or the base position of nucleosomal DNA that marks the midpoint of the nucleosomal base pair sequence (see Michael and Thoma, 2021).

With its two juxtaposed DNA gyres, the nucleosomal DNA itself places most DNA motifs directly adjacent to a second DNA strand on the neighboring gyre, except the dyad where only one DNA strand is present (shown in Michael and Thoma, 2021).

According to one embodiment, methods are described herein for analyzing cfDNA in a sample. According to a specific embodiment, an in vitro method is described herein for analyzing cfDNA fragments from a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; and vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence.

As used herein, the term “sample” generally refers to a biological sample obtained from or derived from a subject. Biological samples may be cell-free biological samples or substantially cell-free biological samples, or may be processed or fractionated to produce cell-free biological samples.

In some embodiments, the biological sample used in the method of the invention is a biofluid sample. Non-limiting examples of useful biofluid samples include, e.g., a blood sample, a serum sample, a plasma sample, a cerebrospinal fluid (CSF) sample, a lymph sample, an endometrial fluid sample, a urine sample, a saliva sample, a tear fluid sample, a synovial fluid sample, an amniotic fluid sample, and a sputum sample. In preferred embodiments, the biofluid sample is selected from a blood sample, a urine sample, a cerebrospinal sample, or an amniotic fluid sample. cfDNA can, e.g., be obtained by a standard blood draw, i.e. , a minimally invasive approach. As the blood vial after the blood draw contains both the cellular components of blood and the cell-free fraction, which is referred to as plasma, extraction steps such as centrifugation steps may be required to separate these components.

As used herein, the term “extract” in the context of extracting cfDNA fragments refers to the isolation of the cfDNA or cfDNA fragments from the sample. Isolation, extraction, and or purification of cfDNA or cfDNA fragments may be performed through collection of bodily fluids using a variety of techniques. In some cases, collection may comprise aspiration of a bodily fluid from a patient using a syringe. In other cases, collection may comprise pipetting or direct collection of fluid into a collecting vessel. After collection of bodily fluid, cfDNA or cfDNA fragments may be isolated and extracted using a variety of techniques known in the art. In some cases, cfDNA may be isolated, extracted and prepared using commercially available kits such as the Qiagen Qiamp® Circulating Nucleic Acid Kit protocol. In other examples, Qiagen Qubit™ dsDNA HS Assay kit protocol, Agilent™ DNA 1000 kit, orTruSeq™ Sequencing Library Preparation; Low-Throughput (LT) protocol may be used.

After isolation, in some cases, the cfDNA or cfDNA fragments are pre-mixed with one or more additional materials, such as one or more reagents (e.g., ligase, protease, polymerase) prior to sequencing.

According to another embodiment, a cell-free fraction of a biological sample may be used as a sample in the methods described herein. The term “cell-free fraction” of a biological sample, as used herein, generally refers to a fraction of the biological sample that is substantially free of cells. As used herein, the term “substantially free of cells” generally refers to a preparation from the biological sample comprising fewer than about 20,000 cells per mL, fewer than about 2,000 cells per mL, fewer than about 200 cells per mL, or fewer than about 20 cells per mL. Genomic DNA may not be excluded from the acellular sample and typically comprises from about 50% to about 90% of the nucleic acids that are present in the sample.

In the context of the present invention, the term “liquid biopsy” refers to a broad category for sampling and minimally invasive testing done of a biofluid (e.g., blood, blood plasma or blood serum) to look for fragments of e.g., tumor derived cfDNA that are in the blood.

According to one embodiment, the methods described herein may comprise a step of amplifying a nucleic acid. The terms “amplifying” and “amplification” generally refer to increasing the size or quantity of a nucleic acid molecule. The nucleic acid molecule may be single-stranded or double-stranded. Amplification may include generating one or more copies or “amplified product” of the nucleic acid molecule. Amplification may be performed, for example, by extension (e.g., primer extension) or ligation. Amplification may include performing a primer extension reaction to generate a strand complementary to a single-stranded nucleic acid molecule, and in some cases generate one or more copies of the strand and/or the single-stranded nucleic acid molecule. The term “DNA amplification” generally refers to generating one or more copies of a DNA molecule or “amplified DNA product.”

In some embodiments of the methods described herein, a method comprises performing DNA sequencing e.g., whole genome sequencing, Sanger sequencing, targeted next-generation sequencing (NGS), whole-genome NGS. In a specific embodiment, whole genome sequencing is performed on extracted cfDNA fragments for obtaining the DNA sequence of the cfDNA fragment. The result of this sequencing of the cfDNA fragment is also referred to herein under “sequenced cfDNA fragment” or the “read”.

Thereby the term “sequenced” or "read" refers to a sequence read from a portion of a nucleic acid sample, i.e., is the result of the sequencing experiment. Typically, a read represents a short sequence of contiguous base pairs in the sample. The read may be represented symbolically by the base pair sequence (in ATCG) of the sample portion. It may be stored in a memory device and processed as appropriate to align the sequences with another sequence, to determine whether it matches a reference sequence, or if it meets other criteria. A sequence or a read may be obtained directly from a sequencing apparatus or indirectly from stored sequence information concerning the sample.

The term “sequenced fragment” or “fragment sequence” as used herein refers to the combined sequence and length information of a DNA fragment which is gained, for example, from a pair of sequencing reads which were created by sequencing both ends of that DNA fragment, a process which is known as “paired-end read sequencing”, and subsequently aligning the obtained sequences to a reference genome. The length information is obtained from start and end coordinates of the paired sequence alignments. This information can also be extracted from a single sequencing read of a DNA fragment which was created by exhaustive sequencing of a DNA fragment until an adjacent sequencing adapter is read during the sequencing process. This type of sequencing process is known as “single-end read sequencing”. The adapter sequence is removed computationally from the read sequence afterwards.

According to one embodiment, in the methods described herein the DNA sequences of the cfDNA fragments have different lengths. The length may vary from tens to hundreds of base pairs. In some embodiments of the method described herein, the sequence reads are about 25bp, about 30bp, about 35bp, about 40bp, about 45bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130 bp, about 140 bp, about 150 bp, about 175 bp about 200 bp, about 250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp. In one embodiment, the sequence reads are 151 bp for each end of a DNA fragment that is sequenced in paired-end read sequencing mode. In other embodiments, paired-end reads are 50 bp, 75 bp, 100 bp, 101 bp, 150 bp, 151 bp, or 175 bp long. The term "alignment" as used herein refers to the process of comparing a DNA sequence with a reference sequence. In other words, aligning means comparing a read or sequence obtained by sequencing to a reference sequence and thereby determining whether the reference sequence contains the read sequence, the location where the read sequence is aligned in the reference sequence, and/or how the read sequence aligns with the reference sequence. If the reference sequence contains the read, the read may be mapped to the reference sequence or, in certain embodiments, to a particular location in the reference sequence. In some cases, alignment simply tells whether or not a read is a member of a particular reference sequence (i.e. , whether the read is present or absent in the reference sequence).

A "reference sequence" or a “reference genome sequence” is a sequence of a biological molecule, which is frequently a nucleic acid such as a chromosome or genome. Typically, DNA sequences of multiple cfDNA fragments are members of a given reference sequence.

In various embodiments, the reference sequence is significantly larger than the sequenced portions or reads that are aligned to it.

In one example, the reference sequence is the sequence of a full length genome of a subject, specifically it is a full length human genome. Such sequences may be referred to as reference genome sequences. Such sequences may also be referred to as chromosome reference sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, sub-chromosomal regions, e.g., strands of any species.

In various embodiments, the reference sequence is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the reference sequence may be taken from a particular individual.

A "site" is a unique position in a reference sequence corresponding to a read or a DNA sequence of a cfDNA fragment.

In certain embodiments, a DNA sequence of a cfDNA fragment is aligned with a reference genome sequence in order to determine the cfDNA fragmentation profile.

The term “fragmentation profile” as used herein refers to evaluation of fragmentation patterns of cfDNA across the genome. Such an evaluation can include cfDNA fragment lengths, positions of aligned fragments relative to the reference genome sequence, relative to a specific point on the reference genome, or alignment positions of multiple fragments relative to each other, the ratio between cfDNA fragments with different lengths (e.g., ratio between all cfDNA below a certain length (e.g., 150 bp) vs. all fragments above this length), or whether the nucleosome patterns computed from the cfDNA fragments correspond to nucleosome patterns of a particular cell type, such as white blood cells.

In another embodiment, the fragmentation profile of cfDNA fragments is used to generate a nucleosome map that identifies the position of nucleosomes in the sample. The nucleosome map displays positions of nucleosome peaks, indicating open and closed chromatin regions in the subject’s genome. Open chromatin regions indicate regions of the genome that do not contain nucleosomes. These open regions are able to be bound by various protein factors and regulatory elements and transcribed. Closed chromatin regions are regions of the genome that surround nucleosomes and are inaccessible to protein factors, regulatory elements, and other molecules. These closed chromatin regions are not able to be transcribed.

According to a specific embodiment, the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments, as in step iv. of the in vitro method of analyzing cfDNA fragments and as part of other methods described herein, is determined by determining the dyad count distribution for specific fragment lengths, performing a fragment length-based truncation, determining probability density functions, and removing of the non-informative portion. This probability is also be termed “nucleosome dyad prior distribution ”, “nucleosome prior distribution ”, or “nucleosome prior” herein.

According to one embodiment, the step of determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments comprises mapping of nucleosomal dyads to cfDNA fragments within a coverage peak.

According to one embodiment, mapping of nucleosomal dyads to cfDNA fragments within a coverage peak refers to initial nucleosomal dyad mapping to all cfDNA fragments within a coverage peak. Specifically, from the cfDNA fragmentation profile.

According to one embodiment, mapping of nucleosomal dyads to cfDNA fragments within a coverage peak may be performed as follows: In each coverage peak, the position of maximum coverage overlap is mapped to each individual cfDNA fragment contributing to the peak, i.e., within each cfDNA fragment, the relative position of the dyad is inferred. Specifically, this is illustrated in the enlarged panel of Figure 5, left side, where the putative localization of the nucleosome dyad is indicated as a dashed line, which meets some cfDNA fragments within the peak. Specifically, the relative position of the nucleosome dyad may map to the center of a cfDNA fragment or off the center or may not be determinable. Specifically, the summary of this “mapping of nucleosomal dyads to cfDNA fragments within a coverage peak” or “initial nucleosomal dyad mapping to all cfDNA fragments within a coverage peak” is depicted in Figure 5, right side, left panel (“nucleosomal fragments”), where the cfDNA fragments from the coverage peak from the enlarged panel of Figure 5, left side, are displayed sorted by size and where an arrow on the respective cfDNA fragment indicates the location of the nucleosomal dyad.

According to one embodiment, the step of determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments comprises establishment of a peak specific and cfDNA length-specific statistic.

According to one embodiment, establishment of a peak-specific and cfDNA length-specific statistic may be also referred to as establishment of a locus-specific and cfDNA length-specific statistic.

According to one embodiment, establishment of a peak-specific and cfDNA length-specific statistic allows a detailed cfDNA fragment length-specific dyad statistic for each peak (locus). According to a specific embodiment, all cfDNA fragments mapping to the same locus and that have the same length, the inferred nucleosome dyad positions are recorded (see e.g., Figure 5, right side, center panel (Fragment Length Specific Dyad Statistics)).

According to one embodiment, the step of determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments comprises establishing a distribution of probability of the presence of a nucleosomal dyad.

According to one embodiment, establishing a distribution of probability of the presence of a nucleosomal dyad may be also referred to as Establishment of the nucleosome prior distribution (yi).

According to one embodiment, establishing a distribution of probability of the presence of a nucleosomal dyad may be performed as follows: The cfDNA fragments are replaced with nucleosome prior distributions (yi), which is derived from the inferred (hypothetical) nucleosome positions over cfDNA fragments with a specific length (see e.g., Figure 5, right side, right panel (Resulting Dyad Location Distribution) and Figure 8, upper part). Specifically, details about the steps involved in transforming the inferred (hypothetical) nucleosome position over cfDNA fragments (=empiric count distributions; empiric distribution of nucleosome dyad locations per fragment length) are also shown in Figures 6 and 7. Specifically, Figure 6 depicts the normalization based on cfDNA counts. The sum of counts is needed to generate an AUC of 1 for the entire region within the range defined in the previous step. Next, the area of random position signals is determined, which involves an update of the 0’ axis. Then, the random area is subtracted from the entire area resulting in an AUC of <1 for the respective areas. This facilitates comparing the relative height between priors to relate different priors to each other. The result is cfDNA fragment length-specific high-confidence information about dyad positioning. These nucleosome dyad prior distributions contain the data required to calculate the posterior nucleosome localization probability from cfDNA fragmentation. The prior localizations are retrieved and overlaid with the BAM file, and the original fragments from the BAM file are used for calculations. The overlaid nucleosome prior signals are then used to calculate the nucleosome localization posterior probability. After this step, the nucleosome dyad posterior distribution is available.

According to one embodiment, the step of determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments comprises mapping of nucleosomal dyads to cfDNA fragments within a coverage peak, establishment of a peak-specific and cfDNA length-specific statistic, and establishing a distribution of probability of the presence of a nucleosomal dyad.

According to one embodiment, the nucleosome fragment-specific prior distributions allow calculation of a per-base average across these distributions, resulting in the nucleosome posterior signal (see e.g., Figure 8).

According to a specific embodiment, the average probability of the presence of a nucleosomal dyad at certain base positions in the reference genome sequence is determined based on the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments and the fragmentation profile obtained from aligning DNA sequences of the cfDNA fragments with a reference genome sequence. This average probability, also termed “nucleosome dyad posterior distribution”, “nucleosome posterior distribution”, or “nucleosome posterior” herein, is determined by Bayesian inference as described herein.

According to one embodiment, Bayesian inference is used to compute the positions of nucleosome dyads based on coverage maxima and cfDNA fragmentation by using Baye’s Theorem. Bayes’ theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event. Specifically, two signals are generated and evaluated in the methods described herein: the first is based on sequencing coverage, i.e. , the number of sequencing reads aligned to a specific locus in a reference genome (see e.g., Figure 6, left side: assumption 1); the second signal is the “posterior nucleosome signal” based on Bayesian inference.

According to one embodiment, the methods described herein comprise the step of mapping the peaks of the average probability of the presence of a nucleosomal dyad across a reference genome sequence. Specifically, said reference genome sequence may be the same as the reference genome sequence used for aligning DNA sequences of cfDNA fragments.

The term “peaks” as used herein in the context of mapping the peaks of average probability of the presence of a nucleosomal dyad refers to local maxima of said average probability along the reference genome sequence. Specifically, these local maxima must be at least 2 bp, 3 bp, 4 bp, 5 bp, 7 bp, 10 bp, 12 bp, 15 bp, or more apart from each other and must be supported by more than 1 , 2, 3, 4, 5, 6, 7, 8, or more cfDNA fragments. Higher minimum distance values yield stricter peak calling and peak grouping results whereas lower values allow for a more permissive peak calling and grouping. The number of required supporting fragments must also regard the target sequencing depth of the sequencing dataset. A cfDNA fragment supports a peak if one of the highest local maxima of the fragment’s associated nucleosome prior distribution is located within 20 bp of the local maximum of the nucleosome posterior distribution or within a smaller base range.

According to another specific embodiment, in the methods described herein the nucleosomal dyads or the peaks of the average nucleosomal dyad probability are mapped for the whole genome or for sub-regions thereof.

According to one embodiment, the methods described herein may further comprise analyzing the depth of coverage.

The term “depth of coverage” as used herein refers to the number of fragment sequences that align with a particular site of the reference genome.

Specifically, coverage describes whether or not any fragment sequence aligns with a particular site or region of a reference genome. In another embodiment, it is also used to describe the average target coverage across an entire reference genome.

As used herein, the term “coverage pattern” generally refers to a spatial arrangement of fragment sequences after alignment of read sequences to a reference genome. The coverage pattern identifies the extent and depth of coverage of nextgeneration sequencing methods.

According to one embodiment, the methods described herein may further comprise determining the fragment support of inferred nucleosome dyads.

According to one embodiment, the method for analyzing cfDNA described herein further comprises step vii. of chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing.

The term “chain” as used herein in the context of chaining nucleosomal dyads refers to the grouping of peaks of the posterior probability that occur consecutively along a reference genome following rules of naturally occurring nucleosomal spacing and regularity of fragment support.

According to one embodiment, rules for naturally occurring nucleosomal spacing and regularity of fragment support are as follows. Peaks may only be chained if there is a minimum distance of 100 bp, 115 bp, 120 bp, 125 bp, 130 bp, 135 bp, 140 bp, 145 bp, or 146 bp between them, whereas chaining becomes more stringent if a higher minimum distance is chosen. Based on these possible lower distance bounds and by using fragment support information, one or more low-fragment-support sub-chains may be identified at the site of an already established nucleosome chain that has higher average fragment support. In theory, nucleosome dyads may be arbitrarily far apart from each other because the association of DNA with histone octamer cores is not strictly necessary for the existence of a DNA molecule. A formal definition of chain termination conditions is used nevertheless to obtain in the genomic space confined chains which equates working with a higher nucleosome chain resolution. Well-covered stretches of DNA of at least 50% of the data set’s target x-fold coverage that exceed a length of 471 bp and that are found to be devoid of nucleosome dyad peaks are unexpected to be observed in natural chromatin. This criterion is used for termination of all nucleosome chains and sub-chains neighboring the nucleosome-deserted reference stretch. Shorter distances can be used to establish a more stringent chain termination behavior. Such more stringent termination distances can be around 450 bp, 430 bp, 410 bp, 390 bp, 370 bp, 350 bp, 340 bp, 330 bp, 320 bp, 310 bp, 300 bp, 290 bp, 280 bp, 270 bp, 260 bp, 250 bp, 240 bp, 230 bp, 220 bp, 210 bp, 200 bp, 190 bp, or 185 bp. Another chain termination condition is defined by diminished fragment support of consecutive nucleosome peaks that would otherwise fulfill spacing constraints. Sudden changes in fragment support for the next peak that is to be chained of 2 standard deviations of the previous average fragment support as estimated from the current chain, or a reduction below 40% of the average fragment support of chained peaks also indicates the termination of a nucleosome chain. Higher percentage values/smaller number of standard deviations of the fragment support drop can be used to achieve a more stringent chain termination. Such values are 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, and 99% of the average fragment support of chained peaks.

According to one embodiment, the herein provided method allows determining nucleosome positions with higher precision and thus, adjacent signals can be resolved (see e.g., Figures 9 and 10) which with so far known methods appeared to be just one signal. This allows to determine which nucleosomal signals belong together, i.e., whether the nucleosomes form one chain (only one cell lineage caused the pattern) or more than one chain (several cell lineages or tissues caused the nucleosome pattern). According to a specific embodiment, biological knowledge and well-established average distances between nucleosomes are included following determining which signals belong together, e.g., were contributed by a specific cell type. For example, this is illustrated in Figure 12C, showing that not only the distances between nucleosome peaks are measured. Instead, the signals from this particular gene (RIT1 in Figure 12C) show the presence of two nucleosome chains A and B, indicating that the signals must be derived from at least two different cell lineages.

According to a specific embodiment, in the herein described method, adjacent signals are not only chained, but due to the increased resolution, the number of cell lineages that make up the nucleosome signals in any region of the genome can be inferred, which includes unprecedented resolution at the individual gene level.

According to one embodiment, one or more chains of mapped peaks are obtained in the methods described herein.

According to one embodiment, each chain represents a cell lineage/tissue of origin of the cfDNA.

According to one embodiment, mapping or chaining may result in one or more nucleosome maps. Specifically, this is made possible by the superior resolution of the herein described methods and allows to resolve nucleosome peaks as representing several peaks and hence determining, for each region in the genome, how many cell lineages/tissues have contributed. For example, in Figure 12C, nucleosome positions are evaluated for a specific region; distance evaluations between these nucleosome peaks are applied, where these distances are derived from “biology knowledge” (“naturally occurring nucleosomal spacing”). This is used to establish which nucleosomes belong together, i.e. , are derived from the same cell lineages. This, in turn, allows to determine how many different cell lineages contributed their DNA for this locus. The resolution of the various peaks into different chains is possible due to the increase resolution of the herein described methods.

According to one embodiment, chaining is performed genome-wide.

According to one embodiment, chaining is performed in coding and non-coding regions.

According to one embodiment, the herein described methods allow chaining for established regulatory regions, such as TSSs or TFBSs, and genome-wide. Genomewide chaining may include coding and non-coding regions.

According to a specific embodiment, the herein described method allow chaining in non-coding regions, e.g., such as introns.

According to one embodiment, chaining refers to analysis of nucleosome occupancies.

According to one embodiment, described herein is an in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; and vii. comparing the mapped peaks obtained in vi. with a library comprising standard maps and/or outlier maps of nucleosomal dyads; wherein congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status, congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status, and/or congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status.

According to one embodiment, in the herein described methods a reference nucleosome map is not necessarily needed.

According to a specific embodiment, from the nucleosome positions it can be deduced which genes and pathways are active or silent in the cells that release their DNA into the circulation. Specifically, from these gene and pathways activities, it can be directly inferred which cells contribute to the cfDNA pool as gene expression and signal pathways are highly cell and tissue specific.

According to one embodiment, the in vitro method for determining the health status of a subject further comprises the step of chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing after step vi.. Specifically, the further step of chaining is performed after mapping in step vi. and before comparing in step vii.. Thereby, the step of chaining is performed as step vii. and comparing is performed as step viii..

According to a specific embodiment, in the case chaining is performed in the method for determining the health status, comparing in step viii. comprises comparing the mapped peaks obtained in vi. with a library comprising standard maps, comparing the mapped peaks obtained in vi. with a library comprising outlier maps of nucleosomal dyads, and/ or comparing the chained peaks obtained in vii. with a standard map of nucleosomal dyad chains, wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status; c. congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status; and/or d. difference with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for an unhealthy status.

In a certain embodiment, a congruence of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with standard maps derived from healthy subjects is characteristic for a healthy status. In a certain embodiment, a difference of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with standard maps derived from unhealthy subjects is characteristic for a healthy status.

In a certain embodiment, a congruence of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with standard maps derived from unhealthy subjects is characteristic for an unhealthy status. In a certain embodiment, a difference of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with standard maps derived from healthy subjects is characteristic for an unhealthy status.

According to a specific embodiment, the library comprises standard maps derived from healthy subject and/or standard maps derived from unhealthy subjects.

According to a specific embodiment, in the method described herein comparing the mapped nucleosomal dyads with a library comprising standard maps of nucleosomal dyads comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio (e.g. Figure 21), determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning.

According to a specific embodiment, the subject is considered healthy if the deviation of nucleosomal dyad positioning on cfDNA fragments between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is less than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for an unhealthy subject. Specifically, said deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is 99, 98, 97, 96, 95, 94, 93, 92, 91 , 90, 89, 88, 87, 86, 85, 84, 83, 82, 81 , 80,75, 70, 65, 60, 55, 50, 45, 40, 35, 30, 25, 20, 15, 10, or 5% of the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for an unhealthy subject.

In one embodiment, congruence of a subject’s nucleosomal dyad chains with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for a healthy status.

In another embodiment, deviation of a subject’s nucleosomal dyad chains from a standard map of nucleosomal dyad chains obtained from unhealthy subjects is characteristic for a healthy status.

In yet another embodiment, congruence of a subject’s nucleosomal dyad chains from a standard map of nucleosomal dyad chains obtained from unhealthy subjects is characteristic for an unhealthy status.

According to a specific embodiment, a machine learning model for binary classification between healthy and unhealthy regarding a specific disease group or pregnancy can be trained on the set of standard dyad chains from samples of both groups to learn patterns of dyad chains that signify an unhealthy sample. Multiple such models can be combined to achieve multi-class classification.

According to yet another embodiment, described herein is an in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; and vii. comparing the mapped peaks obtained in vi. with a library comprising standard maps and outlier maps of nucleosomal dyads; wherein congruence with the library of outlier maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects and the library standard maps derived from unhealthy subjects indicates an association with an unhealthy status.

In a certain embodiment, in the step of comparing, a congruence of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with outlier maps derived from unhealthy subjects and a similar difference of 50, 60, 65, 70, 75, 80, 85, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99, or 100% with standard maps derived from healthy and standard maps derived from unhealthy indicates an association with an unhealthy status.

According to a specific embodiment, the library comprises standard maps derived from healthy subject and/or standard maps and outlier maps derived from unhealthy subjects.

According to a specific embodiment, comparing the mapped nucleosomal dyads obtained in vi. with a library comprising standard maps and outlier maps of nucleosomal dyads comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio (e.g. Figure 21), determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning.

According to a specific embodiment, the subject is considered unhealthy if the deviation of nucleosomal dyad positioning on cfDNA fragments between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is more than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for an unhealthy subject. Specifically, the subject is considered unhealthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is 1.1 , 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5., 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10, 15, 20-fold or even higher than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for an unhealthy subject.

According to a specific embodiment, changes of the congruence between the sample dyad map and the standard dyad map of healthy subjects are expressed as z- score and a subject is identified as unhealthy, if said z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3. Specifically, changes of the congruence between the sample dyad map and the standard dyad map of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if the z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, or even more. Alternatively, changes of the congruence between the sample dyad map and the standard dyad map of unhealthy subjects are expressed as z-score and a subject is identified as healthy, if the z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, or even more, more preferably if the absolute value of the z-score exceeds 2.

According to an alternative embodiment, a machine learning model may be used to learn classification from multiple algorithms.

The term “congruence” as used herein is defined as the percentage of standard map peaks in callable regions of the sample that were also called from the sample data.

Callable regions are defined as regions that exceed the lower bound for minimum fragment support for calling main peaks from the nucleosome posterior signal.

As used herein, the term “subject” generally refers to an individual, entity or a medium that has or is suspected of having testable or detectable genetic information or material. A subject can be a person, individual, or patient. The subject can be a vertebrate, such as, for example, a mammal. Non-limiting examples of mammals include humans, simians, farm animals, sport animals, rodents, and pets. The subject may be displaying a symptom(s) indicative of a health or physiological state or condition of the subject, such as a cancer or a stage of a cancer of the subject. As an alternative, the subject can be asymptomatic with respect to such health or physiological state or condition.

According to a specific embodiment, in the method described herein, the standard maps of nucleosomal dyads characteristic for a healthy subject are derived from healthy subjects. Healthy subjects may be understood as subjects not having the symptoms that the subject to be tested is suffering from. In general, healthy subjects are not suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

As used therein, the term “cohort” or “cohort of subjects” shall refer to a group of subjects having a specific classification and may specifically refer to the samples received from said subjects. The number of subjects of a cohort can vary, i.e. it may comprise 2, 3, 4, 5, 6, 7 or more subjects, however it also may be a larger group of subjects, like for example but not limited to 10, 50, 100 or more subjects. According to the embodiment of the invention the cohort may also comprise large cohorts of 500 or more subjects. Specifically, the cohort of subjects as described herein shall refer to a group of subjects being associated with or having a condition. These subjects of a cohort can thereby be assigned to a specific classification or status, e.g. displaying a certain condition, such as a clinical, physiologic, or pathologic condition, specifically, selected from but not limited to health status, aging status, cell type, tissue type, and specific disease status. Specifically, the cohort of subjects shall refer to a group of subjects being healthy, unhealthy, of a certain age, and/or having a specific disease.

Markers for specific conditions may be, but are not limited, to patterns of dyad positions indicating a specific condition of a subject or a cohort of subjects.

"Aging" according to this invention is a combination of processes of deterioration that follow the period of development of an organism. Aging is generally characterized by a declining adaptability to stress, increased homeostatic imbalance, increase in senescent cells, and increased risk of disease. Because of this, death is the ultimate consequence of aging.

Unhealthy aging may be induced by stress conditions including, but not limited to chemical, physical, and biological stresses. Unhealthy aging is also referred to as “inflammaging”. For example, accelerated aging can be induced by stresses caused by UV and IR irradiation, drugs and other chemicals, chemotherapy, intoxicants, such as but not limited to DNA intercalating and/or damaging agents, oxidative stressors etc; mitogenic stimuli, oncogenic stimuli, toxic compounds, hypoxia, oxidants, caloric restriction, exposure to environmental pollutants, for example, silica, exposure to an occupational pollutant, for example, dust, smoke, asbestos, or fumes.

According to one embodiment, in the methods described herein the standard maps of nucleosomal dyads characteristic for an unhealthy subject are derived from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

According to one embodiment, described herein is the establishment of a library comprising standard maps of nucleosomal dyads. In a specific embodiment, the standard maps of nucleosomal dyads are established from samples of healthy and/or unhealthy subjects as described herein.

According to another specific embodiment, the preparation of standard maps of nucleosomal dyad comprises analyzing cfDNA as described herein. Specifically, recurring peaks are integrated into a standard map of peak positions for a specific group of samples. Peak positions are regarded as recurring in a homogeneous group of samples, if a peak of nucleosome posterior distribution is called within a region of 5 bp, 10 bp, 15 bp, 20 bp at a specific site for 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, or 95% of the samples, if the samples are homogeneous regarding a specific health status or health characteristic or a particular pathology. Choosing higher percentages of samples from the homogeneous group and lower base ranges for the peak calling lead to a stricter standard map containing fewer recurring peaks. Samples may be excluded from the computation if the candidate site is not sufficiently covered, e.g. depth of coverage is below the minimum fragment support required for peak calling from the posterior nucleosome signal. Highly pronounced singular peaks are recorded for every non-healthy sample group in a separate outlier map of nucleosomal dyads as described later on. Standard maps of non-healthy groups may also include locations of recurring peak locations from the healthy standard map of nucleosomal dyads, if these are recurrently absent in the non-healthy group. Alternatively, machine learning model for binary classification between healthy and unhealthy regarding a specific disease group or pregnancy can be trained on the set of standard dyad peaks from samples of both groups to learn patterns of dyad peaks that signify an unhealthy sample. Multiple such models can be combined to achieve multi-class classification.

According to one embodiment, the outlier maps of nucleosomal dyads characteristic for an unhealthy subject are derived from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

According to one embodiment, described herein is the establishment of a library comprising outlier maps of nucleosomal dyads for non-healthy groups. In a specific embodiment, the outlier peak maps of nucleosomal dyads are established from samples of healthy and/or unhealthy subjects as described herein.

According to another specific embodiment, the preparation of outlier maps of nucleosomal dyads comprises analyzing cfDNA as described herein. Specifically, outlier peaks occurring only in a subset of samples of a specific non-healthy group, i.e. not among recurring peaks of the same group, are integrated into a map of outlier peak positions for that specific sample group. Peaks that qualify as outliers must carry a trait or multiple traits that indicate a pronounced character that supports their presence in order to be regarded in the outlier map of nucleosomal dyads. Pronounced peaks can not only be defined by high prominence values, but also by a combination of high confidence values, high fragment support values, low peak dilation values, high prominence values, and/or high phasedness values. Outlier maps of nucleosomal dyads of non-healthy groups may also include locations of recurring peak locations of the healthy standard map of nucleosomal dyads, if these are recurrently absent only in a subgroup of the same non-healthy sample group. Outlier maps may be created for specific subgroups of samples from a non-healthy group if the subgroup is sufficiently homogeneous in terms of outlier peaks (i.e. number of outliers are common among samples of the subgroup) and at least one pathological characteristic of these samples or signals obtained from these samples through methods described herein.

According to one embodiment, a method is described herein for monitoring the treatment success. Specifically, described is an in vitro method for monitoring the treatment success of a patient comprising the steps of: i. extracting cfDNA fragments from a sample of said patient; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; and vii. comparing the mapped peaks obtained in vi. with the mapped peaks of a previous result from said patient and/or a standard map of nucleosomal dyads characteristic for the treatment success, wherein differences and/or congruences obtained in vii. provide information on the treatment success of the patient.

According to one embodiment, the in vitro method for monitoring the treatment success of a patient further comprises the step of chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing. Specifically, the further step of chaining is performed after mapping in step vi. and before comparing in step vii. Thereby, the step of chaining is performed as step vii. and comparing is performed as step viii..

According to a specific embodiment, in the case chaining is performed in the method for monitoring the treatment success of a patient, comparing in step viii. comprises comparing the mapped peaks obtained in vi. with the mapped peaks of a previous result from said patient and/or a standard map of nucleosomal dyads characteristic for the treatment success, and/ or comparing the chained peaks obtained in vii. with the chained peaks of a previous result from said patient and/or a standard map of chained peaks characteristic for the treatment success, wherein differences and/or congruences obtained in the step of comparing provide information on the treatment success of the patient.

According to a specific embodiment, in said method of monitoring treatment success comparing the step of comparing may comprise determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio (compare with raw dyad count distributions from Figure 21), determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning.

According to a specific embodiment, the treatment success is monitored for the treatment of cancer, specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer; and for the treatment of inflammatory diseases, specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis; chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease; and/or asthma. According to one embodiment, a method is described for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average nucleosomal dyad probability of v. across the reference genome sequence; and vii. determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the mapped peaks obtained from vi. with a library comprising mapped nucleosomal dyads of specific tissues or cell types.

According to one embodiment, the method for determining the cell type and/or tissue contribution of cfDNA in a sample further comprises the step of chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing. Specifically, the further step of chaining is performed after mapping in step vi. and before determining in step vii.. Thereby, the step of chaining is performed as step vii. and determining is performed as step viii..

According to a specific embodiment, in the case chaining is performed in the method for determining the cell type and/or tissue contribution of cfDNA in a sample, determining in step viii. comprises comparing the mapped peaks obtained from vi. with a library comprising mapped nucleosomal dyads of specific tissues or cell types, and/ or comparing the chained peaks obtained in vii. with a library comprising chained peaks of specific tissues or cell types.

According to a specific embodiment, in said method for determining the cell type and/or tissue contribution of cfDNA, the tissue and/or cell types are selected from the group consisting of cancer cells, specifically lung cancer, colorectal cancer, breast cancer, prostate cancer, and bladder cancer; and normal cells, specifically hematopoietic cells, liver cells, epithelial cells, and bone marrow. According to one embodiment, an in vitro method for determining the health status of a subject is described herein comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. determining the sequence of the cfDNA fragments by performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; and iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; wherein the probability obtained from iv. for different fragment lengths of the cfDNA fragments as obtained from iii. provides information on the health status of said subject, preferably wherein a health status deviating from a healthy status is indicated if the z-scores of the set of cumulative deviations and/or informative counts ratios deviate from the set of distributions of cumulative deviations and/or informative counts ratios recorded from healthy subjects.

According to a specific embodiment, changes of the informative counts ratios as obtained in step iv between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if said z-scores exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3. Specifically, changes of the informative counts ratios between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if the z-scores exceed a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, or even more. A subset of the least varying informative counts ratios may be selected to reduce the complexity of the task or a machine learning model may be trained to learn a binary classification based on all informative counts ratios across a set of frequently occurring fragment lengths, such as fragments with a length between 120 bp and 180 bp, and between 290 bp and 320 bp.

According to a specific embodiment, changes of the cumulative deviations as obtained in step iv between the sample set of cumulative deviations and the standard set of cumulative deviations of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if said z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3. Specifically, changes of the cumulative deviations between the sample set of cumulative deviations and the standard set of cumulative deviations of healthy subjects are expressed as z-score and a subject is identified as unhealthy, if the z-score exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, or even more. A subset of the lowest cumulative deviations may be selected to reduce the complexity of the task or a machine learning model may be trained to learn a binary classification based on all cumulative deviations across a set of frequently occurring fragment lengths, such as fragments with a length between 120 bp and 180 bp, and between 290 bp and 320 bp.

According to a specific embodiment, in said method of determining the health status of a subject, a health status deviating from a healthy status is cancer, unhealthy aging, or pregnancy-associated complications.

According to a specific embodiment, in said method of determining the health status of a subject, for example, the non-random fragment counts inside of a centralized 41 bp window divided by those outside of said window for a specific nucleosome dyad distribution like for 167 bp fragments not deviating abnormally from the same count ratio obtained from healthy subjects with a z-score of 1 is considered healthy.

According to a specific embodiment of the invention, a health status may be diagnosed. Such a health status can be an unhealthy status. Thereby, a certain disease, health condition, or also a predisposition may be diagnosed.

As used herein, the term “diagnose” or “diagnosis” of a status or outcome generally refers to predicting or diagnosing the status or outcome, determining predisposition to a status or outcome, monitoring treatment of a subject, diagnosing a therapeutic response of a subject, and prognosis of status or outcome, progression, and response to particular treatment.

Non-limiting examples of the diagnosed, monitored, or treated diseases include, neurodegenerative diseases, cancers, chemotherapy-related toxicities, irradiation induced toxicities, organ failures, organ injuries, organ infarcts, ischemia, acute vascular events, a stroke, graft-versus-host-disease (GVHD), graft rejections, sepsis, systemic inflammatory response syndrome (SIRS), cytokine releasing syndrome (CRS), multiple organ dysfunction syndrome (MODS), traumatic injuries, aging, diabetes, atherosclerosis, autoimmune disorders, eclampsia, preeclampsia, infertility, pregnancy- associated complications, coagulation disorders, asphyxia, drug intoxication, poisoning, and infections. In one specific embodiment, the disease is a cancer.

Numerous cancers may be detected, monitored, or treated using the methods described herein. Cancer cells, as most cells, can be characterized by a rate of turnover, in which old cells die and are replaced by newer cells. Generally dead cells, in contact with vasculature in a given patient, may release DNA or fragments of DNA into the bloodstream. This is also true of cancer cells during various stages of the disease. This phenomenon may be used to detect the presence or absence of cancers in individuals using the methods described herein.

For example, blood from patients at risk for cancer is drawn, or urine is collected, and the sample is prepared as described herein to generate a population of cfDNA. The methods of the disclosure are employed to detect cfDNA fragment patterns and features that may be unique to certain cancers present. The method may detect the presence of cancerous cells in the body, despite the absence of symptoms or other hallmarks of disease. The method may also help to detect different subtypes of cancer based on the features of the cfDNA fragments detected in the patient sample.

The types and number of cancers that are detected, monitored, or treated include, but are not limited to, blood cancers, brain cancers, lung cancers, skin cancers, nose cancers, throat cancers, liver cancers, bone cancers, lymphomas, pancreatic cancers, bowel cancers, rectal cancers, thyroid cancers, bladder cancers, kidney cancers, mouth cancers, stomach cancers, solid state tumors, heterogeneous tumors, homogeneous tumors and the like.

In certain embodiments, the methods provided herein may be used to monitor already known cancers or other diseases in a particular patient. This allows a practitioner to adapt treatment options in accordance with the progress of the disease. In this example, the methods described herein may track cfDNA or ctDNA in a particular patient over the course of the disease. In some instances, cancers can progress, i.e. become more aggressive and genetically unstable. In other examples, cancers remain benign, inactive, dormant or in remission. The methods of this disclosure may be useful in determining disease progression, remission or recurrence and the appropriate adjustments in treatment that are required for the disease state.

Further, the systems and methods described herein may be useful in determining the efficacy of a particular treatment option. Biological samples are collected longitudinally over time from a single patient and comparison of the cfDNA profiles in all of the different samples collected illustrates how the cancer or disease is progressing or diminishing.

It is useful to consider the mathematical or symbolic underpinnings of certain methods disclosed herein.

According to one embodiment, Bayesian inference is used to compute the positions of nucleosome dyads based on coverage maxima and cfDNA fragmentation by using Baye’s Theorem. The theorem is shown in the equation I.

In equation I, H is the hypothesis, and E is the evidence. Probabilities are P(H) as the prior probability, P(E|H) as the likelihood, P(E) is called the model evidence or marginal likelihood, and P(H|E) is the posterior probability which is computed according to the methods described herein. For the problem at hand, the hypothesis is that the position of a nucleosome, represented by the position of its dyad, can be derived from the location of an observed cfDNA fragment, which originates from that very same nucleosome, by taking into account the length of the fragment and prior knowledge about the relationship between the dyad’s location and the fragment length.

According to one embodiment, the evidence E is the combined information about cfDNA fragments gained from read alignment against the reference sequence e.g., a high-quality human reference genome, after sequencing. The sequence alignment step produces the length and position information for each fragment. In this context, the evidence E at a specific locus will also be called “observed fragmentation” or “fragmentation evidence”.

According to one embodiment, the fragment length-specific prior probability P(H) gives the probability of a nucleosome, which is represented by its dyad in our model, being positioned relative to each base of the fragment. Based on the knowledge that nucleosome dyads confer by far the highest cleaving resistance to cfDNA fragments, the probability distribution of the dyad location across a fragment can be approximated by the associated cleaving resistance distribution. The maximum or the most pronounced local maxima of this cleaving resistance distribution gives or give the expected location of the nucleosome dyad or the locations of multiple nucleosome dyads from multiple DNA-associated histone complexes (i.e. di-nucleosomal fragments) relative to the fragment before all of the cfDNA fragmentation evidence of the alignment locus of that fragment has been taken into account.

The process of computing fragment length-dependent prior distributions is also referred to as herein under “Creating prior knowledge”.

According to one embodiment, the likelihood P(E|H) is the probability of observing a cfDNA fragment locally under the hypothesis that nucleosomal DNA in immediate genomic vicinity was the origin of the fragment before degradation. The likelihood reduces to the observed local fragmentation after taking into account that observing unprotected fragments by chance is highly unlikely. Observation of cfDNA fragments in bodily fluids of living mammals can only be justified by DNA being in a protective nucleosomal structure before fragmentation that hinders rapid clearance and recycling of cellular debris.

According to one embodiment, the denominator P(E) is either called marginal likelihood or model evidence.

According to a specific embodiment, for the factor P(E), other parameters like genomic locus or fragment length have been “integrated out” so that the probability does not depend on them anymore. If this marginal likelihood factor is omitted, the posterior probability is only proportional to the combination of observed fragmentation and prior knowledge (equation II).

According to this specific embodiment related to equation II, it is not possible to integrate over the result to compute an actual probability between 0 and 1. However, this is negligible as only the local maxima of the posterior nucleosome signal are of interest, which works with a scaled version of the posterior probability independent of a constant scaling factor or a factor that varies significantly only on a large scale, i.e. , not locally.

According to one embodiment, the posterior probability P(H|E) is what is of interest in the methods described herein, i.e., the probability of the hypothesis H being true after observing E. In other words, the average resistance to cleaving by DNases across all cfDNA pool tissue sources at a respective base of the genome given the local fragmentation evidence. Finding local maxima/calling peaks of this signal yields positions that show relatively high probability of harboring a nucleosome dyad in at least one of the contributing tissues since cleaving resistance maxima are considered to be conferred by nucleosomes, i.e. , the maxima is the resulting average expected location of the nucleosomal dyad at that locus.

According to one embodiment, local peaks of the posterior probability in equation II refer to the base positions in the reference genome sequence where a nucleosomal dyad is most likely to be present as determined in the methods described herein. The observed fragmentation refers to the cfDNA fragmentation profile obtained by aligning the DNA sequences of the cfDNA fragments with a reference genome sequence. The prior knowledge refers to the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments.

According to one embodiment, the method described herein is used to assess the health status of a person using features based on native cfDNA characteristics which describe either parameters of the fragmentome of an individual or specifically the nucleosome positioning along the genome of cfDNA contributing cells of this individual, and, thus, indirectly the chromatin state at the time of cfDNA shedding of these cells (chromatin-associated parameters). The informative character of a feature is defined based on its distribution among homogeneous groups of samples sharing an identical or at least similar health/disease status and its ability to distinguish between different sample groups based on these distributions. The deviation of a feature from its expected distribution may be expressed as a z-score (using the standard deviation from the normal group).

The term “fragmentome” as used herein refers to all aspects related to cfDNA fragmentation pattern analysis, such as the cfDNA length and the frequency of these at specific loci, the relationships between DNA sequence and cfDNA fragment end locations, whether the cfDNA fragment ends are jagged or blunt, relationship to nucleosomes and open chromatin regions, fragment strandedness (i.e., single stranded vs. double stranded DNA) and observation frequency, their coordinates compared to the reference genome, and directional information of the cfDNA ending locations.

According to another embodiment, the parameters describing an individual’s fragmentome that are extracted from a sequencing dataset are the dyad count distribution, parameters derived from dyad count distributions, parameters derived from approximate probability density functions of nucleosome dyads (priors).

According to a specific embodiment, the dyad count distribution is the distribution of the approximate locations of local maximum cleaving resistance (= dyad position of hypothetical nucleosomes) over fragments of a specific length in a predefined range of base pairs around the fragment center and this for all fragments with a minimum number of occurrences in the sequencing data set (“dyad count distributions”). The minimum number of occurrences (e.g., 100000, 125000, 150000, 175000, 200000, 225000, or 250000), depends on whether the distribution of approximate location is derived from a single sample or pool of samples and on the achieved sequencing depth of the sequencing experiment.

According to a specific embodiment, a parameter derived from dyad count distributions is the ratio of informative (= nonrandom) counts over non-informative (= fully random) counts in a centralized window of predefined length for all fragment lengths with a minimum number of occurrences (“informative counts ratios”).

According to a specific embodiment, a parameter derived from dyad count distributions is the fragment lengths and length ranges for which the informative counts ratio deviates significantly from the expected healthy distribution (“fragments showing aberrant dyad distributions”). Significance may be obtained from analytical derivations (e.g. confidence interval of the proportion) or empirical methods (e.g. bootstrapping or jackknifing).

According to a specific embodiment, a parameter derived from dyad count distributions is the fragment length with the highest informative count ratio (“most informative fragment length”).

According to a specific embodiment, a parameter derived from dyad count distributions is based on a specific dyad count distribution: e.g., certain fragment lengths selected for their biological relevance (e.g., 147 bp nucleosome core, 167 bp nucleosome and linker - with different linker lengths being described across tissues). For example, fragments with a length of 167 bp could be considered: the ratio of nonrandom dyads counted inside of a 41 bp window (= including bases 64 to 104) around the central base (= base no. 84; using 1 -based counting) over the nonrandom dyad counts outside of this window up to the bases marking fragment ends. This is also referred to as “kurtosis of dyad placement” illustrated in Figure 14.

According to a specific embodiment, a parameter derived from dyad count distributions is the approximated probability density functions of nucleosome dyad placement over fragments of identical length (bp) with a minimum number of occurrences in the sequencing data set (“dyad probability densities” or also “nucleosome dyad prior probability distribution” or abbreviations thereof). According to a specific embodiment, a parameter derived from approximate probability density functions of nucleosome dyads (priors) is the cumulative deviation of a sample’s dyad probability density function from the one extracted from the healthy control group for each fragment length (with a minimum number occurrences in the data set) in a centralized window of predefined size per fragment length (“cumulative deviation of positioning”). The cumulative deviation of positioning is illustrated in Figure 15.

According to a specific embodiment, a parameter derived from approximate probability density functions of nucleosome dyads (priors) is the fragment lengths and length ranges with a significant “cumulative deviation of positioning”, e.g., z-score greater than 2.

According to a specific embodiment, a parameter derived from approximate probability density functions of nucleosome dyads (priors) is the signal-to-noise ratio of a sample’s fragment length specific dyad count distribution (e.g. for 177 bp fragments) computed from the highest mode of the informative part (= signal) and the noise computed in a window (e.g. a 60 bp centralized window) inside the “nominal regions” that appear close to fragment ends (“dyad signal-to-noise ratio”). The dyad signal-to- noise ratio is illustrated in Figure 16.

According to a specific embodiment, the features describing parameters of the chromatin might be computed for the whole genome or sub regions thereof. Sub regions might be continuous in their genomic coordinates, or derive from sets of regions. "Informative features" may be used to derive a set of sub regions. These regions can be single loci or certain functional regions of the genome that appear homogeneous with respect to one or more of these features (see “Types of regions”). Alternatively, predefined compositions of (also different types of) regions to which same or similar biological functionality/meaning can be attributed or which contain entities (genes, cis- regulatory elements) that were described to interact in molecular signaling pathways. An example would be the regions that belong to a molecular pathway known to have an important role in diseases (e.g. signaling pathways such as MARK, STAT, PAM, PI3K- AKT, RAS, NOTCH, APC, TGF-beta, ERBB, pathways involved in cell cycle regulation and apoptosis, pathways for DNA damage repair or chromatin modification, and other pathways that engage in transcriptional regulation that are important in cancer). According to a specific embodiment, types of regions used in the methods described herein are the inferred nucleosome locus, gene-specific regions, regions of the genome around cis-regulatory elements, and/or other potentially relevant loci.

According to a specific embodiment, the inferred nucleosome locus is the immediate vicinity of a locus for which the presence of a nucleosome dyad was predicted or experimentally derived. This usually relates to a region spanning up to 300 bp centered on an inferred nucleosome dyad signified by a nucleosome posterior peak. Existing data (e.g. ATAC-seq, MNase seq, ChlP-seq, etc.) might be used to create sets of loci for specific health conditions/pathologies/cells/tissues.

According to a specific embodiment, the gene-specific regions contain multiple nucleosomes, such as transcription start sites (TSSs), transcription termination sites (TTSs), intron/exon borders of the gene, the “gene body” (whole CDS region), adjacent cis-regulatory elements like the upstream promoter region. A symmetric window of several kbp encompassing that region is usually used to describe local nucleosome positioning characteristics.

According to a specific embodiment, the regions of the genome around cis- regulatory elements, specifically transcription factor binding sites (TFBSs), that may not be immediately assignable to certain genes and can show long-range regulatory effects on multiple genes - like enhancers, silencers, and insulators.

According to a specific embodiment, other potentially relevant loci (potential cis- regulatory elements) of unknown function are based on temporary accessibility of chromatin, e.g., defined by hypersensitivity to cleaving by DNase, so called DNase hypersensitive sites (DHSs).

According to a specific embodiment, these regions may be linked to adjacent genes based on genomic distance, published literature and databases, and based on detected TAD boundaries surrounding both the region and the gene. The links may be used to inform a selection of regions (e.g., those associated with tumor suppressors or oncogenes). These region assemblies are used as an additional set of information aside of the directly attributable features.

According to one embodiment, chromatin-associated features may be used in the methods described herein. Chromatin associated features are mainly based on inferred nucleosome loci and the fragment support of these inferred dyad positions. Inference of nucleosome dyad positions is based on the determination of local maxima of cleaving resistance, i.e., the nucleosome posterior signal. These local peaks are called from the nucleosome dyad posterior signal. Subsets of fragments may be used to reflect different nucleosomal associations which are linked to fragment length. To this end, only mononucleosomal fragments (e.g. 82 bp - 270 bp), di-nucleosomal fragments (e.g. >270 bp), or sub-mononucleosomal fragments (e.g. <82 bp) may be used to compute a feature. Combinations of features computed from different sets of fragments may be combined in the computation of new features. Fragment length ranges may be optimized based on fragment occurrence in the sequencing data set and therefore might be smaller or larger than indicated.

According to a specific embodiment, nucleosome dyad positions are used as chromatin-associated features and nucleosome dyad positions are inferred from the genomic origin-resolved cfDNA fragment pool and prior information about the nucleosome dyad location relative to fragments of different length.

According to a specific embodiment, the nucleosome dyad’s fragment support is used as chromatin-associated feature. An inferred nucleosome dyad’s fragment support is calculated from the locally observed fragmentation or the subset of fragments which was used during inference, and the proximity of the maximum of each individual fragment’s nucleosome prior distribution. The maximum of each such prior distribution must be in immediate vicinity of the inferred nucleosome dyad (e.g. within 35 bp, 30 bp, 25 bp, 20 bp, 15 bp, 10 bp, 7 bp, 5 bp, or 3 bp up- and downstream) to be regarded as supporting it. Lower distance values result in a more stringent estimation of the dyad fragment support.

According to one embodiment, features describing inferred nucleosome dyads and peak groups may be used in the methods described herein. These features refer to the occurrence of one main peak and no or multiple side peaks. An alternative name for the peak group category is “peak core”. These features are peak prominence, core-peak prominence, peak dilation, raw fragment support, main fragment support, detailed fragment support, main support ratio, naive phase, main phase, detailed phase, peak dispersion, peak dispersion distance, peak confidence, main peak confidence and detailed peak confidence, side peaks, chained, primary chain, secondary chain, pathologic chain, relative genomic distance, upstream/downstream end of primary chain, and upstream/downstream end of secondary chain.

According to a specific embodiment, peak prominence refers to the height of a main peak (called from the dyad posterior signal) compared to surrounding main peaks in the vicinity or how it scores against a distribution of peaks at that locus from a group of normal samples (e.g. rank in list of peaks from normal sample plus current peak).

According to a specific embodiment, core-peak prominence refers to a feature like peak prominence but relative to the side peaks of the group of peaks which was assigned to the main peak during peak grouping.

According to a specific embodiment, peak dilation refers to the distance in bp which describes the stretch of bases that the flat posterior signal is covering if it was limited to a certain maximum height around the peak (based on peak height); only for main peaks. The dilation feature of posterior peak is illustrated in Figure 17.

According to a specific embodiment, raw fragment support refers to the number of fragments that support a specific peak call based on maxima of nucleosome priors for fragments in close vicinity of the peak call (e.g. within 75 bp or depending on dilation of peak). This metric disregards any other peak calls (main peaks and side peaks) at the locus. The number of fragments can be replaced by the sum over their GC-bias correction weights.

According to a specific embodiment, main fragment support is the number of fragments supporting a main peak call. Fragments can be assigned only to one main peak. This metric disregards side peaks. The number of fragments can be replaced by the sum over their GC-bias correction weights.

According to a specific embodiment, detailed fragment support refers to the number of fragments supporting a peak call. Fragments can be assigned only to one peak inside a peak group. This metric resolves fragment support inside peak groups. The number of fragments can be replaced by the sum over their GC-bias correction weights. Can be computed for side peaks.

According to a specific embodiment, main support ratio refers to the detailed fragment support of the main peak of a peak group over the sum of detailed fragment support of all side peaks of that group. The number of fragments can be replaced by the sum over their GC-bias correction weights.

According to a specific embodiment, naive phase refers to the “phasedness” of all signals belonging to the same peak group; similarity of nucleosome placement across tissues with similar dyad placement. Computationally, this is average of the metric described below over all N fragments with midpoint inside a window of certain size around the peak (e.g. 149 bp symmetric window, or a window based peak dilation metric — includes more fragments for broader peaks). The maximum of each fragment’s prior maximum location weighted by a factor describing the proximity to the peak call. The maximum possible distance of a prior’s maximum is half the chosen window size w. By subtracting the actual absolute distance di (is always positive) of the prior’s maximum from the maximum possible distance and dividing the result by the maximum distance gives a value between 0 and 1 (O=“furthest away” and 1 - ’maximum of prior overlaps with base of the called peak”). The phase feature of posterior peak is illustrated in Figure 18.

According to a specific embodiment, main phase refers to a feature like naive phase, but the fragments are filtered for those for which the current main peak is the closest main peak. Only for main peaks.

According to a specific embodiment, detailed phase refers to a feature like main phase, but the fragments are additionally divided among the closest side peaks in the peak group. It can be computed for side peaks.

According to a specific embodiment, peak dispersion refers to 1 minus phase metric yields the respective dispersion metric (between 0=all prior maxima overlap and 1 =all contributing prior maxima are at the window borders, no maxima in center).

According to a specific embodiment, peak dispersion distance refers to the distance in bases between the most upstream peak and the most downstream peak of the peak group.

According to a specific embodiment, peak confidence refers to the approximate average contribution of a fragment’s prior signal to the nucleosome dyad peak (posterior signal). In contrast to phase, the proximity factor is multiplied (i.e., weighted) with the non-random signal strength of the prior (which is represented by the absolute height of the maximum of the prior pmax ; could also be replaced by the non-random fraction of the prior) before forming the sum. To get an absolute value between 0 and 1 , the result is divided by the maximal possible sum of signal contributions. This feature combines information about signal strength of a fragment’s prior distribution and the relative location of the fragment to a specific dyad call. The confidence feature of posterior peak is illustrated in Figure 19.

According to a specific embodiment, main peak confidence and detailed peak confidence are computed from different sets of fragments, similar to those used for main phase and detailed phase computations.

According to a specific embodiment, side peaks refer to the number of sub-peaks that are called around a main peak. According to a specific embodiment, the chained feature describes whether or not a peak was chained together with other peaks to form a chain of inferred nucleosome dyads (1 if in a chain, 0 otherwise).

According to a specific embodiment, in a primary chain the peak is part of the primary chain (= nucleosome chain with highest count of main peaks in region).

According to a specific embodiment, in a secondary chain the peak is part of a secondary chain (= chain with fewer or no main peaks compared to the primary chain in the same region).

According to a specific embodiment, in a pathologic chain such as a “CRC chain” the peak is part of a chain (primary or secondary) that was associated with a certain pathology.

According to a specific embodiment, the relative genomic distance of an inferred nucleosome dyad (= location of maximum of main peak) or the start or end of a chain of such nucleosome dyads to the next locus of a certain type (e.g. EVX2 TFBS) or a specific locus (e.g. TSS of TP53 gene).

According to a specific embodiment, the upstream/downstream end of primary chain refers to the first/last inferred nucleosome dyad of a primary chain; relative to the + strand.

According to a specific embodiment, the upstream/downstream end of secondary chain refers to the first/last inferred nucleosome dyad of a secondary chain; relative to the + strand.

According to one embodiment, features describing multiple inferred nucleosome dyads and or chained nucleosomes are nucleosome repeat length, chain length, chain regularity, nucleosome density, and positioning diversity (might be expressed as a ratio).

According to a specific embodiment, the nucleosome repeat length (NRL; or inter-nucleosome distance) describes an average or median distance between a main peak and a specific number of surrounding peaks or the average over all main peaks found in a certain region or set of regions or in a primary or secondary chain.

According to a specific embodiment, the chain length is the length of a nucleosome chain as number of chained dyads and/or as absolute distance in bp.

According to a specific embodiment, the chain regularity is a metric describing how regular the inter-nucleosome distance of inferred nucleosomes in a chain or a region is. According to a specific embodiment, the nucleosome density is the number of inferred nucleosome dyads per kilo base across defined region or set of regions, primary chain or secondary chain; might exclude inferred dyads from secondary chains wherever applicable.

According to a specific embodiment, the positioning diversity (ratio) is the average number of inferred dyads including peaks from secondary chains divided by the nucleosome density of the region or set of regions.

According to one embodiment, a feature using nucleosome chains is “secondary chains” describing the number of secondary chains in region.

According to another embodiment, a feature using nucleosome chains is the relative genomic distance of peak or chain to next instance of certain type (e.g. EVX2 TFBSs) or a specific instance (e.g. TSS of TP53 gene).

According to one embodiment of the methods described herein, compound features can be created by mathematically combining features in a meaningful way. Features that are not derived from posterior signal such as depth of fragment coverage across specific regions or at specific loci can be combined with features that are inferred from nucleosome positions and/or fragment support and/or other features that are derived from the underlying nucleosome peaks. If features are combined over regions, typical descriptive statistics like mean, median, standard deviation and other metrics and derivatives of these are used.

According to a specific embodiment, a compound feature is the chromatin state concordance across contributing tissues: either binary (“discordant/chaotic”=0, or “concordant’- 1 ) or a continuous value for example a value between 0 (= discordant) and 1 (= identical/fully concordant).

According to a specific embodiment, a compound feature is the chromatin condensation if concordant or higher than a chosen threshold, e.g. 0.5, or if a nucleosome chain with certain regularity and NRL characteristics is present: 30 nm chromatin fibril (dense), 10 nm “beads on a string” conformation (loose) based on depth of fragment coverage, support metrics for peak(s) of a chain, inter-peak distance and possibly further features to refine prediction of the state.

According to a specific embodiment, a compound feature is the DNA accessibility of a locus or average across loci: this feature is derived from nucleosome positions, derivatives, and depth of coverage. If at least one chain spans the locus, instead of coverage, the average fragment support across chain peaks, the fragment support for neighboring peaks and/or overlapping peaks (e.g. within 47 bp) and the internucleosome distance and chain regularity can be used to compute accessibility or to predict the accessibility of the locus for each chain.

According to one embodiment, classifiers and predictors may be used in the methods described herein. Models for classification of health status of a patient and prediction of certain health parameters (e.g. response to therapy, development of tumor resistance to treatment, recurrence free survival, time to recurrence, tumor metastasized or not, time to sepsis, etc.) are trained on features and feature sets using machine learning methods. Features and feature sets may be selected and reduced in dimensions using principal component analysis (PCA) to extract combinations of features that explain variability in the data, non-negative matrix factorization (NMF) to extract recurring signatures in homogeneous sample groups, random forests or gradient boosting machines (also to limit the number of allowed decisions) to assess important binary decisions for classification, auto-encoders to reduce feature space to important hyperparameters (also de-noising) or similar methods and/or any combination of these. Suppression of batch effects on the feature selection procedure is achieved by applying standard controlling procedures involving computing of correlation metrics, regression analysis, (hierarchical) clustering of features based on similarity and testing resulting clusters against known possible confounding variables like sequencing batch, sequencing technology, depth of coverage, age of sample, sample sex (wherever applicable) an any other example.

According to one embodiment, tissue deconvolution is used or performed in a method described herein. Tissue deconvolution refers to the inference of cfDNA contribution by individual tissues and/or cell-types to the cfDNA pool uses a reference catalog of tissue-/cell type-specific feature signatures. The catalog is created from existing sequencing data sets. Signatures may consist of single features or combinations of features and sets of these as described above. Features and sets may be restricted to certain regions or sets of regions of the genome, especially in the case of chromatin- associated features. To name a possible approach, non-negative matrix factorization (NMF) can be used to extract recurrent feature signatures from homogeneous groups of samples. Using the reference catalog, NMF can also be used to compute a “best fit” linear combination of signatures from a sequencing dataset. Signatures may not scale linearly with the abundance of their corresponding cfDNA releasing cells. Therefore, other methods than NMF might be used to achieve a more accurate deconvolution. Tissue deconvolution yields an estimate of tissue-/cell types that are described by the reference catalog as values between 0 and 1. Minor contributions might be ignored and only a ranked list of top contributing tissues used for further model training and/or use in regression or classification tasks.

According to one embodiment, the methods described herein comprise determining an index of fragment length and dyad position. Specifically, wherein for each cfDNA length, it is determined how often the dyad is in the center of the cfDNA fragments.

According to a specific embodiment, determining an index of fragment length and dyad position may be used for determining the health status of a subject. Specifically, in healthy cells the dyad is in the center of a cfDNA fragment. More specifically, an asymmetric distribution may be an indication of disease. This disease can be cancer, for example. Unlimited examples for such change of fragment length and dyad position could be due to mutations in histone genes, altered composition of nucleosomes, mutations in the cfDNA e.g., due to degradation machinery.

According to one embodiment, a nucleosome prior fragmentation index is determined. Specifically, on fragment length and dyad position.

According to a specific embodiment, for each cfDNA fragment length, it is determined how often is the dyad in the center or off the center. Specifically, the healthier a person, the more often is the dyad in the center of cfDNA fragments. As an example, from n cfDNA fragments, p% has a length of 167 bp; of those, the dyad is in q% in the center.

According to one embodiment, the methods described herein comprise characterization of genomic regions where the dyad is preferential in the center of the cfDNA and genomic regions where the dyad has more variable positions. Specifically, regions with preferential dyad-center cfDNAs are more regulatory important than regions where the dyad is more frequently off the center.

According to one embodiment, nucleosome phasing means “strict phasing”, e.g., +1 nucleosomes vs. “fuzziness”.

According to one embodiment, information on a single gene level can be deduced. Specifically, absence of a nucleosome at the TSS is compatible with being expressed. Specifically, the presence of a nucleosome at the TSS is incompatible with being expressed. According to a specific embodiment, further nucleosome positions can be included. Specifically, in an active gene, the peak downstream to TSS reflecting the +1 nucleosome is high and located approx, at position +50 bp, the peak upstream of TSS (-1 nucleosome) is increased, and its maximum located between -175 bp and -225 bp. Specifically, further may be included the distances of the downstream nucleosome to each other.

According to one embodiment, the methods described herein may comprise determining stable genes (=always the same nucleosome pattern) and unstable genes (=various nucleosome patterns). Specifically, stable genes are referred to as genes which are always the same nucleosome pattern. Specifically, unstable genes are referred to as genes having various nucleosome patterns.

According to one embodiment, the methods described herein may comprise determining for specific gene sets the number of genes which are without a nucleosome and/or the variability of the genes.

According to a specific embodiment, the nucleosome position for all HKs may be determined. Specifically, the number of genes without nucleosome at TSS may be determined. Specifically, the variability for HK genes may be determine.

According to a specific embodiment, the nucleosome position for PAU genes may be determined. Specifically, the number of genes without nucleosome at TSS may be determined. Specifically, the variability for PAU genes may be determined.

According to one embodiment, the methods described herein may comprise the determination of the aging status of a subject.

According to a specific embodiment, based on the determination of a nucleosomal dyad, for each genomic region can be determined whether the genomic region has a regulatory function. For example, it can be determined whether it is a TSS or TFBS or a region with another regulatory function. This step may be further facilitated by including reference genome annotations. Then, the cfDNA fragmentation pattern may be determined for each regulatory region/TF binding site, e.g., the number, the lengths of cfDNA fragments covering the respective regulatory regions and their relative positioning in the region or to the binding site. From the cfDNA lengths occurring at a particular regulatory region, a variety of statistics may be calculated, e.g., the distribution of various cfDNA fragment lengths, proportions of certain cfDNA lengths, subdivision of cfDNA fragmentation lengths into groups, calculation of ratios between the groups, and other statistics. According to a specific embodiment, the cfDNA fragmentation patterns within their regulatory context allow estimations about the age of a subject, as for example, gene transcription changes in an age-dependent way.

According to one embodiment of the invention, the present invention provides a computer-implemented method. Thereby, all of the herein and above described features relating to an in vitro method apply also for the herein described computer- implemented methods.

According to one embodiment of the invention, a computer-implemented method is described for analyzing cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; and v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence.

According to a specific embodiment, the computer-implemented method described herein for analyzing cell-free DNA (cfDNA) fragments from a sample further comprises the step vi. of chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing.

According to another embodiment of the invention, a computer-implemented method is described for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; and vi. comparing the mapped peaks obtained in v. with a library comprising standard maps and outlier maps of nucleosomal dyads; wherein congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status, wherein congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status, and/or wherein congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects indicates an association with an unhealthy status.

According to a specific embodiment of the invention, a computer implemented method for determining the health status of a subject is described herein. Said method comprised the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with a library comprising standard maps, comparing the mapped peaks obtained in v. with a library comprising outlier maps of nucleosomal dyads, and/ or comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains, preferably wherein said comparing the mapped peaks obtained in v. comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status; c. congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status; and/or d. difference with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for an unhealthy status.

According to a specific embodiment, the library of standard maps derived from unhealthy subjects and/or said library of outlier maps of nucleosomal dyads derived from unhealthy subjects is from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

According to another embodiment of the invention, a computer-implemented method is described for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; and v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; wherein chains of nucleosomal dyad positions are determined by chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, wherein a deviation of a subject’s nucleosomal dyad chains from a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for an unhealthy status.

According to another embodiment of the invention, a computer-implemented method is described for monitoring the treatment success of a patient comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; and vi. comparing the mapped peaks obtained in vi. with the mapped peaks of a previous result from said patient and/or a standard map of nucleosomal dyads characteristic for the treatment success, wherein differences and/or congruences obtained in vi. provide information on the treatment success of the patient. According to another embodiment, a computer-implemented method is described for monitoring the treatment success of a patient comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with the mapped peaks of a previous result from said patient, comparing the mapped peaks obtained in v. with a standard map of nucleosomal dyads characteristic for the treatment success, comparing the chained peaks obtained in vi. with the chained peaks of a previous result from said patient, and/or comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains characteristic for the treatment success, preferably wherein comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in vii. provide information on the treatment success of the patient.

According to another embodiment of the invention, a computer-implemented method is described for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average nucleosomal dyad probability of iv. across the reference genome sequence; and determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the mapped peaks obtained from v. with a library comprising mapped nucleosomal dyads of specific tissues or cell types.

According to a specific embodiment, a computer-implemented method is described for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the mapped peaks obtained from v. with a library comprising mapped nucleosomal dyads of specific tissues or cell types and/or determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains of specific tissues or cell types.

According to another embodiment of the invention, a computer-implemented method is described for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; and iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; wherein the probability obtained from iii. for different fragment lengths of the cfDNA fragments as obtained from i. provides information on the health status of said subject, preferably wherein a health status deviating from a healthy status is indicated if the z-score of the informative counts ratios as obtained in step iii. between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects are expressed as z-score and a subject is identified as unhealthy if said z-scores deviate from the distributions of ratios and/or cumulative deviations recorded from healthy subjects, preferably wherein a health status deviating from a healthy status is cancer or pregnancy-associated complications.

According to one embodiment, a computer program is provided comprising instructions which, when the program is executed by a computer, cause the computer to carry out the computer-implemented method described herein.

According to a further embodiment, a computer-readable medium is provided having stored thereon the computer program described herein for performing the computer-implemented method.

According to a specific embodiment, the computer-implemented method described herein comprises the step of receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample of a subject. Specifically, said data may be generated using a sequencing-device connected to the computer or apparatus used for performing the computer-implemented method described.

According to one embodiment, other features of the methods described herein, especially the features described herein in the context of the Bayesian interference calculations and parameters derived from the results of said calculations can be combined with the described provided computer-implemented methods.

According to one embodiment, a data processing apparatus comprising means for carrying out the computer-implemented methods described herein is provided by the invention.

According to a specific embodiment, said data processing apparatus may be connected to an apparatus or device capable of sequencing cfDNA fragments.

According to a specific embodiment, said data processing apparatus may be connected to an apparatus or device capable of extracting cfDNA from a sample. Specifically, said data processing apparatus is further connected to an apparatus or device capable of sequencing cfDNA fragments.

According to one embodiment, a computer program is provided comprising instructions which, when the program is executed by a computer, cause the computer to carry out a computer-implemented method described herein. Specifically, said computer program may be combined with a computer program comprising instructions to cause the device capable of extracting cfDNA from a sample to execute its function of extracting cfDNA from a sample. Specifically, said computer program may be further combined with a computer program comprising instructions to cause the device capable of sequencing cfDNA fragments to execute its function of sequencing cfDNA fragments. Alternatively, said computer program may be combined with a computer program comprising instructions to cause the device capable of sequencing cfDNA fragments to execute its function of sequencing cfDNA fragments.

According to yet another embodiment, an apparatus is used for performing a method described herein. Such apparatus may be characterized by the following features: (a) a sequencer configured to (i) receive DNA extracted from a sample of the bodily fluid comprising DNA, and (ii) sequence the extracted DNA under conditions that produce DNA fragment sequences; and (b) a computational apparatus configured to (e.g., programmed to) instruct one or more processors to perform various operations such as those described with two or more of the method operations described herein. In some embodiments, the computational apparatus is configured to perform one or more of the steps of the computer-implemented method described herein.

In certain embodiments, the apparatus also includes a tool for extracting DNA from the sample under suitable conditions. In some embodiments, the apparatus includes a module configured to extract cfDNA obtained from plasma for sequencing in the sequencer.

In some examples, the apparatus includes a database of reference genome sequences and/or a library comprising standard maps and outlier maps of nucleosomal dyads. The computational apparatus may be further configured to instruct the one or more processors to map the cfDNA fragments obtained from the blood of the individual to the database of reference genome. The computational apparatus may be further configured to instruct the one or more processors to map the nucleosomal dyads obtained from the analysis of cfDNA in a sample as described herein to the database of reference genome(s). Said mapped nucleosomal dyads or peaks of the average probability of the presence of a nucleosomal dyad may be compared by the apparatus with the library comprising standard maps and outlier maps of nucleosomal dyads.

In general, the computational apparatus may perform all steps of the method described herein that can be performed by such an apparatus.

Analysis of the sequencing data and the results derived therefrom are typically performed using computer hardware operating according to defined algorithms and programs. Therefore, certain embodiments employ processes involving data stored in or transferred through one or more computer systems or other processing systems. Embodiments of the invention also relate to an apparatus for performing these operations. This apparatus may be specially constructed for the required purposes, or it may be a general-purpose computer (or a group of computers) selectively activated or reconfigured by a computer program and/or data structure stored in the computer. In some embodiments, a group of processors performs some or all of the recited analytical operations collaboratively (e.g., via a network or cloud computing) and/or in parallel. A processor or group of processors for performing the methods described herein may be of various types including microcontrollers and microprocessors such as programmable devices (e.g., CPLDs and FPGAs) and other devices such as gate array ASICs, digital signal processors, and/or general purpose microprocessors.

In addition, certain embodiments relate to tangible and/or non-transitory computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations. Examples of computer-readable media include, but are not limited to, semiconductor memory devices, magnetic media such as disk drives, magnetic tape, optical media such as CDs, magneto-optical media, and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM). The computer readable media may be directly controlled by an end user or the media may be indirectly controlled by the end user. Examples of directly controlled media include the media located at a user facility and/or media that are not shared with other entities. Examples of indirectly controlled media include media that is indirectly accessible to the user via an external network and/or via a service providing shared resources such as the "cloud." Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.

According to one embodiment, the use of a computer-implemented method is described herein, wherein said computer-implemented method is used in a method described herein, specifically in an in vitro method described herein. Specifically, a computer-implemented method is used the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; and ii. performing at least one of the steps of a method described herein, specifically of an in vitro method described, more specifically performing at least one of the steps iii. to viii. of an in vitro method described herein.

The following items are described herein:

1. An in vitro method for analyzing cell-free DNA (cfDNA) fragments from a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; and vii. chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing.

2. An in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning the DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; vii. chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing; and viii. comparing the mapped peaks obtained in vi. with a library comprising standard maps, comparing the mapped peaks obtained in vi. with a library comprising outlier maps of nucleosomal dyads, and/ or comparing the chained peaks obtained in vii. with a standard map of nucleosomal dyad chains, preferably wherein said comparing the mapped peaks obtained in vi. comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning; wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status; c. congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status; and/or d. difference with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for an unhealthy status; preferably wherein said library of standard maps derived from unhealthy subjects and/or said library of outlier maps of nucleosomal dyads derived from unhealthy subjects is from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

3. The in vitro method of item 2, wherein the subject is considered unhealthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is more than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; wherein the subject is considered healthy if the deviation of nucleosomal dyad positioning between the mapped nucleosomal dyads obtained in vi. and standard maps of nucleosomal dyads characteristic for a healthy subject is less than the deviation of nucleosomal dyad positioning between the mapped nucleosome dyads obtained in vi. and standard maps and outlier maps of nucleosomal dyads characteristic for an unhealthy subject; and/or wherein a subject is considered unhealthy if the z-score of the changes of the informative counts ratios between the sample set of informative counts ratios and the standard set of informative counts ratios of healthy subjects exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3; and/or wherein a subject is considered unhealthy if the z-score of the changes of the cumulative deviations between the sample set of cumulative deviations and the standard set of cumulative deviations exceeds a limit that is characteristic for a specific group of unhealthy samples, preferably if the absolute value of the z-score exceeds 3. 4. An in vitro method for monitoring the treatment success of a patient comprising the steps of: i. extracting cfDNA fragments from a sample of said patient; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average probability of the presence of a nucleosomal dyad of v. across the reference genome sequence; vii. chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing; and viii. comparing the mapped peaks obtained in vi. with the mapped peaks of a previous result from said patient, comparing the mapped peaks obtained in vi. with a standard map of nucleosomal dyads characteristic for the treatment success, comparing the chained peaks obtained in vii. with the chained peaks of a previous result from said patient, and/or comparing the chained peaks obtained in vii. with a standard map of nucleosomal dyad chains characteristic for the treatment success, preferably wherein comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in viii. provide information on the treatment success of the patient.

5. The in vitro method of item 4, wherein the treatment success is monitored for the treatment of cancer, specifically of prostate cancer, colon cancer, breast cancer, bladder cancer, and/or lung cancer; and for the treatment of inflammatory diseases, specifically of inflammatory bowel disease, systemic lupus erythematosus, ulcerative colitis; chronic inflammatory diseases such as thyroiditis, Crohn‘s disease, chronic obstructive pulmonary disease; and/or asthma.

6. An in vitro method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. extracting cfDNA fragments from the sample; ii. performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; v. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iv. and the fragmentation profile of iii.; vi. mapping peaks of the average nucleosomal dyad probability of v. across the reference genome sequence; vii. chaining the mapped peaks obtained from step vi. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing; and viii. determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the mapped peaks obtained from vi. with a library comprising mapped nucleosomal dyads of specific tissues or cell types and/or determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the chained peaks obtained in vii. with a standard map of nucleosomal dyad chains of specific tissues or cell types.

7. The in vitro method of item 6, wherein the tissue and/or cell types are selected from the group consisting of cancer cells, specifically lung cancer, colorectal cancer, breast cancer, prostate cancer, and bladder cancer; and normal cells, specifically hematopoietic cells, liver cells, epithelial cells, and bone marrow.

8. A computer-implemented method for determining the health status of a subject comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with a library comprising standard maps, comparing the mapped peaks obtained in v. with a library comprising outlier maps of nucleosomal dyads, and/ or comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains, preferably wherein said comparing the mapped peaks obtained in v. comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, and/or determining chains of nucleosomal dyad positioning wherein a. congruence with the library of standard maps derived from healthy subjects and difference with the library of standard maps derived from unhealthy subjects is characteristic for a healthy status; b. congruence with the library of standard maps derived from unhealthy subjects and difference with the library of standard maps derived from healthy subjects is characteristic for an unhealthy status; c. congruence with the library of outlier maps of nucleosomal dyads derived from unhealthy subjects and difference with the standard maps of libraries from healthy and unhealthy subjects is characteristic for an unhealthy status; and/or d. difference with a standard map of nucleosomal dyad chains obtained from healthy subjects is characteristic for an unhealthy status; preferably wherein said library of standard maps derived from unhealthy subjects and/or said library of outlier maps of nucleosomal dyads derived from unhealthy subjects is from subjects suffering from cancer, inflammation, coronary disease, acute tissue damage, chronic disease, complications during pregnancy, beginning sepsis, sepsis, and/or unhealthy aging.

9. A computer-implemented method for monitoring the treatment success of a patient comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. comparing the mapped peaks obtained in v. with the mapped peaks of a previous result from said patient, comparing the mapped peaks obtained in v. with a standard map of nucleosomal dyads characteristic for the treatment success, comparing the chained peaks obtained in vi. with the chained peaks of a previous result from said patient, and/or comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains characteristic for the treatment success, preferably wherein comparing the mapped peaks comprises determining the deviation of nucleosomal dyad positioning on cfDNA fragments, determining changes in the informative counts ratio, and/or determining presence or absence of specific nucleosome positioning that is characteristic for one of the healthy or unhealthy subject groups, wherein differences and/or congruences obtained in vii. provide information on the treatment success of the patient.

10. A computer-implemented method for determining the cell type and/or tissue contribution of cfDNA in a sample comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from i. with a reference genome sequence; iii. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; iv. determining the average probability of the presence of a nucleosomal dyad at base positions in the reference genome sequence based on the probability of iii. and the fragmentation profile of ii.; v. mapping peaks of the average probability of the presence of a nucleosomal dyad of iv. across the reference genome sequence; vi. chaining the mapped peaks obtained from step v. across the reference genome sequence into series of mapped peaks following rules for naturally occurring nucleosomal spacing, and vii. determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the mapped peaks obtained from v. with a library comprising mapped nucleosomal dyads of specific tissues or cell types and/or determining the cell type and/or tissue contribution of the cfDNA fragments by comparing the chained peaks obtained in vi. with a standard map of nucleosomal dyad chains of specific tissues or cell types.

11. A data processing apparatus comprising means for carrying out the method of any one of items 8 to 10.

12. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of items 8 to 10.

13. A computer-readable medium having stored thereon the computer program of item 12.

14. Use of a computer-implemented method in a method according to any one of items 1 to 7, said method comprising the steps of: i. receiving data representing the DNA sequences of cfDNA fragments acquired by sequencing of cfDNA fragments extracted from a sample; ii. performing at least one of the steps iii. to viii. according to any one of items

1 to 7. 15. An in vitro method for determining the health status of a subject comprising the steps of: i. extracting cfDNA fragments from a sample from the subject; ii. determining the sequence of the cfDNA fragments by performing whole genome sequencing on the extracted cfDNA fragments; iii. determining the cfDNA fragmentation profile by aligning DNA sequences of the cfDNA fragments from ii. with a reference genome sequence; and iv. determining the probability of the presence of a nucleosomal dyad for each base position of the cfDNA fragments; wherein the probability obtained from iv. for different fragment lengths of the cfDNA fragments as obtained from iii. provides information on the health status of said subject, preferably wherein a health status deviating from a healthy status is indicated if the z-score of the set of cumulative deviations and/or informative counts ratios deviate from the set of distributions of cumulative deviations and/or informative counts ratios recorded from healthy subjects, preferably wherein a health status deviating from a healthy status is cancer or pregnancy-associated complications.

EXAMPLES

The examples described herein are illustrative of the present invention and are not intended to be limitations thereon. Many modifications and variations may be made to the techniques described and illustrated herein. Accordingly, it should be understood that the examples are illustrative only and not limiting.

Example 1: General aspects of methods described

Principle of liquid biopsy for circulating tumor DNA analysis.

The basic principle of a liquid biopsy is illustrated in Figure 1 :

(Top left) Tumor cells may release their DNA into the circulation. Tumors often have an increased vascularization due to their demand for nutrition because of the accelerated growth and higher number of tumor cell divisions. If tumor cells are in direct contact with blood, the likelihood that their DNA will be shed into the circulation increases.

(Top center) Several mechanisms for DNA release into the bloodstream have been suggested, such as apoptosis, necrosis, and active secretion. Most cfDNA pieces in human plasma are nonrandomly fragmented, which is thought to be due to their origin mainly from apoptotic cells. Apoptosis is typically accompanied by changes in cell morphology, such as blebbing of the cell membrane.

(Top right) Apoptosis may eventually result in the death and destruction of the respective cell and the release of DNA and associated molecules into the circulation.

(Bottom right) The enzymatic digestion during apoptosis fragments DNA; however, the architecture of nucleosomes determines access to DNA. The close association between DNA and the nucleosomal core particle protects the DNA from enzymatic digestion. Therefore, cleavage of DNA in internucleosomal regions is a frequent event, whereas cleavage of nucleosomal bound DNA is rare. This results in the typical size distribution of cfDNA fragments with a modal distribution of 167 bp, which corresponds approximately to the length of DNA wrapped in ~1.7 left-handed superhelical turns around a histone octamer (-147 bp) and a linker fragment (-20 bp).

(Bottom left) A standard blood draw will contain blood cells, such as white blood cells (WBCs) and erythrocytes, as a minimally invasive sampling method. Furthermore, circulating tumor cells (CTCs) may be detectable in the blood of patients with cancer. In addition to cells, multiple other molecules are in the bloodstream. For this patent application, cfDNA molecules are the most relevant ones.

From whole blood to number of fragments and genomic coverage: brief description of technical aspects to sample preparation.

For this molecular testing, 10-20mL of blood is drawn from the subject into specialized tubes for the stabilization of cfDNA. These tubes contain a proprietary blend of reagents that both prevent blood coagulation and stabilize white blood cells. This is critical, as white blood cells can release their DNA into the circulation, which masks the organ- or tumor-specific signal that is to be detected and profiled downstream. To separate plasma from whole blood, the tubes are centrifuged for two steps at 1 ,900 x g for 10 minutes each. DNA is isolated from 2-6mL of plasma using a cfDNA-specific extraction kit. Double-stranded DNA (dsDNA) originating from plasma is then quantified. On average, 1 mL of plasma from a patient with cancer contains approximately 1 ,500 diploid genome equivalents (GE) (~10ng of DNA), with considerably higher amounts often observed in patients with metastatic cancer. A typical 10 mL blood draw yields on average 4 mL plasma containing 6,000 GE (12 x 10 A 3 molecules per region or gene). For library preparation, 10ng of DNA is used as input and PCR is performed to selectively amplify library fragments containing adapters for subsequent sequencing. Libraries are then quantified and sequenced in paired-end mode (150bp x 2 or 100bp x 2) at high coverage (~30x). For humans, 30x coverage can be achieved with 600 million reads of 150 bp (or 300M paired-end reads). After quality control and pre-processing of sequencing reads (e.g. adapter trimming, base quality filtration, GC-correction), the fragment coverage, defined as the number of nucleosome prior distributions per base, is extracted from each sample dataset.

Primary cfDNA analysis.

The basic principle of primary cfDNA analysis is illustrated in Figure 2:

(Top left: DNA isolation from plasma sample) Within the blood vial, the blood cells must be separated from the cell-free or liquid component of whole blood, and the latter is referred to as blood plasma. This separation is done by centrifugation. After centrifugation, the different blood fractions are visible in the blood vial where the yellow fraction corresponds to the plasma (~55% of whole blood). The thin white layer is the buffy coat containing WBCs and platelets (~1% of whole blood). In contrast, the red fraction (~45% of whole blood) comprises mainly erythrocytes. cfDNA is isolated from the yellow part, i.e., the blood plasma, and kept in Eppendorf tubes for subsequent analysis.

(Top right: library preparation for sequencing) cfDNA fragments are subjected to library preparation for subsequent NGS, including ligating adapters to the cfDNA fragments.

(Bottom left: untargeted NGS sequencing) The libraries are then sequenced with an NGS sequencer. Such a sequencer may be the Illumina NextSeq; however, other NGS sequencers from this manufacturer or other vendors can equally be used for this purpose.

(Bottom right) Sequencing produces an output of strings made up of the nucleotides A, G, C, or T. These sequence reads must be processed with sophisticated bioinformatics tools to derive any meaningful information.

Nucleosome occupancy analysis based on depth-of-coveraqe.

The basic principle is shown in Figure 3:

(Top left: read alignment) Schematic representation of an enlarged chromosomal region with cfDNA sequence reads (gray bars) mapped to the reference genome (blue bar). Two regions show an increased number of sequencing reads.

(Top right: nucleosomal occupancy analysis) Bioinformatics tools are applied to infer from the sequence reads the depth of coverage (red line) and distinguish between well-protected loci, which should correspond to nucleosomal bound DNA, moderately protected loci, or unprotected regions, which should reflect increased enzymatic digestion due to the lack of protective nucleosomes. Peak regions of protected loci may represent nucleosome dyad positions (black dashed lines), i.e., regions occupied by the center of a nucleosome.

(Bottom: combined signal over several hundred similar regions of interest) Plasma DNA depth of coverage analysis of distinct genomic regions may provide biologically relevant information. The left coverage plot illustrates a drop in average depth of coverage across many TSSs of genes that are likely to be expressed. At TSSs, nucleosomes are removed to create an NDR over the promoter, allowing transcription factors to bind. When computing compound loci read depths centered at the TSS, the flanking regions upstream and downstream show periodic oscillations, reflecting the organization of nucleosomes adjacent to the TSS. From such a pattern, it can be inferred that the genes used in the compound depth computation are likely expressed in the cells that shed their DNA into the circulation. In contrast, the right coverage plot reflects the average across regions with increased, uniform coverage, indicating densely packed nucleosomes with less defined positioning, suggesting that these genes are unlikely to be expressed.

Typical fragment length distribution

An example is shown in Figure 4: This figure is intended to give a comparison between the relative occurrence of fragment lengths within a single dataset and not absolute values for actual counts since these may vary greatly depending on the seguencing platform, targeted seguencing depth, and other factors. Hence, the linear y- axis shows no values.

The distribution of fragment lengths in an untargeted whole genome seguencing dataset from isolated double-stranded cfDNA is shown. Fragments were seguenced paired-end to be able to deduce the length based on the start and end positions of reads from both ends. Only fragments falling in the local windows around hypothetical nucleosome dyads were counted during the computation of nucleosome prior distribution.

Computation of nucleosome dyad prior distribution.

The basic principle is illustrated in Figure 5:

(left side) The lower coverage plot illustrates three regions where different numbers of cfDNA fragments map. To the left is a locus with high coverage of seguencing reads, indicating high resistance to enzymatic digestion. The region on the right side has fewer sequencing reads suggesting moderate resistance to cleaving. In contrast, the region in the center hardly overlaps with any cfDNA sequencing reads, which indicates no protection from enzymatic digestion. As nucleosomes offer protection from enzymatic digestion during apoptosis and as the nucleosomal dyad is the region where nucleosomal DNA is most tightly bound, it is possible to translate the sequencing depths into nucleosome position maps where the position of maximum coverage overlaps with the nucleosome dyad (dashed light grey line in the upper coverage plot, the position of the inferred nucleosome dyad axis). Hence, nucleosomal dyad positions are inferred from sequencing read depth analysis for each cfDNA fragment in this step.

(right side) The individual cfDNA fragment nucleosomal dyad information is then used to infer within each cfDNA fragment the relative position of the dyad (block arrows; left panel “Nucleosomal Fragments”). The nucleosomal dyad may map in the center of a cfDNA fragment, be somewhere off the center, or may not be determinable. The next step involves fragment length-specific dyad statistics. The inferred nucleosome dyad positions are recorded for all fragments that map to the same locus and have the same length (center panel, black triangles). These statistics are then translated into inferred (hypothetical) nucleosome positions over cfDNA fragments with a specific length (right panel). We refer to the empiric distribution of nucleosome dyad locations per fragment length as “nucleosome prior distribution” (yi). After this step, nucleosomal prior distributions for all cfDNA fragments with a particular length are available.

Transformation of empiric count distributions to nucleosome prior distributions

The principle is shown in Figure 6:

The initial dyad count distribution for a specific fragment length is first truncated according to a certain strategy (step 2; the strategy shown is fragment length-based truncation), then normalized to an area under the curve of one to resemble a probability density function (step 3) and finally, the non-informative constant portion of counts is removed by adjusting the zero level (step 4).

Overview heatmap of dyad count distributions with different distribution truncation strategies.

The principle is shown in Figure 7:

Count distributions are shown relative to the central base(s) (medium gray vertical line) of fragments. Counts for fragment lengths from 50bp to 300bp are depicted. Inferred dyads were counted beyond fragment ends, which are indicated by the medium gray dashed lines. The distributions shown in the figure are used for computing prior distributions of nucleosome dyads. Medium gray areas indicate low counts. The transition from medium gray to darker gray and further to lighter grays up to white, as in the center of the figure, indicates increasing counts. The darker spots to the far left and right of the center mark an increase in counts that can be attributed to neighboring nucleosomes of the observed fragment. The degree of how precisely neighboring nucleosomes are positioned relative to the nearest one (based on fragment length) can be derived from the spread of the spot for that fragment length. The most accurate neighboring dyad positioning seems to exist for fragments between 200bp and 220bp. Horizontal bands of different grays appearing approximately every 10bp vertically for smaller fragments up to a length of about 160bp originate from the increasing cleaving resistance based on the DNA helix twisting with an approximate 10bp periodicity, which causes steric hindrance of the cleaving process by making the DNA backbone facing towards the histone complex in the same periodic manner. White lines indicate the count minima that are closest to the fragment center, which would be likely chosen by the short-range truncation method. Fragment length truncation would terminate count distributions at the fragment ends instead (medium gray dashed lines). The uniform truncation strategy would use an identical distance from the center of each fragment to the truncated bases at both sides (end-to-end distance here: 170bp; white dotted lines). An unmarked version of the count heatmap is shown in the small panel on the top right.

Nucleosome occupancy pattern from nucleosome priors.

The principle is shown in Figure 8:

Illustration of how the average dyad signal (y), i.e., the nucleosome posterior signal (bottom panel), is computed from the previously computed fragment-specific prior distributions (yi; center panel). In the first step, DNA fragments are replaced by their characteristic nucleosome prior distributions, and the per-base average across these distributions yields the nucleosome posterior signal. After this step, a detailed map of posterior nucleosomal dyad positions across the human genome can be computed via peak calling.

Nucleosome occupancy patterns; Nucleosome posterior signal vs. depth of coverage.

The principle is shown in Figure 9:

Using a posterior nucleosome signal results in a significant increase in resolution when resolving nucleosome dyad locations. (A) Display of a coverage-based signal (fine broken light gray line) in comparison to a nucleosome priors-based signal (medium gray line) for a selected human reference DNA sequence (black bar). On the left, a region is shown where the nucleosome-prior position (black dotted vertical line) differs from the highest read coverage (gray dotted line). The difference between the two inferred dyad positions is indicated (A). Hence, our new calculation method maps nucleosome dyad positions with an increased resolution.

(B) Due to the increase in resolution, several neighboring nucleosome position signals in close vicinity to each other can now be resolved. The coverage-based signal reveals only one peak (indicated with a gray arrow), and the nucleosome dyad positioning calculation reveals that the large coverage-based signals consist of three separate dyad signals (black arrows).

(C) The top panel illustrates how the coverage-based signal (fine broken light gray line) is generated from sequencing reads (gray horizontal bars), which map to this particular human reference DNA sequence. The coverage-based analysis results in the identification of one peak region (gray arrow (C)). In contrast, a nucleosome priors-based analysis reveals that the large coverage-based signal consists of two closely positioned nucleosomes (dashed light gray line, indicating the respective dyad axes), with partially overlapping signals. Hence, an interpretation based only on the coverage-based signal can result in wrong conclusions.

Extraction and selection of features from biologically relevant genomic loci for machine learning.

The principle is shown in Figure 10:

(Top panel) The left side illustrates how the combination of depth of coverage (fine broken light gray line) and nucleosome priors-based signal (medium gray line) can be leveraged to calculate the local DNA accessibility (ac) from the depth of coverage (d) and the spacing between nucleosome dyads (And). The accurate calculation of nucleosome dyad positions allows for determining genomic regions where nucleosomes are highly phased or less well organized (“variability of placement” in the second column of this panel). The local organization of nucleosomes can be investigated in more detail using the number and spread of side-peaks of a main peak, which would be the highest local maximum, and the width of the main peak. We get this hierarchical information after classifying peaks according to their height and inter-peak distances after peak calling. Furthermore, it enables the reliable identification of regions with closed or open chromatin (third column). In particular, open chromatin regions indicate locations within the human genome with regulatory functions, such as transcription start sites, i.e., regions with extraordinary biological relevance. One way to analyze the uniformity and value of nucleosome spacing locally across neighboring nucleosomes is to employ discrete short-time Fourier transformation to the compound coverage-based signal (fourth column).

(Lower panels) These analyses allow the generation of multiple features. Machine learning can then extract the informative features for specific medical or biological questions, such as distinguishing between healthy individuals and persons with a disease.

Machine learning classifiers detect pathophysiological states.

The principle is shown in Figure 11 :

Two different options to achieve a binary classification between healthy and pathologic states.

Example 2: Single gene nucleosome positioning track analysis

In general, examples 2-5 and example 11 show examples of the layout or implementation of the methods described herein. For comparison, the L-WPS signal (Snyder et al., 2016), another nucleosome positioning signal, is visualized in some cases examples. Another low-resolution positioning signal, the central 31 fragment bases depth of coverage (c31 DoC), is used in Figure 25 to show applications of current state- of-the-art positioning signals and their weaknesses. The c31 DoC computes the depth of coverage using only the central 31 bp of a cfDNA fragment as determined by the start and end positions of a sequence read pair that was aligned to the reference genome.

Additional features such as the nucleosome dyad, its position within individual cfDNA fragments, and nucleosome mapping with unprecedented resolution lead to a tremendous increase in the accuracy of nucleosome positioning analysis by several orders of magnitude and allow for new applications of cfDNA tests like high accuracy single locus nucleosome positioning analysis from liquid biopsy samples and deconvolution of multiple positioning signals at any locus. To this end, mapping both nucleosome positions and open chromatin regions simultaneously for all present copies of DNA (maternal and/or paternal; depending on the chromosome on which the site of interest is located and the sex of the individual) and/or the top cfDNA donor tissues is enabled; open chromatin regions include regions with regulatory functions and provide insights into essential biological processes, like the ENCODE CIS-regulatory element region marked in Figure 22A. The basic principle of achieving high-resolution dyad mapping is illustrated in the Figures, particularly Figure 8. A comparison of two low-resolution signals is provided in Figure 25 to depict problems with current nucleosome positioning approaches. This innovation is the foundation for many new applications of high biological and medical relevance, which were not possible before. Moreover, this approach provides sufficient coverage to simultaneously capture other critical and clinically relevant features from cfDNA besides nucleosome positioning and open chromatin mapping, such as somatic copy number alterations, tumor fraction represented in plasma, and germline variants. Because all this information can be harvested from a whole-genome sequencing dataset, there is no requirement for a separate library preparation or sequencing procedure, making this workflow more time- and cost-efficient compared to previous approaches for which special assays had to be developed. As such, our high-resolution nucleosome dyad mapping approach maximizes the genomic insights obtained from whole-genome sequencing in an unprecedented fashion.

Figure 25 is described in the following:

(A) Violation of the majority positioning assumption and proof of the presence of multiple nucleosome positioning signatures (grouped nucleosome peaks forming multiple overlapping chains in the same genomic region of a single sample). Two low- resolution positioning signals in the pericentromeric region of the short arm of chromosome 12, which is known to harbor well-positioned nucleosome chains. The majority peak positions computed from the local maxima of the positioning signals under the assumption that only one chain of nucleosomes is present are shown below in this screenshot detail taken from the IGV program. The peak calling algorithm enforces a minimum distance between called peaks and looks for the highest local maxima. Rows are as follows: central 31 fragment bases depth of coverage (c31 DoC), nucleosome majority positioning peak calls created using the c31 DoC signal, L-WPS positioning signal, nucleosome majority positioning peak calls created using the L-WPS signal, aligned read depth of coverage as computed by IGV program, and the read alignments from the BAM file which was used to create all signals and majority calls above. The c31 DoC signal is visualized as colored bars (dark grey). The L-WPS signal is displayed as a line plot with the value zero as a thin horizontal line. It can be seen that L-WPS is mainly negative due to the signal being primarily determined by the local presence of fragment ends, which reduces the L-WPS value. The height of the majority positioning calls is the prominence of the peaks called by scipy.find_peaks() function in Python. The peaks from both signals show a high concordance, indicating both signals can be used as a source to compute the majority positioning of nucleosomes. Two nucleosome chains are marked in the case of the c31 DoC signal: a more prominent A chain of majority positions and a less pronounced B chain in between the A chain majority call positions. No such second chain can be seen in the L-WPS panel of the figure. This shows the low usability of the L-WPS signal for determining secondary nucleosome positions and conducting multiple signature deconvolution. Judging from the c31 DoC signal, it is unclear if only one secondary nucleosome chain is present or multiples because of the wide spread of the c31 DoC peaks.

(B) Comparison between the low-resolution c31 DoC nucleosome positioning signal of three healthy male samples (younger than 30 years; top) and three prostate cancer samples with high ctDNA fraction between 45% and 65% (bottom). A pericentromeric region on the short arm of chromosome 12, which should harbor well- positioned nucleosomes, is shown. It can be seen that the cancer samples show more variability in nucleosome positioning (dilated peaks when compared to the healthy young males), with no clear preferential positioning being distinguishable for many positions (e.g., the initial interval between the two most pronounced majority positions of the healthy samples). This indicates the presence of multiple nucleosome chains for the cancer samples. Samples were GC bias corrected before the computation of the c31 DoC signal.

(C) Comparison between the low-resolution c31 DoC nucleosome positioning signal of three healthy male samples younger than 30 years (top), two healthy male samples older than 55 years (middle), and three prostate cancer samples with high ctDNA fraction between 45% and 65% (bottom). A pericentromeric region on the short arm of chromosome 12, which should harbor well-positioned nucleosomes, is shown. The young, healthy males show majority peaks with approximately equal height (peaks supported approximately equally by cfDNA fragments). While the prostate cancer samples do show a similar consistent signal height and also show high variability of positioning as in (B), the older healthy male positioning signals tend to show little support for any nucleosome peaks for many positions and strong outliers for others (second old healthy male sample: pronounced peak marked with black arrow). Dilated majority peaks can be interpreted as increased entropy in the chromatin regulatory machinery, leading to more lenient chromatin positioning regulation. Samples were GC bias corrected before the computation of the c31 DoC signal. Examples 2-5 refer to Figures 12, 22, 23, and 24. Example 11 also refers to Figure 25. Figure 12 is described in the following:

Panels A and B each show two positioning signals computed for the TSS or the TTS of a gene. Genes are shown in 5’ to 3’ direction left to right, according to the strand on which the gene is located. This causes the left end of the x-axis to equal “upstream of the TSS/TTS” and the right end to equal “downstream of the TSS/TTS”.

Two nucleosome position tracks are displayed in each panel: the novel posterior nucleosome dyad signal (gray), and the L-WPS signal (light gray broken line). The L- WPS signal was computed according to the algorithm published by Snyder et al. in 2016. The latter is only defined for a fragment length range of 120-180 bps.

The dark gray line tracks fragment coverage for each position of the genome and can be seen to be insufficient for the bp-accurate determination of the DNA’s resistance to cleaving by DNases based on its local maxima.

(A) Nucleosome positioning signal plots for the intermediate and long fragment length ranges for genes PI.A2G2E and ZNF648 which are expected to be unexpressed or only lowly expressed in blood plasma samples (see gene expression tables in Figures 24C and 24E, second row “whole blood” for confirmation). The medium gray lines show nucleosome positioning patterns that are compatible with the TSS/TTS of an unexpressed gene. The positioning signals are shown separately for the intermediate fraction (fragments between 120bp and 180bp) and the long fraction (fragments longer than 180bp) separately. Although the intermediate fraction is highly over-represented in the cfDNA pool, these two fragment groups show high concordance of posterior dyad signal maxima positioning for both loci. High concordance can be seen in particular for positions of well-supported main peaks of the intermediate fraction, like the nucleosomes at the 0, +1 , and +2 positions of the PI_A2G2E TSS. This is reasonable because the fragments of the long fraction, which are mainly dinucleosomal fragments, are thought to originate from cells that exhibit a closed chromatin state. The increased protection arising from the inaccessibility of such a chromatin state leads to an increase of “missed” internucleosomal cleaving during apoptosis because the linker regions between nucleosomes are less exposed. TSSs of inactive genes are examples of such closed chromatin states. More examples of TSSs of genes expected to be not or only occasionally expressed in hematopoietic cells are provided in Figures 24A,B, and D. The patterns of main peak positioning between the TSS of PI_A2G2E and the TTS of ZNF648 show high similarity with ZNF648 exhibiting more positioning variability around the main peaks of the intermediate fraction. The nucleosome repeat distance (NRD) for both loci is relatively consistent within each locus and falls inside the expected interval of possible NRD values that occur in a 30 nm chromatin fibril of densely packed nucleosomes (i.e. closed chromatin). The L-WPS signal is only defined for fragments in the intermediate fraction and generally shows a lower resolution in its ability to discern close neighboring peaks compared to the posterior signal. Repeating positioning patterns occur more often in the posterior signal than in the L-WPS signal (e.g. intermediate fraction of ZNF648). Further examples of TTSs of inactive genes are provided in Figures 24F and G.

(B) Nucleosome positioning signal plots for different fragment length ranges, TSS of LPGAT1 and TTS of PFKFB2. The dark gray depth of coverage signals shows large NDRs upstream to the TSS of LPGAT1 in the proximal promoter region and downstream to the TTS of PFKFB2. The gray posterior nucleosome positioning signals of the intermediate fragment length fraction show well-positioned 1 and +1 nucleosomes around the NDR for the TSS and well-positioned -1 and 0 nucleosomes for the TTS (the low support peak in the TTS NDR is omitted during major peak assessment). The positioning signals are shown separately for the intermediate fraction (fragments between 120bp and 180bp) and the long fraction (fragments longer than 180bp) separately. These two fragment fractions show good concordance for the TTS posterior signal and lesser concordance for the TSS positioning signal. The lower concordance between the intermediate and long fraction in case of the presence of an NDR in the intermediate fraction is reasonable because the long fraction of cfDNA fragments is expected to originate from cells with a closed chromatin state and, thus, their local maxima mark nucleosome dyad positions for a small set of cfDNA shedding cells that exhibit a closed chromatin state in the LPGAT1 promoter region. Therefore, the discordant peak in the NDR upstream to the TSS of LPGAT1 of the long fraction marks the position of the NDR-blocking nucleosome for a closed chromatin conformation. Most cell types show active transcription of the LPGAT1 and the PFKFB2 genes. Further examples of TSSs of genes actively transcribed across a wide range of normal human tissues are shown in Figure 23A-C and Figure 23 E-H. In the case of the TTS of PFKFB2, the narrow high peak inside the NDR at position +200 is disregarded (for main peak assessment) because it does not reach the required lower limit for fragment support since only one fragment supports the peak. The L-WPS signal is only defined for fragments in the intermediate fraction. The L-WPS signal loses signal amplitude when approaching an NDR region (e.g. TSS of LPGAT1 and L-WPS signal around the masked nucleosome dyad posterior peak in the NDR of PFKFB2) because of its dependence on the depth of coverage (compare to L-WPS positioning signal amplitude loss in the region upstream to ATP1 A1 TSS in Figure 22B). It also increases in a non-intuitive way (i.e. the region of lowest resistance to cleaving shows high values of windowed protection score) which is indicated for the regions marked with “L-WPS zero-drift”. In contrast, NDRs can be detected more easily using the distance between the main peaks of the posterior nucleosome dyad signal after applying the minimum fragment support criterion to the set of main peaks in the region of interest. It is noted that although posterior peaks might exist in NDR regions, the peaks are still caused by cfDNA fragment evidence and are not artifacts, as is the case for L-WPS. Regions of interest can either be predefined or selected from candidate regions that show a substantial decrease in coverage. A combination of depth of coverage and distance between flanking nucleosomes forms the accessibility metric (Figure 10, top panel).

(C) Count distributions of hypothetical nucleosome dyads for different fragment lengths for each base of a cfDNA fragment (within light gray vertical lines) and outside of cfDNA fragments (negative base positions and positive base positions after right-most light gray vertical line).

(Top left panel) Dyad count distribution for cfDNA fragments with a length of 110bp. The amplitude of the highest count vs. lowest count in this empiric distribution is small compared to dyad count distributions of longer fragments. This signifies the lower resistance to DNase cleaving for shorter nucleosomal cfDNA fragments. Two modes are visible, each one likely to stem from fragments from either side of the nucleosomal dyad, hence, the symmetry.

(Top right and bottom left panels) Count plots for cfDNA fragments with lengths 149bp and 167bp) The distribution for 149bp DNA fragments (i.e. mononucleosomal fragment without linker) was found to exhibit the narrowest and highest peak relative to the lowest count in the distribution. This agrees with the notion that mononucleosomal fragments without linkers are most difficult to bind for DNA degrading enzymes. The count distribution for mononucleosomal fragments with linker DNA stretches is illustrated in the bottom left panel. The Count distribution illustrates three peak regions within and close to the center. This reveals that the possibility of a central dyad position, an upstream-shifted or downstream-shifted position (shift approximately 10 bp) are almost equally likely to be observed with the central positioning being a little less likely. This is in good agreement with two proposed models of asymmetric or symmetric H1 histone binding. H1 is a histone which can loosely bind linker stretches of mononucleosomal DNA. Neighboring nucleosomes are visible as increased counts upstream and downstream of the cfDNA fragment ends (darker gray). The darkest gray counts mark the level of random dyad positioning signals. The proportion of random positioning is determined by the amount of counted fragments that did not agree with the predominant nucleosome positioning at all counting loci of hypothetical nucleosome positions. These fragments either had their actual nucleosomal dyad in a randomly close vicinity or, based on the amount of these observations, overlapped with the hypothetical nucleosome dyad position only by chance.

(Bottom right panel) Dyad count distribution for cfDNA fragments with a length of 330bp. These fragments correspond to dinucleosomal fragments, which are expected to be wrapped around two histone core complexes. Hence, the presence of two local dyad positioning maxima on such fragments is expected and was indeed observed (light gray peaks between light gray vertical lines).

(D) Nucleosome-conferred DNA cleaving resistance: The x-axis displays the values for observed fragment lengths, i.e., from 50-350bp. The y-axis shows the “conferred cleavage resistance”, i.e., a metric related to a specific transformation of the dyad count distributions, which is displayed in Figure 12E. First, the limited dyad count distribution, including the random fraction, is normalized to an area under the curve of 1 . The end frequency distribution is expressed relative to an expected nucleosome dyad position on a cfDNA fragment of a specific length which is equal to the position of the maximum of the nucleosome prior distribution. The normalized end frequency distribution is the fragment’s mirrored normalized dyad count distribution, with the mirrored distribution shifted towards the fragment end of interest such that the mirrored fragment end overlaps with the expected dyad position on the fragment (which is equal to the maximum of the mirrored distribution overlapping with the fragment end of interest). The cleaving resistance is computed from the ratio of the maximum observed end-frequency (either 5’, darkest gray; or 3’, medium gray) and the end’s frequency at the dyad. For example, an unweighted cleaving resistance of 2 means that the maximum end frequency for a given fragment length is twice the end frequency at the dyad. As this ratio shows how a fragment's maximum observed end-frequency difference relates to the absolute value at the supposedly most cleaving-protected point, the plot can also be interpreted as the confidence in the determined fragment end locations relative to the dyad. This equals our confidence in deciding on the location of the dyad relative to a fragment of a certain size. For DNA fragments with a length of 149bp, for example, we are most confident in locating the relative dyad position. It is centered as expected. DNA fragments of 167bp show already a higher uncertainty of dyad placement, which is in accordance with the 10bp flanking unprotected linker segments being randomly cleaved (i.e., the cleaving resistance is lower). Weighted versions take into account how the distance between the end count maximum and the dyad relates to the total fragment length.

(E) Fragment end frequency transformation: the expected dyad position on a fragment is determined by the position of the maximum of the empiric dyad count distribution. The count distribution is normalized to an area under the curve of 1 after fragment length dependent truncation. The resulting distribution is mirrored and placed such that the maximum overlaps with the fragment end of interest. If the maximum of the count distribution is well pronounced, like in the case of 149 bp fragments, the resulting fragment end distribution is representative of where to expect fragments to end relative to the nucleosome dyad for fragments of a specific length.

All panels of Figures 23 and 24 show a nucleosome positioning signal on top, as described in Figure 12, either at the TSS or the TTS of an actively transcribed (Figure 23) or an inactive (Figure 24) gene. The data obtained by the invention are in line with known expression patterns.

Single gene nucleosome positioning track analysis of gene regulatory state

The herein described approach enables to reconstruct nucleosome positions for individual genes based on accumulated nucleosome priors. Various patterns with high biological significance are observed:

First, genes with a nucleosome positioned at the TSS and arrays of nucleosomes upstream and downstream with similar phasing: PI.A2G2E and ZNF648 are examples of genes with a per-gene positioning track in cfDNA from healthy individuals (Figure 12A). As the nucleosomes at positions -1 and 0 block the bulky transcription machinery’s binding, this nucleosome pattern is in concordance with the expectation for an unexpressed gene. Hence, the interpretation “unexpressed” is based on a high likelihood of unexpressed genes having nucleosomes at typical NDR locations (e.g. at positions - 1 and 0), flanked by a relatively regular nucleosome phasing with roughly 180-200bp inter-nucleosome distances (Figure 12A). If any are observed, this can be confirmed with the posterior signal from the long fraction of cfDNA fragments. The nucleosome positioning indicated by long fragments shows where nucleosomes are expected for a closed chromatin state of the genomic region.

Second, we observe per-gene nucleosome positioning tracks with a nucleosome- depleted region (NDR) at the TSS or TTS with highly phased nucleosomes downstream of the TSS (or upstream of the TTS) and less phased nucleosomes upstream. Examples of such patterns are LPGAT1 or PFKFB2 (the pattern is reversed for the TTS since upstream corresponds to the gene body and downstream corresponds to transcription factor binding sites in contrast to the TSS example), which demonstrate in cfDNA from a healthy individual a wide NDR and high peak for the +1 nucleosome (not as pronounced for TTS), phasing of downstream nucleosomes (upstream for TTS), and well-positioned -1 nucleosomes for TSS and +1 for TTS (Figure 12B). This pattern corresponds to RNA polymerase II proximal promoter regions at the TSS, containing a nucleosome-free region of ~200bp around the TSS, mostly upstream flanked by well- positioned nucleosomes at both sides. This correlates well with the expected expression of a gene, as the differences of nucleosome occupancy patterns between Figures 23 and 24 clearly show. Additionally, regulatory information can be obtained by the bp- accurate measurement of the +1 and the -1 nucleosome positions. A gene’s promoter region could be in a poised chromatin state, i.e. , bearing simultaneously both activation- associated and repression-associated histone modifications, which enable the gene to switch rapidly between an active and repressive state. The difference between poised state and active transcription can be determined by a slight shift of the +1 nucleosome and a change in the fragmentation pattern beneath it due to nucleosome unwrapping and rewrapping (the latter happening upstream of the NDR) in the case of active transcription which does not occur in the poised state. At the TTS, the NDR pattern corresponds to bound transcription factors of the transcription termination machinery.

Example 3: Multi-signature decomposition

Multiple positioning signatures can be extracted from a nucleosome dyad posterior distribution by chaining posterior nucleosome dyad calls, which is grouping neighboring posterior peak calls according to known rules of chromatin organization along the genome. Example 3 mainly refers to results depicted in panels A-F of Figure 22, which are described in the following:

(A) Nucleosome occupancy plot based on 120-180 bp fragments (“intermediate fraction”) for the TSS of the RIT1 gene. The posterior nucleosome dyad signal (gray line) shows two distinct nucleosome dyad positioning series, supposedly coming from two separate nucleosome positioning patterns. Capital “A” and “B” characters at peak positions mark the dyad positions of these two manually chained posterior dyad peak series. The L-WPS signal is shown to illustrate the high accuracy of the posterior signal, even for a lower depth of coverage. The L-WPS signal shows a strong upward drift and signal amplitude loss with reduced depth of coverage. All signals were normalized to a range between 0 and 1 based on the minimum and maximum of the corresponding signal in a 6kbp window surrounding the TSS of the RIT1 gene. This is an example where two nucleosome series can be distinguished (indicated by dark gray “A” and light gray “B” characters marking posterior dyad peak positions of the corresponding nucleosome dyad chain). Both nucleosome peak chains have an almost constant nucleosome repeat length up to the TSS. The regularity of the internucleosome distance of both chains is disrupted after the TSS. The low depth of coverage around the TSS region (positions - 200 bp up to around +180 bp) suggests that the gene might be active, which is in line with known gene expression levels of normal human tissues. The low support of peaks around the TSS causes the central three A chain peaks and the central two B chain peaks to be disregarded during the main peak assessment. Therefore, these peaks are not part of a major chain and do not represent the main regulatory state of the RIT1 gene.

(B) Nucleosome occupancy plot based on 120-180 bp fragments (“intermediate fraction”) for the TSS of the ATP1A1 gene shown for comparison of L-WPS and nucleosome dyad posterior probability with signals shown in (A). The posterior signal is visible as a series of repeating peaks with the highest signal amplitude in the region with almost no reduction in signal amplitude compared to L-WPS. The peak situation in the upstream region of the TSS of ATP1A1 is very similar to the one described in (A) for the RIT1 gene, except there is no second chain upstream to the TSS. Peaks between -200 bp and +550 bp are excluded from the main peak assessment because of the low support of only 3 fragments. Therefore, these peaks likely stem from a few fragments from a closed chromatin state. The signal amplitude loss of L-WPS is even more extreme than in (A).

(C) Nucleosome occupancy for TTS of the NOTO gene. This is an example of chaining posterior peaks based on nucleosome repeat length. Here, three chains can be created from local peaks. The A chain was started from the high posterior peak at the TTS, grouping peaks with internucleosomal very close to the distance to the highest upstream peak. Chaining can be done at least up to the -2 nucleosome (slightly downstream of position -400 bp). The second chain (“B chain”) started from the highest peak between the -2 and the -1 nucleosome of the A chain. Both the A and B chains are subject to a substantial decrease in the height of chained peaks after the TTS peak. The nucleosome repeat length of this chain is smaller than for the A chain. A third chain (“C chain”) can be started from the central peak at the TTS and continued downstream for at least three additional posterior peaks. The C chain exhibits a nucleosome repeat length similar to the B chain. The prominent height of the posterior peak, which is chosen for seeding the chaining process for the A and C chain here, is in concordance with the regulatory importance of the location where the posterior peak is located, namely the TTS of the NOTO gene.

(D) Nucleosome occupancy for the TTS of the OR4F5 gene. This example shows a chaining algorithm that uses the posterior probability signal created from the long fraction of cfDNA fragments. Since the long fraction is thought to originate from one or more closed chromatin states, the local maxima of this signal can be used to extract nucleosome chains from the intermediate fraction posterior probability signal. Intermediate fraction peaks close to or overlapping with peaks of the long fraction can be chained together to form a chain of nucleosomes that form less accessible chromatin. In this example, the A chain contains posterior peaks compatible with closed chromatin, which is in concordance with known OR4F5 gene expression for normal human tissues. Other peaks between A chain peaks are grouped to form the B chain. Here, the B chain also contains peaks that would be incompatible with each other but are interpreted as options for nucleosome positioning within the B chain (e.g., B1 and B4 positions). The A chain and the B chain nicely separate. The consistent nucleosome repeat length for both chains suggests that both chains come from the same chromatin state. Because of the derivation of the A chain from the long fraction, this means that both the A and B chains originate from closed chromatin. Moreover, the high nucleosome repeat length of about 200 bp indicates that the region might be in solenoid chromatin structure, which is a model for the less-accessible 30 nm DNA fiber compared to the most accessible 11 nm chromatin conformation termed “beads on a string”. Three examples of other 200 bp nucleosome repeat length TSS loci of lowly or not expressed genes are depicted in Figures 24A, B, and D. Therefore, the result of the chaining analysis would be that the OR4F5 gene was not actively transcribed in cfDNA shedding cells of the person the sample was taken from. The algorithm cannot be carried out with other positioning signals like the L-WPS signal alone because it is not defined for the long fraction of cfDNA fragments, which is required to find a representation of closed chromatin. Low- resolution DoC-based positioning signals also cannot be used to reliably distinguish dyad peaks of different chains from noise consistently across all positions, as is evident for the L-WPS signal in this example (see region B1 to B2 and region A4 to B5).

(E) Nucleosome occupancy for the TSS of the SLAMF7 gene. Another example of using the long fraction posterior probability signal to achieve extraction of two chains of peaks from the intermediate fraction posterior signal. In contrast to the OR4F5 example described in (D), the internucleosomal distances in both the A and the B chains vary greatly, with a notable increase around the TSS. This hints at an actively transcribed gene, which is in concordance with what is known for whole blood and many other tissues. The TSS is less blocked for the B chain series of peaks, which does not overlap with the peaks from the long fraction derived posterior probability peaks, except for the last peak to the very downstream end of the figure at position +500 bp, where both chains overlap in the peak marked with “AB”. Also, the variability of the alternative positioning peaks within the B chain is much higher upstream to the TSS than downstream, indicating active transcription because of extensive nucleosome repositioning in the promoter region and not a poised state in which nucleosomes would remain in place even though they have been displaced.

(F) Nucleosome occupancy for the TTS of the FIGLA gene. This is an example of chaining posterior probability peaks fusing a high prominence seeding peak (marked “AB”). In this case, the internucleosomal distances of the B chain are less constant than for the A chain. The A chain peak positions exhibit greater variability in the region upstream of the seeding peak for the -3 to -1 nucleosomes, though being the more pronounced chain in this area. The peak prominence changes immediately downstream to the TTS, where the B chain greatly increases. The A chain immensely loses prominence after the seeding peak. No precise peak positioning is evident in the 300 bp region around the TTS for the L-WPS signal. The high internucleosomal distance of about 200 bp of the A chain suggests a less accessible chromatin state than the typical “beads on a string” conformation. Moreover, the location of the seeding peak and the upstream B chain peak suggests the absence of the transcription termination machinery (lack of an NDR), which is in concordance with known expression levels of the gene in normal human tissue.

Example 4: Improved pathway analyses and tumor subclassification The high-resolution single gene nucleosome positioning tracking gene regulatory analysis mentioned above critically facilitates the identification of altered pathways and tumor subclassification. Current views of cancer biology suggest that cancer driver genes can be classified into 12 signaling pathways that regulate three core cellular processes, i.e., cell fate, cell survival, and genome maintenance. Therefore, a common and limited set of driver genes and pathways is responsible for most common forms of cancer. We can classify the activity status of these genes and pathways with far- reaching consequences, for example, for early cancer detection and therapy decisions.

Example 5: A nucleosome prior fragmentation index: fragment length and dyad position

For each cfDNA fragment with a certain length, we can now provide statistics describing the likely nucleosome dyad location relative to the fragment and whether information about the location of a hypothetically associated nucleosome dyad can be derived from them at all.

A clinical example would be the cfDNA analysis of a healthy individual, where the majority of cfDNA fragments has a clearly symmetrical pattern, i.e., the dyad can be positioned in the center of the cfDNA fragments. A similar symmetrical pattern is expected from cfDNA fragments of pregnant women, although the fetal-derived cfDNA increases the number of shorter cfDNA fragments. However, as pregnancy is a physiological process, DNA digestion in the apoptotic cells is mostly symmetrical. In contrast, in a pathological process, such as cancer, canonical digestion is disturbed in many cells resulting in an increase of cfDNA fragments with an asymmetric dyad position (Figure 20).

A way of representing dyad statistics is illustrated for different fragment lengths in Figure 12D: raw cleaving resistance maxima (= dyad) counts relative to fragment bases that start at position 0 and end at the second light grey vertical line. These dyad count distributions are used in computing the nucleosome dyad prior distributions. Furthermore, we can calculate a nucleosome-conferred DNA cleaving resistance, which was most substantial for 149bp cfDNA fragments (Figure 12D).

The degree of well-positioned ends relative to the dyad enables us to establish detailed symmetry statistics. For example, several studies found that plasma DNA samples from patients with cancer are enriched for smaller cfDNA fragments (<150bp) and that the size distribution of small cfDNA fragments (100-150bp) to larger cfDNA fragments (151-220bp) can distinguish between healthy cfDNA patterns and those from patients with cancer (Mouliere et al., 2018). Here, we can add the nucleosomal dyad position for each fragment in addition to the fragment size fractions. As mentioned above, multiple processes such as nucleosome breathing, nucleosome sliding, and specific physiological and pathological states, affect the accessibility of nucleosomal DNA. Hence, the dyad position within cfDNA fragments is highly informative about an individual's health status. For example, cfDNA fragments of cancer patients will have an increased variability of dyad positions per cfDNA fragment.

In general, cfDNA fragments with the dyad at the center may indicate nucleosome stability, whereas cfDNA fragments with dyads off the center may reflect nucleosome instability. Since ctDNA has different fragmentation patterns, modeling cfDNA fragment length and dyad positions will facilitate distinguishing plasma samples from healthy donors from patients with cancer.

Other options for a nucleosome prior-based fragmentation index: First, the fraction of 167 bp fragments that have a dyad counted inside a defined central portion of the fragment over all observed 167 bp fragments, which we named “kurtosis of dyad placement” (Figure 14). This can be extended to all mono-nucleosomal fragments. Hence, the more DNA fragments deviate from a center position of the nucleosome dyad (indicating aberration from the canonical apoptosis process), the smaller the index/the value of kurtosis will be. Second, the count of all fragments of fragment lengths that show a clear preference for relative dyad positioning is divided by the number of fragments where no such preference can be established. A minimum threshold for the informative fraction of an empiric dyad count distribution is used here to define the existence of a preference for dyad placement.

Example 6: Plasma DNA tissue deconvolution: special handling of mixed signals (tissue deconvolution)

In individuals with a disease, the contribution of DNA from tissues may change if the diseased organ releases its DNA into the bloodstream. Our algorithms capture such changes and determine the tissue of origin (tissue deconvolution). Deciphering a “mixed” nucleosome pattern can reveal whether the DNA was released from cells where the respective nucleosomes had different positions. The number of fragments supporting different local nucleosome peaks could be used to estimate the percentage of different tissues contributing fragments to the coverage of a specific genomic locus. By analyzing multiple tissue-specific loci with extracted patterns, the accuracy of the estimated tissue contribution can be increased. To explain the relevance of tissue deconvolution: Several studies described diverse cellular and tissue origins of cfDNA. The bloodstream serves as a heterogeneous reservoir of cfDNA fragments that vary from individual to individual as well as with age and other underlying physiological conditions. In healthy individuals, the majority of cfDNA consists of DNA released from hematopoietic cells (circulating hematopoietic DNA; chDNA) and, to a much lesser extent, from DNA of solid organs (circulating organ DNA; coDNA). Hence, while individuals with cancer harbor three different cfDNA fractions in their blood (i.e. , ctDNA, chDNA, coDNA), diagnostic cfDNA applications in oncology usually focus only on ctDNA, neglecting chDNA and coDNA. However, they typically represent the vast majority of cfDNA. Hence, current approaches do not capture information about the immune system, which may be harbored within chDNA, or organ damage as a side effect of chemotherapy, which could be deciphered from coDNA. Our nucleosome-based approach allows assessment of all three fractions and thus provides additional, important information. An example is illustrated in Figure 13.

Example 7: Applications in patients with cancer

For novel medical applications the basic principle is illustrated in Figure 13 which shows a comparison of the relevant loci for different patient groups (healthy, lung cancer, and colorectal cancer). In individuals with a disease, the accessibility of genomic regions may change. If the diseased organ releases its DNA into the bloodstream, the composition of the cfDNA pool will change accordingly. Our algorithms capture such changes and determine the organs with increased DNA release.

There are likely numerous additional applications that are not listed here.

Applications in patients with cancer can be split into three time periods according to the various stages of a patient journey, i.e., early detection, early disease, and advanced disease.

Early detection of disease, i.e., screening for the presence of cancer in healthy individuals, aims at the detection of diseases in specific organs before the manifestation of clinical symptoms so that therapies can be started as early as possible. In contrast, early disease detection refers to issues, such as patient selection and monitoring, evaluation of ctDNA evolution, or clearance as a surrogate endpoint, and as such to the detection of minimal residual disease (MRD) (Moding et al., 2021). Both periods, i.e., early detection and early disease, have in common that disease-associated changes (modifications) in plasma DNA may be hard to find in the blood. For example, of the cfDNA in the blood of a patient with cancer, the ctDNA might account for just 0.1 % or even less. Hence, highly sensitive approaches are needed to attain low detection limits for analyzing minute amounts within cfDNA. Whereas early cancer detection requires a tumor-agnostic approach, a tumor-informed method can be used for MRD detection. With such a tumor genotype-informed MRD detection strategy, tumor fraction with a limit of detection of <0.01 % can be discovered by screening for numerous mutations (Moding et al., 2021). Our approach offers the opportunity to screen for thousands of targets, i.e., nucleosome positions and open chromatin regions, and should have a great potential of identifying minor traces of alterations from tumor cells. Furthermore, a nucleosome/open chromatin-based strategy does not require any knowledge about alterations in the tumor; hence, it is a tumor-agnostic approach.

In patients with advanced disease, the primary purpose of ctDNA analyses is the treatment selection via biomarker and the monitoring of patients in remission for recurrence/metastasis/resistance. As our improved nucleosome mapping options pave the way for novel gene expression/pathway analyses, it will affect the medical treatment of patients by targeted treatment selections and identification of resistance mechanisms.

Example 8: Applications in patients with chronic diseases, e.g., inflammatory bowel disease

Many chronic diseases will affect the composition of cfDNA. As an example, we discuss here inflammatory bowel diseases (IBD). Idiopathic IBD, Crohn’s disease (CD), and ulcerative colitis (UC) are characterized by uncontrolled chronic inflammation of the gastrointestinal tract. Symptoms include frequent bloody bowel movements, abdominal pain, weight loss, and fatigue. Complications include stricture formation, abscesses, fistulas, extra-intestinal manifestations, and colorectal cancer. Current therapy consists of 5-aminosalicylates (5-ASA), corticosteroids, immunosuppressives, and biological treatment options.

The clinical course of UC and CD is unpredictable. It is characterized by times of remission and times of active disease with characteristic symptoms of abdominal pain, diarrhea, and weight loss. Despite many similarities between the two conditions, disease phenotype and progression differ significantly. Thus, while CD can affect the whole gastrointestinal tract causing transmural inflammation, UC is confined to the mucosal and occasionally the submucosal layer of the colon. The majority of the patients, despite improved treatment options, alternate between periods of remission and periods of active disease. We anticipate that increased colon-derived DNA will be present in the circulation during periods of active disease. It will be interesting to see whether nucleosome profiling will identify the active disease with a lead time before symptoms occur so that treatment can start earlier. Furthermore, based on the tissue-specificity of TFs, we should be able to determine the location of active disease. For example, increased accessibility of EVX2 indicates involvement of the colon or rectum, whereas the accessibility of PDX1 increases if the duodenum is affected.

Example 9: Therapy decisions

The examples for patients with cancer and chronic diseases illustrate that advanced cfDNA applications will be instrumental for personalized and targeted treatment decisions for a wide range of conditions.

Example 10: Applications in patients with syndromes, e.g., Coffin-Siris syndrome

The effect of specific syndromes on cfDNA and nucleosome position has not been explored yet. However, nucleosome position changes likely occur in at least a subset of syndromes.

An example is patients with disturbances in the switch/sucrose non-fermenting (SWI/SNF) complex. The SWI/SNF complex is an ATP-dependent chromatin remodeler that regulates the spacing of nucleosomes and thereby controls gene expression. Heterozygous mutations in genes encoding subunits of the SWI/SNF complex have been reported in individuals with Coffin-Siris syndrome (CSS), with most of the mutations in ARID1 B. CSS is a rare congenital disorder characterized by facial dysmorphisms, digital anomalies, and variable intellectual disability. Mutations in genes encoding subunits of the ubiquitously expressed SWI/SNF complex may alter the nucleosome profiles in different cell types.

In a first study with cfDNA of CSS-affected individuals with heterozygous ARID1 B mutations, we did not observe significant changes in the nucleosome profile around transcription start sites. It should be interesting to repeat these analyses with increased resolution and go beyond TSSs, i.e., include most cis-regulatory regions.

Example 11: Identification of physiological states, e.g., age effect of nucleosome dyad position on cfDNA fragments

So far, cfDNA analyses have been mainly used to characterize disease states. However, provided that cfDNA can be interrogated with increased resolution limits, it should be possible to derive “physiological” information about the health condition of an individual, e.g., whether someone ages well or not (“healthy aging”). Our technologies may be suitable to estimate the age of an individual.

Aging leads to multiple changes, for example, profound alterations in the immune system and increased susceptibility to chronic, infectious, and autoimmune diseases.

Hence, we anticipate that aging is associated with multiple alterations, which will result in different nucleosome patterns. Our strategy will allow the determination of the health condition of individuals and the realization of novel concepts, such as an aging clock based on open chromatin regions (i.e. , biological age vs. chronological age).

This has the potential to revolutionize conducting aging studies by tremendously shortening the period until aging-associated changes can be detected.

As an example that nucleosome positions change during aging and in cancer genomes, Figure 25 depicts exemplary comparisons between low-resolution nucleosome positioning signals of a small number of healthy males below the age of 30, healthy males above the age of 55, and prostate cancer patients in pericentromeric regions of chromosome 12 which are known to harbor chains of well-positioned nucleosomes.

Example 12: Companion Diagnostics for Clinical Trials

Before medication, e.g., for treating cancers, can be put on the market, these therapeutics must undergo thorough clinical testing to show feasibility, efficacy, and safety. These clinical trials are usually accompanied by diagnostic tests demonstrating efficacy and/or safety. Blood-based analyses can be used to show the in vivo efficacy of drugs that, e.g., block the translocation of transcription factor proteins that need to be transported back into the nucleus of the human cell after being produced. By analyzing from cfDNA the activity of such a transcription factor gene, e.g., by assessing its TSS or the accessibility of its transcription factor binding sites, one can safely assess the functionality and efficacy of different dosages on a molecular level. This has the potential to increase patient safety in clinical trials because non-functioning treatment can be detected early on. Besides this specific subset of drugs, the efficacy of any other drug that targets a molecular pathway in cancer or any molecular mechanism that involves the expression of genes or interaction with the nuclear DNA of human cells in tissues that shed cfDNA into the circulation can be surveyed and securely and accurately assessed using high accuracy nucleosome positioning analysis of liquid biopsy samples. Example 13: General assumptions

Several assumptions were made when using the Bayesian interference based calculation as described in the method provided herein.

In assumption number 1 multiple fragments, though aligning close to each other, can be viewed as independent observations of nucleosomes, because they are highly unlikely to originate from the same cell. This applies particularly to fragments that overlap after the sequence alignment. Hence, an arbitrary mixture of cfDNA fragments can be used to derive a distribution of their collective former nucleosome dyad locations.

In assumption number 2, a prior probability is used, independent of the genomic locus of interest, by making two assumptions. Thereby, a locus is defined as a continuous region spanning about 10kbp. First, the distribution of nucleosomes per 10kb does not deviate tremendously from a uniform distribution along the mappable human genome with an average dyad-to-dyad distance of 167bp (e.g., no kbp or larger stretches are without nucleosomes, and the minimum dyad-to-dyad distance is much closer to the 167bp average than to zero). The average distance gives a per-base probability of observing a nucleosome of one over 167. Second, it is assumed that the DNA fragmentation process acts equally on all cfDNA fragments and is independent of their origin locus and, in particular, the nucleosome of origin. It follows that the relative frequencies at which fragment lengths are observed at a locus are similar to those seen across the whole dataset. As a result, the probability of observing a specific local fragmentation pattern is consistent across the entire mappable genome. Thus, the marginal likelihood which acts only as a scaling factor can be omitted when computing the local maxima of the posterior probability.

In assumption number 3, besides the assumptions concerning locus independence and independence of observations above, it is suggested that excluding local fragmentation evidence is unnecessary when looking at the problem of computing prior distributions from a practical point of view. Since the number of fragments across the non-local genome is much higher than the local number of fragments (estimated factor between 10 5 and 10 6 ), iterating the computation for all possible local windows while excluding local fragments whenever generating the prior knowledge is not performed. It is concluded that the influence of local fragments on the resulting prior knowledge can be omitted. The prior distributions computed as described herein have seen all local fragmentation patterns in each local window, but since the information from the grand total of all local windows is massively larger, any possible influence of a single local fragmentation evidence on the computed prior distributions is negligible. The large sample size allows the use of the empirically derived distributions as a good representation of the underlying random process - the empiric distributions become highly representative. For an assumed one-fold genome-wide coverage, a nucleosome every 167bp on the non-ENCODE exclusion marked GRChg38 reference genome yields more than 15 million hypothetical nucleosomes in total for counting. This number must be divided among all fragment lengths according to their relative frequency of occurrence in the whole dataset to get an estimate of how many counts are expected to be obtained for every fragment length. In practice, a 30x depth-of-coverage WGS dataset (i.e., on average, every base of the genome is covered by 30 sequenced DNA fragments) yields multiple million data points for each fragment length if rare fragment lengths are excluded by carrying out an in silico size selection.

Two signals are generated and evaluated in the method described herein. The first is based on sequencing coverage, i.e., the number of sequencing reads aligned to a specific locus in a reference genome (Figure 5, left side: assumption 1). The second signal is the “posterior nucleosome signal”, and its generation is based on Bayesian inference as outlined in the following. Nucleosomes offer protection from enzymatic digestion during apoptosis, with the nucleosomal dyad being the region where nucleosomal DNA is most tightly bound, and, thus most resistant to being cleaved by DNases. Therefore, it is possible to translate the sequencing depth into approximate nucleosome position maps (Figure 5). In other words, finding local peaks of depth-of- coverage yields positions that are highly likely to overlap with nucleosome dyad locations. Hence, in the first step of prior knowledge computation, nucleosomal dyad positions are inferred from sequencing read depth. These dyad positions are subsequently mapped across the respective cfDNA fragments (Figure 5). Because of the resolution of this approach, primarily information for the most pronounced nucleosome positioning while little to no information for smaller signals in the same region is gained. Following this, mainly one series of nucleosomes is observed along the genome (“single nucleosome series assumption”). The nucleosomal dyad may map inside or outside a given fragment, and mapped dyad positions are summarized for all observed fragment lengths. A fragment length-dependent maximum distance between the inferred nucleosome dyad and fragment is used to restrict dyad counting to the nearest nucleosome. This ensures that neighboring nucleosomes are excluded while relevant fragments mapping to regions between nucleosomes are included. This proximity restriction can be lifted to create prior knowledge about positioning neighboring nucleosomes following the same procedure. The following steps are depicted in Figure 6. Summary statistics are transformed into probability density functions by truncating the empirical distribution and normalizing the area under the discrete count distribution to one afterward. Two local count minima around the fragment center can be used for truncating the fragment’s dyad counts left and right (“short-range truncation”). Other truncation strategies can be used, for example, terminating the distribution at fragment ends (“fragment length truncation”). The maximum or the most pronounced maxima in the resulting distribution of cleaving resistances over fragments of a specific length is expected to be the preferred location(s) of the nucleosomal dyad on fragments of the specific length. Hence, we call this distribution “nucleosome prior distribution”. Other factors might increase cleaving resistance but primarily the strong interaction between DNA and histone complex confers cleaving resistance to cfDNA. The cleaving resistance of each fragment end (5’ and 3’) can be computed by combining the end distance to the closest preferred dyad position on fragment times signal strength of the cleaving resistance at this position (Figure 12). The base positions where prior distributions are truncated according to the strategies mentioned above are depicted in the overview heatmap of dyad prior distributions in Figure 7. White lines indicate where local minima next to fragment midpoints are expected. Orange dashed lines indicate the location of fragment ends. A third option would be to extend all priors equally to a total length of, e.g., 170 bp (i.e., 85 bp left and right of the fragment center; “uniform truncation”) so that for each position on the genome, only fragments with their center mapping inside a window of certain length around it can affect the posterior signal at that very same position. Although this might help avoid coverage-dependent distortions in regions with tremendous depth-of-coverage variations or untypical fragmentation, it has not been tested yet. The base positions of truncation in this exemplary case of uniform 170 bp truncation are indicated in Figure 7 by green dotted lines. The increase of “grainy” noise in Figure 7 for fragments shorter than 120bp and longer than 180bp is due to excluding these fragments from the coverage-based inference of the nucleosome dyad locations. The noise level for each fragment length follows approximately the reciprocal of the frequency at which that fragment length is observed in the dataset (compare fragment lengths with darker noisy bands from Figure 7 to their relative abundance based on Figure 4; high fragment count leads to low noise level). After truncation, the constant non-informative part of the count distribution is removed by adjusting the zero level. Reducing the area under the curve of one by eliminating the non-informative part allows us to model the expected loss of information for every fragment length into its prior distribution. This data loss originates in part from random breaking and degradation of DNA fragments due to pre-analytical factors like sample draw, wet lab procedure, and storage conditions, but also because the “single nucleosome series” assumption is violated for a certain part of fragments that indeed map to an inferred nucleosome site but do not share the exact predominant nucleosome positioning at the site (the positioning that likely caused depth of coverage to peak there). The naturally occurring fragmentation that can observed in liquid biopsy samples usually represents multiple chromatin states and, thus, multiple different nucleosome dyad locations. At the same time, only the most common one can be estimated from the depth of coverage signal and be used for counting. Therefore, the simplification of assuming only one chromatin state is present during prior knowledge creation results in completely random dyad positions being recorded in the summary statistics. The noise level depends on the fragment length because different fragment lengths occur at different frequencies and, thus contribute more or less to the predominant nucleosome positioning state (see Figure 4).

After range limiting and area under the curve normalization, we refer to the informative part of the empiric distribution of nucleosome dyad locations relative to a fragment of specific length i as “nucleosome prior distribution” (yi).

Applying prior knowledge

Prior distributions for the relative nucleosomal dyad location are now available for all observed fragment lengths. From these nucleosome prior distributions (yi) and the observed fragmentation along the genome, the average posterior dyad signal is computed, which results in a nucleosome priors-based DNase cleaving resistance signal (y). Because of the division by several summed prior distributions, the signal is ultimately independent of any biases -like GC bias- that can cause increased variability of the local depth of coverage. We named this nucleosome dyad signal “nucleosome posterior distribution” or “posterior nucleosome signal” based on the Bayesian inference procedure (Figure 8).

Extraction of nucleosome dyad positions

After this step, a map of posterior nucleosomal dyad positions can be computed across the mappable, non-homologous human genome by calling peak positions from the posterior nucleosome dyads y. Based on signal amplitude and peak proximity, peaks can be grouped hierarchically into main and associated side peaks. The depth-of- coverage at the site of a posterior peak can be used as an additional source of information to incorporate the absolute fragment support for a peak in subsequent “chaining” analysis (e.g., peaks called at locations with 30x depth of coverage are well supported in contrast to peaks called from only five or fewer fragments). The chaining analysis creates a series of peaks following rules for naturally occurring nucleosomal spacing to extract the most likely and possible chromatin states from the observed individual dyad positioning in a region. Examples of manually chained posterior dyad peaks are depicted in Figure 22. Main peaks, as well as side peaks, can be chained. The more dyad series are present at a genomic locus, the harder it is to tell them apart accurately.

A region with the highest coverage-based signal may match the nucleosome posterior signal; however, the prior-based signal can differ significantly from the sequence read cover signal (Figure 9A). A decisive advantage is that the posterior nucleosome dyad signal can resolve nearby signals from different chromatin states in the cfDNA shedding cell population, which is impossible based on sequence depth analyses alone (Figure 9B, C). Hence, interpretations solely derived from coveragebased signals -as in cfDNA assays- can result in wrong conclusions.

Handling of low-confidence regions

A unique feature of our approach is that the posterior nucleosome signal can be interpreted within the context of the coverage-based signal. In general, analyzing posterior nucleosome positions in low depth-of-coverage regions is challenging. The posterior nucleosome signal may not be representative in regions with a low depth of coverage (e.g., below 6x coverage). Therefore, such regions can either be masked and excluded from further analysis (Figure 9D) or simply down-scaled based on a target coverage of, e.g., 10x. The latter primarily supports visual inspection of loci. To give an example for down-weighting the posterior based on a target depth-of-coverage of 10, the posterior signal at a locus with 1x coverage would be divided by ten, one with 5x coverage would be divided by 2, and a locus with 20x coverage would be amplified by a factor of 2. Only the posterior signal for bases with 10x coverage would remain the same in this example. Examples of low confidence peaks being masked out are the peak in the NDR of the PFKFB2 gene TTS which is supported only by a single fragment (Figure 12B), the peaks in the central -200 bp to +200 bp region of the RIT gene TSS of Figure 22A, the peaks around the TSS of the ATP1A1 gene between position -200 bp and +500 bp as shown in Figure 22B, and the peak slightly upstream to position +200 of the TTS of the NOTO gene which is displayed in Figure 22C.

Enrichment

The combination of depth of coverage and posterior nucleosome signal can be leveraged not only to characterize nucleosome dyad peaks but also to calculate particular metrics like the local DNA accessibility (ac) from the depth of coverage (d) and the spacing between nucleosome dyads (And) (Figure 10). The accurate calculation of nucleosome dyads spacing (And) allows for determining genomic regions where nucleosomes are highly phased or less well organized. Furthermore, it enables the reliable identification of closed or open chromatin (Figure 10). Another way of detecting closed chromatin regions (not shown) is to compute the ratio of fragment midpoints in the long fraction vs. the short fraction over a window across multiple nucleosomes, e.g., one that spans 600bp. Open chromatin regions indicate locations within the human genome with regulatory functions, such as transcription start sites, i.e., regions with extraordinary biological relevance. Several options exist to analyze the local nucleosome phasing, such as discrete short-time Fourier transformation of an average multi-locus coverage signal (Figure 10).

Machine learning approaches, such as training random forest models in a supervised learning setting, can help select informative features (Figures 10, 11). One typical application uses these features to investigate whether a plasma sample was derived from a healthy donor or an individual with a particular disease. The search for disease-associated signals would be the classic example of a diagnostic liquid biopsy application for the early detection of diseases based on the epigenetic features of cfDNA shedding tissues. Trained models could be updated based on orthogonal diagnoses and other technologies like magnetic resonance imaging (MRI). The updated ground truth labels of samples (e.g., changing a sample’s label from “healthy” to “disease developing” after a patient has developed a specific disease with symptoms detectable by imaging analysis some months after the liquid biopsy) can be used to fine-tune a model and to extract additional informative features like the number of samples grows. Additionally, the machine learning-assisted epigenetic characterization of disease states could deepen our understanding of disease-causing processes and mechanisms of disease progression if the approach were frequently applied. REFERENCES

Alberts, B., et al. (2022). Molecular Biology of the Cell. 7th edn. New York: W.W. Norton & Co.

Hall, M.A., et al. (2009). High-resolution dynamic mapping of histone-DNA interactions in a nucleosome. Nat Struct Mol Biol 16, 124-129.

Heitzer, E., et al. (2019). Current and future perspectives of liquid biopsies in genomics-driven oncology. Nature reviews Genetics 20, 71-88.

Jiang, P., et al. (2015). Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients. Proceedings of the National Academy of Sciences of the United States of America 112, E1317-1325.

Michael, A.K., and Thoma, N.H. (2021). Reading the chromatinized genome. Cell 184, 3599-3611.

Moding, E.J., et al. (2021). Detecting Liquid Remnants of Solid Tumors: Circulating Tumor DNA Minimal Residual Disease. Cancer discovery.

Mouliere, F., et al. (2018). Enhanced detection of circulating tumor DNA by fragment size analysis. Science translational medicine 10, eaat4921.

Snyder, M.W., et al. (2016). Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell 164, 57-68.

Speicher, M. (ed.), et al. (2010). Vogel and Motulsky's Human Genetics: Problems and Approaches. 4th edn. Heidelberg: Springer Berlin.

Strachan, T., and Read, A. (2018). Human Molecular Genetics. 5th edn. New York: W. W. Norton & Company.

Sun, K., et al. (2018). Size-tagged preferred ends in maternal plasma DNA shed light on the production mechanism and show utility in noninvasive prenatal testing. Proceedings of the National Academy of Sciences of the United States of America 115, E5106-E5114.

Ulz, P., et al. (2019). Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nature communications 10, 4666.

Ulz, P., et al. (2016). Inferring expressed genes by whole-genome sequencing of plasma DNA. Nature Genetics 48, 1273-1278.

Weinberg, A., (2013). The Biology of Cancer. 2nd edn. New York: W. W. Norton & Company.

Winogradoff, D., and Aksimentiev, A. (2019). Molecular Mechanism of Spontaneous Nucleosome Unraveling. Journal of Molecular Biology 431 , 323-335.