A METHOD FOR COMPUTING AND MERGING STATICAND DYNAMIC METRICSOF A COMPUTER PROGRAM, A METHOD FOR TRAINING A MACHINE LEARNING MODEL BASED ON HYBRID METRICS, A MACHINE LEARNING MODEL, AND A METHOD FOR USING THE MACHINE LEARNING MODEL

Title:

Document Type and Number:

WIPO Patent Application WO/2024/057046

Kind Code:

Abstract:

In a method for computing and merging static and dynamic metricsof a computer program, in a static analysis step (S100), source code (10) is analyzed without execution by means of a static analysis system (1 ) and static analysis results (1 1 ) are generated, static metrics (12) are computed (12) from them in a static metrics computation step (S1 10), in a source code instrumentation step (S120) the source code (10) is instrumented by means of a dynamic injection system (3) without disturbing the basic operation of the source code (10), in a compilation step (S121 ), for a non-interpreted language, the instrumented source code (13) is translated into an executable binary format (14), in a dynamic execution step (S125), the instrumented source code (13) is executed by a dynamic analysis system (2) using a testing framework and dynamic analysis results (15) are produced, from which, in a dynamic metrics computation step (S127), dynamic metrics (16) are computed by processing the dynamic analysis results (15) obtained using a dynamic injection system (3), and in a hybrid matching step (S140), the static metrics (12) and the dynamic metrics (16) are merged into a hybrid dataset (17). In a method for training a machine learning model based on hybrid metrics, a preprocessed dataset (30) is generated from a hybrid dataset (20) in a dataset preprocessing step (S200) by filtering data with zero standard deviation (25) and data with no training significance (26), and a vulnerability feature (35) is determined from the preprocessed dataset (30) in a feature determination step (S210), in a classifying machine learning model configuration step (S220), a training dataset (55) and a validating dataset (56) are defined as parameters (45), a machine learning algorithm (46) is defined as parameters (45), and a machine learning model (50) is generated based on the extracted features (40), in a machine learning algorithm training step (S230), a trained machine learning model (60) is generated by training the configured machine learning model (50) using a machine learning system (4). In a method using a machine learning model, a vulnerability prediction (70) for preprocessed hybrid results (65) isgenerated in a vulnerability prediction evaluation step (S300) using preprocessed hybrid results (65) determined by a trained model (60), and a preprocessed hybrid result (65) is generated in a hybrid result preprocessing step (S350) using preprocessed hybrid results (65). A machine learning system (4), wherein a vulnerability prediction is generated by performing a preprocessing step (S200), a feature extraction step (S210), a classifier machine learning model configuration step (S220), a machine learning algorithm training step (S230), and a trained model evaluation step (S240).

Inventors:

FERENC RUDOLF (HU)
HEGEDŰS PÉTER (HU)
WOLF LEVENTE (HU)
NAHIMI SELIM KRISZTIÁN (HU)

Application Number:

PCT/HU2023/050057

Publication Date:

March 21, 2024

Filing Date:

September 12, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV SZEGEDI (HU)

International Classes:

G06F21/56; G06F21/57; G06N20/00

Other References:

SHAR LWIN KHIN LKSHAR@SMU EDU SG ET AL: "Experimental comparison of features and classifiers for Android malware detection", PROCEEDINGS OF THE IEEE/ACM 7TH INTERNATIONAL CONFERENCE ON MOBILE SOFTWARE ENGINEERING AND SYSTEMS, ACMPUB27, NEW YORK, NY, USA, 13 July 2020 (2020-07-13), pages 50 - 60, XP058482302, ISBN: 978-1-4503-7959-5, DOI: 10.1145/3387905.3388596
KIM SEOKMO ET AL: "Software Vulnerability Detection Methodology Combined with Static and Dynamic Analysis", WIRELESS PERSONAL COMMUNICATIONS, SPRINGER, DORDRECHT, NL, vol. 89, no. 3, 17 December 2015 (2015-12-17), pages 777 - 793, XP035998369, ISSN: 0929-6212, [retrieved on 20151217], DOI: 10.1007/S11277-015-3152-1
CHEN S ET AL: "Security Vulnerabilities: From Analysis to Detection and Masking Techniques", PROCEEDINGS OF THE IEEE, IEEE. NEW YORK, US, vol. 94, no. 2, 1 February 2006 (2006-02-01), pages 407 - 418, XP011442698, ISSN: 0018-9219, DOI: 10.1109/JPROC.2005.862473

Attorney, Agent or Firm:

DANUBIA PATENT & LAW OFFICE LLC (HU)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. A method for calculating and combining static and dynamic metrics of a computer program, in which a computer tool is used to process source code in such a way that in a static analysis step (S100), source code (10) is analyzed, and static analysis results (1 1 ) are generated, in a static metrics calculation step (S1 10), static metrics (12) are calculated from the static analysis results (1 1 ), a source code instrumentation step (S120) instrumentation is performed in source code (10), in a source code compilation step (S121 ), for a non-interpreted language, the instrumented source code (13) is compiled into executable binary format (14), a dynamic execution step (S125) produces dynamic analysis results (15), in a dynamic metrics calculation step (S127), dynamic metrics (16) are calculated from the dynamic analysis results (15), in a hybrid pairing step (S140), the static metrics (12) and the dynamic metrics (16) are merged into a hybrid dataset (17), characterized in that in the static analysis step (S100), the source code (10) is parsed by a static analysis system (1 ) in such a way that no execution is performed on the source code (10), in the instrumentation step (S120) of the source code (10), the source code (10) is instrumented by means of a dynamic injection system (3) in such a way that the basic operation of the source code (10) is not altered, in the dynamic execution step (S125), the instrumented source code (13) is run by a dynamic analysis system (2) using a testing framework, in the dynamic metrics calculation step (S127), the dynamic analysis results (15) are processed using the dynamic injection system (3).

2. A method for training a machine learning model based on hybrid metrics, where in a dataset preprocessing step (S200), a preprocessed dataset (30) is produced from a hybrid dataset (20), in a feature extraction step (S210), features (40) are determined from the preprocessed dataset (30), in a classifying machine learning model configuration step (S220), a machine learning model (50) is created based on the specified features (40), in a machine learning algorithm training step (S230), a trained machine learning model (60) is generated by training the configured machine learning model (50), characterized in that during the dataset preprocessing step (S200), data with zero standard deviation (25) and data with no training significance (26) are filtered, in the feature extraction step (S210), a vulnerability feature (35) is defined, parameters (45) and a machine learning algorithm (46) are defined during the configuration step (S220) of the classifier machine learning model, during the training step (S230) of the classifier machine learning model, a training dataset (55) and a validation dataset (56) are defined, the classifier machine learning model configuration step (S220) and the classifier machine learning model training step (S230) are performed using a machine learning system (4).

3. A method for using a machine learning model with the help of which in a vulnerability prediction evaluation step (S300) defined by a trained model (60), a vulnerability prediction (70) is generated for preprocessed hybrid results (65) using a trained machine learning model (60), a hybrid result preprocessing step (S350) produces preprocessed hybrid results (65), characterized in that the vulnerability prediction evaluation step (S300) involves the use of preprocessed hybrid results (65), determining the hybrid results (65) involves the following steps:

- create a preprocessed dataset (30) from a hybrid dataset (20) containing data with zero standard deviation (25) and data with no training significance (26) in a dataset preprocessing step (S200), - in a feature extraction step (S210), a vulnerability dataset (5) and features (40) defined based on a vulnerability feature (35) are generated from the preprocessed dataset (30),

- to create a configured machine learning model (50) from the features (40) defined in a classifier machine learning model configuration step (S220) by choosing parameters (45) and a machine learning algorithm (46),

- to produce a trained machine learning model (60) using a machine learning algorithm training step (S230) training dataset (55) and validation dataset (56).

4. The method according to claim 1 or 2 or 3, characterized in that, in the dynamic execution step (S125), the execution is performed on interpreted source code (10) or on compiled binary format (14), depending on whether the programming language is interpreted.

5. The method according to claim 1 or 2 or 3, characterized in that, in the hybrid matching step (S140), t he static metrics (12) and the dynamic metrics (16) are matched according to the condition specified in the step, from which hybrid datasets (17) are generated.

6. A machine learning system (4) characterized in that a vulnerability prediction is generated by performing a preprocessing step (S200), a feature extraction step (S210), a classifier machine learning model configuration step (S220), a machine learning algorithm training step (S230) and a trained model evaluation step (S240) according to any of claims 1 and 4-5, or claims 2 and 4-5, or claims 3 and 4-5.

7. The machine learning system (4) according to claim 6, characterized in that the hybrid dataset (20) used in the preprocessing step (S200) is generated by any one of claims 1 and 4-5.

8. The machine learning system (4) according to claim 6, characterized in that the hybrid dataset (20) used in the preprocessing step (S200) is generated by a method according to any one of claims

2 and 4-5.

9. The machine learning system (4) according to claim 6, characterized in that the hybrid dataset (20) used in the preprocessing step (S200) is generated by a method according to any one of claims

3 and 4-5.

10. A machine learning system (4) according to any one of claims 6 to 9, characterized in that the machine learning algorithm training step (S230) omits the validation step and omits the generation of the validation dataset (56).

11 . A machine learning system (4) according to any one of claims 6 to 10, characterized in that the hybrid results (17) used in the hybrid result preprocessing step (S350) are generated by the method according to any one of claims 1 and 4 to 5.

Description:

A method for computing and merging staticand dynamic metricsof a computer program, a method for training a machine learning model based on hybrid metrics, a machine learning model, and a method for using the machine learning model

The invention relates to a method for calculating and combining static and dynamic metrics of a computer program according to claim 1 , a method for training a machine learning model based on hybrid metrics according to claim 2, a method for using a machine learning model according to claim 3, and to a machine learning system according to claim 6. In particular, the field of the invention relates to the prediction of vulnerabilities of code snippets in computer programs, combining static and dynamic analysis methods and determining vulnerabilities based on the static and dynamic metrics calculated therefrom.

The process of vulnerability prediction can be determined by analyzing parts of the source code that may be suspect for a defect. If we are aware of the parts of our program that are prone to defects, we can make efficient use of our available testing capacity and the code review process. As a result, vulnerability prediction greatly aids the process of developing and maintaining software.

The information extracted from both static and dynamic code analysis can be used in a variety of ways. These include determining test coverage, checking code quality, and calculating other indicators.

US2010/ 0058295 A1 describes a method for evaluating the dynamic test coverage of a given program code. The test coverage of the test code is identified. The code coverage of the test code is analyzed. The current coverage information is stored. Code coverage information for one or more previous versions of the test code is stored. The current coverage information is compared with the previous coverage information. To determine the difference between the current and the previous coverage information, the system aggregates the difference. In response to the decision to generate test cases automatically, new test cases are generated automatically based on the differences. The new test cases are stored. The test code coverage is analyzed based on the new test case. The new coverage information is stored. Finally, the new coverage information is transmitted to the user.

US 8726392 B1 describes a computer-implemented method for combining static and dynamic code analysis, which may include: 1 ) Identification of executable code to be analyzed to determine whether the executable code is capable of leaking sensitive data; 2) Static analysis of the executable code to identify one or more objects that may be used to transmit sensitive data from the executable code. The static analysis is performed without executing the executable code. 3) Using the results of the static analysis, dynamic analysis is configured to track one or more objects identified during the static analysis. 4) Dynamic analysis is performed by monitoring one or more objects identified during static analysis while executable code is executing to determine whether the executable code is leaking sensitive data through one or more objects. It also describesother methods, systems, and computer- readable media.

US 2007/0288899 A1 describes a method for generating static and dynamic code analyses in a straightforward, iterative manner. A software analysis tool I nt egrates the results of dynamic and static analyses and iteratively uses the results of a previous analysis or analyses to supplement the current analysis. The information gathered at runtime during the debugging process is integrated with the static code analysis results. This information is generated and stored as part of the results of the testing and debugging processes. The stored information is then used to provide improved analysis results. With the software analysis tool, software developers do not need to perform separate static and dynamic analyses.

The security analyses and vulnerability test results in US10776497 B2 are "packaged" or "tied" to the actual software they describe. By linking the results to the software itself, additional users of the software can access information about the software, make informed decisions about software implementation, and analyze security risks across the entire system by accessing all (or most) of the associated reports. This helps to summarize the risks identified in the executables running on the system.

In light of the known solutions, the need for a more efficient method to evaluate the vulnerability prediction of the program code using static and dynamic analysis data via a trained machine learning model has been identified.

It is the aim of the invention to develop a program code vulnerability prediction method that uses the combined results of static and dynamic code analysis to achieve the highest possible accuracy of vulnerability prediction, according to the state of the art. The primary objective of the invention is to provide a method for evaluating the vulnerability prediction accuracy using a machine learning model based on hybrid analysis data.

It was recognized that the method according to the invention and its preferred embodiments can significantly increase the accuracy of vulnerability prediction, thereby enabling a code fragment to be more efficiently determined to contain a vulnerability. By being aware of which parts of our program are prone to defects, we can make efficient use of our available testing capacity and the code review process. As a result, vulnerability prediction greatly aids software development and maintenance. The objectives of the invention have been achieved by a method for computing and merging static and dynamic metrics of a computer program according to claim 1 , wherein source code is processed by a computer tool in such a way that in a static analysis step source code is analyzed and static analysis results are generated, and in a static metrics computation step static metrics are computed from the static analysis results, in a source code instrumentation step, instrumentation is performed in the source code, in a source code compilation step, the instrumented source code is translated into executable binary format for a non-interpreted language, in a dynamic execution step, dynamic analysis results are produced, in a dynamic metrics computation step, dynamic metrics are computed from the dynamic analysis results, and finally, in a hybrid matching step, static metrics and dynamic metrics are merged into a hybrid dataset. In the static analysis step, the source code is analyzed by means of a static analysis system in such a way that no execution is performed on the source code, in the source code instrumentation step, the source code is instrumented by means of a dynamic injection system in such a way that the basic operation of the source code is not modified, in the dynamic execution step, the instrumented source code is executed using a dynamic analysis system using a testing framework, and in the dynamic metrics computation step, the dynamic analysis results obtained are processed using the dynamic injection system.

On the other hand, the objectives of the invention have been addressed by a method for training a machine learning model based on hybrid metrics according to claim 2, wherein a preprocessed dataset is generated from a hybrid dataset in a dataset preprocessing step, and features are determined from the preprocessed dataset in a feature determination step, in a classifying machine learning model configuration step, a machine learning model configured based on the determined features is generated, in a machine learning algorithm training step, a machine learning model trained by teaching the configured machine learning model isgenerated. In the dataset preprocessing step, data with zero standard deviation and data with no training significance are filtered, in the feature extraction step, a vulnerability feature is defined, in the classifying machine learning model configuration step, parameters and a machine learning algorithm are defined, during the classifier machine learning model training step, a training dataset and a validation dataset are defined, and the classifier machine learning model configuration step and the classifier machine learning model training step are performed using a machine learning system.

Thirdly, the objectives of the invention have been achieved by a method for using a machine learning model according to claim 3, wherein a vulnerability prediction evaluation step using a trained model produces a vulnerability prediction for preprocessed hybrid results using a trained machine learning model, a hybrid result preprocessing step produces preprocessed hybrid results. The vulnerability prediction evaluation step involves using the preprocessed hybrid results and determining the hybrid results involves generating a preprocessed dataset from a hybrid dataset containing data with zero standard deviation and no training significance in a dataset preprocessing step, in a feature definition step, generating from the preprocessed dataset, a vulnerability dataset and features defined based on the vulnerability feature, in a classifying machine learning model configuration step, generating a configured machine learning model from the defined features by selecting parametersand a machine learning algorithm, and in a machine learning algorithm training step, generating a trained machine learning model using a training dataset and a validation dataset.

Fourthly, the objectivesof the invention have been achieved by a machine learning system according to claim 6, wherein a vulnerability prediction is generated by applying any method according to the invention, performing a preprocessing step, a feature extraction step, a classifier machine learning model configuration step, a machine learning algorithm training step, and a trained model evaluation step.

Some advantageous implementations of the invention and some embodiments are described in subclaims.

In the method according to the invention, static source code analysis and dynamic source code analysis are performed, and results are produced.

Furthermore, in the method according to the invention, different metrics are computed from the results obtained from static source code analysis and dynamic source code analysis.

In addition, in the method according to the invention, a source code compilation procedure is preferably used.

The method according to the invention further advantageously combines static metrics and dynamic metrics.

The method according to the invention further advantageously implements source code instrumentation.

The method according to the invention advantageously further comprises configuring, training, and using a machine learning model.

A primary advantage of the method of the invention isthat the prediction is performed by a machine learning algorithm, where the accuracy of the vulnerability prediction can be increased for a sufficiently large input training dataset. An additional advantage of the method of the invention is that metrics can be computed from the results of the source code analysis, which can be used to determine a composite evaluation of the source code from a code quality perspective.

Another advantage of the method according to the invention is that the results of the analysis are stored during the static and dynamic source code analysis, thus they can be used later to implement further code quality procedures.

The invention is further described by means of example embodiments of the process, with reference to the accompanying drawing, where:

Figure 1 shows an illustration of a preferred embodiment of the systems used by the invention,

Figure 2 shows a flowchart of a preferred embodiment of a method for calculating hybrid metrics according to the invention,

Figure s shows an example source code,

Figure 4 shows an example of a static metrics dataset,

Figures 5A, 5B show pseudocode and JavaScript examples provided for the source code instrumentation step,

Figure 6 shows a flowchart of a preferred embodiment of a method for training a machine learning model according to the invention,

Figure 7 shows a flowchart of a preferred embodiment of a process for producing a vulnerability prediction according to the invention, and

Figure 8 shows example dynamic analysis results.

Figure 1 is an illustration of a preferred embodiment of a system for predicting the vulnerability of a computer program using a machine learning model according to the invention, based on code quality metrics calculated by static and dynamic analysis of the source code 10 of the program. The system of the invention is advantageously divided into two parts. A first part consisting of modules M contains the tools used by the system, these are also called modules. The Modules M contain the following subsystems: static analysis system 1 , dynamic analysis system 2, dynamic injection system 3, machine learning system 4. The second part of the system used by the invention, the database, responds with the data used by the subsystems in the modules. A preferred embodiment of a database DB includes vulnerability datasets 5, static analysis results 6, dynamic analysis results 7, and hybrid results 8. The static analysis system 1 is preferably a software program that processes source code and whose results provide sufficient information to compute complexity, inheritance, size and coupling static metrics (see step S1 10, static metrics calculation) based on the functions and methods in the code. This also includes software that performs the calculation of these metrics in addition to producing the information mentioned.

The dynamic analysis system 2 is preferably a software program that dynamically analyzes the source code 10, which can be used to analyze a working program or system modified with the dynamic injection system 3 at runtime or after runtime, and then compute complexity, documentation, inheritance, size and coupling dynamic metrics from the analysis.

The dynamic injection system 3 is preferably a software modifying source code 10 that adds additional instructions to the existing source code 10 that do not interfere with the basic operation, and which communicate important information about the given code fragment, either through a system external to the program, or directly to a database or to the file system, in a way that can be processed later. The output dataset can be in any format, for example, JSON or XM Lfile formats.

The machine learning system 4 is preferably software comprising one or more machine learning classification algorithms that can produce vulnerability predictions 70 from an input dataset with feature records.

The vulnerability dataset 5 is preferably a dataset of records containing the paths, names, and row and column locations within the code of all functions and methods in the source code 10 of a system. It should also include, in addition to these features, whether the function is vulnerable or not (e.g., 0 or 1 ).

The static analysis results 6 are preferably a dataset of twelve static metrics 12 calculated from the results produced by the static analysis system 1 .

The dynamic analysis results 7 are preferably a dataset of dynamic metrics 16 calculated from the results produced by the dynamic analysis system 2.

The hybrid results 8 are preferably a dataset produced by matching the static analysis results 6 and the dynamic analysis results 7, based on path and location in the code.

Figure 2 is a flowchart of a preferred embodiment of a method for computing hybrid metrics according to the invention. The method uses a computational tool to process the source code 10, which is purposefully free of syntactic errors and does not encounter problems during execution. Preferably, the source code 10 is loaded from a database, a computer, a cloud service, or other storage. The format of the source code 10 is preferably a program code or script type file format known from the literature, preferably JS, PY JAVA, C, O+, or another file format known from the literature. As an example, Figure 3 shows a source code 10 that is a JS format source code 10, consisting of thirty-one lines and three functions.

The source code 10 is analyzed in the static analysis step S100 using a static source code analysis system 10, which produces static analysis results 1 1 . In the static analysis step 100, the source code 10 is analyzed without execution. In the static analysis step 100, source code 10 written in any programming language can be used.

From the static analysis results 1 1 , static metrics 12 are generated in the Static metrics calculation step S1 10. In the static metrics calculation step S1 10, we preferentially compute complexity, documentation, inheritance, size, and coupling metrics for functions, methods, and classes in the code, in arbitrary processable output formats. The static analysis results 1 1 obtained in the S100 static analysis step can be produced in any format, and the static metrics 12 can be produced in any combination. The stored name of each metric is not authoritative and is only for information for the user. The values of the metrics are real numbers (which include rational, integer, and natural numbers) that provide a measure of one or more predefined features of the code fragment. M etrics can be grouped into categories. Complexity metrics include, for example, the well-known McCabe's Cyclo matic Complexity (McCC), Nesting Level (NL), and Weighted Methods per Class (WMC). Documentation metrics include, for example, Comment Density (CD), Comment Lines of Code (CLOG), and Documentation Linesof Code (DLOC). Coupling metrics include, for example, Number of Incoming Invocations (Nil) and Number of Outgoing Invocations (NOI). Inheritance metrics include, for example, Depth of Inheritance Tree (DIT) and Number of Descendants (NOD). Sze metrics include, for example, Lines of Code (LOG), Number of Methods (NM), Number of Parameters (NUM PAR), Number of Statements (NOS), and Number of Classes (NCL).

The format of the static metrics 12 dataset is preferably a file format known from the literature, preferably JSON, XM L, or CSV. An example is shown in Figure 4, the static metrics 12 dataset is a CSV format dataset of three records, where each record contains the basic information of the functions (name, long name, path, start row, start column, end row, end column) and their corresponding ten different static metrics.

The choice of metrics used in the static metrics calculation step S1 10 - any number and proportion of metrics can be chosen - can result in significant variation in the subsequent steps, but this step includes all possible combinations that arise, and does not distinguish between them.

The source code 10 in the source code instrumentation step S120 is augmented by a source code instrumentation system with additional instructions that do not interfere with the basic operation, which communicate important information about the code fragment, are stored in a system external to the program, or directly in a database or file system, in a way that can be processed later. Instructions, expressions, and controls to be instrumented include simple selection control, initial condition repetition control, count repetition control, logic expression, conditional expression, exception handling control, exception catching control, and direct block instruction, examples and pseudocode of which are shown in Figures 5A, 5B. The format of the output dataset is arbitrary, examples are JSON or XM Lfile formats.

It is possible that additional instructions inserted by the source code instrumentation system used during the source code instrumentation step SI 20, while not interfering with basic operation, may increase the runtime of the executed program, its memory and CPU consumption. The source code instrumentation step S120 can use source code 10 written in any programming language.

The instrumented source code 13 is compiled to a binary format 14 in the source code compilation step S121 for a non-interpreted programming language. An interpreted programming language is defined as a programming language that does not require one or more prior binary translation steps to run but requires interpreter software and source code to execute. Otherwise, the programming language is non-interpreted, requiring the compile steps to be performed to create an executable binary.

In the dynamic execution step S125, the non-interpreted language instrumented in step S120 and the interpreted language binarized in step S121 are executed through a test f ramework. In this dynamic execution step S125, the dynamic analysis results 15 obtained can be produced in any format. As a result of the instrumentation performed in the source code instrumentation step S120, dynamic analysis results 15 in an appropriate format are obtained by running the modified program, which are then processed in the dynamic metrics calculation step S127. The dynamic metrics 16 calculated in the dynamic metrics calculation step S127 can be produced in any combination and order. Example dynamic analysis results 15 are shown in Figure 8.

The advantage of the dynamic source code analysis step of the invention is that it analyzes the actual part of the program that is executed, avoiding the parts that are not executed at runtime. Ignoring the parts that are not executed during the run of the program can lead to more realistic results.

In the dynamic metrics calculation step S127, the output files produced by the program run in the dynamic execution step S125 are processed. The dataset to be processed contains the attributes of each function and method in the source code and the set of instructions executed in them. These attributes include the name, path, line number, column number, ending line number, ending column number, and called status (whether the function or method was called during execution) of a particular function or method. In addition to these attributes, the set of instructionsactually executed within each function or method is also included. This set contains all recognized and instrumented instructions that are relevant for the calculation of the following dynamic metrics. For example, among the complexity metrics, McCabe's Cyclo matic Complexity (McCC) and Nesting Level (NL), and among the coupling metrics, Number of Incoming Invocations (Nil) and Number of Outgoing Invocations (NOI).

From the static metrics 12 and the dynamic metrics 16 in the hybrid matching step S140, hybrid matching results 17 are produced as follows. In the hybrid matching step S140, functions, methods, and classes in the static metrics 12 dataset and the dynamic metrics 16 dataset are matched according to the following methods. During the matching process, the listed conditions must be met :

The function, method, or class to be matched comes from the same source file.

The name of the function, method, or class to be matched is the same.

The starting and ending line numbers of the function, method, or class to be matched are identical.

Note that the instrumentation that occurs during the source code instrumentation step S120 does not affect the source file, name, starting line number, and ending line number of functions, methods, and classes in the source code. Consequently, it is possible to match the static metrics 12 and the dynamic metrics 16.

The procedure in Figure 5 includes a dataset preprocessing step S200, a feature definition step S210, a classifier machine learning model configuration step S220 and a machine learning algorithm training step S230.

The hybrid dataset 20 is modified in the dataset preprocessing step S200, which creates preprocessed datasets 30. The dataset preprocessing step S200 includes the filtering of data 25 with zero standard deviation and the filtering of data 26 with no training significance. The filtering of data 25 with zero standard deviation removes from the hybrid dataset 20 the properties that are identical for all the different source code elements. Filtering data 26 with no training significance removes from the hybrid dataset 20 all properties that are negligible for the evaluation of the machine learning model. Examples of such data are function name, function path, and function starting line number.

Note that during the dataset preprocessing step S200, the modifications made to the hybrid dataset

20 reduce the size of the dataset. The specified features 40 in the Feature extraction step S210 of the machine learning model can be used to determine the corresponding features based on the preprocessed dataset 30 using any known method. In addition, a vulnerability feature 35 is selected from the vulnerability dataset 5 by matching. The matching can be based on the path of the funchon, its location in the code, or other features. The vulnerability dataset 5 should already contain a binary vulnerability label (can be 0 or 1 ), which acts as the classifier label for the classifier machine learning algorithm. The vulnerability feature 35 is a classification label for the training of classifier machine learning algorithms known from the literature.

The configured machine learning model 50 is created in the classifier machine learning model configuration step S220, which is based on the specified features 40, the parameters 45, and the choice of the machine learning algorithm 46. In doing so, an arbitrary combination of selected parameters 45 and different machine learning algorithms may be used. The parameters 45 and the machine learning algorithms 46 can make a big difference in the accuracy of the final prediction, and their optimal choice is crucial for the reliability of the trained model. The parameters 45 are hyperparameters that can denote, for example, the number of neurons, the number of layers, or the activation function for neural networks. A machine learning algorithm 46 is an algorithm that determines the basis of the model, the prediction is evaluated based on the mathematical formulae it provides. As an illustrative example, the best classifier configured machine learning model 50 in the evaluations was generated according to the following parameters: data standardization and scaling, label binarization, and Random Forest algorithm.

The trained machine learning model 60 is generated in the machine learning algorithm training step S230 according to the following method, wherein an arbitrary ratio of training datasets 55 and validation datasets 56 is preferably used. Prior to training, a training dataset 55 and a validation dataset 56 are generated from the preprocessed datasets 30, which are subsets of the preprocessed datasets 30 and do not intersect. The proportions of these sets can be chosen in the ways described in the literature, or the size of these sets can be increased or decreased by a so-called sampling method, so that the elements of the sets are randomly but equally duplicated so that the proportion between the sets is optimal. Then, using the training datasets 55, the classifier configured machine learning model 50 is trained using the specified features 40. After the training is completed, the performance of the configured machine learning model 50 is improved using the validation dataset 56.

Figure 7 is a flow diagram of a preferred embodiment of the method for providing a vulnerability prediction according to the invention. The hybrid results 17 are the same as the hybrid results 17 described in Figure 2. In Figure 7, the machine learning model evaluation step includes a prediction evaluation step S300 determined by the trained model.

As shown in Figure 7, the vulnerability prediction generation process includes the hybrid metric computation process shown in Figure 2 and the machine learning model training process according to the invention described in Figure 6.

The preprocessed hybrid results 65 are generated in the hybrid result preprocessing step S350 based on the hybrid results 17 described in Figure 2. During thisst ep, all data and featuresthat are negligible for the evaluation of the machine learning model are arbitrarily removed from the hybrid results 17. Examples of such data are function name, function path, and function starting line number.

Note that the features removed in the dataset preprocessing step S200 must also be removed in the hybrid result preprocessing step S350, and only features that have been used to train the trained machine learning models 60 can be left.

To provide the vulnerability prediction 70, in the prediction evaluation step S300 of the trained model, the trained machine learning models 60 evaluation generated in the process shown in Figure 6 is performed using as input the preprocessed hybrid results 65 obtained in the hybrid result preprocessing step S350.

List of reference signs:

M module

DB database

1 static analysis system

2 dynam ic analysis system

3 dynamic injection system

4 machine learning system

5 vulnerability dataset

6 static analysis result

7 dynamic analysis result

8 hybrid results

10 source code

1 1 static analysis result

12static metrics

13 instrumented source code

14 binary format

15 dynamic analysis result 16 dynamic metrics

17 hybrid dataset

20 hybrid dataset

25 data with zero standard deviation

26 data with no training significance

30 preprocessed dataset

35 vulnerability feature

40 specified feature

45 parameter

46 machine learning algorithm

50 configured machine learning model

55 training dataset

56 validating dataset

60 trained machine learning model

65 preprocessed hybrid result

70 vulnerability prediction

S100 static analysis step

S110 static metrics calculation step

5120 source code instrumentation step

5121 source code compilation step

S125 dynamic execution step

S127 dynamic metrics calculation step

S140 hybrid matching step

S200 dataset preprocessing step

S210 feature extraction step

S220 classifier machine learning model configuration step

S230 machine learning algorithm training step

S240 trained model evaluation step

S300 prediction evaluation step

S350 hybrid result preprocessing step

Previous Patent: CLIENT NETWORK CONFIGURATION

Next Patent: PISTON FOR SUCKER-ROD PUMPS