Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEMS AND METHODS FOR SUBGRAPH MATCHING USING ACTIVE LEARNING
Document Type and Number:
WIPO Patent Application WO/2024/086674
Kind Code:
A1
Abstract:
Systems and methods for subgraph matching using active learning in accordance with embodiments of the invention are illustrated. One embodiment includes a method for subgraph matching with active learning. The method includes steps for encoding vertices of a world graph into a plurality of candidate sets, applying a first filter to each of the plurality of candidate sets to reduce the number of candidates in each of the plurality of candidate sets, querying for a template vertex in a corresponding candidate set from the plurality of candidate sets, determining a final candidate matching the corresponding template vertex for each of the plurality of the candidate sets, and returning a number of total queries made.

Inventors:
BERTOZZI ANDREA (US)
GE YURUN (US)
Application Number:
PCT/US2023/077234
Publication Date:
April 25, 2024
Filing Date:
October 18, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV CALIFORNIA (US)
International Classes:
G06N3/08; G06N3/02; G06N5/04; G06N20/00; G06F18/2323; G06N3/04
Attorney, Agent or Firm:
SUNG, Brian, K. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1 . A computer-implemented method for subgraph matching with active learning, the method comprising: encoding vertices of a world graph into a plurality of candidate sets; applying a first filter to each of the plurality of candidate sets to reduce the number of candidates in each of the plurality of candidate sets; querying for a template vertex in a corresponding candidate set from the plurality of candidate sets; capturing external information from a user concerning one or more vertices in the candidate set; using external information to eliminate one or more candidates from the candidate set; determining a final candidate matching the corresponding template vertex for each of the plurality of the candidate sets; and outputting subgraph isomorphisms of template graph in the world graph based on final candidates.

2. The method of claim 1 , wherein the each of the plurality of candidate sets of vertices correspond to a vertex of a template graph such that the vertex of the template graph can be mapped to a vertex in a candidate set.

3. The method of claim 1 , wherein the first filter is selected from the group consisting of Node Label Filter, Node-level Statistics Filter, Topology Filter, Repeated-Sets Filter, Neighborhood Filter, and Elimination Filter.

4. The method of claim 3, wherein the choice of the filter is task-specific.

5. The method of claim 1 , wherein the querying comprises using a querying algorithm to select an optimal vertex in the template graph to query.

6. The method of claim 1 , wherein the querying comprises using external information to determine if the template vertex is matched.

7. The method of claim 1 , wherein determining a final candidate comprises applying a second filter to reduce the number of candidates in the candidate set based upon the querying and external information.

8. The method of claim 7, wherein the second filter is the same as the first filter.

9. The method of claim 7, wherein the querying and applying a second filter are performed iteratively to determine the final candidate in the candidate set.

10. The method of claim 9, wherein each iteration of querying comprises adding the used external information into the querying algorithm.

11 . The method of claim 1 , further comprising determining a total number of subgraph isomorphisms (Sis) between the template graph and world graph.

12. The method of claim 10, wherein the total number of Sis is determined by a spanning-tree-based estimation.

13. The method of claim 11 , wherein the spanning-tree-based estimation comprises indexing the template vertices.

14. The method of claim 1 , wherein the first filter is based on basic properties of subgraph isomorphisms.

15. The method of claim 1 , wherein the querying comprises selecting a querying algorithm that solves subgraph matching using the fewest number of queries.

16. A subgraph matching system using active learning, the system comprising: a computing device, comprising: at least one processor; a memory; and at least one non-transitory computer-readable media comprising program instructions that are executable by the at least one processor such that the system is configured to carry out the method of claims 1-15.

Description:
SYSTEMS AND METHODS FOR SUBGRAPH MATCHING USING ACTIVE LEARNING

STATEMENT OF FEDERAL SUPPORT

[0001] This invention was made with government support under Grant Number 2027277, awarded by National Science Foundation and under Grant Number FA8750- 18-2-0066, awarded by the U.S. Department of Defense, Defense Advanced Research Projects Agency. The government has certain rights in the invention.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0002] The current application claims the benefit of and priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 63/379,989 entitled “Active Learning for the Subgraph Matching Problem” filed October 18, 2022. The disclosure of U.S. Provisional Patent Application No. 63/379,989 is hereby incorporated by reference in its entirety for all purposes.

FIELD OF THE INVENTION

[0003] The present invention generally relates to graph theory, more specifically, the invention relates to subgraph matching using supervised machine learning algorithms.

BACKGROUND

[0004] Graphs are non-linear mathematical structures that consist of pairs of sets of vertices and edges. Graph theory refers to the study of graphs, which can offer insights into the relationships between different vertices and edges between different graphs and can be used to model pairwise relations between objects. Graph theory is an integral component in many areas of modern science, including computer science, artificial engineering, machine learning, deep learning, data science, and social networks.

[0005] As data becomes more conveniently collected and stored, large amounts of unlabeled data give data scientists more data than they are capable of analyzing. Semisupervised learning (SSL) is an approach to machine learning that involves a small amount of labeled data with a large amount of unlabeled data to achieve an accurate classification with significantly fewer training points.

[0006] Active learning is a subset of SSL that interactively queries a human agent to label a selected number of unlabeled data points with the desired outputs. Since manual labeling of these growing amounts of data can be expensive and time-consuming, active learning is becoming an appealing option to process these unlabeled data to improve the underlying SSL classifier’s performance.

SUMMARY OF INVENTION

[0007] Systems and methods for subgraph matching using active learning in accordance with embodiments of the invention are illustrated. One embodiment includes a method for subgraph matching with active learning. The method includes steps for encoding vertices of a world graph into a plurality of candidate sets, applying a first filter to each of the plurality of candidate sets to reduce the number of candidates in each of the plurality of candidate sets, querying for a template vertex in a corresponding candidate set from the plurality of candidate sets, determining a final candidate matching the corresponding template vertex for each of the plurality of the candidate sets, and returning a number of total queries made.

[0008] In a further embodiment, the each of the plurality of candidate sets of vertices correspond to a vertex of a template graph such that the vertex of the template graph can be mapped to a vertex in a candidate set.

[0009] In still another embodiment, the first filter is selected from the group consisting of Node Label Filter, Node-level Statistics Filter, Topology Filter, Repeated- Sets Filter, Neighborhood Filter, and Elimination Filter.

[0010] In a still further embodiment, the choice of the filter is task-specific.

[0011] In yet another embodiment, the querying comprises using a querying algorithm to select an optimal vertex in the template graph to query.

[0012] In a yet further embodiment, the querying comprises using external information to determine if the template vertex is matched. [0013] In another additional embodiment, determining a final candidate comprises applying a second filter to reduce the number of candidates in the candidate set based upon the querying and external information.

[0014] In a further additional embodiment, the second filter is the same as the first filter.

[0015] In another embodiment again, the querying and applying a second filter are performed iteratively to determine the final candidate in the candidate set.

[0016] In a further embodiment again, each iteration of querying comprises adding the used external information into the querying algorithm.

[0017] In still yet another embodiment, determining a total number of subgraph isomorphisms (Sis) between the template graph and world graph.

[0018] In still another additional embodiment, the total number of Sis is determined by a spanning-tree-based estimation.

[0019] In a still further additional embodiment, the spanning-tree-based estimation comprises indexing the template vertices.

[0020] In still another embodiment again, the first filter is based on basic properties of subgraph isomorphisms.

[0021] In a yet further additional embodiment, the querying comprises selecting a querying algorithm that solves subgraph matching using the fewest number of queries.

[0022] Additional embodiments and features are set forth in part in the description that follows, and in part will become apparent to those skilled in the art upon examination of the specification or may be learned by the practice of the invention. A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings, which forms a part of this disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

[0023] The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.

[0024] Fig. 1 illustrates an example subgraph matching problem (SMP) in a multichannel setting in accordance with an embodiment of the invention.

[0025] Fig. 2 illustrates a conceptual workflow of subgraph matching using active learning in accordance with an embodiment of the invention.

[0026] Fig. 3A illustrates a process for performing subgraph matching in accordance with an embodiment of the invention.

[0027] Fig. 3B illustrates a process of adjusting template graphs based on filtering in accordance with an embodiment of the invention.

[0028] Fig. 4 illustrates a pseudocode of the subgraph matching algorithm using active learning in accordance with an embodiment of the invention.

[0029] Fig. 5 illustrates an edge entropy computation in accordance with an embodiment of the invention.

[0030] Fig. 6 illustrates a pseudocode of TreeEPM algorithm in accordance with an embodiment of the invention.

[0031] Fig. 7 illustrates a process of estimating the EPMs for the candidate of the selected vertex with label 3 in accordance with an embodiment of the invention.

[0032] Fig. 8 illustrates a computing device that can be utilized to perform subgraph matching with active learning in accordance with an embodiment of the invention.

[0033] Fig. 9 illustrates a network architecture for subgraph matching using active learning in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

[0034] Subgraph matching is one of the fundamental problems in graph theory that involves finding a smaller template graph, known as a subgraph, within a larger world graph. Subgraph matching is a common task that has applications in many fields, such as computer science, cheminformatics, biology, and social network analysis. The goal of subgraph matching is to identify patterns or structures within the template graph that are isomorphic to the world graph. Two graphs are isomorphic if there exists a bijection between their vertices and edges such that the adjacency relation is preserved. This can be useful for identifying relationships between different entities, predicting future behavior, and detecting anomalies or outliers. Graph theory provides a powerful framework for analyzing complex systems, and subgraph matching is an important tool for exploring these systems.

[0035] While the objective of the subgraph matching problem (SMP) may be straightforward, it can be computationally challenging to actually solve an SMP, especially for larger world graphs. A good subgraph matching algorithm should be exact, which means that it needs to find all isomorphic subgraphs in the world graph. Partial matches can be helpful towards locating exact matches, but the overall goal of a subgraph matching algorithm is generally to find all exact matches. However, it is computationally intensive for a subgraph matching algorithm to be exact since the number of possible subgraphs of a graph can grow exponentially with the size of the graph. The world graph may have additional vertices and edges that are equally good candidates for components of the template graph, and the template may have vertices and edges that are interchangeable, which can create confusion for the algorithm. Additionally, subgraph matching algorithms must be able to handle noise and errors in the data.

[0036] In a real-world use-case scenario, it may be necessary to identify one specific subgraph isomorphism (SI) by restricting the candidate vertices and edges in the world graph. In some cases, there may be more than one SI of relevance. These potential conditions could lead to additional constraints being imposed on the algorithm, and with more constraints added to the algorithm, the complexity of the algorithm may increase correspondingly. Therefore, to solve the SMP in a real-world application, it can be beneficial to explore strategies to reduce the complexity of the solution space with minimal cost. This can, in turn, help with simplifying the algorithm.

[0037] Active learning is an area in statistical machine learning that involves a subject matter expert (SME) participating in the actual algorithm for aiding the classification of points in a dataset. Supervised machine learning algorithms generally require an abundance of labeled data. However, in a real-world dataset, unlabeled data is common, and accurate labeling may require human involvement that cannot be crowd-sourced due to privacy or security reasons. Semi-supervised machine learning methods are therefore often superior in a real-world application, as they use significantly fewer training points. Active learning in numerous embodiments of the invention involves the use of an algorithm or formula to choose individual data points for labeling by an SME, and the newly labeled data are included back into the semi-supervised learning process. In the context of subgraph matching, active learning could simplify the sorting and matching process through the use of external information brought forth by an SME.

[0038] Current methods to perform subgraph matching do not typically involve active learning and mostly use approaches such as tree search, constraint propagation, and graph indexing. However, these methods can be very computationally intensive. For example, constraint propagation keeps a record of all vertices in the world graph that may be possible matches for each template vertex. While a combination of constraint propagation and tree search can solve the SMP, the computational load of this combination may be high. Other algorithms are approximate, meaning that they can be much faster, but there is no guarantee that they can find all isomorphic subgraphs.

[0039] Systems and methods in accordance with many embodiments of the invention can incorporate active learning into subgraph matching to reduce the number of computations required to find the matching solution set. In many embodiments, a system or process determines the optimal vertices for an SME to obtain external information using a novel querying algorithm. The SME can review the optimal vertex(s) identified by the querying algorithm and introduce any additional external information they deem to be helpful in deciding whether the vertex being reviewed is indeed an isomorphism of a vertex in the template graph. External information can include, for example, an indication that the vertex is correctly matched, adding additional one or more labels or edges to the world graph, and/or adding additional characteristics or values to one or more vertices or edges. The external information can be fed back into the subgraph matching algorithm such that the next query can be made with the external information incorporated in it. The inclusion of the external information can further filter the possible isomorphisms in the world graph, which allows the querying algorithm to be more focused on locating the correct vertices that are isomorphisms of the template graph. In numerous embodiments, the querying and filtering loop is performed iteratively until the SMP is solved and a comprehensive solution set is obtained.

Subgraph Matching

[0040] Subgraph matching in a real-world scenario typically involves complex multichannel graphs that have vast solution spaces. A multichannel graph g = (V,8, £, C) can be defined as a set of vertices, directed edges between the vertices, labels on the vertices, and channels on the edges. The number of vertices is denoted n. Each vertex v E V has a label £(v) belonging to some arbitrary set of labels. There can be any number of edges between each pair of vertices (u, v) in either direction. Each edge belongs to one of the channels C. Edges between the same pair of vertices in the same channel with the same direction are indistinguishable. The function 8\ V x V -> H |c| describes the number of edges in each channel between each pair of vertices. In particular, 8(u, v) can be represented as a |C| -dimensional vector, the k th element of which is the number of edges from vertex u to vertex v in the k ,h channel.

[0041] In a multichannel graph, two vertices are adjacent or neighboring if there is an edge between them in either direction in any channel. An edge is incident to a vertex if the vertex is one of the edge’s endpoints. The degree of a vertex v can be denoted as deg(v) in a multichannel graph as the number of edges incident to a vertex. The neighborhood of a vertex v, denoted as /V(v), is the set of all adjacent vertices. A path is a finite sequence of distinct vertices such that any two consecutive vertices are adjacent. A subgraph of a multichannel graph g - (y, 8,£,C) is a multichannel graph g' - (y’,8',£',C') with V C V, C' = C, £'O) = £(v) for all v e V, and 8'(u, v) < 8(u,v) for all U, V E V.

[0042] The SMP can be succinctly stated: given two multichannel networks, a template graph g t = (V t , 8 t , £ t , C t ) and a world graph g w = (V w , 8 W , £ w , C) , we wish to find all subgraphs of the world graph that match the template graph. For the subgraphs of the world graph to match the template graph means that the subgraphs are isomorphisms of the template graph. An injective function f: V t V w is called a SI or subgraph matching and the set of all Sis from Q t to Q w is denoted This definition can be used to define isomorphisms in which the world graph has more edges than the template. If the function additionally satisfies (2) at equality, it is an induced SI. A function which satisfies (1 ) and (2) but is not necessarily injective is an edge-preserving map (EPM).

[0043] Fig. 1 illustrates an example SMP in a multichannel setting in accordance with an embodiment of the invention. Solving the SMP illustrated in Fig. 1 would require the algorithm to locate all the combinations of vertices and edges in the world graph that are isomorphisms of the template graph. Even in an SMP of a smaller scale illustrated in Fig. 1 , there are four Sis corresponding to mapping the template vertices (A, C, B) to each of the four circled sets of vertices (4, 7, 5), (7, 10, 9), (1 , 6, 8), (1 , 6, 2). With subgraph problems involving larger multichannel graphs, the potential number of solutions may increase even further. Hence, integrating active learning into the SMP stems from a desire to increase matching accuracy and reduce computational load when using subgraph matching in a real-world application.

[0044] A conceptual workflow of subgraph matching using active learning in accordance with an embodiment of the invention is illustrated in Fig. 2. In many embodiments; the system obtains candidate sets for all template vertices using a filter-based subgraph matching algorithm. An active learning algorithm can be used to determine one or more optimal vertices for an SME to review. The SME can review the suggested vertices using external information they deem to be helpful in classifying the suggested vertices. The SME’s decision and the external information used by the SME to make the decision can be fed back into the subgraph matching algorithm to further filter the candidate sets.

[0045] Fig. 3A illustrates a process for performing subgraph matching in accordance with an embodiment of the invention. In several embodiments, vertices of world graphs are first encoded into a plurality of candidate sets. In many embodiments, candidate sets are sets of world vertices to which a given template vertex t G V t may be mapped. In several embodiments, a template vertex is isomorphic to a world vertex if the template vertex can be mapped onto the world vertex. A candidate set for template vertex t may be denoted by C(t) and excludes any world vertex that has been ruled out as a candidate for t.

[0046] Process 300 filters (305) a candidate set. Filtering is the process of ruling out candidates. In many embodiments, filtering is performed using basic properties of Sis known to be true. Examples of these basic properties can include, but are not limited to, that template vertices must map to vertices of higher degree and that neighboring template vertices map to neighboring world vertices. Various techniques for filtering that may be utilized in accordance with embodiments of the invention are be further discussed below.

[0047] In many embodiments, the filtering may produce exact and inexact matches. If the filtering does not produce any exact matches, this can mean that there are not candidates for all template vertices in the template graph. Adjustments can be made to the template graph to locate exact matches. This adjustment process will be discussed in further detail below.

[0048] Once there are candidates for all template vertices, process 300 queries (310) for an optimal template vertex in the corresponding candidate set. Template vertex may be a vertex from a template graph. In many embodiments, the optimal template vertex may be identified by querying algorithms discussed further below. In numerous embodiments, querying involves an algorithm that selects an optimal template vertex to query. External information from SMEs may be introduced in this step to incorporate the active learning process, as the system can seek external information to determine the assignment of the selected template vertex in the world graph. In many embodiments, the querying algorithm can further filter the candidate set based on the previous querying result and the external information introduced.

[0049] Process 300 determines (315) if external information by an SME would aid in matching based on the results of the query. If external information is necessary, process 300 can include (320) external information provided by an SME. External information provided by an SME can be included into the next query. Process 300 checks (325) whether the optimal template vertex has been matched. In numerous embodiments, external information can take the form of labels where the SME can help decide whether the queried vertex does indeed match with the candidate in the world graph. If there are no matches, process 300 can return to 310 to query for the optimal template vertex in other candidates in the candidate set.

[0050] If systems determine that external information is not necessary, process 300 determines (330) a final candidate matching the corresponding template vertex in each candidate set. In numerous embodiments, querying may be an iterative process where systems can query and filter the candidate sets of each template vertex alternately until there is only one candidate left in the set. Once filtering and querying have reduced the size of the candidate sets of each template vertex to one candidate, the matching has been determined. Process 300 outputs (335) subgraph isomorphisms of a template graph in the world graph based on the final candidates. In several embodiments, systems can return the number of total queries made to output the subgraph isomorphisms.

[0051] While specific processes for subgraph matching using active learning are described above, any of a variety of processes can be utilized to perform subgraph matching using active learning as appropriate to the requirements of specific applications. In certain embodiments, steps may be executed or performed in any order or sequence not limited to the order and sequence shown and described. In a number of embodiments, some of the above steps may be executed or performed substantially simultaneously where appropriate or in parallel to reduce latency and processing times. In some embodiments, one or more of the above steps may be omitted.

[0052] Fig. 3B illustrates a process of adjusting template graphs based on filtering in accordance with an embodiment of the invention. Process 350 filters (355) a candidate set. Process 350 obtains (360) all exact and inexact matchings based on the filtering. The concept for inexact matches can be found in T. K. Tu, J. D. Moorman, D. Yang, Q. Chen, A. L. Bertozzi, “Inexact Attributed Subgraph Matching,” 2020 IEEE International Conference on Big Data, pp. 2575-2582, 2020, the relevant disclosure of which is hereby incorporated by reference in its entirety. [0053] Process 350 can determine (365) if there are candidates for all template vertices of a template graph based on number of exact matchings. In many embodiments, if there are no exact matchings, then there are no candidates for all template vertices. Process 350 can modify (370) the template graph based on the inexact matchings. In many embodiments, modifications can be made to the template graph to broaden the search. In several embodiments, there may be inexact matches based on the filtering, and template graphs can be modified based on the information presented by the inexact matches.

[0054] While specific processes for adjusting template graphs based on filtering are described above, any of a variety of processes can be utilized to adjust template graphs as appropriate to the requirements of specific applications. In certain embodiments, steps may be executed or performed in any order or sequence not limited to the order and sequence shown and described. In a number of embodiments, some of the above steps may be executed or performed substantially simultaneously where appropriate or in parallel to reduce latency and processing times. In some embodiments, one or more of the above steps may be omitted.

[0055] In many embodiments, filtering is a key tool to narrow down candidate sets to solve the subgraph matching problem. One example filter is a node label filter. The node label filter requires that in order for a world vertex v w to be a candidate for a template vertex v t , the label Z w (v w ) must match the label £ t (v t ).

[0056] Node-level statistics filter can also be an option for filtering, which provides that for a world vertex v w to be a candidate for a template vertex v t , certain statistical properties of v w should not be less than those of v t .

[0057] A topology filter can also be used as a filter, where the topology filter enforces the constraint that if v w is a candidate for template vertex v t , then for every template vertex u t neighboring v t , there must exist a candidate u w for u t that neighbors v w . The natural extension to multiplex networks is that if v w is a candidate for template vertex v t , then for every template vertex u t neighboring v t , there must exist a candidate u w for u t that has as many edges in each channel and direction between v w and u w as there are between v t and u t . [0058] Repeated-sets filter may be another option, which provides that for a world vertex v w to be a candidate for template vertex v t , there must exist some injective mapping from the template vertices to their candidates under which v t is mapped to v w . Repeated set filter may be known as the “all-different” filter. To enforce the repeated-sets filter, we identify sets of template vertices T £ v t where the union of their candidates U Vte; rDOt) has the same cardinality as T . These are known as tight sets. Candidates for template vertices in a tight set cannot be candidates for any vertices outside of the tight set, since this would violate Repeated-sets filter.

[0059] Another option can be the neighborhood filter, which extends the local all- different constraint to the multiplex network case. In the undirected single-layer graph context, the local all-different constraint provides for a world vertex v w to be a candidate for a template vertex v t , there must be some injective mapping ft from the neighbors of v t to their candidates under which /j(u t ) is a neighbor of v t for each u t .

[0060] Elimination filter can also be an option, which filters by identifying any contradictions that would result from them being assigned. For each template vertexcandidate pair (v t , v w ), v w may be assigned to v t and iterate over all other filters until convergence. If this results in one or more template vertices having no candidates, then v t cannot be mapped to v w and we eliminate v w as a candidate for v t .

[0061] In numerous embodiments, systems can alternately filter the candidate sets of each template vertex based on a combination of at least one filter described above. The choice of a filter to use to eliminate candidates can be task-specific. Example choices of a filter are discussed in J. D. Moorman, T. K. Tu, Q. Chen, X. He, A. L. Bertozzi, “Subgraph Matching on Multiplex Networks,” IEEE Transactions on Network Science and Engineering, vol. 8, no. 2, pp. 1367-1384, 1 April-June 2021 , the relevant disclosure of which is hereby incorporated by reference in its entirety.

[0062] Fig. 4 illustrates a pseudocode of a subgraph matching algorithm using active learning in accordance with an embodiment of the invention. Querying and Active Learning

[0063] The purpose of incorporating active learning into subgraph matching is so that the querying can be more pointed and direct. When the querying is highly effective, more information can be imparted into the learning process of the algorithm. The algorithm itself is able to filter out vertices from candidate sets more effectively and efficiently with more information. As more iterations of the matching process are performed, the algorithm may become increasingly effective.

[0064] There are various querying strategies incorporating active learning that can be used for the subgraph matching process, e.g., in processes such as those described above. In some embodiments, the system utilizes local template centrality-based strategies to query for the isomorphisms of the template vertices in the world graph. Assume that querying a template vertex can reduce the candidate set on neighboring vertices more than other vertices; the local template centrality-based strategies are based on the local regions containing neighbors of template vertices and prefer vertices with higher degrees. Local template centrality-based strategies can be effective in reducing the solution space with a limited number of queries. An example of a local template centrality-based strategy to reduce the solution space is the maximum degree strategy, where the system chooses the template vertex with the highest degree: t = argmax deg(t) (3)

In some embodiments, the max sum of candidates strategy may be used, where the vertex with the largest sum of the number of candidates for neighboring template vertices and itself is chosen:

[0065] In selected embodiments, the edge entropy strategy may be used to determine which template vertex to query: log -P/F(t = w) (5) Assuming that t E V t , t' E is the number of candidates for t’ if t is matched to w G C(t) , can be an estimate for the probability that t is mapped to w based on the possible mappings of the edge with endpoints t’ and t.

[0066] Fig. 5 illustrates an edge entropy computation in accordance with an embodiment of the invention. Edge A may be mapped to nine possible candidate edges in the world, represented by the three cases that edge A can map to. In the left scenario, there are three edges connected to the selected world vertex. Therefore, the probability in this case is 1/3. Similarly, the probability for the remaining edges are 2/9 and 4/9. Hence, the edge entropy of this edge i

[0067] Probabilistic query strategies can be a separate branch of strategies that may be used to query for template vertices. Probabilistic query strategies attempt to estimate the probability with which a template vertex is assigned to a given world vertex. In many embodiments, probabilistic query strategies involve granting all Sis between a given template graph and a world graph equal probabilities. The probability that a template vertex t E V t and world vertex w E V W are paired together is simply the proportion of Sis where they are matched:

If the assignment of certain template vertices to world vertices is already known, the above definition can be adjusted to restrict F(£ t , g w ) to those with the known assignments.

[0068] In many cases, directly computing all Sis may be computationally intractable due to the NP-complete nature of simply finding one. Instead of fully enumerating all Sis, the quantities can be approximated based on local structures. An example structure that can be used to approximate Sis is the immediate neighborhood. An approximation for the number of solutions can be obtained by asking how many mappings there are from the neighbors of a template vertex t to the neighbors of a world vertex w. In other words, the set of local Sis (LSI) for any template vertex, world vertex pair (t, w) can be defined in the following manner:

This is the set of Sis from the neighborhood of t to the neighborhood of w with the additional requirement that each template vertex has candidates inherited from the original SI problem. An associated probability can be defined by:

Computing the number of LSIs corresponds to counting mappings between N(t) and N(w) and ensuring that each template vertex is assigned a different world vertex. In the constraint programming framework, this is referred to as an all-different problem. This problem can be reduced to that of finding the permanent of a 0-1 matrix, which lies in the P#-complete complexity class, suggesting that even this problem may be difficult to solve efficiently. In practice, the problem sizes are often small enough, and the symmetry in the candidate sets can be utilized to solve these relatively quickly.

[0069] In the case that this problem is still too costly to solve, a crude approximation to the number of LSIs may be computed by removing the injectivity requirement where the number of local edge preserving mappings (LEPM) are counted. This quantity is very easy to count as it is just given by the product of the sizes of each of the candidate sets for all neighbors:

The associated probability can be defined as:

A rough approximation of the number of Sis can also be obtained by instead considering

EPMs for a spanning tree of the template. The number of EPMs between a spanning tree T of template g t and world g w , where t is mapped to w by STEPM(g t ,g w , t, w, T) . An associated probability can be computed in the same manner:

[0070] A variety of strategies for selecting query vertices can be designed after computing the associated probabilities. Examples of strategies in active learning for determining queries include minimum confidence sampling, margin sampling, and maximum entropy sampling. Minimum confidence sampling refers to selecting the template vertex that is least confident about its most likely assignment, which can be defined as:

Margin sampling refers to selecting the template vertex whose two most probable assignments are closest together in probability. If w* = arg max P(t = w), then this is wec(t) given by:

Maximum entropy sampling selects the template vertex which maximizes entropy:

[0071] In some embodiments, a query strategy can be determined by analyzing the symmetry apparent in the subgraph matching problem. Symmetry is well-known for confounding general combinatorial problems by dramatically expanding the solution space. Algorithms that exploit symmetry can significantly reduce search time, sometimes by an exponential factor, for highly symmetric problems. These algorithms generally identify vertices that are effectively interchangeable in that exchanging them in a SI will produce another valid isomorphism.

[0072] There are various notions of symmetry that can be discussed in the context of the subgraph matching problem. One example is structural equivalence, where two vertices are deemed to be structurally equivalent when they have the exact same set of neighbors (excluding each other). These vertices can be interchanged in any isomorphism, and the new mapping will also be an isomorphism. In many embodiments, venn diagram representations can be used to represent intermediate solutions in intersections of candidate sets. The concept of structural equivalence in subgraph matching can be found in D. Yang, Y. Ge, T. Nguyen, D, Molitor, J. D. Moorman, A. L. Bertozzi, “Structural Equivalence in Subgraph Matching,” IEEE Transactions on Network Science and Engineering, vol. 10, no. 4, pp. 1846-1862, 1 July-Aug, 2023, the relevant disclosure of which is hereby incorporated by reference in its entirety. A more complicated notion which includes structural equivalence is automorphic equivalence. Two vertices are automorphically equivalent if there is an automorphism c/y.V ^ V on the graph mapping one vertex to the other. Then, given this automorphism (p and any SI f, we can construct a new SI f ° 0.

[0073] In the context of the active learning problem, symmetry can be an important factor to consider when deciding which vertices to query. In a group of M structurally equivalent template vertices, for any SI f, M! - 1 , additional isomorphisms can be constructed by simply permuting the images of these vertices. All of these isomorphisms may be valid based on the information apparent in the problem. Discerning which permutation is the true solution necessarily requires querying at least M - 1 of these vertices (the last may be determined by the process of elimination). In an example satisfiable SI problem with template g t = (V t , 8 L ) and world g w = (V w , £ w ), where V t = partitions the template vertices into structural equivalence classes, then at least - 1) template queries are needed to identify a unique solution.

[0074] Discerning between isomorphisms that are generated by applying an automorphism to the template graph can be more challenging. Relevant to this task is the notion of a base of the automorphism group of a template graph. A base of an automorphism group is a sequence of vertices (vi, V2, ... , v n ) such that for any pair of automorphisms #= <p 2 , the two sequences (0i(vi), 0 1 (v 2 ), -,0i(vn))' and (02 (TIX 0 2 (V 2 ), ...,are 0 2 (^n)) are distinct. This definition implies that the values of (|) on vi, ... , Vn can uniquely identify 4>. Hence, distinguishing isomorphisms generated by the automorphism group can involve querying all vertices of a base of the automorphism group. Spanning-Tree-Based Estimation of SI Count

[0075] For larger multichannel graphs with long paths, a spanning tree-based estimation can be more computationally tractable. In some embodiments, the spanning tree-based estimation method is based on the estimation method set forth in the CFL-Match algorithm, which proposed a new data structure called the compact path index (CPI). CPI can be defined by the rule: there is an edge between v e C(u) and v 0 e C(u 0 ) for adjacent vertices u and uo in CPI if and only if S(v,v 0 ) = 1 in G. The data structure can be used to estimate the frequency of solutions when the template is a path graph and can be extended to the case when the template is a tree. In such a case, the estimation is the number of edge-preserving mappings generated by the current candidate list. For a more general template, a breadth first search (BFS) starting from the selected vertex can be used to generate a spanning tree of the template and perform the estimation using the spanning tree.

[0076] In certain embodiments, the number of possible edge-preserving mappings to the current candidate set can be used to approximate the number of Sis. By removing the injectivity constraint, the system may bypass the all-different problem and complete the estimation in polynomial time.

[0077] A graph T can be called a path graph if there exists an indexing of template vertices V t = such that the edge set of the template satisfies the following conditions: S t (x i , Xj') > 0 were x k is defined as the root of the path graph. If x k is the root of the path graph template, then the template vertices can be indexed by x 1 ,x 2 , ...,x k such that edges only exist in vertices with consecutive indices. If the template is a path graph with a root x k , the solution frequency estimation problem with a given vertex x k is called the path graph root estimation problem.

[0078] In a template graph t where x k is a given vertex of t, and where C(xj)[/] represents the j th element in the candidate set C(xJ, the solution frequency estimation problem can approximate the cardinality of the following sets of SI solutions f-. {f G x fc )| . The CPI adjacency matrix M u+1 is a 0- 1 matrix of size |C(x £ )| x |C(x £+1 )| , where if f w (C(x £ )[a], C(x £+1 )[<b]) > 0 , then M i i+1 [a, b] - 1. Otherwise, M i i+1 [a, b] - 0. The CPI graph can then be defined as a single-channel graph with Ei=iK(X)l vertices. The vertices can be indexed by the following set: V = {(x £ , C(x £ ) [/]), i = 1,2, = 1,2, ..., |C(x £ )|}. This means that each vertex corresponds to a unique template-candidate pair. The edge set is defined to be:

Otherwise,

This means the edges of the CPI graph contain all the template-candidate pairs such that the template vertex and candidate are simultaneously neighboring to each other. In the setting of the path graph template root estimation problem, the exact number of edgepreserving mappings for path graph templates can be calculated by calculating the cardinality of edge-preserving mappings:

[0079] In many embodiments, the definition of a tree in graph theory can be extended to multichannel graphs to determine the SI count. A tree in graph theory is defined for single-channel graphs as a connected acyclic undirected graph. A tree is also a notion in data structure usually visualized by a tree in the context of graph theory and stores the connection and hierarchy of the vertices. These two definitions of a tree are linked together by assigning a root to the tree. For a directed or undirected multichannel graph G, the underlying graph G u is a single channel, undirected simple graph which can be defined as:

A multichannel graph G has a tree structure if and only if the underlying graph G u is a tree. If G is a tree or G is a multichannel graph with tree structure and u G Vg, then the pair (6, u) is called a rooted tree with root u. For a rooted tree (T, r), where w is any vertex other than r. If G is a tree, then there is a unique path P in G that connects w and r. If G has a tree structure, then there is a unique path P in G u that connects w and r. If P = (v 0 , v 1 , v 2 , ..., v k ), where v 0 = r, v k = w. Then called the parent of w, and w is called the child of v k-1 . are called the ancestors of w, and w is the descendant of v 0 , v lt v 2 , ... , v k-r . Vertices with the same parent are called siblings to each other. If (T,r) is a rooted tree, and w G V T . The subgraph T w generated by w and all its descendants also has a tree structure. (T w ,w) is called the subtree of (T,r) with root w. [0080] Equation (15) may be denoted as pathEPM and uses the following divide and conquer algorithm denoted TreeEPM to calculate the number of EPMs. Fig. 6 illustrates a pseudocode of the TreeEPM algorithm in accordance with an embodiment of the invention. In many embodiments, the CPI matrices are precalculated and stored in a hash map M to avoid repeat calculations. The CPI matrix for template vertices x and y may be denoted as M(x,y). As illustrated in Fig. 6, 0 denotes the element-wise product of two vectors. TreeEPM can return the exact number of EPMs regarding each candidate vertex of the selected root x. The cardinality of edge-preserving mappings can be computed by |£g G EPM(T)|^(x) = C(x)[/]}| = TreeEPM(T,x, M~)\j].

[0081] A general template does not naturally generate a tree structure. Therefore, to extend the spanning tree method to a general template, in many embodiments, a breadth first search is utilized to produce a spanning tree of the template. The number of EPMs on the general template can be estimated by TreeEPM.

[0082] Then, we give the following estimation algorithm for a general template. For a selected vertex x, the TreeEPM algorithm can be used on the rooted tree structure with root x. The algorithm can return a vector of |C(x)| length, recording an estimate of the number of EPMs that map x to each element in C(x). Fig. 7 illustrates a process of estimating the EPMs for the candidate of the selected vertex with label 3 in accordance with an embodiment of the invention. The underlying graph of the template is shown on the top right of Fig. 7. By using the BFS starting from vertex 3, a rooted tree with parent/children relations shown in the bottom left of Fig. 7 can be obtained. By looking at the subtrees, the estimation can be calculated by the Cartesian product of the estimation from three subproblems with smaller sizes.

[0083] In many embodiments, the complexity analysis of the algorithm has three steps, including the calculation of CPI matrix hash map M, the multiplication of matrices in the path, and the Cartesian product constructed in the tree structure. |£ t |, |£ w |, |S“ |, |£“| denote the number of edges in the template, world, and underlying graphs of the template and world graphs, respectively. C max denotes the maximal size of a candidate set C(x). d eno tes the maximal degree of the template graph, and D“ vg denotes the average degree of the underlying graph of the world graph.

[0084] In many embodiments, an upper bound for the worst-case complexity is computed for the complexity analysis. A CPI matrix for all the adjacent pairs of vertices in the template can be computed such that the number of CPI matrices to compute is equal to N eut , and the size of each CPI matrix is C max x c max . The complexity of this part may be bounded by O(|£“ | (c max ) 2 ) . If a vertex x has been selected to construct the tree, all vertices in the template need to be reviewed. Therefore, the complexity of this part is bounded O(| V t |).

[0085] Then, the matrix of every edge in the BFS tree needs to be multiplied where the number of edges is equal to \V t | - 1. For each vertex in the BFS tree, the size of the Cartesian product is bounded by |V t | - 1, and the cost of computing each Cartesian product is bounded by O(C ma% ). The total cost of this step is bounded by

[0086] All vertices need to be selected to repeat the calculation, and the total time complexity of the algorithm is bounded by . This expression demonstrates that the algorithm can be completed within polynomial time. However, this bound of time complexity can be larger if the number of candidates for any vertex is large. In certain embodiments, the template and the world graph are both sparse. |£“| = O(| V t |) may be assumed from the sparsity of the template. In many embodiments, from the sparsity of the world graph, the expectation of a number of nonzero elements of each column of the CPI matrix is equal to D“ vg . Therefore, the complexity of matrix multiplications for a BFS tree will be O((C max ) 2 lV t lD^ vg ) instead of O((C max ) 3 |V t |) . Hence, in several embodiments, the expectation of total complexity with the assumption of sparsity i

System Implementation

[0087] Processes that provide the methods and systems for performing subgraph matching with active learning in accordance with some embodiments can be executed by a computing device or computing system, such as a desktop computer, tablet, mobile device, laptop computer, notebook computer, server system, and/or any other device capable of performing one or more features, functions, methods, and/or steps as described herein. Fig. 8 illustrates a computing device that can be utilized to perform subgraph matching with active learning in accordance with an embodiment of the invention. Computing device 800 includes a processor 810. Processor 810 may direct the matching application 842 to perform subgraph matching with active learning based on graph data 844 and model data 846. In many embodiments, processor 810 can include a processor, a microprocessor, a controller, or a combination of processors, microprocessor, and/or controllers that performs instructions stored in the memory 840 to perform subgraph matching. Processor instructions can configure the processor 810 to perform processes in accordance with certain embodiments of the invention. In various embodiments, processor instructions can be stored on a non-transitory machine readable medium. Computing device 800 further includes a network interface 820 that can receive graph data from external sources. Computing device 800 may further include a memory 840 to store learning models under model data 846. Computing device 800 may further include peripherals 820 to allow for user control and analysis of the matching process.

[0088] Although a specific example of a computing device is illustrated in this figure, any of a variety of transmitting elements can be utilized to perform subgraph matching similar to those described herein as appropriate to the requirements of specific applications in accordance with embodiments of the invention. [0089] Fig. 9 illustrates a network architecture for subgraph matching using active learning in accordance with an embodiment of the invention. Such embodiments may be useful where computing power is not possible at a local level, and a central computing device (e.g., server) performs one or more features, functions, methods, and/or steps described herein. In such embodiments, a computing device 910 (e.g., personal computer) is connected to a network 920 (wired and/or wireless), where it can receive inputs from one or more computing devices, including data from a records database or repository 930 containing graph data to be used in matching, data provided from a personal computing device, and/or any other relevant information from one or more other remote devices 910 and/or 940. Once computing device 910 performs one or more features, functions, methods, and/or steps described herein, any outputs can be transmitted to one or more computing devices 910 for entering into records.

[0090] In accordance with still other embodiments, the instructions for the processes can be stored in any of a variety of non-transitory machine readable media appropriate to a specific application.

[0091] Although specific methods of subgraph matching using active learning are discussed above, many different methods of subgraph matching using active learning can be implemented in accordance with many different embodiments of the invention. It is therefore to be understood that the present invention may be practiced in ways other than specifically described, without departing from the scope and spirit of the present invention. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.