A METHOD AND COMPUTER PROGRAM FOR PREDICTING THE POTENTIAL LIFETIME WINNINGS OF A RACEHORSE

Title:

A METHOD AND COMPUTER PROGRAM FOR PREDICTING THE POTENTIAL LIFETIME WINNINGS OF A RACEHORSE

Document Type and Number:

WIPO Patent Application WO/2023/242551

Kind Code:

Abstract:

A method of predicting the potential lifetime winnings of a racehorse disclosed. The method involves the steps of gathering horse data relating to a multiplicity of horses and gathering race data relating races in which those horses ran together with results data relating to the results of those races. Processing the gathered data to generate numerical data and generating at a model based on said processed data. Then selecting a portion of each of the horse data, the race data and the results data relating to a horse in a lineage relating to offspring or potential offspring of a pair of horses and applying the selected data to the model to predict the potential lifetime winnings of the offspring or potential offspring.

Inventors:

BARGHOUTH GHASSAN (GB)

Application Number:

PCT/GB2023/051532

Publication Date:

December 21, 2023

Filing Date:

June 13, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

PUREBREED AI LTD (GB)

International Classes:

G07F17/32

Foreign References:

JP2020149583A	2020-09-17
KR20050074226A	2005-07-18
KR20050074225A	2005-07-18

Other References:

VELIE B D ET AL: "Performance selection for Thoroughbreds racing in Hong Kong", EQUINE VETERINARY JOURNAL, R & W PUBLICATIONS, SUFFOLK, GB, vol. 47, no. 1, 1 April 2014 (2014-04-01), pages 43 - 47, XP071655566, ISSN: 0425-1644, DOI: 10.1111/EVJ.12233
THIRUVENKADAN A K ET AL: "Inheritance of racing performance of Thoroughbred horses", LIVESTOCK SCIENCE, ELSEVIER, AMSTERDAM, NL, vol. 121, no. 2-3, 1 April 2009 (2009-04-01), pages 308 - 326, XP025969298, ISSN: 1871-1413, [retrieved on 20080816], DOI: 10.1016/J.LIVSCI.2008.07.009

Attorney, Agent or Firm:

ARCHER, Graham (GB)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. A method of predicting the potential lifetime winnings of a racehorse comprising: gathering horse data relating to a multiplicity of horses said horse data including at least one of date of birth of the horse, country of birth of the horse, pedigree of the horse, and speed/stamina ratio of the pedigree, Dam family summary, genetic strength value; gathering race data relating to a multiplicity of horse races in which a plurality of the horses of said horse data have run, said race data including at least one of raced distance, race type, handicap, race grade; ground type, ground conditions and racecourse; gathering results data relating to the results of the races of said race data, said results data including how much money was won by horses entering said race and further including at least one of which horses of said multiplicity of horses were entered in said race, which of those horses started the race, which of those horses completed the race, trainer name, jockey name, and the finishing times in which those horses finish the race; processing said gathered data to generate numerical data relating to said gathered data; generating at least one model based on said processed data; and selecting a portion of each of said processed horse data, said processed race data and said processed results data relating to a horse in a lineage relating to offspring or potential offspring of a pair of horses of said multiplicity of horses and applying said selected data to said model to predict the potential lifetime winnings of said offspring or potential offspring.

2. A method according to claim 1, wherein said model is a regression model.

3. A method according to claim 2, wherein regression model comprises at least one of: Random forest; XGBoost; SVM; Linear Models, Ridge and lasso; KNN; Logistic regression; and PCA for feature reduction then SVM For classification.

4. A method according to any preceding claim, wherein said horse data comprises date of birth of the horse, country of birth of the horse, pedigree of the horse, and speed/ stamina ratio of the pedigree, family summary, genetic strength value

5. A method according to any preceding claim, wherein said race data comprises raced distance, race type, handicap, ground type, ground conditions and racecourse.

6. A method according to any preceding claim, wherein said results data comprises how much money was won by horses entering said race, which horses of said multiplicity of horses were entered in said race, which of those horses started the race, which of those horses completed the race completed the race, trainer name, jockey name, and the finishing times in which those horses finish the race;

Description:

A Method and Computer Program for Predicting the Potential Lifetime Winnings of a Racehorse

The present invention relates to a method and computer program for predicting the potential lifetime winnings of a racehorse and relates particularly, but not exclusively, to a tool for use by potential purchasers of a young horse at auctions to decrease the investment risk in such purchases.

The cost of buying, keeping, training and racing a thoroughbred racehorse is generally regarded as being very high. Currently race horse owners depend largely on the historical performance of a race horse bloodline, as well as the expertise of a bloodstock agent, to make a decision on which thoroughbred to buy at an auction. Data suggests that 70% of purchased thoroughbreds do not get put forward to race and only between 1 and 4% of horses win a race with only around 0.2% winning a group 1 race. This means that an owner should have to acquire and train 250-500 horses to achieve 1 horse that wins a group 1 race. As examples, a horse called Snaafi Dancer bought for $10.2Mn but failed to enter a race and Green Monkey bought for $16Mn who failed to win any of his races. However, a horse called Takeover Target was bought for $1,250 and ended up winning over $ 6Mn in prize money similarly Buffering was bought for $22, 000 and won over $7Mn in prize money. It is therefore clear that it is very difficult to predict the potential winnings of a horse when they are typically purchased at a young age before they race, and a bloodstock agent recommendation is limited to a qualitative advice, never quantitative.

Preferred embodiments of the present invention seek to overcome or alleviate the above described disadvantages of the prior art. According to an aspect of the present invention there is provided a method of predicting the potential lifetime winnings of a racehorse comprising: gathering horse data relating to a multiplicity of horses said horse data including at least one of date of birth of the horse, country of birth of the horse, pedigree of the horse, and speed/stamina ratio of the pedigree, Dam family summary, genetic strength value; gathering race data relating to a multiplicity of horse races in which a plurality of the horses of said horse data have run, said race data including at least one of raced distance, race type, handicap, race grade; ground type, ground conditions and racecourse; gathering results data relating to the results of the races of said race data, said results data including how much money was won by horses entering said race and further including at least one of which horses of said multiplicity of horses were entered in said race, which of those horses started the race, which of those horses completed the race, trainer name, jockey name, and the finishing times in which those horses finish the race; processing said gathered data to generate numerical data relating to said gathered data; generating at least one model based on said processed data; and selecting a portion of each of said processed horse data, said processed raced data and said processed results data relating to a horse in a pedigree relating to offspring or potential offspring of a pair of horses of said multiplicity of horses and applying said selected data to said model to predict the potential lifetime winnings of said offspring or potential offspring .

By providing a method that utilises horse data, race data and results data to generate a model that links the pedigree of a horse to the potential winnings, the advantage is provided that the risk associated with the purchase of a horse is better understood and therefore potential purchases have some of the financial risk removed, and a buyer is able to make a better informed decision on what value he/she would pay for the purchase based on data versus a guess. In particular, the use of a regression model provides a technical solution to the problem of predicting a qualitative horseracing success based on pedigree, and provides additional insights including a quantitative assessment of the potential career winnings, a scoring benchmark between horses being sold at an auction, the type of race that is most likely to suit the offspring in question, likely suitable ground conditions, likely suitable racecourse, likely suitable trainer, and likely ideal race distance. Furthermore, this invention is able to take into consideration the race conditions at which the bloodline has won a race, the relative finishing times of the horses running at a certain race, understanding the conditions of a certain horse winning a race as a function of the horses it was running against thereby further improving the out predictions.

In a preferred embodiment the model is a regression model.

In another preferred embodiment the regression model comprises at least one of: Random forest; XGBoost; SVM; Linear Models, Ridge and lasso; KNN; Logistic regression; and PCA for feature reduction then SVM For classification.

The horse data preferably comprises date and country of birth of the horse, and pedigree of the horse.

The race data also preferably comprises raced distance, race type, ground type, ground conditions and race location. The results data preferably comprises how much money was won by horses entering said race, which horses of said multiplicity of horses were entered in said race, which of those horses started the race, which of those horses completed the race and the order in which those horses finish the race.

Preferred embodiments of the present invention will now be described, by way of example only, and not in any limitative sense with reference to the accompanying drawings in which: -

Figure 1 is a flowchart representing steps of an exemplary embodiment of the present invention; and

Figure 2 is a schematic representation of the apparatus utilised in the operation of an exemplary embodiment of the present invention .

A method, system and computer program for predicting the potential lifetime winnings of a racehorse are provided by the present invention. The method 10 is illustrated in a flowchart shown in figure 1 and contains eight steps labelled 12 to 28. The first three stages of this method involves the gathering of data with the data relating to horses (12) , to races (14) that these horses run in and the results (16) of those races. The majority of the data is gathered from existing online databases which between them contain the data necessary to undertake the method of the present invention. These databases include, but are not limited to, Total performance data (TPD, https://www.totalperformancedata.com/) , The Racing Post (https://www.racingpost.com/) and Pedigree online (https://www.pedigreequery.com/) . The data collection process can be divided into two parts. There is the initial gathering of historic data which is used in the initial generation of the model as well as the addition of new data which is used to update or refine that model after it is initially generated. The horse data, gathered at step 12, is primarily obtained via Pedigree online and comprises data relating to a multiplicity of horses including, but not limited to:

• Date and country of Birth of the horse;

• sex of the horse; and

• Pedigree of the horse.

• Dam Family summary

• Speed/ stamina ratio of pedigree up to 5 generations

- Dosage index

- Dosage profile

- Centre of distribution (CD)

- Genetic strength value (GSV)

The race data, gathered at step 14, is obtained primarily from Total Performance Data and The Racing Post. The race data relates to races participated in by a multiplicity of the horses listed in the horse data and includes, but is not limited to:

• raced distance;

• race grade (class 1 to class 7) ;

• race type (flat, jumps, jump types) ;

• ground type (grass, all-weather) ;

• ground conditions (firm, good, soft, heavy) ;

• racecourse;

• jockey;

• trainer;

• finishing time;

• stride;

• cross sectional timing; and

• days of rest between races

The results data, gathered at step 16, is also obtained from Total Performance Data and The Racing Post. The results data relates to the results of the races of the race data and includes, but is not limited to:

• how much money was won by horses entering said race;

• which horses of the multiplicity of horses were entered in the race;

• which of those horses started the race;

• handicap weights carried (if applicable) ;

• which of those horses completed the race; and

• the order in which those horses finish the race.

The operation of the method of the present invention is undertaken using a system incorporating the features shown, by way of example only, in figure 2. As stated above, the data is gathered from websites, whether freely available or requiring subscription or under license, which form part of the Internet, indicated at 30. The gathered data is stored on a server 32, which is indicated separately in figure 2, but may equally form part of an Internet based cloud storage system and the transfer of data in the data gathering process is indicated schematically with reference numeral 34.

After the data gathering steps 12, 14 and 16 the stored data is transformed, tabulated, inter-linked, cleaned-up, prepared and processed to ensure consistency and convert nonnumeric data into numeric data which can be used in the data model in step 18. Examples of ensuring consistency include linking the pedigree data, to the runners data. Another example is measurements involving weights and lengths ensuring that the units of measurement are consistent. For example, handicaps are commonly measured in stones but are also measured in lb and Kgs similarly, race distances are often indicated in furlongs but can also be miles and yards. Similarly, winnings data may be gathered from different data sources in different currencies and these can be standardised by conversion to a single currency as well as correcting for the value of money. Transforming the data into a tabulated form, cleaning up the data of errors, duplicates, complementing missing information, linking and validating the data sets from different sources helps to ensure consistency in the data set being used thereby improving the output from this input data.

An example of race data which must be converted to numeric data is ground conditions. When racing on grass in the UK and Ireland the ground conditions are referred to as "the going" using terms such as "firm", "good", "soft" and "heavy" with intermediate steps between these such as "good to soft". The gathered race data includes these terms and these are converted to numerical values using a lookup table such as that shown below .

Similar conversions are made for other data that is gathered that is not already in numeric form. Examples of such data include, but are not limited to, sex of the horse, race grade, race type, ground type, race location. The race grade, also known as the race class, is information which can be derived from the name of the race. For example World Pool Handicap

(Class 2) (0-105, 4yo+) (lmll3yds) I hf Good provides that the race is a class 2 race, for horses handicapped between 0-105, limited to horses over 4 year old, with a race distance of Im furlongs

Similarly, the type of race, for example "flat" or "National Hunt" (over jumps) can also be determined from the name of the race by the application of a set of rules. However, it should be noted that different rules may need to be applied in order to determine the type of race in different countries as different naming conventions are used in different parts of the world. Where the type of race cannot be determined from the name of the race this race is flagged to an operator who is then able to manually enter the data. This generation of numeric data from the non-numeric data is indicated at step 20 in figure 1.

An important aspect of the horse data is the lineage or pedigree data, for the Sires as well as the Dams. While the sire/stallion data are available, there is less data available from the dam side, especially if the dam have not raced and was used for breeding only. All Thoroughbreds today go back in their bloodline to three sires, and 74 dams. The pedigree data is processed using string processing to generate a family summary that would be used in the model to understand the impact of the dam side on the performance. The pedigree data also includes, or is used to derive, the following features dosage profile, index and centre of distribution for every horse, as well as GSV, Triads, Conduit mare profile, speed and stamina.

The race and results data are gathered together from sources described above and processed to produce usable numeric data from the captured data. Although the present invention is not limited to thoroughbred racing, initial testing of the invention has utilised these data sources as they offer highly comprehensive data sets with over 300, 000 races data and over 600, 000 horses data enabling higher accuracy of the output of the training model. From the race and results data, the names of the horses entering the race can be identified and these linked to horses in the horse data. Statistics are aggregated for runners depending on whether they run in flat races or national hunt races. For every horse data is gathered relating to the Sire and Dam (parents) including the number of races participated in and the total prize money won (this is provided as a summation of rank multiplied by the prize money using the mean and total prize-money one) . Also, for each sire and dam the total number of wins is included and the class of the race in which the win occurred is also ranked. Other information such as peak performance for the sire and dam are derived including, for example, the peak age of winning. Other data also included progression performance measures such as: the number of days to go to the highest official rating from the first race; race tenure (number of days in the horses career) ; number of rest days between races; and finishing times per race. It is also possible to infer statistics about trainers including number of wins, number of races and total money one.

Once the data has been processed and readied for use the process of generating the regression model can begin (step 22) . For each progeny horse, that is the horse of a sire and a dam, the following variables are included: a) The aggregated sire and dam information in runners b) Dosage Index (DI) , Centre of Distribution (CD) , and Dosage profile (DP) c) Family summary of the progeny horse i.e. the eves names d) Family summary of the parents e) Progeny sex information f) Birth date for progeny and the bloodline.

A model is generated based on the data retrieved using the following algorithms: Random forest; XGBoost; SVM; Linear Models, Ridge and lasso; KNN; Logistic regression; and PCA for feature reduction then SVM For classification. The above listed algorithms are a nonlimiting list of suitable algorithms which were used. However, it should be noted that other regression algorithms may be used as well or in replacement. In order to train the algorithm different data sources were used. This included runners data in which the analysis was limited to horses born after 2001 but before 2017 and those for which full racing information is available. The other source used was summary retrieved data.

Once the model has been generated the input data required in order to operate the system is simply a sire and dam combination together with information about the date and country of birth of the offspring (step 24) . This information allows the required data to be retrieved relating to the parent horses, this data is then input into the model which produces the outputs of a quantified prediction of potential career earnings (step 26) . The other output is a quantified benchmark scoring of the offspring horses selected, the type of race that is most likely to suit the offspring in question, likely suitable ground conditions, likely suitable racecourse, likely suitable trainer, and likely ideal race distance. An example of these outputs is shown in figure 2 on the handheld computer device 36 which is an example of the kind of computing device which can be used to access the system of the present invention (step 28) . Data communicates between the computer device 36 and the server 32 (as indicated by data transfers 38 and 40 in figure 2) via the Internet 30. In the example shown in figure 2 the potential earnings are indicated as a numerical figure. However, the potential earnings could alternatively be provided as a series of confidence levels, for example, 95% confidence level, 75% confidence level, and 50% confidence level as examples of target bins. Compound target variables can also be determined from total money, number of races and number of wins. This is achieved using different types of clustering including, but not limited to, Kmeans, Guassian mixture model, DBScan and Hierarchical clustering. Elbow clustering is used to get the best number of clusters. Tailored labelling is also used on the variables quantiles these being, for example, horses that one more than 75% compared to all horses and those that when more than 75% of horses in total money. - li lt will be appreciated by persons skilled in the art that the above embodiments have been described by way of example only and not in any limitative sense, and that various alterations and modifications are possible without departure from the scope of the protection which is defined by the appended claims.

Previous Patent: GROUP KEY SHARING

Next Patent: CATIONIC CONTACT LENS