2024 Finalists & Winners

Open-Track Data + Software Winner
Brandon Onyejekwe (View Full Paper)
Northeastern University
Bio:
Brandon Onyejekwe is a recent graduate from Northeastern University with a Bachelor of Science in Data Science and a minor in Mathematics. His main research interests broadly involve solving real world problems using machine learning methods. While an avid sports fan overall, his main passion falls with running, as he was a captain of Northeastern’s club running team, a member of the Sports Analytics Club, and is currently half marathon training. He currently works as a Data Engineer at Travelers Insurance in Hartford, CT.
Quantifying Uncertainty in Marathon Finish Time Predictions
Abstract:
In the middle of a marathon, a runner’s expected finish time is commonly estimated by extrapolating the average pace covered so far, assuming it to be constant for the rest of the race. These predictions have two key issues: the estimates do not consider the in-race context that can determine if a runner is likely to finish faster or slower than expected, and the prediction is a single point estimate with no information about uncertainty. We implement two approaches to address these issues: Bayesian linear regression and quantile regression. Both methods incorporate information from all splits in the race and allow us to quantify uncertainty around the predicted finish times. We utilized 15 years of Boston Marathon data (312,805 runners total) to evaluate and compare both approaches. Finally, we developed an app for runners to visualize their estimated finish distribution in real time.

Open-Track Methods Finalist
Ryan Brill (View Full Paper)
University of Pennsylvania
Bio:
Ryan Brill is a fifth (and final) year PhD student studying applied math & statistics at the University of Pennsylvania. He is broadly interested in decision making under uncertainty and real-world applications of statistics and data science. He also works part-time for the Dodgers and was a 2022 NFL Big Data Bowl Finalist. Outside of work, he has recently been enjoying watching Industry season 3, reading about moral psychology, and exploring New York City.
The winner of the NFL draft is not necessarily cursed: Exploring the discrepancy between NFL draft expected value curves and the observed trade market
Abstract:
Football analysts traditionally value a future draft pick position by its expected performance or surplus value. But, these expected value curves do not match the valuation implied by the observed trade market. One takeaway is general managers are making terrible trades on average. An alternative explanation is they are using some other value function that captures an essential piece of the puzzle missing from previous analyses. We are partial to the latter explanation. In particular, traditional analyses don’t consider how variance in performance outcomes changes over the draft. Because variance decays convexly accross the draft, eliteness (e.g., right tail probability) decays much more steeply than expected value. We suspect general managers value performance nonlinearly, placing exponentially higher value on players as their eliteness increases. This is because elite players have an outsize influence on winning the Super Bowl. Thus, in this paper we consider nonlinear draft value curves that capture the outsize influence of elite players. Such nonlinear value functions produce steeper draft value curves that more closely resemble the observed trade market.

Open-Track Methods Finalist
Lee Kennedy-Shaffer (View Full Paper)
Yale School of Public Health
Bio:
Lee Kennedy-Shaffer is an assistant professor of biostatistics at the Yale School of Public Health. His primary research interests are in infectious disease and vaccine study design, along with the methodology of cluster-randomized trials and quasi-experiments. A lifelong Mets fan, he is also interested in using these methods to understand baseball and sports more generally, identifying what causal inference methods work in sports settings, and using sports to broaden interest in statistics to students and the wider public.
Panel Data Methods to Evaluate the Impact of
Rule Changes
Abstract:
In recent years, several major team sports have instituted rule changes in attempts to improve game play and the viewing experience. From 2020 to 2023, Major League Baseball instituted several rule changes affecting team composition, player positioning, and game time. Understanding the effect of these rules—both on the game as a whole and on individual teams and players—is crucial for leagues, teams, players, and other relevant parties to assess their impact and either push for further changes or to roll back existing rules. Panel data and quasi- experimental methods provide useful tools for causal inference in these settings. I demonstrate this potential by analyzing the effect of the 2023 shift ban at both the overall and player-specific levels. Using difference-in-differences analysis, I show that the policy increased BABIP and OBP for left-handed batters by a modest amount. For individual players, synthetic control analyses identify several players whose offensive performance (OBP, OPS, and wOBA) improved significantly because of the rule change, and other players with previously high shift rates for whom it had little effect. This work both estimates the impact of this specific rule change and demonstrates how these methods for causal inference are potentially valuable for sports analysis—at the player, team, and league levels—more broadly.

Student-Track Data + Software Winner
Bhaskar Lalwani (View Full Paper)
Kalinga Institute of Industrial Technology
Bio:
Bhaskar is a junior pursuing a Bachelor's in Computer Science at Kalinga Institute of Industrial Technology. With a focus on deep learning and data science, he has experience working with transformer-based architectures, applying them to tasks such as optical character recognition for multiple languages and fine-tuning models for Indic languages. He is looking to gain more exposure in his interests which include intersection of AI, language and linguistics. He plans to pursue a MS in CS after graduation. In his free time, Bhaskar enjoys playing the tabla and piano, and reads science fiction.
KabaddiPy: A package to enable access to Professional Kabaddi Data
Abstract:
Kabaddi, a contact team sport of Indian origin, has seen a dramatic rise in global popularity, highlighted by the upcoming Kabaddi World Cup in 2025 with over sixteen international teams participating, alongside flourishing national leagues such as the Indian Pro Kabaddi League (230 million viewers) and the British Kabaddi League. We present the first open-source Python module to make Kabaddi statistical data easily accessible from multiple scattered sources across the internet. The module was developed by systematically web-scraping and collecting team-wise, player-wise and match-by-match data. The data has been cleaned, organized, and categorized into team overviews and player metrics, each filterable by season. The players are classified as raiders and defenders, with their best strategies for attacking, counter-attacking, and defending against different teams highlighted. Our module enables continuous monitoring of exponentially growing data streams, aiding researchers to quickly start building upon the data to answer critical questions, such as the impact of player inclusion/exclusion on team performance, scoring patterns against specific teams, and break down opponent gameplay. The data generated from Kabaddi tournaments has been sparsely used, and coaches and players rely heavily on intuition to make decisions and craft strategies. Our module can be utilized to build predictive models, craft uniquely strategic gameplays to target opponents and identify hidden correlations in the data. This open source module has the potential to increase time-efficiency, encourage analytical studies of Kabaddi gameplay and player dynamics and foster reproducible research. The data and code are publicly available: https://github.com/kabaddiPy/kabaddiPy

Student-Track Data + Software Winner
Aniruddha Mukherjee (View Full Paper)
Kalinga Institute of Industrial Technology
Bio:
Aniruddha Mukherjee is currently a junior majoring in Computer Science at Kalinga Institute of Industrial Technology (KIIT), where he ranks at the top of his class. He is also pursuing a BS in Data Science with the Indian Institute of Technology, Madras (IIT-M) in an online format. He has interned at various research institutions like BITS Pilani, Tata Consultancy Services Research and The University of Texas at Austin. He is passionate about solving problems and has explored solutions in quantitative finance, healthcare, anomaly-detection and image quality assessment leading to presentations and publications at venues like IEEE Transactions, ACM's International Conference on AI in Finance (ICAIF'24) and Springer’s Cognitive Computation. Aniruddha’s drive to build and create impactful solutions has led him to win three hackathons hosted by Indian Institutes of Technology (IITs) and co-author two filed patents on real-world solutions using AI. He also has been working closely with SkinAI, a New Delhi based startup, and with IIT-Kharagpur as a collaborator with the Department of Artificial Intelligence. Outside of academics, he is a Grade 8 pianist (ABRSM), enjoys playing football, tennis and chess, and enjoys debating. Aniruddha has volunteered for Stanford as an Instructor (CS106A) to teach CS basics. He is enthusiastic about utilizing technology and engineering to make a significant and meaningful impact in the lives of individuals. Looking ahead, he is interested in pursuing a Master’s in Computer Science (MSCS) followed by a PhD.
KabaddiPy: A package to enable access to Professional Kabaddi Data
Abstract:
Kabaddi, a contact team sport of Indian origin, has seen a dramatic rise in global popularity, highlighted by the upcoming Kabaddi World Cup in 2025 with over sixteen international teams participating, alongside flourishing national leagues such as the Indian Pro Kabaddi League (230 million viewers) and the British Kabaddi League. We present the first open-source Python module to make Kabaddi statistical data easily accessible from multiple scattered sources across the internet. The module was developed by systematically web-scraping and collecting team-wise, player-wise and match-by-match data. The data has been cleaned, organized, and categorized into team overviews and player metrics, each filterable by season. The players are classified as raiders and defenders, with their best strategies for attacking, counter-attacking, and defending against different teams highlighted. Our module enables continuous monitoring of exponentially growing data streams, aiding researchers to quickly start building upon the data to answer critical questions, such as the impact of player inclusion/exclusion on team performance, scoring patterns against specific teams, and break down opponent gameplay. The data generated from Kabaddi tournaments has been sparsely used, and coaches and players rely heavily on intuition to make decisions and craft strategies. Our module can be utilized to build predictive models, craft uniquely strategic gameplays to target opponents and identify hidden correlations in the data. This open source module has the potential to increase time-efficiency, encourage analytical studies of Kabaddi gameplay and player dynamics and foster reproducible research. The data and code are publicly available: https://github.com/kabaddiPy/kabaddiPy

Student-Track Methods Finalist
Zeke Weng (View Full Paper)
University of Toronto
Bio:
Zeke is a second-year student at the University of Toronto, studying Computer Science and Statistics with a focus on Artificial Intelligence. He is keenly interested in machine learning and aims to gain more experience in multi-agent systems, reinforcement learning, and algorithmic game theory. Zeke intends to graduate in 2026 and then pursue graduate studies back home in California. Outside of his studies, he helps lead the University of Toronto Sports Analytics student group and has his sights set on the NFL Big Data Bowl after this conference.
Boulder2Vec: Modeling Climber Performances in Professional Bouldering Competitions
Abstract:
In the past decade, sport climbing has grown to be a popular pastime due to its social, physical and mental stimulation. This growth has been bolstered by its recent addition to the Summer Olympics in three formats: bouldering, speed and lead. In particular, bouldering, a form of climbing that focuses on short, difficult, routes (known as "problems"") with multiple attempts has seen the greatest growth, with 71% of new climbing gyms opening in North America being boulder-focused. Using data from professional bouldering competitions from 2008 to 2022, we train a generalized linear model to predict climber results and measure skill level. However, this approach is limited, as a single numeric coefficient per climber cannot adequately capture the intricacies of climbers’ varying strengths and weaknesses in different boulder problems. For example, some climbers might prefer more static, technical routes while other climbers may specialize in powerful, dynamic problems. To this end, we apply Probabilistic Matrix Factorization (PMF), a well-established framework commonly used in recommender systems, to automatically learn to represent the unique characteristics of climbers and problems with latent, multi-dimensional vectors.In this framework, a climber’s performance on a given problem can be predicted from the dot product of the corresponding climber vector and problem vectors. Additionally, PMF effectively handles sparse datasets, such as those encountered in competitive bouldering where climbers don't attempt every problem, by extrapolating patterns from similar users, thus inferring information about unobserved interactions. We contrast the empirical performance of PMF to the generalized linear model approach and investigate the learned multivariate representations to gain insights into climber characteristics.

Student-Track Methods Finalist
Victor Hau (View Full Paper)
University of Toronto
Bio:
Victor is a third year Engineering Science student at the University of Toronto majoring in mathematics, statistics and finance. As an engineering student, he is interested in applying his knowledge to exciting real-world problems such as that of sports analytics. Above all, Victor enjoys using data and modeling to drive decision-making and operations research. He plans to pursue a career in a data-centric role in data science or other quantitative fields for his co-op year as well as after graduation. In addition to his academic focuses, he is an avid follower of the NFL, NBA and NHL (go Leafs!).
Boulder2Vec: Modeling Climber Performances in Professional Bouldering Competitions
Abstract:
In the past decade, sport climbing has grown to be a popular pastime due to its social, physical and mental stimulation. This growth has been bolstered by its recent addition to the Summer Olympics in three formats: bouldering, speed and lead. In particular, bouldering, a form of climbing that focuses on short, difficult, routes (known as "problems"") with multiple attempts has seen the greatest growth, with 71% of new climbing gyms opening in North America being boulder-focused. Using data from professional bouldering competitions from 2008 to 2022, we train a generalized linear model to predict climber results and measure skill level. However, this approach is limited, as a single numeric coefficient per climber cannot adequately capture the intricacies of climbers’ varying strengths and weaknesses in different boulder problems. For example, some climbers might prefer more static, technical routes while other climbers may specialize in powerful, dynamic problems. To this end, we apply Probabilistic Matrix Factorization (PMF), a well-established framework commonly used in recommender systems, to automatically learn to represent the unique characteristics of climbers and problems with latent, multi-dimensional vectors.In this framework, a climber’s performance on a given problem can be predicted from the dot product of the corresponding climber vector and problem vectors. Additionally, PMF effectively handles sparse datasets, such as those encountered in competitive bouldering where climbers don't attempt every problem, by extrapolating patterns from similar users, thus inferring information about unobserved interactions. We contrast the empirical performance of PMF to the generalized linear model approach and investigate the learned multivariate representations to gain insights into climber characteristics.

Student-Track Methods Finalist
Ethan Baron (View Full Paper)
New York University
Bio:
Ethan recently started a PhD in Computer Science at New York University working on machine learning. He completed his undergraduate degree at the University of Toronto in Engineering Science, where he led the University of Toronto Sports Analytics student group. Ethan has also worked on soccer analytics as a data scientist at Zelus Analytics, and has presented his sports analytics research at NESSIS, MathSport, and CORS. Outside of work, he is a passionate road cycling fan, and enjoys playing volleyball, basketball, and ultimate frisbee!
Boulder2Vec: Modeling Climber Performances in Professional Bouldering Competitions
Abstract:
In the past decade, sport climbing has grown to be a popular pastime due to its social, physical and mental stimulation. This growth has been bolstered by its recent addition to the Summer Olympics in three formats: bouldering, speed and lead. In particular, bouldering, a form of climbing that focuses on short, difficult, routes (known as "problems"") with multiple attempts has seen the greatest growth, with 71% of new climbing gyms opening in North America being boulder-focused. Using data from professional bouldering competitions from 2008 to 2022, we train a generalized linear model to predict climber results and measure skill level. However, this approach is limited, as a single numeric coefficient per climber cannot adequately capture the intricacies of climbers’ varying strengths and weaknesses in different boulder problems. For example, some climbers might prefer more static, technical routes while other climbers may specialize in powerful, dynamic problems. To this end, we apply Probabilistic Matrix Factorization (PMF), a well-established framework commonly used in recommender systems, to automatically learn to represent the unique characteristics of climbers and problems with latent, multi-dimensional vectors.In this framework, a climber’s performance on a given problem can be predicted from the dot product of the corresponding climber vector and problem vectors. Additionally, PMF effectively handles sparse datasets, such as those encountered in competitive bouldering where climbers don't attempt every problem, by extrapolating patterns from similar users, thus inferring information about unobserved interactions. We contrast the empirical performance of PMF to the generalized linear model approach and investigate the learned multivariate representations to gain insights into climber characteristics.

Student-Track Methods Finalist
Jacky Jiang (View Full Paper)
Rice University
Bio:
Hao "Jacky"" Jiang is a driven student at Rice University, pursuing a B.S. in Computer Science and a B.A. in Sport Analytics. With a strong foundation in software development, machine learning, and data science, he has contributed to research projects such as wearable systems for exercise recognition at Cornell University. His internships include building scouting applications for D.C. United and enhancing recommendation algorithms at Petkeley AI Innovations. An active community member, Jacky has volunteered over 100 hours at the Houston Food Bank. Looking ahead, Jacky plans to pursue a Ph.D. in Human-Computer Interaction, aiming to deepen his expertise in the field and contribute to advancing technology for practical and user-centered applications.
GoalNet: Advancing Counterattack Prediction in Soccer through Gender-Specific Graph Neural Networks
Abstract:
Traditional soccer analysis tools emphasize metrics such as chances created and expected goals, leading to an over-representation of attacking players’ contributions and over-looking the pivotal roles of players who facilitate ball control and link attacks. Identifying these players could help coaches develop specific tactics and club recruiting. Examples include Rodri from Manchester City and Palhinha who just transferred to Bayern Munich. To address this bias, we developed a model utilizing graph neural networks (GNN) to analyze match events comprehensively. Our research aims to identify players with pivotal roles in a soccer team using GNNs, incorporating both spatial and temporal features. In our approach, each event in a soccer match is represented as a graph where nodes correspond to players and edges denote interactions. Each node encompasses various attributes, including the player’s name and historical performance metrics such as average pass completion rate. Edges capture interactions between players, such as passes and tackles, with features including pass frequency and distance.We incorporate the last k events to maintain temporal context, accounting for recent interactions. Our model is trained to predict the expected threat (xT) changes for each event, effectively attributing these changes to the contributing players based on their interactions in the previous events. We combine metrics such as degree centrality with the output of the trained GNN model to assign xT changes as credits to players more accurately. To validate the effectiveness of this method, we examined player evaluation outputs, demonstrating that this innovative evaluation method accurately reflects player contributions. Our findings highlight the significance of these pivotal players in the team dynamics, providing a more nuanced understanding of their impact on the game. This comprehensive analysis using GNNs allows for a balanced evaluation of player contributions, showcasing the indispensable roles of facilitators and initiators in soccer matches.

Student-Track Methods Finalist
Jerry Cai (View Full Paper)
Rice University
Bio:
Yanxiao Cai is a research assistant and junior software engineer studying computer science at Rice University. Having worked in many research fields, from analyzing large-scale EHR datasets to machine learning model developments for recommendation systems and CTR predictions, he has gathered much experience. Equally comfortable with state-of-the-art techniques like the recurrent neural network, transformer model, and graph neural network, Yanxiao has worked with mainstream frameworks such as PyTorch to raise the accuracy of his predictions. His interest is in machine learning in sports, especially football. Yanxiao works passion-flooded to learn how data collection and processing in industries happen and is committed to applying machine learning to unlock insights in healthcare and sports analytics.
GoalNet: Advancing Counterattack Prediction in Soccer through Gender-Specific Graph Neural Networks
Abstract:
Traditional soccer analysis tools emphasize metrics such as chances created and expected goals, leading to an over-representation of attacking players’ contributions and over-looking the pivotal roles of players who facilitate ball control and link attacks. Identifying these players could help coaches develop specific tactics and club recruiting. Examples include Rodri from Manchester City and Palhinha who just transferred to Bayern Munich. To address this bias, we developed a model utilizing graph neural networks (GNN) to analyze match events comprehensively. Our research aims to identify players with pivotal roles in a soccer team using GNNs, incorporating both spatial and temporal features. In our approach, each event in a soccer match is represented as a graph where nodes correspond to players and edges denote interactions. Each node encompasses various attributes, including the player’s name and historical performance metrics such as average pass completion rate. Edges capture interactions between players, such as passes and tackles, with features including pass frequency and distance.We incorporate the last k events to maintain temporal context, accounting for recent interactions. Our model is trained to predict the expected threat (xT) changes for each event, effectively attributing these changes to the contributing players based on their interactions in the previous events. We combine metrics such as degree centrality with the output of the trained GNN model to assign xT changes as credits to players more accurately. To validate the effectiveness of this method, we examined player evaluation outputs, demonstrating that this innovative evaluation method accurately reflects player contributions. Our findings highlight the significance of these pivotal players in the team dynamics, providing a more nuanced understanding of their impact on the game. This comprehensive analysis using GNNs allows for a balanced evaluation of player contributions, showcasing the indispensable roles of facilitators and initiators in soccer matches.