As a sports research scientist, one of the main requirements of my job has been to characterize athletes in a number of different sports. I like to use the terms player characterization or player profiling, which initially came from my work on Product Player Matching at Callaway Golf. I have since used similar methodologies and analysis tools for profiling both hitting and pitching mechanics in baseball and ACL injury mechanisms in football.
In these studies, I am typically looking for similarities in performance metrics to classify athletes. My work in player profiling at Callaway was initially presented to me as a project to classify or bucketize players for club fitting efforts. While I understood the philosophy, I was hesitant to use those terms as I didn’t feel they really represented what I was trying to accomplish with my Player Profiling, Digital Human Modeling, and Club Fitting projects.
Bucketizing players to me meant find anything that was similar and put the golfer in any number of buckets, and we could recommend a product solution for that group. However, there were a multitude of ways to bucketize players: player handicap, club head speed, body type (The LAWs of the Golf Swing), or ball flight characteristics to name a few. When it comes to customization or personalization efforts, I just don’t believe that using average population values works. I could go to our club fitting database, perform a query and download all players with 95 – 100 mph club head speed and a push-draw club head path, and find players using multiple combinations of shaft flex and club head loft. While those query values may partially correlate with population performance levels, there is nothing that suggests those metrics were the appropriate metrics to use in the first place to enhance individual performance.
Using averages on an entire population is a very common statistical tool. However, I think it is extremely useless information…for individual performance analysis. It is very beneficial to define the population you are analyzing in a research study. The reason population averages have been used for so long is because often times that is all the data that was available and was necessary for a study. For performance analysis, it is a waste of time in my opinion. Performance involves selecting an appropriate metric, measuring an appropriate base line for the individual, and then monitoring changes of the metric over time compared to the base line value.
When I share my views on this, I am often met with skepticism and told that is not how scientific studies are performed. My response is that is why there is so little improvement in measurable athletic performance in the US. Our data scientists are too concerned with journal requirements than they are with actually implementing research technologies and programs that provide measurable athletic performance improvement. Most of the technologies and treatments that are providing measurable improvements to US teams are coming from overseas.
If you don’t believe me, how about you listen to Dallas Mavericks owner Mark Cuban, a tech innovator and investor. Cuban recently talked with ESPN’s Colin Cowherd on August 27th about where he believes technology is headed. Cuban stated in that interview that personalized medicine is the biggest technology paradigm that will come to NBA teams. His quotes extended beyond sports and applied to personalized medicine and getting individual blood level baselines. Cuban also expanded upon his comments to Cowherd in this ESPN Insider article by Tom Haberstroh, “We are doing as many things as possible to create baselines for our players,” Cuban says. “One of the problems we all have, not just in sports, is that we wait until we are sick or have a health problem to get data about ourselves. Then our doctors compare us to the general population. But that’s a worthless comparison. I think the smartest thing I do for my health and we try to do at the Mavs, is to take ongoing assessments so we have a baseline for each individual that we can monitor for any abnormalities.” Cuban just invested in Australian-based Catapult Sports which provides an athlete performance monitoring technology for his Mavericks team.
Jeremy Holsopple is the Athletic Performance Director for the Dallas Mavericks, hired by Cuban in August 2013. Holsopple was previously a strength coach for MLS’s New York Red Bulls, where he would often visit European soccer clubs to gather information on their training programs. It was there that Holsopple found how far behind professional American sports teams were in sports science. Holsopple said, “In the U.S., we dominate in speed, power and athleticism. But in terms of understanding training loads and how to work the energy systems of the body and making an athlete more durable and being able to take athleticism to a higher level, we’re just starting to learn a lot more.”
For my player profiling studies, I use the following steps for my research:
- Select appropriate measurement technologies.
- Data collection studies in conjunction with other appropriate measurement technologies to either provide additional information, validate selected measurement technology, or to compare against other measurement technologies.
- Data scrubbing or post-processing to clean up data for subsequent analysis.
- Data manipulation or transformations to work with data in an appropriate format for performance analysis.
- Data analysis to categorize athletes, measure performance levels, and monitor changes in performance metrics.
- Summarization of results.
Where most sports performance analysis studies go wrong, is the very first step. They will use technologies that either they are used to or others have used for performance analysis, not necessarily the technology most appropriate for the performance evaluation as Holsopple suggested previously. For motion capture biomechanics studies, this first step is often tied to step #4 as well. Most all sports biomechanics studies will use data analysis techniques that use rotational values that are specific to their methodology and are incapable of being directly compared with other studies. This is a huge problem with sports performance studies, especially overhead throwing studies, and why they offer so little performance improvement with professional athletes. There are some universal software algorithms that could make the data directly comparable from study to study, player to player, and session to session, but they are not used. That is why I believe inertial sensing technologies will be the technology of choice for any sports science study concerned with true athletic performance improvement moving forward as they directly measure the appropriate metrics.
From my professional experiences, I have acquired an ability to see through the noise and find the appropriate signal necessary for performance analysis. To me that is the biggest problem with big data and sports analytics today. All professional sports teams are so focused on getting into sports analytics and collecting data that they lose sight of what they are trying to actually do. The amount of data produced from motion tracking technologies like SportVU and MLBAM’s upcoming player tracking system can range from several gigabytes to terabytes dependent on the length of the game. Having a couple of analysts throw infinite queries at these streams of data may seem like analytics, but the analysts get so stuck in the endless data analysis stream that they never even get to the performance improvement part of their work, which is really want the professional teams want for their players. As Kirk Goldsberry wrote, “Whereas just a few years back acquiring good data was the hard part, the burden now largely falls upon an analytical community that may not be equipped to translate robust surveillance into reliable intelligence. The new bottleneck is less about data and more about human resources, as overworked analysts often lack the hardware, the software, the training, and most of all the time to perform these emerging tasks.”
The modern day sports analyst needs to be as adapt at developing simulation models or performance algorithms as he is at writing analysis code. With the amount of data that could be analyzed, it is important to be able to select the most appropriate data to analyze. The player tracking data provided by SportVU and other tracking technologies is essentially the same as motion capture data. The biggest differences are that motion capture data also calculates rotational or orientation data; also, the player tracking data is collected at a much lower sampling rate, but it is collected over much longer time periods – seconds vs. hours. In both cases, 3D positional data is measured, and then velocities and accelerations are derived from that data. As there are multiple players on the court or field, this raw and calculated data amplifies right away. Then additional statistical measures such as distance traveled and spatial statistics are added. A golf swing or baseball pitch only lasts for 1 – 1.5 seconds. A basketball game is 48 minutes and a baseball game can last 2.5 hours or more. So it is imperative that the analyst know how to select the most appropriate data to analyze. Just because the manufacturer provides the information, though, does not mean it needs to be analyzed. That is where the analysts needs to understand what data was collected and how it was processed. The analyst should then limit his analysis to data that he wants and needs for his model or algorithm without blindly performing query after query.
This applies to physics-based models and kinematic and kinetic analyses where an analyst should have a very good understanding of both the system he is modeling and the data that he is using. However, when we start looking to characterize players using non-kinematic data, with huge data sets requiring machine learning and artificial intelligence,the problem becomes much more complicated. In these scenarios programming logic based algorithms for these large data sets is not practical. Modern day sports analysts now often employ machine learning and artificial intelligence in data mining and pattern recognition applications with these huge data sets. In these applications, the analyst is using advanced proprietary software programs in conjunction with computer clusters to extract patterns and knowledge from these large data sets. In one example, Muthu Alagappan used topological data analysis software to analyze a data set for a full NBA season’s stats for 452 NBA players and found a new way to group players. His research won first prize at MIT’s Sloan Sports Analytics Conference in 2012.
James Guszcza and Bryan Richardson wrote an incredible piece on big data entitled Two dogmas of big data: Understanding the power of analytics for predicting human behavior. Anyone involved with analytics in any capacity, much less sports analytics, needs to read this. And then read it again. I have included a couple of paragraphs in italics from that article below that highlight points from the article. When I first started Player Profiling efforts at Callaway Golf, I used a lot these same ideologies in my decision making based much more on scientific insight and knowledge than any insightful articles like this as there were not any at the time…at least that I was aware of.
“Roughly ten years ago, The Economist magazine quoted the science fiction author William Gibson as saying, “The future is already here—it’s just not very evenly distributed.” Gibson’s comment is not a bad description of the varying degrees to which analytics and data-driven decision-making have been adopted in the public and private spheres. Much has been done, much remains to be done.” – James Guszcza and Bryan Richardson
Guszcza and Richardson write, “A well-kept secret of analytics is that, even when the data being analyzed are readily available, considerable effort is needed to prepare the data in a form required for the fun part—data exploration and statistical analysis. This process is called “data scrubbing,” connoting the idea that “messy” (raw, transactional, incomplete, or inconsistently formatted) data must be converted into “clean” (rows and columns) data amenable to data analysis. While it sounds (and indeed can be) tedious, data scrubbing is counterintuitively the project phase where the greatest value is created.”
…This story, itself hardly over a decade old, has lately been complicated by the emergence of “big data” as a dominant theme of discussion. Big data is routinely discussed in transformative terms as a source for innovation. “Data is the new oil,” the saying goes, and it will enable scientific breakthroughs, new business models, and societal transformations. A zeitgeist-capturing book title declares that it is a “revolution that will transform the way we live, work, and think.” The Cornell computer scientist Jon Kleinberg judiciously declared, “The term itself is vague, but it is getting at something that is real… big data is a tagline for a process that has the potential to transform everything.”
…While there is little doubt that the topic is important, its newness and the term’s vagueness have led to misconceptions that, if left unchecked, can lead to expensive strategic errors. One major misconception is that big data is necessary for analytics to provide big value. Not only is this false, it obscures the fact that the economic value of analytics projects often has as much to do with the psychology of de-biasing decisions and the sociology of corporate culture change as with the volumes and varieties of data involved.
The second misconception is the epistemological fallacy that more bytes yield more benefits. This is an example of what philosophers call a “category error.” Decisions are not based on raw data; they are based on relevant information. And data volume is at best a rough proxy for the value and relevance of the underlying information.
…It is hard to overstate the importance of Meehl’s “practical conclusion” in an age of cheap computing power and open-source statistical analysis software. Decision-making is central to all aspects of business, public administration, medicine, and education. Meehl’s lesson—routinely echoed in case studies ranging from baseball scouting to evidence-based medicine to university admissions—is that in virtually any domain, statistical analysis can be used to drive better expert decisions. The reason has nothing to do with data volume and everything to do with human psychology.
…But none of this bursts the big data bubble entirely. Once again the realm of “people analytics” applied to professional sports provides a bellwether example. Sports analytics has rapidly evolved in the decade since Moneyball appeared. For example, the National Basketball Association employs player tracking software that feeds real time data into proprietary software so that the data can be analyzed to assess player and team performance. Returning to William Gibson’s image, professional sports analytics is a domain where “the future is already here.”
Given the time and expense involved in gathering and using big data, it pays to ask when, why, and how big data yields commensurately big value. Discussions of the issue typically focus on various aspects of size or the questionable premise that big data means analyzing entire populations (“N=all” as one slogan has it), rather than mere samples. In reality, data volume, variety, and velocity is but one of many considerations. The paramount issue is gathering the right data that carries the most useful information for the problem at hand.
…We have told a two-part story to counter the two dogmas of big data. The first half of the story is the more straightforward: In domains ranging from the admissions office to the emergency room to the baseball diamond, measurably improved decisions will likely more often than not result from a disciplined, analytically-driven use of uncontroversial, currently available data sources. While more data often enables better predictions, it is not necessary for organizations to master “big data” in order to realize near-term economic benefits. Behavioral science teaches us that this has as much to do with the idiosyncrasies of human cognition as with the power of data and statistics.