The idea for this blog came from a presentation at the Kitman Labs Performance Summit in London (March, 2019) by my good friend, Dr. Robin Thorpe. Robin, a sports scientist, is an expert in recovery and regeneration physiology having spent nearly 10 years at Manchester United, before moving to Altis as Director of Performance and Innovation.
As Sport Scientists we collect data, we analyse it and we feedback to coaches. However simple that may seem, it can be easy to fall into the ‘Data Collection for Data’s Sake’ trap. If you don’t know why you are collecting it, then what are you collecting it for? I’m not ashamed to admit that I have previously fallen into that trap (yes, it can be trial and error!). In our support for players and coaches, it is important that the information fed back is accurate. It’s quite easy to make an assumption based on our data collection which may be inaccurate. In turn, this leads to incorrect inferences being made that may effectively reduce the training time and prescription for our players. We should never lose sight that our role should involve the use of scientific principles to improve and enhance our players, and not wrap them in cotton wool. We must look at maximising our players training and playing time. The greater the squad availability, the higher the probability of success (Carling et al., 2015).
In Dr. Thorpe’s presentation he spoke about the ‘Four Pillars of Confidence’ (Reliability, Validity, Sensitivity, Usability) when considering data metrics. A recent study by Starling and Lambert (2018) reported that of the 55 coaches and support staff the interviewed, 96% viewed monitoring of both training load and the training load response as important. However, of the coaches interviewed, it was noted that of the protocols used to monitor players, there was no single protocol that is cost-effective, time-efficient and non-invasive to players. While this may be an issue in some respects, a further issue may appear if the feedback to coaches is inaccurate and incorrect. Thus in the worst case scenario, causing us to lose training time.
In sport science, statistics are probably one of our biggest assets, and one of our most important aspects when making decisions based on data (Buchheit., 2016). However, if our statistical skills are less than proficient, we may end up making decisions that may be incorrect, or send confusing messages to our key stakeholders. This potentially gives practitioner, coach and player little confidence in the data, the data collection process and/or sport science.
Whatever we monitor we must ensure that the tools we use for monitoring are repeatable (Reliability), measure what they are supposed to measure (Validity), sensitive enough to detect meaningful change in the player data (Sensitivity), and subsequently useful (Usability) for the coaches and/or players (depending on who you are feeding back too!)
Therefore, the aim of this blog is to provide an overview of each of the ‘Four Pillars of Confidence’ suggested by Dr. Thorpe, and provide example statistical methods that may allow us to make inferences based on our data to support coaches as opposed to snapshot decisions based on small data.
Reliability may be considered as one of the most important of the Four Pillars as this directly affects the exactitude of our athlete monitoring (Atkinison and Nevill, 1988). For example, if we are to measure daily wellness in our players via a questionnaire, it is important to know what change on the scale would signify a meaningful change (McGuigan, 2017).
Whilst there are various methods to assess reliability, understanding the typical error of measurement; a method that directly measures the error within the test, subsequently allows us to calculate the variation in the monitoring tool. An effective type of typical error of measurement is the coefficient of variation (CV) which is expressed as a percentage. The CV gives us an indication of the spread of our data relative to the mean. Thus, the lower the CV, the lower the random noise, and therefore a higher chance of detecting a real change in the data (Hopkins, 2000). This gives us confidence that any changes in our data are reliable and not down to chance, and we are measuring what we say we are measuring! By calculating the CV, [100 (standard deviation / mean)] we can calculate the reliability of our monitoring tests, and in subsequently have confidence in our reporting of data to coaches and/or players.
This excellent video by Dr. Anthony Turner (Middx Uni) gives an excellent insight into assessing the reliability of your data.
The term ‘validity’ determines if the monitoring tool we use does what it says it does. Does the tool we use assess what we want it to assess? As with reliability there are various forms of validity (construct, ecological, face, content and criterion validity).
However, the types of validity of greatest importance to athlete monitoring, and for the purpose of this article, are construct and ecological (McGuigan., 2017). As a brief overview, construct validity refers to the extent of which a test measures what it was designed to measure (Baumgartner, 2007).
Ecological validity describes how the monitoring tool we select relates to the player’s performance and how well we can apply them in a real world scenario (McGuigan., 2017). However, it is possible to have a tool which has high reliability but little to no validity. Thus, when selecting our monitoring tools it is important that we have both high reliability and high validity. It is important that as practitioners we reduce the ‘noise’ in a test, and keep conditions as consistent as possible when administering any test or monitoring protocol.
Examples of test conditions that may affect the validity of our monitoring may be as simple as the number of observers, music, the preceding instructions on how to perform the test and the volume / frequency of verbal encouragement from testers / peers (Halperin et al., 2015). Thus, to quantify the validity of a test, your measurement (practical) values should be as close as possible to true values otherwise known as the “gold standard”. This is otherwise known as ‘criterion validity’. However, there are two parts to criterion validity: concurrent and predictive. (McGuigan., 2017).
For example, if we used a correlation between a performance test and a criterion measure, we could investigate the relationship between a laboratory based cycling time trial with a cycling competition time trial (Currell and Jeukendrup., 2008). However, this while this may seem logical, it is far more difficult to replicate the complex demands of a sport such as football, in a performance test. Thus, an example of concurrent validity is high correlation between the Yo-Yo Intermittent Test with high-intensity running in football (Krustrup et al., 2003).
Therfore, Predictive validity is the ability of a performance protocol to predict performance. For example, Hawley and Noakes (1992) used a test of maximal oxygen uptake (V̇O2max) and peak power output (Wmax). The authors subsequently demonstrated that Wmax explained 94% of the variance in 20-km time-trial performance, whilst VO2max described 82% of the variance.
Therefore, the nature of predictive validity, and its ability to deal with future performance, would have good application in areas such as fatigue monitoring (McGuigan., 2017).
Thus, in the context of our wellness data for this article, it was suggested by Thorpe et al., (2016) that the validity of the potential markers of fatigue (from our wellness data) can be assessed by examining their sensitivity to changes in prescribed training load over periods of time.
Thorpe et al., (2016)
When describing the sensitivity of a monitoring tool, we are referring to its ability to detect the small, but meaningful, changes in performance and/or in another aspect such as fatigue. Thus, sensitivity is related to both the reliability and validity of our monitoring protocol (McGuigan., 2017). For the applied practitioner, any valid marker of fatigue needs to be sensitive to fluctuations in training load (Meeusen et al., 2013). Consequently, and for the purpose of this part of the article, the focus will be on subjective well-being measures.
Recent literature by Thorpe et al., (2015, 2017) demonstrated that self-report measures, and in particular self-perceived measures of fatigue, were sensitive to daily and short-term training load accumulation. Furthermore, a systematic review by Saw et al., (2016) suggested that subjective measures reflected changes in athlete wellbeing, thus appearing to be sensitive to changes in training load, both acute and chronic. However, there may be challenges in collecting data, and detecting change when using self reported subjective questionnaires (compliance, familiarity etc). Thus, simply taking a mean average of the team’s reported scores may not detect any meaningful change.
The mean is a measure of ‘central tendency’. If you are given a data and calculate the mean it will represent the centre or middle of that data set. The challenge with our monitoring, is that the mean value is influenced by outliers. The larger the outlier in the data, the bigger the change or pull on the data mean. However, using the mean is descriptive, and doesn’t really allow us to make inferences about our data.
In summary, if the total sum is what you are looking for then rather than the typical value, use the mean. For example, if you want to know who those with the highest total distance are, then calculating the mean would be a good idea. Remember, in this example you are only interested in those who are the top runners, so therefore those below the mean are somewhat irrelevant in your analysis. Thus, could you be missing vital information by only looking at a mean value?
If we take daily well being data as an example below we can see the mean average of the group along the top of the table:
With each metric taken on a 1-5 scale (a total of 20) it appears that the average for each metric is as follows; Soreness 3.6/5, Energy 3.8/5, Stress 3.8/5 and Sleep 3.8/5. The mean average for the group is 14.96/20. This looks ok, and doesn’t show any abnormalities of causes for concern in our athletes – Or does it? If we look closely, we can see some values (e.g., Player 8 and Player 19) are low. However, these mean average scores tells us there is no cause for concern. Thus, is this a true reflection of our data/athletes?
However, what we could do, is to express the data in a different way that is sensitive to these changes in the group of athletes wellbeing data. The example below is the same data, with the addition of a ‘z-score’.
In essence, the z- score allows us to determine the number of standard deviations away from mean a data point is, i.e. how usual or unusual a certain data point is. As it is a standardized score, it allows us to make inferences based on our data (ie, positive or negative). Thus, provides more information than just the raw scores (Turner et al., 2015).
We can calculate a simple z-score with the following equation:
Z-score = Players score (in this case total) – the group mean score / the standard deviation of the group.
When the raw data is converted to a z-score, the normal distribution of the scores will have a mean of 0 and standard deviation of 1, however the z-scores will range from +3 to -3. Thus, a z-score will allow the practitioner and coach to see how many standard deviations from the mean, either below (negative) or above (positive), a player’s scores are.
A further advantage of z-scores is that they can be easily charted and presented in graphs. Thus, allowing the practitioner to compare data, and/or modify a session or programme, or both, if necessary (McGuigan., 2017). Whilst practitioners can set their own thresholds to determine what is significant, it has been suggested that a threshold score of > 1.5 standard deviations (in this case, a negative score) may be effective in identifying risk (Coutts and Cormack., 2014).
The table below gives an example of a monitoring system that can implemented at low-cost to the practitioner, using statistical analysis methods that allow us to make inferences on our data (Clubb and McGuigan., 2018).
It is certainly worth noting, that while a z-score has been used for this article, it has been based on one day’s worth of data for demonstration purposes only. Although beyond the scope of his article, for further longitudinal analysis a modified z-score can be calculated from baseline data (e.g, preseason). The calculation is as follows (Clubb and McGuigan, 2018):
Modified z-score = (player score – baseline score) / standard deviation of baseline
Furthermore, this excellent free resource by Adam Sullivan will help you build a rolling 28 day z-score in Excel to create a daily wellness dashboard for your team.
Arguably the most important pillar for the applied practitioner – how useful is this data for the coaching & playing staff? This goes back to the point at the start of this article – why collect data for data’s sake? What is important information, and what is not? The latter can be a difficult question for the applied practitioner to ask, but it’s vital we ask. Critical to our success as Sport Scientist’s is our ability to feedback to coaches and players, how we communicate our data with clarity and precision may prove challenging, and depend on those you are working with daily.
As we gain confidence in our data, using various statistical tools at our disposal, we must translate this information to inform practice (McCall et al., 2016). However, as sports scientist’s, having any kind of impact on the training programme and/or practice, is often far from easy (Buchheit. M, 2016). Personally, I believe this comes down to the fourth and final pillar – usability.
Currently, no single marker within the literature allows us to become totally informed on an athletes wellbeing, and subsequently, no single test performed in isolation is capable of giving us the full picture of athlete wellbeing (Starling and Lambert., 2018). Thus, it is imperative that the data we collect is meaningful and usable for coaching staff.
During the fast paced daily environment of elite football, we must filter the data to ensure usability, and translate it for those whom require it most. The key decision makers in the applied environment may have many plates to spin (technical, tactical, business etc) on a daily basis. Thus, more often than not, they are more concerned with simple and concise answers to their questions, e.g. is this player available to train/play? (McCall et al., 2016).
As practitioners, it is our role to simplify the data for our key stakeholders (players, coaches, physios, medical staff). Thus, we must be able to report, with confidence, that the inferences made from our data our Reliable, Valid, Sensitive (to change) and Usable. Therefore, our ability to translate and communicate the data with practical meaning is absolutely paramount (McCall et al., 2016).
Below is a chart I have created when deciding what we should be look for within a monitoring tool to feedback to our key stakeholders.
Adapted from: Starling and Lambert (2018) and Buchheit., M (2016)
Further recommended resources:
For free downloads and creating athlete monitoring tools:
Special thanks for the help,support and guidance during the writing of this article:
Dr. Jamie Pugh, Postdoctoral Researcher, LJMU (J.Pugh@ljmu.ac.uk)
Atkinson, G. and Nevill, A. (1998). Statistical Methods For Assessing Measurement Error (Reliability) in Variables Relevant to Sports Medicine. Sports Medicine, 26(4), pp.217-238.
Baumgartner, T. (2007). Measurement for evaluation in physical education and exercise science. Boston: McGraw-Hill.
Carling, C., Le Gall, F., McCall, A., Nédélec, M. and Dupont, G. (2014). Squad management, injury and match performance in a professional soccer team over a championship-winning season. European Journal of Sport Science, 15(7), pp.573-582.
Clubb, J. and McGuigan, M. (2018). Developing Cost-Effective, Evidence-Based Load Monitoring Systems in Strength and Conditioning Practice. Strength and Conditioning Journal, 40(6), pp.75-81.
Coutts, A. and Cormack, S. (2014). High-Performance Training for Sports. Pp.85-96.
Currell, K. and Jeukendrup, A. (2008). Validity, Reliability and Sensitivity of Measures of Sporting Performance. Sports Medicine, 38(4), pp.297-316.
Halperin, I., Pyne, D. and Martin, D. (2015). Threats to Internal Validity in Exercise Science: A Review of Overlooked Confounding Variables. International Journal of Sports Physiology and Performance, 10(7), pp.823-829.
Hawley, J. and Noakes, T. (1992). Peak power output predicts maximal oxygen uptake and performance time in trained cyclists. European Journal of Applied Physiology and Occupational Physiology, 65(1), pp.79-83.
Krustrup, P., Mohr, M., Amstrup, T., Rysgaard, T., Johanson, J., Steensburg, A., Pedersen, P. and Bangsbo, J. (2003). The Yo-Yo Intermittent Recovery Test: Physiological Response, Reliability, and Validity. Medicine & Science in Sports & Exercise, 35(4), pp.697-705.
Martin Buchheit. (2019). Chasing the 0.2 | Martin Buchheit. [online] Available at: https://martin-buchheit.net/2016/05/16/chasing-the-0-2/ [Accessed 6 May 2019].
McCall, A., Davison, M., Carling, C., Buckthorpe, M., Coutts, A. and Dupont, G. (2016). Can off-field ‘brains’ provide a competitive advantage in professional football?. British Journal of Sports Medicine, 50(12), pp.710-712.
Meeusen, R., Duclos, M., Foster, C., Fry, A., Gleeson, M., Nieman, D., Raglin, J., Rietjens, G., Steinacker, J. and Urhausen, A. (2013). Prevention, diagnosis and treatment of the overtraining syndrome: Joint consensus statement of the European College of Sport Science (ECSS) and the American College of Sports Medicine (ACSM). European Journal of Sport Science, 13(1), pp.1-24.
McGuigan, M. (2017). Monitoring training and performance in athletes.
Starling, L. and Lambert, M. (2018). Monitoring Rugby Players for Fitness and Fatigue: What Do Coaches Want?. International Journal of Sports Physiology and Performance, 13(6), pp.777-782.
Turner, A., Brazier, J., Bishop, C., Chavda, S., Cree, J. and Read, P. (2015). Data Analysis for Strength and Conditioning Coaches. Strength and Conditioning Journal, 37(1), pp.76-83.
Wallace, L., Slattery, K., Impellizzeri, F. and Coutts, A. (2014). Establishing the Criterion Validity and Reliability of Common Methods for Quantifying Training Load. Journal of Strength and Conditioning Research, 28(8), pp.2330-2337.