A New Race Predictor developed by Vickers: My analysis of the paper
"An empirical study of race times in recreational endurance runners"
Sample size - 2311 people in the original set. 2164 people in the final data set with removals. In total, 1387 5K times, 946 10K times, 1579 HM times, 1022 M times. But the data set used to create the line of best fit used less data (884 5K, 595 10K, 989 HM, and 639 Marathon). This is a significantly sized study for the intent and much larger than most studies on this topic. However, most other smaller studies don't use the same collection method which in this case necessitates the large sample size.
Sample collection - An internet questionnaire open to anyone using slate.com article to find participants. The article speaks to potential issues of representativeness and selection bias. I'm not terribly concerned about the selection bias. The representativeness is more closely representative to the real recreational runner than other scientific studies, but it still isn't a good sample. The male median time in this study was 3:28 for the marathon versus 4:11 for NYC marathon, and 4:16 for Running in the USA. The female median time in this study was 3:54, 4:38 in NYC, and 4:41 in Running in the USA. So while closer than other studies, this data set population is still roughly 45 (!!!) minutes faster than the average in NYC or the average marathon finishing time in the US. Also, keep these times in mind for later.
But I don't believe they discussed what I find to be the most important issue enough: an internet questionnaire is self-reported. Studies with self-reported data tend to be less accurate because they rely on the person filling out the information to give correct information. Some people lie, some people make mistakes, some people misinterpret questions, and others just give their "best answer". All of these potential issues could skew the data set. They use "tempo" runs as an example of why there is no better method of data. They state that the alternative to self-reporting (particularly this questionnaire) is to have a running coach visit participants, watch tempo run, and verify the running log to determine whether the subject did run a tempo during most weeks. They claim that this isn't feasible and reliably wouldn't give any different results than self-report. And since their conclusion matched the data (well duh?!?!), then they must be right on their study methodology. So because the conclusion you came up with based on the data, is in agreement with the data you based the conclusion on, then your methodology is correct. Ummm... OK....
Anyways, they state there isn't a better alternative (or at least the single alternative they give is a running coach who physically visits everyone). I can offer another one. Strava. The website allows many different platforms of collection GPS data into one site. You could conceivably form a Strava group with several thousand participants and record their data from there. It would take more work, but would give a much more accurate data set. This way it is much easier to identify false data, and much easier to collect actual values based on performance of training and in race. In fact Strava has already done this with the 2016 London Marathon, so it can be done (although their study has issues as well).
Questionnaire - Alright, so let's give up on the self-reported data set and just evaluate the questionnaire on it's own. I am viewing the actual questionnaire posted rather than the text description found in the paper. The actual questionnaire was available in the supplementary documents.
Age, Gender, Height, Weight - None of these would likely be answered incorrectly except for maybe a mistake.
Are you an endurance runner or a speed demon? Endurance Runner 1 2 3 4 5 6 7 8 9 10 Speed Demon
So this is a self described description of a person. But the best I can tell from the end result they didn't use this data in any meaningful way.
What type of footwear do you wear? Normal, Minimalist, Vibrams, sandals, or barefoot
They threw out anyone in Vibrams, sandals, or barefoot (small sample size). I have no issue with that.
Recent race information (including a note that not races you were pacing a friend - to imply an actual race) Time, Distance - Both self explanatory
How difficult was the course? Very difficult - very hilly, hot or windy Difficult - hilly, hot, or windy Average Fast - cool, calm, and flat Very fast - downhill or tailwind
This is interesting. They are doing this because they intend on adjusting the data to the middle. If very difficult then give an adjustment to make it average. If very fast, then give an adjustment to make it average. Since this is self-reporting then the difficulty of the course averaged over all participants should be average. There should be equally very fast and equally very difficult. Since the entire data set is available, I verified that this is indeed true. All distances recorded an average value of 2.9-3.1 (which is "average"). Also viewed as a histogram (since averaging descriptive values is not always a clear cut method), also shows that the "average" is the middle as well as flanked on both sides with near equivalent values. This is all to say that their data set appears good from a self-reported standpoint for difficulty of course.
How would you rate your fitness?
They threw out anyone that wasn't well prepared. Fine with me. Not a good idea to build a robust calculator with people not at their respective fitness level.
What was your typical weekly mileage leading up to this race?
An interesting and very loaded question in my mind. What does typical mean? Average, presumably. But for how long prior to the desired race distance? 5 weeks, 6 weeks, 10 weeks? There is no guidance to say everyone answered this question the same. How is the mileage divided up amongst the week? Does that matter? I'd argue yes, but that isn't captured here. So someone who runs 7 days a week totaling 50 miles, is the same as someone who runs one day a week at 50 miles. Obviously these aren't the same, but for this data set they would be.
What was the maximum number of miles you ran in a single week during training?
Much more straight-forward question. What was your max mileage in a single week. The interesting thing is according to them the data (or end result) was nearly the same when viewed through the scope of max mileage and weekly mileage. Which means to say that someone who has a max of 40 is commonly in the same weekly average as other people with a max of 40. So since the max mileage and weekly mileage agree, and the max mileage is straight-forward, then it would lead me to believe (but not conclude) that the weekly mileage is probably accurate enough.
Did you run sprints, intervals, or hill repeats most weeks during training?
Alright, so let's try something answer this question in your mind. Alright got your answer. Ok...
Seems straight-forward. But is it? How much sprints? What constitutes "most weeks"? If I do an 18 week training plan, and 7 weeks are spent on sprint intervals. Is that "most weeks"?
How did the authors define this question in the paper? "Interval training is short and intense periods of max effort followed by equal length or longer recovery periods of less strenuous exercise." So why didn't they write that in the questionnaire? Seems easy enough to understand. Why leave it to interpretation?
But here's a twist. What about run/walk? If you do it, did you initially answer yes to sprints, intervals, or hill training? Because by my definition run/walk is "intervals" and I would state that it agrees with the definition as well with short intense periods followed by equal length or longer recovery (in fact that sounds like run/walk to a T).
So this begs the question, did my interpretation change your answer? If it did, that's a problem. Because it means people can interpret the question differently and may be giving inaccurate responses.
Again I think more guidance would help yield more accurate answers.
Did you do tempo runs most weeks during training? (If you don't know what a tempo run is, you probably didn't run one!) *This parenthetical statement actually appears on the questionnaire.
Alright, so let's try something again answer this question in your mind. Alright got your answer. Ok...
Well then. If I don't know what a tempo run is, then I didn't run one. Seems to be pushing people to the answer "no" if one can't define a tempo. And guess what, I find a "tempo" run to be VERY subjective. Can anyone define "tempo" for me? Well the paper references another paper for "tempo" and even offers their own description in the text of the paper, but didn't give the same guidance on the questionnaire. So what was the paper's definition of "tempo"? It's defined as a "steady pace at or above the anaerobic threshold". Quick who knows what their "anaerobic threshold pace" is? Geez, this is getting tough isn't it. So we went from what is "tempo" which could be 5K tempo, or 10K tempo, or HM tempo, or M tempo, to something called anaerobic threshold pace. So what's anaerobic threshold? Just so happens to be also known as your lactate threshold. So this would be a pace between 10K and HM for most of us.
Alright, so now we know that "tempo" is defined as doing runs between 10K to HM. Did anyone change their answer based on this additional information?
Well wait, the paper also cites another article (a Runner's World article) that further defines a "tempo" run as the following: "This is the effort level just outside your comfort zone—you can hear your breathing, but you're not gasping for air. If you can talk easily, you’re not in the tempo zone, and if you can’t talk at all, you’re above the zone. It should be at an effort somewhere in the middle, so you can talk in broken words. Pace is not an effective means for running a tempo workout, as there are many variables that can affect pace including heat, wind, fatigue, and terrain."
Alright, anyone change their answer again or for the first time?
The big question is why didn't the original questionnaire include these further definitions if the paper is willing to make conclusions based on them if never defined to the participants. Way too much ambiguity for me.
Alright, so that's all the questions they asked for the paper. But I believe they're missing a very large piece of the puzzle. I'm a big believer that training mileage is an antiquated way to look at training. Training is two main pieces: relative training intensity (or pace) and duration (or time). These together create "miles" or distance. So there is a definite missing piece to the data set when the authors are looking for weekly training miles or weekly max mileage. Is someone who runs 50 miles at 9:00 min/mile the same as someone who runs 50 miles at 5:00 min/mile? Are two people who run 50 miles at 9:00 min/mile, but with one person with all 50 at 9:00 min/mile and another person at 8:00 and 10:00 50/50 the same? Remember that the calculator that they formed takes into account weekly mileage, but the questions about tempo and intervals was not included in the calculation. Physiologically it seems their data is missing a large piece.
Conclusions It is typically believed that training volume is more important for distances such as the marathon than for the 5 and 10 km (km) distance [18– 20]. In contrast, we found that the association between training mileage and race velocity is similar across race distances.
I agree with this conclusion. Regardless of distance more miles (if done appropriately) will make you a faster runner. Or is it, that faster runners do more miles in training? A little of both. Which is again why it would be important to have the subjects report weekly duration of training (in addition to mileage).
Similarly, interval training is thought to be of most benefit for shorter distances, with tempo runs seen to be of particular value for long races: typical training plans include more frequent interval training, but less emphasis on tempos, for 10 km races than for marathons [21– 23]. We found that tempo runs were more strongly associated with velocity for short distances and that interval training had a similar association with velocity irrespective of distance.
So remember intervals = max effort runs with periods of recovery and tempo = steady paced runs between 10K to HM pace. So sprints are good for short distance and tempo for longer. But their conclusion is that tempo runs were more strongly associated with short distances and that interval training was similar across all distances. But what does the data say?
People with Tempo runs vs those who don't tempo run Marathon = -3.5% HM = -3.6% 10K = -6.4% 5K = -4.7%
So yes, 5K and 10K have more improvement, but there is still a significant amount of improvement with tempo runs for M and HM. A -3.5% improvement in a marathon is a 3:59 vs a 3:51. Anyone want an 8 min improvement? Yes, please. So while the 5K and 10K is more, the HM and M is still a healthy improvement. But this begs the question. Are those who perform tempo runs more experienced and done more marathons/training plans in their running career? And thus would those who run tempo runs tend to be faster runners in general or those looking to push themselves to a physiological limit? These numbers just say those who are faster run tempo runs, but doesn't say that because they ran tempo runs they are faster. 57% of subjects reported running tempo runs.
People with Intervals vs those who don't run tempo Marathon = -2.9% HM = -3.0% 10K = -1.1% 5K = -2.5%
Pretty even across the board. Still a nice improvement or again faster runners tend to run intervals. 52% of subjects reported running intervals.
Our other major finding was that although standard race prediction tools based on the Riegel formula work well for distances up to a half marathon, they substantially underestimate time for the marathon. Given the importance of pacing for marathon distance, this finding has considerable implications. Our novel marathon prediction model is straightforward and could easily be implemented on any website.
So they claim their calculator is better than the previous version.
There are 310 data points in their model 1 prediction (one other race) and 171 data points in the model 2 prediction. The data is further broken down into percentiles of 5%. So for model 1 that means 15 data points and for model 2 9 data points. Getting a lot smaller, right. So when evaluating the actual data I would conclude that the new model (1 and 2) is better than Riegel for everything in the top 67% of their data set, when evaluating the data as raw data. For model 1 that means everyone faster than a expected marathon of 3:52 should use the new calculator and for model 2 a 3:53. However, if you are slower than a 3:52 or 3:53, then the classic Riegel calculator is still better. If you want to say that avoiding a too fast start is the absolute paramount then the time cutoff is more like 4:11-4:14 (faster use the new calc, and slower use the classic calc). Now remember the NYC and Running in the USA averages? They were roughly 4:11-4:38. So essentially, the average runner should still use the classic calculator because the new calculator isn't as good at predicting average to slower times based on those completed in NYC or Running in the USA. Looks like to me they missed the mark with the original data set, and thus when they created a calculator it badly misjudges the times of those in the bottom 50% of marathon runners (but the classic can do those better, or at least according to the limited data set available in their original values).
dis_or_dat said: ↑
An interesting conclusion is that tempo is more important for 5k and 10k than marathon results and slightly more effective than interval training in general.
Again, the conclusion should be that faster 5K/10K runners tend to do "tempo" runs. Whereas, the runners who do "tempo" runs and those who don't don't have significantly different HM/M times. It doesn't mean that the "tempo" runs actually caused the difference. Again because, were these people right in that they even do tempo runs? And two, do more experienced runners and faster runners just more likely to do "tempo" runs because they're pushing their physical limits of improvement? Does the data imply tempo made them faster?
Same thing to consider when evaluating the interval conclusion as well. Did everyone answer appropriately based on their real information vs a misinterpretation of the question? And does the runners have faster times actually mean the intervals actually did it, or was it just a coincidence based on another variable that wasn't elicited based on the questionnaire?
They answer this by saying well the data looks like what we expected based on physiological answers from other papers, thus our data is right. But does that really confirm it, or is it just propagating the same wrong conclusion without confirming anything.
dis_or_dat said: ↑
What do you think about the idea of diminishing returns past 70-75 miles a week for a non-elite runner training for a marathon?
This is a tricky one. This paper doesn't make that conclusion so I'm guessing the blog you read did make this point. Again, is it appropriate to evaluate the training plan based on weekly mileage? Or would duration with paces be a better way to decide what is and isn't appropriate. How about 70 miles a week, but all in one day? I believe how you run those 70 miles would determine whether it's appropriate. I would say that running roughly 10-11 hours per week is a nice cutoff for non-elite / elite, but this value is based off no data. It's purely a guess on my part.
But that not withstanding, what is "non-elite"? What is the cutoff? Top 10%? Top 1%? Top 0.1%? Does someone who is non-elite and run less than 70 miles limiting themselves to non-elite status merely because they aren't running 100 miles per week? Does "elite" mean having the ability to run 7 days a week, multiple times a day, and being able to take naps and nutrition throughout the day and nothing about the actual marathon race time?
How about looking at the data set in the paper? Can we learn anything from them?
Blue = Marathon, Red = HM, Green = 10K, Yellow = 5K.
So someone training for a 5K would appear to max around 80 miles with diminishing returns. For 10K - Around 90 miles For HM - The curve never bottoms out up to 120 miles For M - The curve never bottoms out up to 120 miles.
So for the fastest times on the chart (which may or may not be "elite" depending on definition) the marathon does not have diminishing returns up to 120 miles.
Let's say 2:30 would constitute "elite" for a male marathon runner. That's a pretty fast time. Not world class, but nationally competitive. Only 6 males reported a marathon faster than 2:30. Their weekly mileage was 100, 115, 120, 70, 67, and 95. Is the data real? Who knows, again it's self reported. But it would appear that everyone of them is doing some pretty high mileage. What's the average mileage for those between 2:30-3:00? There are 79 subjects. They average a weekly mileage of 62.3 miles (with a min of 25 (!!!) and max of 100).
There are 22 subjects averaging over 75 miles per week. The average marathon time is 2:48. The next 22 subjects with the highest average weekly mileage average 72.2 miles per week and have an average marathon time of 2:54. The next 22 subjects average 62.3 miles and have an average marathon time of 2:51. Lastly, the next 22 highest average weekly mileage runners average 59.9 miles and have an average marathon time of 3:15.
Over 75 miles = 2:48 72.2 miles = 2:54 62.3 miles = 2:51 59.9 miles = 3:15
What does the data mean? That on average people who run 62-75+ miles run roughly the same marathon time between 2:48-2:54. But somewhere around 60-65 miles lies a point at which slower times start to catch up with training mileage. This is a backwards way to do it and based on only 88 data points, but we start to see a trend. You can see it in the blue line as well with the bend becoming more pronounced the closer you get to and lower than 60 miles per week.
So how does one build up their training weekly mileage and still avoid injury? Lots of slow easy running mixed into weeks with some SOS workouts. Like my proposed plans of 3 SOS workouts with 80% easy. This would seem to agree with my mindset that if you want to be the best endurance runner you can be you need to maximize running economy which comes with lots and lots of miles (or as I like to put it lots and lots of time at the right paces).
Alright, so those are my conclusions. Overall I think it is a good paper. And a definite step in the right direction, however based on the data and shortcomings there are definite improvements to be made. Thoughts?
Vickers AJ, Vertosick EA. An empirical study of race times in recreational
endurance runners. BMC Sports Sci Med Rehabil. 2016 Aug 26;8(1):26. doi:
10.1186/s13102-016-0052-y. eCollection 2016. PubMed PMID: 27570626; PubMed
Central PMCID: PMC5000509.