In February 2023, the results of the world’s largest-ever four-day work week trial were published, indicating a host of positive impacts on employee wellbeing and organisational revenue.
Based on the evidence, we know that job quality matters for wellbeing, so it’s reasonable to assume that changes to work patterns may have wellbeing benefits. But how much confidence can we have in the report’s claims?
Here, Evidence Associate, Michael Sanders, reviews the study and discusses the need for scepticism.
Published by independent research organisation Autonomy, the new report details the results of a four-day week pilot carried out in 61 UK companies from June to December 2022. It involved around 2,900 workers who voluntarily adopted truncated work weeks.
The report is incredibly positive. There are, however, four reasons to be sceptical. This scepticism should not be taken as belief that four-day weeks are a bad thing. The question is whether or not this specific study moves us closer to an answer.
1. The need for a counterfactual
In order for us to identify the impact of an intervention – in this case, a four-day week – we need to have a sense of what would have happened if we’d not given people the intervention. The simplest way to do this is through a randomised trial, where some firms move to a four-day week at random, while others do not.
Instead, the study compares participants before and after the implementation of the four-day week. This kind of analysis offers lower value in our confidence in the findings. There are too many contributing factors to be confident that any changes are caused by the intervention, instead of just the passage of time, or something about the companies that opted in. The participating organisations may have had less happy workforces, or higher staff turnover, than average to begin with and be benefiting from a changing tide. The “great resignation” is, in fact, given by the report authors as a reason for organisations’ participation.
There are good reasons why a randomised trial may not have been possible here, but alternative efforts can be made in future studies. For example, collecting data from a matched sample of other organisations, or from the nine organisations that wanted to take part in the project but for various reasons weren’t ready to. A similar approach is used in the evaluation of the National Citizens’ Service. As a result of consistency in quality and study design, the evaluation has been included in the Centre’s evidence reviews such as Social capital and offers scope for drawing robust conclusions. This demonstrates that designing, delivering and evaluating a programme consciously and consistently can capture impact and produce useful, focused learnings.
2. Lack of independent evaluation
The study was carried out exclusively by people who want the four-day week to succeed. This is understandable, as they are the ones most motivated to study it, but it also makes the findings less reliable due to inherent bias.
For example, meta-analyses conducted by Australian academic John Hattie found large average effects of attainment interventions in education contexts. Almost all of the interventions were tested by their developers or other “true believers”. In contrast, trials funded by the Education Endowment Foundation, which were evaluated by more-or-less impartial independent evaluators, found dramatically smaller effects – perhaps one fifth the size on average.
3. Lack of clarity around outcome measures
The study reports the results for a large number of measures, capturing a range of psychological constructs like stress and burnout, as well as a general sense of positive and negative emotions. We would expect a report to detail which measures were used – the Copenhagen Burnout Inventory for example – and/or to publish their full survey in an annex. Neither practice is done by the report authors.
4. Lack of statistical clarity, detail and consistency
A standard approach to analysis when there is no robust counterfactual is to compare pre-and-post treatment mean. This analysis was clearly conducted, and is mentioned in some places, but not consistently. Instead, pie charts show the proportion of people exhibiting increases, decreases, or no change in the metrics. This is an atypical approach to the analysis.
Where changes are reported, it is occasionally mentioned that changes are statistically significant. As there is a lack of clarity in the paper about what tests were conducted we must assume that where significance is not mentioned, the changes were not significant for those measures (burnout, work stress, mental health scores, anxiety, negative emotions, positive emotions, fatigue).
Who specifically the study is looking at is also difficult to tell from the reporting. Figures on revenue, which show a 1.4% increase over the course of the six-month study in an environment of high inflation, come from less than 40% of the sample. Survey responses at endline are received from 70% of participants who completed the baseline survey. This endline response rate is impressive, but becomes less so when we consider that we are told nothing about who does and does not respond at the two time periods. If the least happy/most stressed people at baseline are the most likely to respond at endline, then simple mean regression could explain all of the findings.
General reflections
Conditions that help us thrive in the workplace – such as job security, learning opportunities, a supportive environment and social connections – impact both our productivity and our wellbeing.
Understanding if a four-day week makes a difference is an important question for the future of work, and one we should continue to interrogate.
To ensure successful investment in systems and changes within workplace contexts – and to support the future of employment for both individual wellbeing and the economy – we need rigorous and robust understanding of impact. This is particularly important as ideas move from small trials to larger-scale implementations. It is very possible to conduct robust workplace trials, however case studies remain the norm.
It’s critical that we improve the quality of evidence by asking if interventions work, how effective they are, for how long, and for whom, and whether they meet the cost-effectiveness threshold. This remains the case across public, private and civil society sectors.