I was astounded to read (“Why you should never use Microsoft Excel to count coronavirus cases“) that the reason so many positive cases of coronavirus were missed off the statistics recently was because outdated software was being used. What?
Data was being uploaded to the national statistics in comma separated value (.csv) format to .xls Excel spreadsheet templates before finally being input to the Public Health England database. Why?
Why indeed.
In this day of far more modern software, why weren’t people uploading their .csv files directly to the cloud and having the data validated before it was incorporated into the main database?
If they have to use Excel, why were they using the far older .xls format rather than .xlsx? It was explained that this was to cover the fact that many people are still using legacy software (ignoring the fact that most data were submitted in .csv format). The .xls format was last used (googles it) by Microsoft Office version 2003 – so why are Public Health England using such old software? The .xlsx format was introduced with Microsoft Excel 2007, and let me see, all versions of Excel prior to and including 2010 ARE NO LONGER SUPPORTED BY MICROSOFT. Mainstream support for Office 2010 ended on 13 October, on Ada Lovelace Day*. Users can update to Office 2019, or move to Microsoft 365 , but may find that their legacy software does not work on it.
So, versions of Office up to 2010 no longer get security patches, and this means they are vulnerable to attack from malware of all kinds.
However, to get back to the missing cases and why they were missed – 15,841 cases between 25 September and 2 October were not uploaded to the government figures, and this was not noticed for several days.
The further consequence of this is that the contacts for all those people were also missed by the Test and Trace operation; and working on the average number of contacts per person, this would be about 50,000 people. With an R rate of about 1.2 that could mean an extra 60,000 cases of coronavirus in the first pass, who would not be offered tests so they could then go on to infect another 1.2 people each and those another 1.2 people each – a geometrical progression.
Public Health England (PHE) receive the data in .csv format, which has no maximum file size. When PHE put the results into Excel, almost 16,000 cases were dropped because the old version of Excel had reached its maximum number of rows. Actually, Excel does warn you when it reaches the limit and asks whether you want to proceed, but maybe nobody was looking at the screen just then. So the extra figures were dropped because there was nowhere for them to go!
The .xls format has 65,536 rows and the .xlsx format a little over a million, although this can be got around using various means including Data Model, which stores the excess data in the computer memory rather than in spreadsheet cells.
However, Excel really is not robust enough for really huge datasets and it has been shown in the past that multiple users of a spreadsheet can corrupt the data very easily because it is difficult to protect adequately. Large datasets should really be held in a database with masks on the input cells so that only data in the correct format goes into each cell and only the correct cells can be accessed. A .csv file with one comma missing for whatever reason could really mess up an old version Excel spreadsheet by importing data into the wrong columns. The possibilities are endless!
I shall leave you to draw your own conclusions about the capability of our public organisations.
Diana Catton
Managing Director
*Ada Lovelace was a prominent mathematician and early computing pioneer. Ada Lovelace Day was started to encourage more girls into studying STEM subjects and going into STEM occupations. STEM stands for Scientific, Technical, Engineering and Maths. Ada Lovelace was also the daughter of the infamous rake and poet Lord Byron.