COVID sequences erased from int'l database shed light on initial spread

"There is no plausible scientific reason for the deletion" of the sequences from the early spread of the virus in Wuhan, according to Jesse Bloom.

A coronavirus testing facility in Wuhan, China. (photo credit: BGI)
A coronavirus testing facility in Wuhan, China.
(photo credit: BGI)
New partial sequences of the novel coronavirus may help shed light on the early spread of the virus, after a scientist uncovered 34 samples of the SARS-CoV-2 virus from early on in the initial epidemic in Wuhan that were mysteriously deleted from the internet, according to a pre-print analysis published on bioRxiv.
Jesse Bloom, the author of the analysis, is the principal investigator at the Bloom Lab at the Fred Hutch research institute, which studies the evolution of proteins and viruses. Bloom is also an investigator at the Howard Hughes Medical Institute.
While the virus was originally thought to have jumped from animals to humans at the seafood market, that theory was largely dismissed after many early cases were found to have no connection to the market. Some studies have found cases from as early as September 2019.
While it is still unclear how the SARS-CoV-2 virus began spreading among humans, the virus is believed to be descended originally from coronaviruses from bats.
The problem, explained Bloom, is that while one would expect the earlier samples of the virus to be more similar to the bat coronavirus than later samples, this is not the case. The sequences collected from the cases linked to the seafood market are notably different from the bat coronavirus compared to other sequences collected at later dates outside Wuhan, with the market samples containing three extra mutations compared to the samples collected later.
 

Very few sequenced virus samples are available from the Wuhan epidemic except for a dozen samples collected in late December in 2019 from patients connected to the Huanan Seafood Market.
In the pre-print analysis, Bloom explained that the lack of information may be partially due to an order issued to unauthorized Chinese labs to destroy all coronavirus samples from early in the outbreak.
Bloom first noticed that data was missing from the National Institutes of Health's Sequence Read Archive (SRA), when he saw that data listed in a study on early mutational events of the virus was missing from the SRA. Data can only be deleted from the SRA by an email request.
The deleted data seems to have been collected by Aisu Fu and Renmin Hospital of Wuhan University, and Bloom traced the data back to a study published in the journal Small in June 2020. While the data was not accessible through traditional search methods, Bloom succeeded in finding the data for 34 virus-positive samples from early on in the epidemic on the SRA's Google Cloud storage. Of those, he was able to partially reconstruct the sequences of 13 samples.
Bloom later found that the sequencing project was removed from the China National GeneBank as well, shortly after the data was removed from the SRA.
The deleted sequences Bloom found somewhat fill in the gap between the samples collected from the Huanan Seafood Market and the possible progenitor viruses found in bats. The 13 partially reconstructed sequences are more similar to the bat coronaviruses than the samples found in the seafood market.
Bloom stressed in the pre-print analysis that this suggests that the sequences from the market are "not representative of the viruses that were circulating in Wuhan in late December of 2019 and early January of 2020."
The scientist bemoaned the fact that the sequences were deleted, stressing that it "clearly would have been more scientifically informative to fully sequence the samples rather than surreptitiously delete the partial sequences."
"There is no plausible scientific reason for the deletion," wrote Bloom, explaining that the paper the sequences were linked to had no corrections, stated that human subjects gave their approval and that the sequencing shows no evidence of any contamination.
"It therefore seems likely the sequences were deleted to obscure their existence. Particularly in light of the directive that labs destroy early samples and multiple orders requiring approval of publications on COVID-19, this suggests a less than wholehearted effort to trace early spread of the epidemic," wrote Bloom in the pre-print analysis.
Bloom suggested that the NIH review it's e-mail records to identify other deletions from the SRA concerning the novel coronavirus, stressing that the deletions do not imply malfeasance as it is "infeasible" for SRA staff to validate the rationale for all deletion requests.
Renate Myles, a spokeswoman for the NIH, told The New York Times that the sequences "were submitted for posting in SRA in March 2020 and subsequently requested to be withdrawn by the submitting investigator in June 2020." The investigator told the archive managers that the sequences were being updated and would be added to a different database, but Bloom was unable to find such data.
"A careful re-evaluation of other archived forms of scientific communication, reporting, and data could shed additional light on the early emergence of the virus," advised Bloom, adding that it may be possible to obtain more information about the early spread of SARS-CoV-2, even if on-the-ground investigations face difficulties.
Bloom was one of the signees on a letter published by Science Magazine in May calling for more investigation into the origin of the coronavirus pandemic. The letter stressed that "knowing how COVID-19 emerged is critical for informing global strategies to mitigate the risk of future outbreaks."
In February, one of the investigators in a World Health Organization-led team probing the origins of the pandemic said that China refused to give raw data on early COVID-19 cases to the WHO-led team, potentially complicating efforts to understand how the outbreak began.
A report on the origins of COVID-19 by a US government national laboratory concluded that the hypothesis claiming the virus leaked from a Chinese lab in Wuhan is plausible and deserves further investigation, The Wall Street Journal reported earlier this month, citing people familiar with the classified document.
The study was prepared in May 2020 by the Lawrence Livermore National Laboratory in California and was referred by the State Department when it conducted an inquiry into the pandemic's origins during the final months of the Trump administration, the report said.
US intelligence agencies are considering two likely scenarios - that the virus resulted from a laboratory accident or that it emerged from human contact with an infected animal - but they have not come to a conclusion, he said.
Reuters contributed to this report.