Data Sharing

Stop Data Monopolies


Whenever I see all those papers in prestigious high impact factor (IF) journals I keep wondering what’s missing in my research. I find that lack of access to massive aggregated datasets is what is stopping me from publishing in top journals. I guess this scenario is particularly true for many other colleagues in the field of genomics: we lack the access to massive/exclusive datasets that make it possible to publish in high IF journals.

ball-457334_1920

Exclusive access to datasets of big size or extraordinary rarity is what in my opinion makes the mark to high IF journals. This bias may lead to non-established (e.g., citizen science) endeavours rarely to making much impact if at all; whatever people with little resources or connections can produce never reaches the sufficient data calibre to allow them to shine.

Despite all of the said trends on the need of data sharing, the reality is that many scientists have little or no incentive to make their massive datasets readily accessible and reusable. It takes too much effort and time and brings no rewards. This needs to change.

The difficult, most time-consuming stage for my research I find it is to get raw data into its processed form, ready for analysis. Just sharing raw data does not seem enough. Both data generation and analyses can be easily crowdsourced. Yet production of readily analysable datasets is where the current bottleneck in the scientific enterprise lies.

What needs to be done to change this?

  • A good starting point is to recognise the problem that top journals papers have privileged access to data that makes them shine.
  • The best possible investment of public tax money (or industry money for that matter) probably lies on the making of processed curated data as widely accessible (ie.e, reusable) as possible. The more people can access and reuse the data, the greater the benefits drawn from the investment.
  • Data reusability plans should be recognised as an integral part of any grant proposal. By showing that data reuse is something that has been planned all along the experimental design rather than being after thought, funders and investors can guarantee that their investment offers the greatest possible return.
  • Greatest possible return does not lie on amazing results but on the ability for some particular dataset to be reutilised and analysis crowdsourced maximally.

 

Did this interest you? Follow me on twitter to stay connected.

1 reply »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s