Open Data Sources

Data has become a highly desirable resource in the digital landscape. Unlike oil, it can easily be replicated and shared with zero additional costs. Open source datasets are commonly used for free across various projects, for reasons ranging from marketing to improving research accuracy to government transparency. Users are advised to consider an acceptable margin of error when utilizing such datasets, as accuracy can be a concern. 

However, the potential benefits of collaboration can ultimately outweigh this risk.

Here are 16 Free Open Data Sources for 2023

1) Dolly 2.0

Databricks announced the release of Dolly 2.0, the latest version of their large language model (LLM) with ChatGPT-like human interaction capabilities. The updated version comes just two weeks after the initial release. According to Databricks, Dolly 2.0 is the first open source LLM that is trained to follow instructions and is fine-tuned on a publicly available dataset. Additionally, the dataset used for Dolly 2.0 is also available for commercial use. This means that businesses can utilize Dolly 2.0 for commercial purposes without having to pay for API access or share their data with third-party providers.

2) Openstreet Map

OpenStreet Map is known as a collaborative world map that is constantly updated by its users. Using their browser-based editor, everyone can easily access the data and modify the locations of streets, buildings, signs and other elements. These changes are then compiled into a comprehensive set of data that can be utilized by major map-making and route-planning organizations.

3) Kaggle 

Kaggle focuses on data science and offers access to notebooks filled with both Python and R code. The portal also includes lessons and competitions aimed at helping interested individuals learn more about data science and the data itself. One particularly interesting feature of the site is the collection of datasets, which range from basic facts and figures to more unusual ones, such as the winning numbers for South Korea’s lottery.

4) Data.Gov

Data.gov is a central clearinghouse for US government data sources, including the Integrated Postsecondary Education Data System and the US Geological Survey’s topographic data for every square mile of the country. Additionally, the site offers a listing of data hubs for further exploration within individual government agencies.

5) Public Library of Science

The Public Library of Science (PLOS) was founded as an alternative to for-profit scientific journals and now offers PLOS Open Data, a collection of associated datasets geared toward scientific research. This initiative offers scientists the opportunity to access and rerun data, facilitating meta-analysis by combining research from multiple studies for larger pattern analysis.

6) Open Science Data Cloud

Another valuable resource is the Open Science Data Cloud, which allows scientists across multiple disciplines to share lab data with one another. Particularly noteworthy among the various projects is Harvard’s Cultural Observatory’s Bookworm, a collection of books and textual material, and Bionimbus, a collection of biomedical data for cell studies.

7) AWS Open Datasets

    AWS provides an extensive collection of datasets that leverage Amazon’s environmental sustainability initiatives, with a focus on natural data. The company preloads many datasets into some of its best services, such as EMR, and also updates them regularly. For example, in January 2021, AWS updated the bioacoustic recordings of Orca sounds with streaming audio from Puget Sound.

      8) Azure Open Datasets

      Azure Open Datasets are curated and preprocessed to simplify their use with Azure’s instances and AI routines. These datasets include the Producer Price Index by the US Department of Commerce, which allows economists to track inflation, and New York City’s yellow taxi cab records, which interest urban planners as a way to study pick-up and drop-off times.

      9) Google’s Dataset

      Google’s Dataset Search provides access to over 25 million open-source datasets for machine learning model training with AI algorithms. Programmers and developers can search thousands of repositories effortlessly to discover open-source datasets using simple keywords. Google has also promised an expansion in the coverage and variety of the datasets for machine learning models.

      10) FiveThirtyEight

      FiveThirtyEight, a popular data journalism site, includes the data underlying their analyses in their articles. For instance, the NHL predictions are based on thousands of simulations which are updated after each game. Political polling and meta-analysis on pollster ratings are also available on the site.

      11) Yelp

      Yelp distributes a subset of their vast collection of opinions about restaurants, shops, and other establishments. Though the current batch is limited to almost seven million reviews of around 150,000 businesses across 11 major cities, Yelp expects that the text and photos will provide rich opportunities for training natural language processing algorithms and other AI applications.

      13) DBpedia 

      DBpedia is an attempt to create an open knowledge graph brimming with ontological information that can be queried with SPARQL. This structure makes it possible to design queries including strong inference and not just raw keywords.

      14) Facebook’s Social Network

      Facebook’s social network is a source of valuable cultural information that can be accessed through Meta’s Graph API. The API permits users to explore this vast data structure like nodes, and even if you aren’t logged in, your code can search it for cultural data.

      15) GitHub

      GitHub is popularly known as a repository for code, but it also hosts data alongside code. The advantage is that it tracks file evolution over time and is an excellent resource for repositories. For instance, MIT’s course on Deep Learning stores sample material for class assignments like training autonomous cars. Python analytics sample for studying NFTs are also available in the numerous GitHub repositories.

      16) Appen Datasets Resource Centre

      Lastly, the Appen Datasets Resource Center provides high-quality licensable datasets that aren’t restricted in size. There are over 11,000 hours of audio and over 25,000 images with 8.7 million words in 80 languages. This open-source dataset enhances the accuracy in machine learning models with a high-performance AI algorithm. It meets the needs of the global customer base who require data to train ML models.

      Final Thoughts

      Overall, while open data brings many advantages and opportunities to the table, there are still certain risks and drawbacks connected to it. To start with the positives – open data contributes to greater efficiency and cost reduction, as well as transparency and trust in organizations. It also facilitates access to knowledge from different perspectives by removing barriers such as economic ones. However, organizations must seriously consider the cons of open data – errors in interpretation or usage of existing data; privacy and consent issues, mosaic effect (where a combination of multiple datasets create new information) and sustainability/cost considerations. 

      In conclusion, although open data is a powerful tool for advancing progress, much more needs to be done in terms of discovering creative solutions that help address its risks effectively. Organizations should take into account all possible pros and cons before committing or leveraging open data initiatives if they wish these projects to reach their full potential without compromising any important considerations along the way.