Leveraging Open Data in Scientific Research

Explore top LinkedIn content from expert professionals.

  • View profile for Yubin Park, PhD
    Yubin Park, PhD Yubin Park, PhD is an Influencer

    CEO at mimilabs | CTO at falcon | LinkedIn Top Voice | Ph.D., Machine Learning and Health Data

    17,693 followers

    LLMs + Open Data => Clean and Manageable Data People ask me what I will do with all the data in the mimilabs data lakehouse. For context, I have been piling up publicly available health data for the last few months. As of today, I have accumulated almost 19TB of data. Yes, that's a lot. In the past, I would have said that I don't need that much data and that I don't have time to look at it. But things are different now. I am optimistic that we can achieve so much when combined with LLMs. For that, let me share an interesting paper, "Large Language Models for Integrating Social Determinants of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction," by Chase Fensore, Rodrigo M. Carrillo Larco, MD, PhD, Shivani A. Patel, Alanna Morris MD MSc, and Joyce Ho [1]. They used LLMs to review hundreds of SDoH variables to identify relevancies for the target prediction tasks (see the figure below). The authors say: "Our results demonstrate the potential of leveraging large language models (LLMs) to accelerate the integration of publicly available social determinants of health (SDOH) data with clinical measures for predictive healthcare tasks. By using LLMs to automatically annotate the domains of over 700 SDOH variables from multiple data sources, researchers can bypass the need for laborious manual annotation." This paper may be just one example, but I believe the implications are huge. With LLMs and massive open data, we can effectively "manage" data and make it usable. That's what I am excited about these days. [1] https://lnkd.in/eZ37iCxv #opendata #llm #datacleaning #preprocessing #cleandata #healthcareanlaytics

  • View profile for Sylvia Burris

    Bioinformatics & Computational Biology PhD student | Data Scientist

    3,185 followers

    If an omics paper doesn’t come with code and data— is it even reproducible? In 2025, science isn’t real unless it can be re-run. Here’s the baseline: >> Code should live on GitHub (or at least be version-controlled) >> Data should be publicly accessible (GEO, SRA, ENA) >> Results should be auditable from raw to figure >> Notebooks > Methods sections Reproducibility isn’t a bonus anymore. It’s the bare minimum. Because without it, omics research is just a black box with pretty plots. If an omics paper can’t be audited, it can’t be trusted. And if it can, it becomes part of something bigger: real science. #Bioinformatics #Reproducibility #OpenScience #NGS #Omics #ComputationalBiology #PrecisionMedicine #ScientificComputing #CodeAsMethod #ResearchStandards

  • View profile for Jeff Barr

    Vice President & Chief Evangelist at Amazon Web Services

    124,263 followers

    This is an impressive use case and a detailed case study -- NASA Jet Propulsion Laboratory and ISRO - Indian Space Research Organization are building an AWS-powered system that will download 4.4 TB of satellite data and produce 70 TB of satellite data products on a daily basis, using a combination of Spot and On-Demand Amazon EC2 instances for processing, Amazon S3 for long-term storage, and a host of other #AWS services for coordination, messaging, notification, and more. As part of the NASA-ISRO Synthetic Aperture Radar (NISAR) satellite mission, images of nearly all of Earth's land and ice surfaces will be captured every 6-12 days. The processed data will be archived in and then distributed through NASA's Earthdata Cloud data lake, also built on AWS, in support of NASA's open science policy. Read the entire case study at https://lnkd.in/gQUhg6je to learn a lot more!

  • View profile for Stephanie M. Lee

    Senior Writer at The Chronicle of Higher Education

    1,792 followers

    NEW: On July 1, the NIH started requiring that all agency-funded research be made freely, immediately available. And in response, some academic publishers are giving scientists no choice but to pay thousands in open-access fees in order to publish their research and comply with the NIH mandate. In a year when federal funding has been exceptionally unreliable, scientists say they are stressed about spending grant dollars on unexpected and questionable #openaccess charges. Things don’t have to be this way, open-science experts say: These fees are imposed entirely by publishers. The most prominent examples are Springer Nature and Elsevier, for-profit enterprises that generate billions in revenue. “They’re responsible to shareholders, and not to the research community,” said Christopher Steven Marcum, who helped draft the federal government’s open-science data policy during the Biden administration. These policies are affecting researchers like Stephanie Rolin, who's been charged nearly $4,400 to publish her latest paper. “I think it is important that people have access to science,” she said. But she can't currently tap into her NIH grant, and even if she could, the open-access fee would eat into the $50,000 she’s allocated to annual #research costs. “If every paper that I publish is going to be 10 percent of my budget,” she said, “there’s only so many papers I’m going to be able to publish.” Read my latest for The Chronicle of Higher Education. And get in touch if you've got a story about how article-processing charges are affecting you! stephanie.lee@chronicle.com https://lnkd.in/gV_3RTdm

  • View profile for Daniel Moghimi

    Senior Research Scientist @ Google | Security and Privacy Research Leader

    8,542 followers

    Is USENIX Security's Open Science Policy Helping or Hurting Research? USENIX Security, a major conference for sharing research in computer security, recently started a new policy. They're asking authors to share their code and other materials (called "artifacts") when they submit their research papers. If authors can't share their artifacts, they need to explain why. Ultimately, they must provide the artifacts before their paper is officially published. While the idea of open science (making research more accessible) is good, this policy has big problems, especially for researchers outside of universities: - Sharing isn't always possible: Many research labs in companies or government work with proprietary tools. This means they can't share code or tools that rely on private components. - Business rules get in the way: Even when sharing code is possible, companies have their own rules and schedules. For example, releasing code too early sometimes might put customers at risk. - Approvals take time: Getting permission to share code may be a long process, and it often doesn't fit with the strict deadlines of a conference. What does this mean for research? Many industry and government labs might stop publishing their work at USENIX Security. This could actually harm open research and open science because important findings from these labs won't be shared as widely. Ultimately, participation from certain research groups will dwindle. In the past, a lot of influential industry research, which later became public, was shared at conferences without requiring immediate artifact release. Sharing code isn't always a useful way to engage with a research community. There are hundreds of different research topics in today's academic security conferences, and they are all different in terms of the benefits of publishing artifact. So, is USENIX Security truly promoting open science, or are they accidentally creating barriers that limit who can share their research in a top-tier venue? https://lnkd.in/ggAGmRGK

  • View profile for Helen Toner

    Interim Executive Director, Center for Security and Emerging Technology

    7,975 followers

    New short paper from Kendrea Beers and me - 2 case studies of OpenMined's great work giving researchers/auditors/etc access to test privately held AI models. In pilot 1 DailyMotion, a French video sharing site, connected OpenMined's privacy-preserving infrastructure to their stack. This let an external researcher analyze what kind of content they were upranking *without* needing access to user data or the algorithm 🤯 Pilot 2 was even cooler: Anthropic & UK AISI did an exercise in "mutual secrecy," using OpenMined's tech to run a biosecurity evaluation where the AI model was kept private from the UK govt and the biological dataset was private from Anthropic. Just a test exercise with toy model and dataset so far, but the idea is that next this could be used to allow them to test a sensitive model on a sensitive dataset - without giving access to either. Still early days for this technology but it's cool stuff, and we're proud that Center for Security and Emerging Technology (CSET)'s Foundational Research Grants program helped to fund OpenMined's work on this. Preprint of the paper (which Kendrea presented at the Conference on Frontier AI Safety Frameworks): https://lnkd.in/e8XGMpUe Call for proposals on a related topic from Open Philanthropy: https://lnkd.in/ehcMJsqe More technical detail from OpenMined on the secure enclaves on H100s they used to make the Anthropic/UKAISI pilot happen: https://lnkd.in/eCmS6dpc

  • View profile for Joshua Berkowitz

    💻 Software Consulting 🤖 AI & Full Stack Developer 👔 Professional Education 🛒 eCommerce 🏢 BigData 🛢️ Database Development 🏗️ Startup Mentor 🎓 Private Instruction 🤝 DevOps

    2,533 followers

    I've been reading about the recent effort by the Princeton Neuroscience Institute at Princeton University to map decision-making in the mouse brain, and the scale of the collaboration is remarkable. Getting 22 labs across continents to standardize their experimental procedures is a significant logistical and scientific achievement in itself. This approach allowed them to create a comprehensive dataset by integrating recordings from over 600,000 neurons. Read more 👉 https://lnkd.in/ePREw7xZ The findings highlight how complex behaviors involve widespread brain activity, moving beyond traditional models focused on isolated regions. The Down Low: 🧠 Distributed Activity: Decision-making signals were observed across many brain areas, not just traditionally recognized cognitive centers. 🤝 Power of Standardization: Unifying experimental protocols across multiple labs was essential for creating a large-scale, reliable dataset. 📊 Open Data: Sharing this extensive dataset openly invites further analysis and discovery from the global research community. This work serves as a strong blueprint for future large-scale projects in neuroscience and other complex scientific fields. Read more 👉 https://lnkd.in/ePREw7xZ #Neuroscience #Collaboration #BigData #Research #DataScience #OpenScience

  • The World Bank is a major producer of development economics research, influencing policy decisions globally. To ensure that research consumers, especially policymakers, can easily examine and reproduce research results, there's a need for documented data sources, clear analytical scripts, and third-party validation. Although the World Bank has been a leader in open science through its Open Data and Open Knowledge initiatives, a significant gap existed in the availability of analytical scripts linking open data to knowledge products. In 2023, the World Bank launched a reproducible research initiative to enhance transparency in its analytical products by publishing reproducibility packages. These packages document how analytical results are derived from original datasets, allowing users to understand the methodology behind findings. The initiative focuses on the Policy Research Working Paper series, a key dissemination channel that is open to submissions from all staff and consultants. However, an analysis revealed that less than half of these working papers were published in journals within five years, and very few were submitted to journals that require data and code submission or verify reproducibility. Spearheaded by the World Bank's chief economist and the Development Economics Vice Presidency, the initiative builds on prior efforts by the Development Impact Department (DIME), which has mandated reproducibility verification since 2019. The DIME Analytics team has been instrumental in promoting transparent research practices and coordinates the new reproducible research initiative. In this column we describe pur current efforts at the World Bank to provide curation support and reproducibility checks for hundreds of working papers, books, and “flagship reports.” I hope you will find the insights offered by the head of DIME analytics Maria Jones helpful to reflect on the practices in your own university, research institute, and government agency. https://lnkd.in/eJKczMnJ

  • View profile for Jessie Anderson

    Health & Regulated Industries @ Google Cloud | Advisor

    4,404 followers

    📢 Exciting news for the future of scientific discovery!🔎🦠 Did you know? An estimated $25 BILLION is wasted annually on non-reproducible preclinical research [Freedman et al., 2015]. Following up on my previous post about the groundbreaking CryoSCAPE blood draw technique developed by Allen Institute for Immunology, I want to highlight an advancement and milestone in a crucial piece of the puzzle in accelerating research: a publication on proactive reproducibility in data analysis with the Human Immune System Explorer built on Google Cloud with the power of Google Workspace. In this new publication in The Royal Society, Paul Meijer and the team at Allen Institute for Immunology dive deep into the big open science challenge of reproducibility, showcasing the power of the Human Immune System Explorer (HISE) platform. Far beyond just storing data and sharing data; the HISE platform is about ensuring transparency and verifiability at every step of the scientific analysis process. Just as CryoSCAPE revolutionized how we collect and preserve delicate blood samples, extending research opportunities to underserved communities, HISE provides students, researchers, academic institutes, and commercial partners the robust "dry lab" infrastructure to analyze this wealth of data with unparalleled rigor. Meijer's publication outlines how HISE's proactive approach—tracking data and methods in real-time—tackles the reproducibility crisis head-on. Now imagine💡the possibilities; seamlessly connect globally collected, perfectly preserved 🩸 samples with an analysis platform that documents every analytical step and provides a fully reproducible environment for verification. Significantly accelerating the validation of scientific findings, fostering greater trust in research outcomes and enabling faster translation of discoveries into real-world applications. This integrated, transparent, and reproducible research is uniquely powered by the combination of Allen Institute's innovative technologies like CryoSCAPE and HISE, built on the scalable and secure infrastructure of Google Cloud. A trusted, efficient platform to launch 🚀 a wave of innovative AI enabled biological discoveries. 🤩 Excited about the future of inclusive & collaborative discoveries! 🧩 Check out the publication to learn more about proactive reproducibility and the transformative potential of HISE! https://lnkd.in/e_EcYJbt #OpenScience #Reproducibility #Immunology #GoogleCloud #AllenInstitute #CryoSCAPE #HISE #ScientificDiscovery #Research #Healthcare Shweta Maniar Rui Costa Peter Skene Ali Zaidi Stuart Gano Vandana Kapoor Will Grannis Prashant Gupta Doug Burton Sunny Barafwala Rory Headon-Weeks Ingram Tom Bumol ANANDA GOLDRATH Ernie Coffey Xiaojun Li Troy Torgerson Andy Hickl Niamh Cahill Eric Steele Collin Martin Lynne Becker Paul Mariz Lauren Hackett Lauren Okada Ed Lein Zachary Thomson Sathyanarayanan Subramanian Stark Pister

  • View profile for Bas Nijholt

    Staff Engineer at IonQ

    1,856 followers

    🎄🎁 Advent of Open Source – Day 6/24: Open Science Publications 🔬 (See my intro post: https://lnkd.in/gVNYBE9m) Today's post is about something I have strong opinions on: making scientific research reproducible. While many researchers talk about open science, actually making your work reproducible for anyone requires significant effort. 📖 Origin Story During my Ph.D. and subsequent research, I noticed a frustrating pattern in scientific publications: claims of "code available upon request" that often led nowhere, or code snippets that wouldn't run without substantial modification. I made an effort that my publications would be different – each includes complete, runnable code that reproduces every figure and result. 🔧 Technical Highlights Multiple repositories covering quantum physics experiments and simulations: • orbitalfield (https://lnkd.in/gWEVZks6) - "Orbital effect of magnetic field on the Majorana phase diagram" (94 citations) • supercurrent-majorana-nanowire (https://lnkd.in/gJXYmYUK) - "Supercurrent interference in few-mode nanowire Josephson junctions" (78 citations) • zigzag-majoranas (https://lnkd.in/gJyDU7vX) - "Enhanced proximity effect in zigzag-shaped Majorana Josephson junctions" (56 citations) • azure-quantum-tgp (https://lnkd.in/gXEA7DeV) - "Protocol to identify a topological superconducting phase in a three-terminal device" (52 citations) • spin-orbit-nanowires (https://lnkd.in/gAuNg2aK) - "Spin-orbit protection of induced superconductivity in Majorana nanowires" (81 citations) 🔄 Evolution of Best Practices What I consider best practices has evolved over time. Today, I would: • Use prefix.dev's Pixi for universal lock files across operating systems and programming languages • Provide a minimal Docker container that just installs the pixi lock file • Create self-documenting Jupyter notebooks that reproduce every result • Include clear figure-to-code mapping for paper reproducibility 🎯 Challenges and Solutions • Balancing code cleanliness with research deadlines • Managing large datasets efficiently • Ensuring long-term reproducibility • Making complex physics simulations accessible 💡 Lessons Learned 1. Clean code takes time but saves more in the long run 2. Documentation is as crucial as the code itself 3. Lock files are essential for true reproducibility 4. Making code public improves its quality Want to explore quantum physics simulations? Check out the repositories above, each linked to their corresponding papers with full reproduction instructions. #OpenSource #OpenScience #Physics #QuantumComputing #Python

Explore categories