Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Show and tell: Stats communicators share their stories
One of the biggest challenges facing any statistician is communicating their work to people who don’t eat, sleep and breathe data to the extent they do…how do we help the non-expert get their head around complex statistics, and at the same time find our voices as messengers? In this special feature, we ask three of the best stats communicators working today [Daniel Parris, Alli Torban, Tom Chivers] to tell us what they do, and how and why they do it. Because rarely does the data speak for itself…
Histograms for faster boosting
Gradient boosted machines can take a while to train, but there is an internal trick that we can pull off to make it a whole lot faster to train trees. It turns out that a histogram may be all we need!..
Predicting type 1 diabetes in children using electronic health records in primary care in the UK: development and validation of a machine-learning algorithm
Children presenting to primary care with suspected type 1 diabetes should be referred immediately to secondary care to avoid life-threatening diabetic ketoacidosis. However, early recognition of children with type 1 diabetes is challenging. Children might not present with classic symptoms, or symptoms might be attributed to more common conditions. A quarter of children present with diabetic ketoacidosis, a proportion unchanged over 25 years. Our aim was to investigate whether a machine-learning algorithm could lead to earlier detection of type 1 diabetes in primary care…
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Spatial ML: Predicting on out-of-sample data
Using spatially derived features can greatly improve model performance. But predicting on out-of-sample data can be tricky. 3️⃣ approach i can think of…These features can include, but not limited to:
the spatial lag (neighborhood average) of a variable
counts of neighboring features
most common category nearby
spatial embedding via principle coordinate analysis…
When do you prefer SQL or Python for Data Engineering? [Reddit]
When do you prefer to use SQL vs Python, what usually are the main determining factors?…
Guiding LLM Output with DSPy Assertions and Suggestions
Assertions in DSPy allow you to define strict rules and constraints that the LLM's output must (or maybe that you just want) to adhere to. For example, you can use assertions to ensure that the generated text doesn't contain certain "bad" words, or that the output conforms to a specific structure or format…Suggestions, on the other hand, provide a more flexible way to guide the LLM's output. Instead of hard-failing when a constraint is not met, suggestions offer feedback and guidance to the model, allowing it to refine its response and try again…In this blog post, we'll walk through practical examples of how to implement assertions and suggestions in your own DSPy-powered applications…
NULL BITMAP Builds a Database #1: The Log is Literally the Database
Today we are starting a new series…We are going to make an LSM-based storage engine (Log-structured storage engine) piece by piece. I think a really lovely thing about LSMs is how they lend themselves to a piece by piece implementation: they're fundamentally a bunch of little components that fit together in intricate ways that can be improved independently and swapped out, so I think it's the perfect project for a piecemeal implementation…
Cleaning tables with merged cells - Walkthrough with table in .docx file
Picking up my work on data cleaning with R: Here's a brief post on working with data from tables with merged cells, often shared in PDF or Word documents and based on real-world examples…
What is it like to dislike data?
This year’s Joint Statistical Meetings have me as a panelist for the memorial session of a well-respected professor of statistics. He devoted his work to the human aspect of statistical consulting. He was also my major advisor for the master’s program in statistics. I have been reflecting on the question of what I identify as his biggest influence on me. As many things that have to do with humans, it is a little hard to articulate…
Superlative mechanical energy absorbing efficiency discovered through self-driving lab-human partnership
Energy absorbing efficiency is a key determinant of a structure’s ability to provide mechanical protection and is defined by the amount of energy that can be absorbed prior to stresses increasing to a level that damages the system to be protected. Here, we explore the energy absorbing efficiency of additively manufactured polymer structures by using a self-driving lab (SDL) to perform >25,000 physical experiments on generalized cylindrical shells. We use a human-SDL collaborative approach where experiments are selected from over trillions of candidates in an 11-dimensional parameter space using Bayesian optimization and then automatically performed while the human team monitors progress to periodically modify aspects of the system…
What is prompt optimization?
Prompt optimization is the process of improving the quality of prompts used to generate content. Often by using few shots of context to generate a few examples of the desired output, then refining the prompt to generate more examples of the desired output…
Aurora: A Foundation Model of the Atmosphere
We introduce Aurora, a large-scale foundation model of the atmosphere trained on over a million hours of diverse weather and climate data. Aurora leverages the strengths of the foundation modeling approach to produce operational forecasts for a wide variety of atmospheric prediction problems, including those with limited training data, heterogeneous variables, and extreme events. In under a minute, Aurora produces 5-day global air pollution predictions and 10-day high-resolution weather forecasts that outperform state-of-the-art classical simulation tools and the best specialized deep learning models…
Let's talk about LLM evaluation
Since my team works on evaluation and leaderboards at Hugging Face, at ICLR 2024 (2 weeks ago) a lot of people wanted to pick my brain about the topic (which was very unexpected, thanks a lot to all who were interested). Thanks to all these discussions, I realized that a number of things that I take for granted evaluation wise are 1) not widely spread ideas 2) apparently interesting…
What’s your achievements in Data Engineering [Reddit]
What's the project you're working on or the most significant impact you're making at your company at Data engineering team…
Optimize a RAG DSPy Application
How many samples are necessary to achieve good performance with DSPy?…
In this tutorial, you will:
Build and optimize DSPy modules that use retrieval-augmented generation and multi-hop reasoning to answer questions over AirBnB 2023 10k filings dataset,
Instrument your application using Parea AI,
Inspect the traces of your application to understand the inner works of a DSPy forward pass.
Evaluate your modules
Understand how many samples are necessary to achieve good performance on the test set…
Generative AI is poised to transform the world – IF data privacy solutions can keep up. Join us in San Francisco for Confidential Computing Summit, two eye-opening days on June 5 & 6 that will bring together top business and technology leaders to evaluate the latest solutions in secure and trustworthy AI, explore confidential data use cases, and get you up-to-speed on what’s now and what’s next. Get $200 off with promo code DSW —> https://bit.ly/4bvJvnk
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Practical Statistics in Medicine with R
The primary emphasis of this textbook is to introduce students to the basic ideas of medical statistics using R. It can also be used as a support for self-directed learning for students and researchers in biomedical field to analyze data. Additionally, bridging the gap between theory and practice, it may be useful for (under)graduate students with a science background (engineering, mathematics) who want to move towards biomedical sciences and develop essential R skills…
Princeton’s ECE524: Foundations of Reinforcement Learning
Interested in learning the mathematical foundations of Reinforcement Learning (RL)? Now is a good time! This semester, we will make videos and lecture notes from my graduate-level RL theory course at Princeton available to the public…
Shiny apps for demystifying statistical models and methods
The purpose of these apps is to demonstrate how statistical methods and models work, using data simulation and visualization. Most of them simulate data according to the rules of a model, plot the simulated data, and report statistical results obtained by fitting the simulated data to the model from which they were generated. The user can adjust values of model parameters or change features of the model, then see what happens to the plotted data and to the statistical results. Most apps have an option to change the random seed, which will generate a new (pseudo-) random set of data…
* Based on unique clicks.
** Find last week's issue #547 here.
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~62,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian