|
Hello! Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Incremental Jobs and Data Quality Are On a Collision Course - Part 1: The Problem If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn – one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets…Why?…
LLM Prompt Tuning Playbook This document is for anyone who would like to get better at prompting post-trained LLMs. We assume that readers have had some basic interactions with some sort of LLM (e.g. Gemini), but we do not assume a rigorous technical understanding…The first half of the document provides mental models on the nature of post-training and prompting. The second half of this document provides more concrete prescriptions and a high-level procedure for tuning prompts. Given the pace of innovation with LLMs, we suspect that the second half is likely to go stale a lot faster than the first half…
FireDucks : Pandas but 100x faster I deal with finance data all the time and so far the Pandas library has been an indispensable tool in my workflow and my most used Python library…I have around +/- 30 thousand lines of Pandas code, so you can understand why I've been hesitant to rewrite them to Polars, despite my enthusiasm for speed and optimization…Here comes FireDucks, the answer to my prayer: a speed demon Pandas library!..the last two benchmark numbers are 130x and 200x faster than Pandas…
Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more. * Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
How I ship projects at big tech companies I have shipped a lot of different projects over the last ~10 years in tech. I often get tapped to lead new ones when it’s important to get it right, because I’m good at it. Shipping in a big tech company is a very different skill to writing code, and lots of people who are great at writing code are terrible at shipping…Here’s what I think about when I’m leading a project and what I’ve seen people get wrong… How Instacart Uses Machine Learning to Suggest Replacements for Out-of-Stock Products The replacement recommendation model sits at the heart of the replacement experience for our customers and shoppers. In a previous post, The Story Behind an Instacart Order, we provided a sneak peek into this model and how our customers engage with it when placing orders. In this blog post, we intend to delve deeper into the machine-learning aspects of the replacement model, shedding light on the various decisions we made throughout its development…
A 5-Step Incident Management Framework for Enterprise Data Organizations The sad reality is that, even in today’s modern data landscape, incident management is often largely ad hoc—with detection only spurned into action when bad data makes its way into production. And this reactionary approach to incident management undermines many teams’ efforts to operationalize and scale data quality programs over time…So, how are the best data teams in the world moving from reactive to proactive?…We’ve compiled the five most important steps to effective incident management, plus some best practices we’ve seen work well…
AMA: I’m Head of AI at a firm in the UK, advising Gov., industry, etc. [Reddit] Ask me anything about AI adoption in the UK, tech stack, how to become an AI/ML Engineer or Data Scientist etc, career development you name it…
Flow With What You Know: Basic physics provides a “straight, fast” way to get up to speed with flow-based generative models In this tutorial post, we provide an accessible introduction to flow-matching and rectified flow models, which are increasingly at the forefront of generative AI applications. Typical descriptions of them are usually laden with extensive probability-math equations, which can form barriers to the dissemination and understanding of these models. Fortunately, before they were couched in probabilities, the mechanisms underlying these models were grounded in basic physics, which provides an alternative and highly accessible (yet functionally equivalent) representation of the processes involved. Let’s flow…
Thoughts on Riemannian metrics and its connection with diffusion/score matching [Part I] We are so used to Euclidean geometry that we often overlook the significance of curved geometries and the methods for measuring things that don’t reside on orthonormal bases. Just as understanding physics and the curvature of spacetime requires Riemannian geometry, I believe a profound comprehension of Machine Learning (ML) and data is also not possible without it. There is an increasing body of research that integrates differential geometry into ML. Unfortunately, the term “geometric deep learning” has predominantly become associated with graphs. However, modern geometry offers much more than just graph-related applications in ML…
How the HashingVectorizer works You can use the CountVectorizer in scikit-learn to encode text to a sparse array that a machine learning model can use. This functionality is great, but it can result in huge widths. An alternative to this is the HashingVectorizer, which we discuss in this video…
Does the UK’s liver transplant matching algorithm systematically exclude younger patients? Seemingly minor technical decisions can have life-or-death effects…A wrenching case study comes from the UK’s liver allocation algorithm, which appears to discriminate by age, with some younger patients seemingly unable to receive a transplant, no matter how ill. What went wrong here? Can it be fixed? Or should health systems avoid using algorithms for liver transplant matching?…
The Polars vs pandas difference nobody is talking about It's certainly nice to see people talking about Polars, and the focus tends to be on features such as: Yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations…I'll start by introducing the group-by operation. We'll then take a look at elementary aggregations with both the pandas and Polars APIs. Finally, we'll look at non-elementary aggregations, and see how the Polars API enables so much more than the pandas one…
Perspectives on diffusion Diffusion models appear to come in many shapes and forms. If you pick two random research papers about diffusion and look at how they describe the model class in their respective introductions, chances are they will go about it in very different ways. This can be both frustrating and enlightening: frustrating, because it makes it harder to spot relationships and equivalences across papers and implementations – but also enlightening, because these various perspectives each reveal new connections and are a breeding ground for new ideas. This blog post is an overview of the perspectives on diffusion I’ve found useful…
Adventures in Probability One of my great school regrets, next to “taking three years of business classes,” is taking less statistics than I did, which was basically the bare minimum. It was, in fact the bare minimum: one probability class and one statistics class…I learned later in life that the fact that I’m terrified of uncertainty is a reason in favour of understanding probability and statistics better, rather than a reason to avoid them…One thing you learn pretty quick if you look into queueing theory stuff, or control theory stuff, or any kind of performance modeling stuff, is the unreasonable ubiquity of the exponential distribution. That thing shows up all the dang time man…
Leakage and the reproducibility crisis in machine-learning-based science Machine learning (ML) is widely used across dozens of scientific fields. However, a common issue called “data leakage” can lead to errors in data analysis. We surveyed a variety of research that uses ML and found that data leakage affects at least 294 studies across 17 fields, leading to overoptimistic findings. We classified these errors into eight different types…
Predictions for the Future of RAG In the next 6 to 8 months, RAG will be used primarily for report generation. We'll see a shift from using RAG agents as question-answering systems to using them more as report-generation systems. This is because the value you can get from a report is much greater than the current RAG systems in use. I'll explain this by discussing what I've learned as a consultant about understanding value and then how I think companies should describe the value they deliver through RAG…Rag is the feature, not the benefit…
.
. * Based on unique clicks. ** Find last week's issue #572 here.
Learning something for your job? Hit reply to get get our help.
Looking to get a job? Check out our “Get A Data Science Job” Course It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~64,300 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :) Stay Data Science-y! All our best, Hannah & Sebastian
| |