͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

Data Science Weekly - Issue 573

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Data Science Weekly

Nov 14

READ IN APP

Issue #573
November 14, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

Incremental Jobs and Data Quality Are On a Collision Course - Part 1: The Problem
If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn – one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets…Why?…

LLM Prompt Tuning Playbook
This document is for anyone who would like to get better at prompting post-trained LLMs. We assume that readers have had some basic interactions with some sort of LLM (e.g. Gemini), but we do not assume a rigorous technical understanding…The first half of the document provides mental models on the nature of post-training and prompting. The second half of this document provides more concrete prescriptions and a high-level procedure for tuning prompts. Given the pace of innovation with LLMs, we suspect that the second half is likely to go stale a lot faster than the first half…
FireDucks : Pandas but 100x faster
I deal with finance data all the time and so far the Pandas library has been an indispensable tool in my workflow and my most used Python library…I have around +/- 30 thousand lines of Pandas code, so you can understand why I've been hesitant to rewrite them to Polars, despite my enthusiasm for speed and optimization…Here comes FireDucks, the answer to my prayer: a speed demon Pandas library!..the last two benchmark numbers are 130x and 200x faster than Pandas…

Sponsor Message

Online Data Science Programs from Drexel University

Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more.

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

How I ship projects at big tech companies
I have shipped a lot of different projects over the last ~10 years in tech. I often get tapped to lead new ones when it’s important to get it right, because I’m good at it. Shipping in a big tech company is a very different skill to writing code, and lots of people who are great at writing code are terrible at shipping…Here’s what I think about when I’m leading a project and what I’ve seen people get wrong…
How Instacart Uses Machine Learning to Suggest Replacements for Out-of-Stock Products
The replacement recommendation model sits at the heart of the replacement experience for our customers and shoppers. In a previous post, The Story Behind an Instacart Order, we provided a sneak peek into this model and how our customers engage with it when placing orders. In this blog post, we intend to delve deeper into the machine-learning aspects of the replacement model, shedding light on the various decisions we made throughout its development…
A 5-Step Incident Management Framework for Enterprise Data Organizations
The sad reality is that, even in today’s modern data landscape, incident management is often largely ad hoc—with detection only spurned into action when bad data makes its way into production. And this reactionary approach to incident management undermines many teams’ efforts to operationalize and scale data quality programs over time…So, how are the best data teams in the world moving from reactive to proactive?…We’ve compiled the five most important steps to effective incident management, plus some best practices we’ve seen work well…
AMA: I’m Head of AI at a firm in the UK, advising Gov., industry, etc. [Reddit]
Ask me anything about AI adoption in the UK, tech stack, how to become an AI/ML Engineer or Data Scientist etc, career development you name it…
Flow With What You Know: Basic physics provides a “straight, fast” way to get up to speed with flow-based generative models
In this tutorial post, we provide an accessible introduction to flow-matching and rectified flow models, which are increasingly at the forefront of generative AI applications. Typical descriptions of them are usually laden with extensive probability-math equations, which can form barriers to the dissemination and understanding of these models. Fortunately, before they were couched in probabilities, the mechanisms underlying these models were grounded in basic physics, which provides an alternative and highly accessible (yet functionally equivalent) representation of the processes involved. Let’s flow…
Thoughts on Riemannian metrics and its connection with diffusion/score matching [Part I]
We are so used to Euclidean geometry that we often overlook the significance of curved geometries and the methods for measuring things that don’t reside on orthonormal bases. Just as understanding physics and the curvature of spacetime requires Riemannian geometry, I believe a profound comprehension of Machine Learning (ML) and data is also not possible without it. There is an increasing body of research that integrates differential geometry into ML. Unfortunately, the term “geometric deep learning” has predominantly become associated with graphs. However, modern geometry offers much more than just graph-related applications in ML…
How the HashingVectorizer works
You can use the CountVectorizer in scikit-learn to encode text to a sparse array that a machine learning model can use. This functionality is great, but it can result in huge widths. An alternative to this is the HashingVectorizer, which we discuss in this video…
Does the UK’s liver transplant matching algorithm systematically exclude younger patients?
Seemingly minor technical decisions can have life-or-death effects…A wrenching case study comes from the UK’s liver allocation algorithm, which appears to discriminate by age, with some younger patients seemingly unable to receive a transplant, no matter how ill. What went wrong here? Can it be fixed? Or should health systems avoid using algorithms for liver transplant matching?…
The Polars vs pandas difference nobody is talking about
It's certainly nice to see people talking about Polars, and the focus tends to be on features such as:
- lazy execution
- Rust
- consistent handling of null values
- multithreading
- query optimisation
Yet there's one innovation which barely ever gets a mention: non-elementary group-by aggregations…I'll start by introducing the group-by operation. We'll then take a look at elementary aggregations with both the pandas and Polars APIs. Finally, we'll look at non-elementary aggregations, and see how the Polars API enables so much more than the pandas one…
Perspectives on diffusion
Diffusion models appear to come in many shapes and forms. If you pick two random research papers about diffusion and look at how they describe the model class in their respective introductions, chances are they will go about it in very different ways. This can be both frustrating and enlightening: frustrating, because it makes it harder to spot relationships and equivalences across papers and implementations – but also enlightening, because these various perspectives each reveal new connections and are a breeding ground for new ideas. This blog post is an overview of the perspectives on diffusion I’ve found useful…
Adventures in Probability
One of my great school regrets, next to “taking three years of business classes,” is taking less statistics than I did, which was basically the bare minimum. It was, in fact the bare minimum: one probability class and one statistics class…I learned later in life that the fact that I’m terrified of uncertainty is a reason in favour of understanding probability and statistics better, rather than a reason to avoid them…One thing you learn pretty quick if you look into queueing theory stuff, or control theory stuff, or any kind of performance modeling stuff, is the unreasonable ubiquity of the exponential distribution. That thing shows up all the dang time man…
Leakage and the reproducibility crisis in machine-learning-based science
Machine learning (ML) is widely used across dozens of scientific fields. However, a common issue called “data leakage” can lead to errors in data analysis. We surveyed a variety of research that uses ML and found that data leakage affects at least 294 studies across 17 fields, leading to overoptimistic findings. We classified these errors into eight different types…
Predictions for the Future of RAG
In the next 6 to 8 months, RAG will be used primarily for report generation. We'll see a shift from using RAG agents as question-answering systems to using them more as report-generation systems. This is because the value you can get from a report is much greater than the current RAG systems in use. I'll explain this by discussing what I've learned as a consultant about understanding value and then how I think companies should describe the value they deliver through RAG…Rag is the feature, not the benefit…

.

Last Week's Newsletter's 3 Most Clicked Links

.
* Based on unique clicks.
** Find last week's issue #572 here.

Cutting Room Floor

Whenever you're ready, 3 ways we can help:

Learning something for your job? Hit reply to get get our help.
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~64,300 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.

Data Science Weekly - Data Science Weekly - Issue 573

Data Science Weekly - Issue 573

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Issue #573
November 14, 2024

Editor's Picks

Sponsor Message

Online Data Science Programs from Drexel University

Data Science Articles & Videos

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Whenever you're ready, 3 ways we can help:

Older messages

Data Science Weekly - Issue 572

Data Science Weekly - Issue 571

Data Science Weekly - Issue 570

Data Science Weekly - Issue 570

Data Science Weekly - Issue 569

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR

Data Science Weekly - Data Science Weekly - Issue 573

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Issue #573November 14, 2024

Editor's Picks

Sponsor Message

Data Science Articles & Videos

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Whenever you're ready, 3 ways we can help:

Older messages

You Might Also Like

Issue #573
November 14, 2024