|
Hello! Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
An intuitive, visual guide to copulas If you ask a statistician what a copula is they might say “a copula is a multivariate distribution C(U1,U2,....,Un) such that marginalizing gives Ui∼Uniform(0,1)”. OK… wait, what? I personally really dislike these math-only explanations that make many concepts appear way more difficult to understand than they actually are and copulas are a great example of that. The name alone always seemed pretty daunting to me. However, they are actually quite simple so we’re going to try and demistify them a bit. At the end, we will see what role copulas played in the 2007-2008 Financial Crisis…
alphaXiv - Open research discussion directly on top of arXiv Students at Stanford have built alphaXiv, an open discussion forum for arXiv papers…
A Visual Guide to Quantization - Demystifying the Compression of LLMs Large Language Models (LLMs) are often too large to run on consumer hardware. These models may exceed billions of parameters and generally need GPUs with large amounts of VRAM to speed up inference. As such, more and more research has been focused on making these models smaller through improved training, adapters, etc. One major technique in this field is called quantization. In this post, I will introduce the field of quantization in the context of language modeling and explore concepts one by one to develop an intuition about the field. We will explore various methodologies, use cases, and the principles behind quantization…
Building a Big Picture Data Team at StubHub See how Meghana Reddy, Head of Data at StubHub, built a data team that delivers business insights accurately and quickly with the help of Snowflake and Hex. The challenges she faced may sound familiar: Unclear SMEs meant questions went to multiple people Without SLAs, answer times were too long Lack of data modeling & source-of-truth metrics generated varying results Lack of discoverability & reproducibility cost time, efficiency and accuracy Static reporting reserved interactivity for rare occasion
Register now to hear how Meghana and the StubHub data team tackled these challenges with Snowflake and Hex. And watch Meghana demo StubHub’s data apps that increase quality and speed to insights… * Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Lessons from My First 8 Years of Research Six years ago, after two years of working at a research-oriented startup and just before starting graduate school, I wrote a blog post called Lessons from My First Two Years of AI Research. That post is below. Now that I’ve finished my PhD, I’ve updated the post at the bottom with a few more lessons learned… MCMC Sampling for Dummies This blog post is an attempt at trying to explain the intuition behind MCMC sampling (specifically, the random-walk Metropolis algorithm). Critically, we’ll be using code examples rather than formulas or math-speak. Eventually you’ll need that but I personally think it’s better to start with the an example and build the intuition before you move on to the math…
Using Quarto to write free, accessible training materials A blog post that turned into a short tutorial explaining why Quarto is my weapon of choice when making my training accessible, and how to host materials for free using Netlify…
What are some typical ‘rookie’ mistakes Data Scientists make early in their career? [Reddit] I was asked this question by one of my interns I am mentoring, and thought it would also be a good idea to ask the community as a whole since my sample size is only from the embarrassing things I have done as a jr 😂…
Become a Superlearner! An Illustrated Guide to Superlearning Superlearning is a technique for prediction that involves combining many individual statistical algorithms (commonly called “data-adaptive” or “machine learning” algorithms) to create a new, single prediction algorithm that is expected to perform at least as well as any of the individual algorithms…The superlearner algorithm “decides” how to combine, or weight, the individual algorithms based upon how well each one minimizes a specified loss function, for example, the mean squared error (MSE). This is done using cross-validation to avoid overfitting. The motivation for this type of “ensembling” is that a mix of multiple algorithms may be more optimal for a given data set than any single algorithm…
We had an AI attempt to make a data-driven story like we do at The Pudding With LLMs (henceforth, AI) in the spotlight – blowing minds, raking in venture capital, and prompting existential crises – we at The Pudding were curious: Will this make our jobs obsolete? How scared should we be that someone else could replicate what we do without the time, training, and expertise we have? Basically, can an AI make a data-driven, visual story, much like we do at The Pudding? What does it actually “do,” and how well does it do it? So, we tried replacing ourselves with Claude, an AI from Anthropic…
Column Names as Contracts Using controlled vocabularies for column names is a low-tech, low-friction approach to building a shared understanding of how each field in a data set is intended to work…In this post, I’ll introduce the concept with an example and demonstrate how controlled vocabularies can offer lightweight solutions to rote data validation, discoverability, and wrangling…I’ll illustrate these usability benefits with R packages including pointblank , collapsibleTree , and dplyr , but we’ll conclude by demonstrating how the same principles apply to other packages and languages…
R package development in Positron This post is about the R package development experience with Positron, the new IDE from Posit based on VS Code. This is not a tutorial on R package development in general — there are great resources for that elsewhere…
Andrej Karpathy's Keynote & Winner Pitches at UC Berkeley AI Hackathon 2024 Awards Ceremony At the 2024 UC Berkeley AI Hackathon's Awards Ceremony, the atmosphere was electric as Andrej Karpathy, founding member of OpenAI, delivered an inspiring keynote. Out of 371 projects, the top 8 teams took the stage to pitch their groundbreaking AI solutions. After intense deliberation by our esteemed judges, the big reveal came: up to $100K in prizes were awarded, celebrating innovation and creativity in AI for Good…
Cypher vs SQL: When do you need graph querying & modeling? In this video, we'll cover the advantages of using graph databases and the Cypher query language for efficient data analysis using practical examples. We highlight the benefits of graph modeling, including handling dependencies between entities, traversing paths, and querying on patterns in data. Watch a comparison of Cypher vs. SQL, and understand the conciseness and efficiency of Cypher for certain classes of queries. We also clearly showcase the flexibility and readability of Cypher over SQL in analyzing complex relationships in your data. Throughout the video, practical examples are provided to help appreciate the use cases better…
An overview of classifier-free guidance for diffusion models This blog post presents an overview of classifier-free guidance (CFG) and recent advancements in CFG based on noise-dependent sampling schedules. The follow-up blog post will focus on new approaches that replace the unconditional model. As a small recap bonus, the appendix briefly introduces the role of attention and self-attention on Unets in the context of generative models…
Misconceptions in statistics [Reddit Discussion] I'm going to give a talk on misconceptions in statistics to biomed research grad students soon. In your experience, what are the most egregious stats misconceptions out there?…
Interpretable Machine Learning: A Guide for Making Black Box Models Explainable Machine learning has great potential for improving products, processes and research. But computers usually do not explain their predictions which is a barrier to the adoption of machine learning. This book is about making machine learning models and their decisions interpretable…
Regression and Other Stories Many textbooks on regression focus on theory and the simplest of examples. Real statistical problems, however, are complex and subtle. This is not a book about the theory of regression. It is a book about how to use regression to solve real problems of comparison, estimation, prediction, and causal inference. It focuses on practical issues such as sample size and missing data and a wide range of goals and techniques. It jumps right in to methods and computer code you can use fresh out of the box…
Introduction to the theory of econometrics Econometrics is often thought of as a branch of statistics, analyzing uncertain events. In this first chapter, however, nothing is uncertain. I simply ask the following question: given a collection of points, how do we draw the best line through these points. This is a question of approximation rather than of estimation. We shall be concerned with estimation in the next chapter…
* Based on unique clicks. ** Find last week's issue #557 here.
Looking to get a job? Check out our “Get A Data Science Job” Course It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume. Promote yourself/organization to ~63,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :) Stay Data Science-y! All our best, Hannah & Sebastian
| |