|
Hello! Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Embeddings are underrated Machine learning (ML) has the potential to advance the state of the art in technical writing. No, I’m not talking about text generation models like Claude, Gemini, LLaMa, GPT, etc. The ML technology that might end up having the biggest impact on technical writing is embeddings. Embeddings aren't exactly new, but they have become much more widely accessible in the last couple years. What embeddings offer to technical writers is the ability to discover connections between texts at previously impossible scales…
What Shapes Do Matrix Multiplications Like? A while back, Karpathy tweeted that increasing the size of his matmul made it run faster. Surprisingly, it’s not just relatively faster, it takes less absolute time. In other words, despite doing more work, it is executing in less time…This may seem intuitively quite strange. Is cuBLAS just messing up somehow? Why doesn’t the matrix multiplication kernel just pad it to a larger shape?…It has become tribal knowledge that the particular shapes chosen for matmuls has a surprisingly large effect on their performance. But … why? Can this be understood by mere mortals? Let’s take a crack at it…
BI-as-Code and the New Era of GenBI Imagine creating business dashboards by simply describing what you want to see. No more clicking through complex interfaces or writing SQL queries - just have a conversation with AI about your data needs…This is the promise of Generative Business Intelligence (GenBI). At its core, GenBI delivers an unreasonably effective human interface, where we iterate quickly, based on BI-as-Code. A simplified version looks like this…
Join a full-day virtual event featuring top speakers like Thomas Wolf - Co found of Hugging Face, Prashanth Chandrasekar - CEO of Stack Overflow and Nathan Benaich from Air Street Capital, along with experts from OpenAI, Stanford, Replit, and more! The event is all about bringing AI agents into production, with sessions on everything from autonomous payment agents to customer service and analytics agents. Connect with the global ML/AI community to dive into real-world applications and best practices for deploying and scaling AI agents in industries like e-commerce, food delivery, SaaS, and beyond. . * Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
IdentityRAG - Customer Insights Chatbot Connect to your customer data using any LLM and gain actionable insights. IdentityRAG creates a single comprehensive customer 360 view (golden record) by unifying, consolidating, disambiguating and deduplicating data across multiple sources through identity resolution… Inventory of methods for comparing spatial patterns in raster data Comparison of spatial patterns in raster data is a part of many types of spatial analysis. With this task, we want to know how the physical arrangement of observations in one raster differs from the physical arrangement of observations in another raster…This blog post series will explain the motivation for comparing spatial patterns in raster data, the general considerations when selecting a method for comparison, and the inventory of methods for comparing spatial patterns in raster data. Next, it will show how to use R to compare spatial patterns in continuous and categorical raster data. Lastly, it will discuss the methods’ properties, their applicability, and how they can be extended…
As a researcher, how do you become industry-ready? [Reddit Discussion] Being a PhD student, much of my time is spent on supervising students, project management and writing "quick and dirty" code for prototyping. I intend to move to industry after the PhD, but I feel like I'm missing out on key software engineering skills and good coding practices. Does anyone else feel this way? How do you upskill yourself to be industry-ready while doing a PhD?…
How we made Waves of Interest Waves of Interest (WOI) is a collaboration between Google News initiative and Truth & Beauty, exploring the hidden patterns in Google search data. This project illuminates the interests and concerns of Americans in recent election years, as seen through the lens of Google Search Interest in the United States from 2004—2020…
Getting comfortable talking about tech A blog post about some of the things that have helped me in getting comfortable talking about tech (although hopefully some of the advice is also useful for non-tech talks!)…
messy - R package to make a data frame messy and untidy. When teaching examples using R, instructors often using nice datasets - but these aren't very realistic, and aren't what students will later encounter in the real world. Real datasets have typos, missing values encoded in strange ways, and weird spaces. The {messy} R package takes a clean dataset, and randomly adds these things in - giving students the opportunity to practice their data cleaning and wrangling skills without having to change all of your examples…
How to Do Bad Biomarker Research This article covers some of the bad statistical practices that have crept into biomarker research, including setting the bar too low for demonstrating that biomarker information is new, believing that winning biomarkers are really “winners”, and improper use of continuous variables. Step-by-step guidance is given for ensuring that a biomarker analysis is not reproducible and does not provide clinically useful information…
What Every Developer Should Know About GPU Computing A primer on GPU architecture and computing…Most programmers have some understanding of the basics of CPUs and sequential programming because they grow up writing code for the CPU, but many are less familiar with the inner workings of GPUs and what makes them so special. Over the past decade, GPUs have become incredibly important because of their pervasive use in deep learning. Today, it is essential for every software engineer to possess a basic understanding of how they work. My goal with this article is to give you that background…
A Scalable Communication Protocol for Networks of Large Language Models Agora is a cross-platform, dead-simple protocol for efficient communication between LLM agents. It enables very different agents to communicate with each other at a fraction of the cost. Agora can also be easily integrated with existing multiagent frameworks, such as Camel AI, LangChain and Swarm…
Understanding LLMs from Scratch Using Middle School Math In this article, we talk about how Large Language Models (LLMs) work, from scratch — assuming only that you know how to add and multiply two numbers. The article is meant to be fully self-contained. We start by building a simple Generative AI on pen and paper, and then walk through everything we need to have a firm understanding of modern LLMs and the Transformer architecture. The article will strip out all the fancy language and jargon in ML and represent everything simply as they are: numbers. We will still call out what things are called to tether your thoughts when you read jargon-y content…
Customer Segmentation - Mixed Data Types [Reddit Discussion] I've only ever had experience with supervised learning for tasks for propensity models. My lead data scientist believes there is value in unsupervised learning to find clusters amongst our customers and target them with more personalised messaging…I generally used to approach my "segmentation" with a combination of different demographic factors and bins for a selection of continuous features (like RFM factors). However, if I were to consider unsupervised learning for clustering, I'll have mixed data… I wanted to understand how you would approach this, whether you ignore categorical variables, or whether you do a clustering model for categorical and numerical variables separately, etc…Also, I'd like to hear your thoughts and stories of how segmentation via ML has added value beyond RFM or RFE analysis…
Understanding Multimodal LLMs: An introduction to the main techniques and latest models In this article, I aim to explain how multimodal LLMs function. Additionally, I will review and summarize roughly a dozen other recent multimodal papers and models published in recent weeks (including Llama 3.2) to compare their approaches…
How to Tackle the Weekend Quiz Like a Bayesian A couple of weeks ago, this question came up in the Sydney Morning Herald Good Weekend quiz: What is malmsey: a mild hangover, a witch’s curse or a fortified wine? Assuming we have no inkling of the answer, is there any way to make an informed guess in this situation? I think there is. Feel free to have a think about it before reading on…
.
. * Based on unique clicks. ** Find last week's issue #571 here.
Learning something for your job? Hit reply to get get our help.
Looking to get a job? Check out our “Get A Data Science Job” Course It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~64,300 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :) Stay Data Science-y! All our best, Hannah & Sebastian
| |