Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :) ( you get extra links each week! )
And now…let's dive into some interesting links from this week.
AI Is Already Better Than You
Making quality the central point of your argument against AI systems is dangerous if that's not really your issue with it, and my feeling is that for most people the quality of the output is not actually their objection to the use of AI. People reach for it as a criticism because it feels viral, it feels snappy, and it feels like an easy way to attack these systems. AI being unable to draw hands is burned into popular culture pretty deeply now, and will probably be something we reference for decades to come, it's just become part of the popular myth of this generation of AI tools, so it feels like an issue that has gained awareness and leverage. The problem is that if this becomes the main issue and it then one day it gets fixed, it's going to have the opposite effect on the discourse...
Learn how Pinecone's new serverless vector database helps Notion, Gong, and CS DISCO optimize their AI infrastructure from our VP of R&D, Ram Sriharsha:
Up to 50x lower costs because of the separation of reads, writes, and storage
O(s) fresh results with vector clustering over blob storage
Fast search without sacrificing recall powered by industry-first indexing and retrieval algorithms
Powerful performance with a multi-tenant compute layer
Zero configuration or ongoing management
Read the technical deep dive to understand how it was built and the unique considerations that needed to be made.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
A bird's eye view of Polars
A good library abstracts away many complexities for its user. Polars is no different in this regard, as it maintains a philosophy that queries you write should be performant by default without knowing any of the internals. However, many users are interested in what happens under the hood either as a learning experience or to squeeze that last bit of performance out of their queries. In this blog post, we will provide a bird’s eye view of how Polars works and in future posts we will deep dive into each of its components…
The difference between clustered, longitudinal, and repeated measures data
What is the difference between Clustered, Longitudinal, and Repeated Measures Data? You can use mixed models to analyze all of them. But the issues involved and some of the specifications you choose will differ. Just recently, I came across a nice discussion about these differences in West, Welch, and Galecki’s (2007) excellent book, Linear Mixed Models. It’s a common question. There is a lot of overlap in both the study design and in how you analyze the data from these designs. West et al give a very nice summary of the three types. Here’s a paraphrasing of the differences as they explain them…
Data Engineering Trends
This blog post highlights the key trends we see in data engineering in 2024 and how those trends affect data teams…
Lesser known Research Areas ML [Reddit]
What are some lesser-known or less explored areas in machine learning that u find interesting ? (Broder, not highly specialized ideas or topics) I'm seeking some areas so that I can study and find about them…
Highlights of NeurIPS 2023 from Reading All 3584 Abstracts
Here is just a highlight of what I found interesting and the general vibes I had while reading over the last two weeks, but keep in mind that I am an undergraduate student that has not worked with or spent a lot of time with a variety of popular topics (e.g., federated learning, differential privacy, causal inference). I’ve structured this post into a high-level overview for each topic I observed, followed by short discussions on papers I found interesting…
The Winner’s Curse Is Easy To Understand From This Picture
Take a look at the photo below, and it should be easy to understand why The Winner’s Curse (the general tendency for detected effects to be an overestimate of the truth) is a thing…The plot shows our typical setup for a hypothesis test. In black is the sampling distribution of the test statistic for a difference in means under the null, and in blue is the statistic’s sampling distribution under the alternative. The shaded blue region represents the statistical power, and those effect sizes in the shaded region would be considered “statistically significant”…
Data Developer Platform
A Data Platform Specification, open for adoption by any data platform developer…The data developer platform specification has been made with the ideology "of the people, by the people, for the people." In other words, it is made by data developers and engineers for data developers and engineers and undoubtedly belongs to the lot. It is entirely open for development and improvement with the aid and influence of fresh technology...
5 Questions AI Engineers need to ask themselves
This writing stems from my experience advising a few startups, particularly smaller ones with plenty of junior software engineers trying to transition into machine learning and related fields. From this work, I've noticed three topics that I want to address. My aim is that, by the end of this article, these younger developers will be equipped with key questions they can ask themselves to improve their ability to make decisions under uncertainty…
LLM App Stack - aka Emerging Architectures for LLM Applications
This is a list of available tools, projects, and vendors at each layer of the LLM app stack. Our original article included only the most popular options, based on user interviews. This repo is meant to be more comprehensive, covering all available options in each category. We probably still missed some important projects, so please open a PR if you see anything missing. We also included Perplexity and Cursor.sh prompts to make searching and markdown table formatting easier…
Allowing report access to 35,000 people [Reddit]
I work for a retail firm and we are running an incentive over 6 months. I produce the incentive reports using python code and we usually share it with a few important salespeople. However this year, management is looking at creating something like a powerbi report and allowing access to all salespeople. The number of salespeople will be about 35,000. I'm experienced with python and powerbi but I've never done something that needs to be accessed by such a wide audience. Does anyone have any experience with such a scenario where a lot of people need access to a report and how it was achieved?…
Applying AI to Immune Cell Networks
The immune system is even “more complicated than the human genome,” says John Tsang, a professor at Yale School of Medicine. One component of this complexity is the complicated network by which immune cells communicate with one another via protein messengers known as cytokines…Recent research, drawing on a blend of immunology, mathematics, and natural language processing (NLP) AI, is helping us better understand immune cell-cytokine networks…Understanding how the immune system communicates and coordinates is necessary for knowing why and how the immune response can go awry…
Academic papers that are so brilliantly and so accessibly written and so universal in scope that they transcend disciplines [Twitter/X]
1/n: There are some academic papers that are so brilliantly and so accessibly written and so universal in scope that they transcend disciplines and stand as timeless testaments to both great thinking and great writing. Here's a short personal selection…
Five Ways to Analyze Ordinal Variables (Some Better than Others)
There are not a lot of statistical methods designed just to analyze ordinal variables. But that doesn’t mean that you’re stuck with few options. There are more than you’d think. Some are better than others, but it depends on the situation and research questions. Here are five options when your dependent variable is ordinal…
Large Language Model Course
The LLM course is divided into three parts:
🧩 LLM Fundamentals covers essential knowledge about mathematics, Python, and neural networks.
🧑🔬 The LLM Scientist focuses on building the best possible LLMs using the latest techniques.
👷 The LLM Engineer focuses on creating LLM-based applications and deploying them.
MIT’s 6.172 Performance Engineering of Software Systems - Lecture 1: Introduction and Matrix Multiplication
Professor Leiserson introduces 6.172 Performance Engineering of Software Systems. The class examines an example of code optimization using matrix multiplication and discusses the differences between programming languages Python, Java, and C…
* Based on unique clicks.
** Find last week's issue #530 here.
Thank you for joining us this week! :)
All our best,
Hannah & Sebastian
Copyright © 2013-2024 DataScienceWeekly.org, All rights reserved.
P.S. For paid subscribers => Even more links are below! The value proposition is that you get 10-15 extra links and blurbs if you become a paid subscriber. We hand-select the top 3 favorite links to be editors' choices and then alternate back and forth between the free and paid sections for the rest of the links/blurbs. That way, the free section doesn’t contain the best links, nor does the paid; it’s “somewhat” evenly distributed. It would mean a lot to us for you to become a subscriber here...