͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

Data Science Weekly - Issue 561

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Data Science Weekly

Aug 22

READ IN APP

Issue #561
August 22, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

Just use Postgres
This is one part actionable advice, one part question for the audience…Advice: When you are making a new application that requires persistent storage of data, like is the case for most web applications, your default choice should be Postgres…

A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals
We present empirical evidence of skills and directed exploration emerging from a simple RL algorithm long before any successful trials are observed. For example, in a manipulation task, the agent is given a single observation of the goal state and learns skills, first for moving its end-effector, then for pushing the block, and finally for picking up and placing the block. These skills emerge before the agent has ever successfully placed the block at the goal location and without the aid of any reward functions, demonstrations, or manually-specified distance metrics…
Which books, papers, and blogs are in the Bayesian canon?
Inspired by this effort by Patrick Collison for Silicon Valley [linked from Tyler Cowen], I thought that it might be fun to think about what makes up the Bayesian canon. As Collison said, “This isn’t the list of books that I think one ought to read — it’s just the list that I think roughly covers the major ideas that are influential here.” I’ve made a start. Would should be added/removed?…

A Sponsor Message

Online Data Science Programs from Drexel University

Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more.

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Visual Design in Scholarly Communication
Visual design is a crucial element in various forms of scientific communication, ranging from papers, slides, to even videos. While there is an increasing need for researchers to produce high-quality visuals, it remains to be a time-consuming and sometimes very challenging task. Despite the significant role they play, there is a noticeable lack of formal education dedicated to this aspect. This subject aims to cover several key topics about visual designs in scholarly communication. Throughout this subject, you will learn:
1. basics and principles for visual design in scholarly communication
2. techniques for creating high-quality figures/tables/visualizations…
Getting to Know infer
infer implements an expressive grammar to perform statistical inference that coheres with the tidyverse design framework. Rather than providing methods for specific statistical tests, this package consolidates the principles that are shared among common hypothesis tests into a set of 4 main verbs (functions), supplemented with many utilities to visualize and extract value from their outputs. Regardless of which hypothesis test we’re using, we’re still asking the same kind of question: is the effect/difference in our observed data real, or due to chance?…
A/B Testing Rigorously (without losing your job)
If you're running an A/B test using standard statistical tests and look at your results early, you are not getting the statistical guarantees that you think you are. In How Not To Run An A/B Test, Evan Miller explains the issue, and recommends the standard technique of deciding on a sample size in advance, then only looking at your statistics once. Never before, and never again. Because on that one look you've used up all of your willingness to be wrong. This suggestion is statistically valid. However it is hazardous to your prospects for continued employment. No boss is going to want to hear that there will be no peeking at your statistics. But suppose that you have managed to get your boss to do that. Then this happens…
I am a data engineer(10 YOE) and write at startdataengineering.com - AMA about data engineering, career growth, and data landscape! [Reddit]
Whether you're curious about starting a career in data engineering, need advice on data architecture, or want to discuss the latest trends in the field, I’m here to answer your questions. AMA!…
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
Self-attention is the core mathematical operation of modern transformer architectures and is also a significant computational bottleneck due to its quadratic complexity in the sequence length. In this work, we derive the scalar energy function whose gradient computes the self-attention block, thus elucidating the theoretical underpinnings of self-attention, providing a Bayesian interpretation of the operation and linking it closely with energy-based models such as Hopfield Networks…
Rediscovering the UK's AI ambition
At the end of July, the UK Secretary of State for Science, Innovation and Technology commissioned Matt Clifford, Chair of the Advanced Research and Invention Agency (ARIA), to produce a roadmap on how the government can harness the benefits for AI to drive growth and productivity. As part of this work, Alex attended a roundtable at 10 Downing Street and stakeholders have been invited to share their thoughts in writing with the taskforce. As believers in openness as a driver of progress, we share our unvarnished views publicly, not just behind closed doors. So in that spirit, we’re sharing our submission in full…
New LLM Pre-training and Post-training Paradigms: A Look at How Modern LLMs Are Trained
Initially, the LLM training process focused solely on pre-training, but it has since expanded to include both pre-training and post-training. Post-training typically encompasses supervised instruction fine-tuning and alignment, which was popularized by ChatGPT. Training methodologies have evolved since ChatGPT was first released. In this article, I review the latest advancements in both pre-training and post-training methodologies, particularly those made in recent months…
Loss Rider - a Python plotting library that can (only) output Line Rider maps
A plotting tool that outputs Line Rider maps, so you can watch a man on a sled scoot down your loss curves. 🎿…
Optimizing Tool Retrieval in RAG Systems: A Balanced Approach
When it comes to Retrieval-Augmented Generation (RAG) systems, one of the key challenges is deciding how to select and use tools effectively…many people ask me whether or not they should think about using retrieval to choose which tools to put into the prompt. What this actually means is that we're interested in making precision and recall trade-offs. I've found that the key lies in balancing recall and precision…In this article, we'll cover:
1. The challenge of tool selection in RAG systems
2. Understanding the recall vs. precision tradeoff
3. The "Evergreen Tools" strategy for optimizing tool selection…
How Narwhals has many end users ... that never use it directly
When you pip install a package you will for sure end up using it later. But often you will also install a bunch of dependencies and it is very likely that you won't directly interact with all of them. That does not mean that such a package is not useful, it merely means that the package might be directly used by a maintainer instead…This is interesting, because recently one such tool came into existence. It is called Narwhals and it seems to be on track to become critical infrastructure for data science projects. We have the maintainer of Narwhals on the show this week to talk about it…
Scikit-Learn can do THAT?!
Many of us know scikit-learn for it's ability to construct pipelines that can do .fit().predict(). It's an amazing feature for sure. But once you dive into the codebase ... you realize that there is just so much more…This talk will be an attempt at demonstrating some extra features in scikit-learn, and it's ecosystem, that are less common but deserve to be in the spotlight. In particular I hope to discuss these things that scikit-learn can do: - sparse datasets and models - larger than memory datasets - sample weight techniques - image classification via embeddings - tabular embeddings/vectorisation - data deduplication - pipeline caching…
A primer on scRNA-seq foundation models
The New York Times published an article titled A.I. Is Learning What It Means to Be Alive. I'm not the biggest fan of the title, but I am happy the subject in it is being talked about more! It laid out the story of how so-called 'scRNA-seq Foundation Models' may potentially change how single-cell RNA sequencing (scRNA) data is interpreted, used, and applied. Though the Times article was fantastic in its own right, it asks very surface-level questions about the whole process…I'd like to do a much deeper dive in this topic and try to walk through the motivation, ideas, and process of creating models, along with what they do well on and what they still struggle with…

A Sponsor Message

Bridging AI, Vector Embeddings and the Data Lakehouse - Live Webinar

Join NielsenIQ and Onehouse to explore the crucial role of vector embeddings in AI.

Discover how Onehouse makes it more cost-efficient, simple, and scalable to generate and manage vector embeddings directly from your data lake, amidst rising vector database costs.

Live webinar. Aug 27, 2024 | 10 am PT

Can't make it? Register anyway to receive the recording!

Register Now Here

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #560 here.

Cutting Room Floor

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~63,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.

Data Science Weekly - Data Science Weekly - Issue 561

Data Science Weekly - Issue 561

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Issue #561
August 22, 2024

Editor's Picks

A Sponsor Message

Online Data Science Programs from Drexel University

Data Science Articles & Videos

A Sponsor Message

Bridging AI, Vector Embeddings and the Data Lakehouse - Live Webinar

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Whenever you're ready, 2 ways we can help:

Older messages

Data Science Weekly - Issue 560

Data Science Weekly - Issue 559

Data Science Weekly - Issue 558

Data Science Weekly - Issue 557

Data Science Weekly - Issue 556

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR

Data Science Weekly - Data Science Weekly - Issue 561

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Issue #561August 22, 2024

Editor's Picks

A Sponsor Message

Data Science Articles & Videos

A Sponsor Message

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Whenever you're ready, 2 ways we can help:

Older messages

You Might Also Like

Issue #561
August 22, 2024