|
Hello! Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Just use Postgres This is one part actionable advice, one part question for the audience…Advice: When you are making a new application that requires persistent storage of data, like is the case for most web applications, your default choice should be Postgres …
A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals We present empirical evidence of skills and directed exploration emerging from a simple RL algorithm long before any successful trials are observed. For example, in a manipulation task, the agent is given a single observation of the goal state and learns skills, first for moving its end-effector, then for pushing the block, and finally for picking up and placing the block. These skills emerge before the agent has ever successfully placed the block at the goal location and without the aid of any reward functions, demonstrations, or manually-specified distance metrics…
Which books, papers, and blogs are in the Bayesian canon? Inspired by this effort by Patrick Collison for Silicon Valley [linked from Tyler Cowen], I thought that it might be fun to think about what makes up the Bayesian canon. As Collison said, “This isn’t the list of books that I think one ought to read — it’s just the list that I think roughly covers the major ideas that are influential here.” I’ve made a start. Would should be added/removed?…
Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more. * Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Visual Design in Scholarly Communication Visual design is a crucial element in various forms of scientific communication, ranging from papers, slides, to even videos. While there is an increasing need for researchers to produce high-quality visuals, it remains to be a time-consuming and sometimes very challenging task. Despite the significant role they play, there is a noticeable lack of formal education dedicated to this aspect. This subject aims to cover several key topics about visual designs in scholarly communication. Throughout this subject, you will learn: basics and principles for visual design in scholarly communication techniques for creating high-quality figures/tables/visualizations…
Getting to Know infer infer implements an expressive grammar to perform statistical inference that coheres with the tidyverse design framework. Rather than providing methods for specific statistical tests, this package consolidates the principles that are shared among common hypothesis tests into a set of 4 main verbs (functions), supplemented with many utilities to visualize and extract value from their outputs. Regardless of which hypothesis test we’re using, we’re still asking the same kind of question: is the effect/difference in our observed data real, or due to chance?…
A/B Testing Rigorously (without losing your job) If you're running an A/B test using standard statistical tests and look at your results early, you are not getting the statistical guarantees that you think you are. In How Not To Run An A/B Test, Evan Miller explains the issue, and recommends the standard technique of deciding on a sample size in advance, then only looking at your statistics once. Never before, and never again. Because on that one look you've used up all of your willingness to be wrong. This suggestion is statistically valid. However it is hazardous to your prospects for continued employment. No boss is going to want to hear that there will be no peeking at your statistics. But suppose that you have managed to get your boss to do that. Then this happens…
I am a data engineer(10 YOE) and write at startdataengineering.com - AMA about data engineering, career growth, and data landscape! [Reddit] Whether you're curious about starting a career in data engineering, need advice on data architecture, or want to discuss the latest trends in the field, I’m here to answer your questions. AMA!…
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters Self-attention is the core mathematical operation of modern transformer architectures and is also a significant computational bottleneck due to its quadratic complexity in the sequence length. In this work, we derive the scalar energy function whose gradient computes the self-attention block, thus elucidating the theoretical underpinnings of self-attention, providing a Bayesian interpretation of the operation and linking it closely with energy-based models such as Hopfield Networks…
Rediscovering the UK's AI ambition At the end of July, the UK Secretary of State for Science, Innovation and Technology commissioned Matt Clifford, Chair of the Advanced Research and Invention Agency (ARIA), to produce a roadmap on how the government can harness the benefits for AI to drive growth and productivity. As part of this work, Alex attended a roundtable at 10 Downing Street and stakeholders have been invited to share their thoughts in writing with the taskforce. As believers in openness as a driver of progress, we share our unvarnished views publicly, not just behind closed doors. So in that spirit, we’re sharing our submission in full…
New LLM Pre-training and Post-training Paradigms: A Look at How Modern LLMs Are Trained Initially, the LLM training process focused solely on pre-training, but it has since expanded to include both pre-training and post-training. Post-training typically encompasses supervised instruction fine-tuning and alignment, which was popularized by ChatGPT. Training methodologies have evolved since ChatGPT was first released. In this article, I review the latest advancements in both pre-training and post-training methodologies, particularly those made in recent months…
Loss Rider - a Python plotting library that can (only) output Line Rider maps A plotting tool that outputs Line Rider maps, so you can watch a man on a sled scoot down your loss curves. 🎿…
Optimizing Tool Retrieval in RAG Systems: A Balanced Approach When it comes to Retrieval-Augmented Generation (RAG) systems, one of the key challenges is deciding how to select and use tools effectively…many people ask me whether or not they should think about using retrieval to choose which tools to put into the prompt. What this actually means is that we're interested in making precision and recall trade-offs. I've found that the key lies in balancing recall and precision…In this article, we'll cover: The challenge of tool selection in RAG systems Understanding the recall vs. precision tradeoff The "Evergreen Tools" strategy for optimizing tool selection…
How Narwhals has many end users ... that never use it directly When you pip install a package you will for sure end up using it later. But often you will also install a bunch of dependencies and it is very likely that you won't directly interact with all of them. That does not mean that such a package is not useful, it merely means that the package might be directly used by a maintainer instead…This is interesting, because recently one such tool came into existence. It is called Narwhals and it seems to be on track to become critical infrastructure for data science projects. We have the maintainer of Narwhals on the show this week to talk about it…
Scikit-Learn can do THAT?! Many of us know scikit-learn for it's ability to construct pipelines that can do .fit().predict(). It's an amazing feature for sure. But once you dive into the codebase ... you realize that there is just so much more…This talk will be an attempt at demonstrating some extra features in scikit-learn, and it's ecosystem, that are less common but deserve to be in the spotlight. In particular I hope to discuss these things that scikit-learn can do: - sparse datasets and models - larger than memory datasets - sample weight techniques - image classification via embeddings - tabular embeddings/vectorisation - data deduplication - pipeline caching…
A primer on scRNA-seq foundation models The New York Times published an article titled A.I. Is Learning What It Means to Be Alive. I'm not the biggest fan of the title, but I am happy the subject in it is being talked about more! It laid out the story of how so-called 'scRNA-seq Foundation Models' may potentially change how single-cell RNA sequencing (scRNA) data is interpreted, used, and applied. Though the Times article was fantastic in its own right, it asks very surface-level questions about the whole process…I'd like to do a much deeper dive in this topic and try to walk through the motivation, ideas, and process of creating models, along with what they do well on and what they still struggle with…
Join NielsenIQ and Onehouse to explore the crucial role of vector embeddings in AI. Discover how Onehouse makes it more cost-efficient, simple, and scalable to generate and manage vector embeddings directly from your data lake, amidst rising vector database costs.
Live webinar. Aug 27, 2024 | 10 am PT
Can't make it? Register anyway to receive the recording! Register Now Here * Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
* Based on unique clicks. ** Find last week's issue #560 here.
Looking to get a job? Check out our “Get A Data Science Job” Course It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume. Promote yourself/organization to ~63,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :) Stay Data Science-y! All our best, Hannah & Sebastian
| |