Hello and thank you for tuning in to Issue #512!
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
Seeing this for the first time? Subscribe here:
If you find this newsletter helpful to your job, consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
If you don’t find this email useful, please unsubscribe here.
And now, let's dive into some interesting links from this week :)
Why Nvidia’s AI Supremacy is Only Temporary
Nvidia is an amazing company that has executed a contrarian vision for decades, and has rightly become one of the most valuable corporations on the planet thanks to its central role in the AI revolution. I want to explain why I believe it’s top spot in machine learning is far from secure over the next few years. To do that, I’m going to talk about some of the drivers behind Nvidia’s current dominance, and then how they will change in the future…
An introduction to Python for R Users
I have a confession to make: I am now a Python user. Don’t judge me, join me! In this post, I introduce Python for data analysis from the perspective of an R (tidyverse) user. This post is a must-read if you are an R user hoping to dip your toes in the Python pool…
LLM Training: RLHF and Its Alternatives
Reinforcement Learning with Human Feedback (RLHF) is an integral part of the modern LLM training pipeline due to its ability to incorporate human preferences into the optimization landscape, which can improve the model's helpfulness and safety…In this article, I will break down RLHF in a step-by-step manner to provide a reference for understanding its central idea and importance…Following up on the previous article that featured Llama 2, this article will also include a comparison between ChatGPT's and Llama 2's way of doing RLHF…Finally, for those wondering about the relevance or necessity of RLHF, I also added a section highlighting the most recent alternatives…
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Build and keep your context window
This is the keynote I prepared for PyData Amsterdam 2023. The TL;DR is that we must understand the historical context of our engineering decisions if we are to be successful in this brave new LLM world…I have something I’m worried: I’m worried that, in data land, we have forgotten how to deal with the two fundamental problems of computer engineering. You might know them already: Cache invalidation and naming things…
Should I transfer all my work to PyTorch already? [Reddit Discussion]
I've been using TensorFlow since 2017. I know it wasn't ideal or easy back then, but as an early adopter, I became very proficient with it and it has improved a lot since then. I have developed and deployed many custom models in low-level TF, both with and without utilizing the Keras abstractions. I am very comfortable with it in general…But I'm noticing now that PyTorch is gaining more popularity, all the younger practitioners, who got into deep learning within the last 3-5 years, are PyTorch adopters. I've also heard rumors that even Googlers are also abandoning TF…Anyhow I'd really prefer to stay within my comfort zone and continue to develop and improve in TF, but if TF is dying, then I better not to, right? So should I convert?…
🐾 zoofs ( Zoo Feature Selection )
zoofs is a python library for performing feature selection using a variety of nature-inspired wrapper algorithms. The algorithms range from swarm-intelligence to physics-based to Evolutionary. It's easy to use , flexible and powerful tool to reduce your feature size…
Making Deep Learning Go Brrrr From First Principles
So, you want to improve the performance of your deep learning model. How might you approach such a task? Often, folk fall back to a grab-bag of tricks that might've worked before or saw on a tweet. "Use in-place operations! Set gradients to None! Install PyTorch 1.10.0 but not 1.10.1!"…That being said, reasoning from first principles can still eliminate broad swathes of approaches, thus making the problem much more approachable…So, if you want to keep your GPUs going brrrr, let's discuss the three components your system might be spending time on - compute, memory bandwidth, and overhead…
Building RAG-based LLM Applications for Production (Part 1)
In this guide, we will learn how to:
Develop a retrieval augmented generation (RAG) based LLM application from scratch.
Scale the major workloads (load, chunk, embed, index, serve, etc.) across multiple workers.
Evaluate different configurations of our application to optimize for both per-component (ex. retrieval_score
) and overall performance (quality_score
).
Implement LLM hybrid routing approach to bridge the gap b/w OSS and closed LLMs.
Serve the application in a highly scalable and available manner.
Share the 1st order and 2nd order impacts LLM applications have had on our products….
On Prompting Stable Audio
Stable Audio allows you creating custom-length audio just by describing it. It is powered by a generative audio model based on diffusion. You can generate and download audio in 44.1 kHz stereo. You also have a nice interface, no need to be a hacker! And the audio you create can be used in your commercial projects. I’ve been experimenting with it during the last weeks, and here some ideas on how to use it!…
Convergence of gradient descent in over-parameterized networks
Neural networks typically have very large number of parameters. Depending on whether they have more parameters than training instances, they are over-parameterized or under-parameterized. In either case, their loss function is a multivariable, multidimensional and often non-convex function. In this post, we study over-parameterized neural networks and their loss landscape; we answer the question of why gradient descent (GD) and its variants converge to global minima in over-parameterized neural networks, even though their loss function is non-convex…
A Gentle Introduction to GDAL Part 5: Shaded Relief
In my previous posts on GDAL (written more than five years ago!) I covered how to open and interpret maps and images with embedded geographic information; how to transform maps from one projection to another; some of the complexities introduced working with highly detailed maps; and how to read and manipulate satellite imagery. This post and the next one will cover using GDAL for visualizing other types of data: measurements like elevation, cloud cover, city lights, and vegetation…
[AMA] I'm a data science manager in FAANG [Reddit Discussion]
I've worked at 3 different FAANGs as a data scientist. Google, Facebook and I'll keep the third one private for anonymity. I now manage a team. I see a lot of activity on this subreddit, happy to answer any questions people might have about working in Big Tech…
From Statistical to Causal Learning
We describe basic ideas underlying research to build and understand artificially intelligent systems: from symbolic approaches via statistical learning to interventional models relying on concepts of causality. Some of the hard open problems of machine learning and AI are intrinsically related to causality, and progress may require advances in our understanding of how to model and infer causality from data…
At LVMH, San Francisco, we are looking for a Manager of CRM Data Science Analytics to join our team and help us transform our customer data into insights and strategies for our luxury brands.
The successful candidate will have an extensive background in data science, analytics and customer relationship management (CRM), as well as a strong understanding of the luxury industry and how it applies to customer data analysis.
The Manager of CRM Data Science Analytics will be responsible for leveraging customer data and analytics to inform CRM strategies, drive customer engagement and ensure our brands’ success. In addition, this individual will support the development and implementation of data-driven strategies across the entire LVMH group
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
🤗 Deep Reinforcement Learning Course
Welcome to the most fascinating topic in Artificial Intelligence: Deep Reinforcement Learning. This course will teach you about Deep Reinforcement Learning from beginner to expert. It’s completely free and open-source!…
Transformer: Concept and code from scratch
In this post, I’ll document my learnings on main building blocks of transformer and how to implement them using PyTorch…Transformers are novel neural networks that are mainly used for sequence transduction tasks. Sequence transduction is any task where input sequences are transformed into output sequences. Most competitive neural sequence transduction models have an encoder-decoder structure. The encoder maps an input sequence of symbol representations to a sequence of continuous representations, the decoder then generates an output sequence of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next…
From PyTorch to JAX: towards neural net frameworks that purify stateful code
We will:
quickly recap a stateful LSTM-LM implementation in a tape-based gradient framework, specifically PyTorch,
see how PyTorch-style coding relies on mutating state, learn about mutation-free pure functions and build (pure) zappy one-liners in JAX,
step-by-step go from individual parameters to medium-size modules by registering them as pytree nodes,
combat growing pains by building fancy scaffolding, and controlling context to extract initialized parameters purify functions and
realize that we could get that easily in a framework like DeepMind's haiku
using its transform
mechanism…
* Based on unique clicks.
** Find last week's issue #511 here.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S.
If you found this newsletter helpful to your job, please consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.