Hello!
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If you don’t find this email useful, please unsubscribe here.
Is this newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week
Kalman Filter For Dummies
When I started doing my homework for Optimal Filtering for Signal Processing class, I said to myself :"How hard can it be?". Soon I realized that it was a fatal mistake…this article is the result of my couple of day's work and reflects the slow learning curves of a "mathematically challenged" person…If you're humble enough to admit that you don't understand this stuff completely, you'll find this material very enlightening.
Understanding Moments
Why are a distribution's moments called "moments"? How does the equation for a moment capture the shape of a distribution? Why do we typically only study four moments? I explore these and other questions in detail…
Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments
LoRA is one of the most widely used, parameter-efficient fine-tuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it…
Hex is a collaborative workspace for data science and analytics. Now data teams can run their queries, notebooks, and interactive reports — all in one place.
Hex has Magical AI tools that can generate queries and code, create visualizations, and even kickstart a whole analysis, all from natural language prompts, allowing teams to accelerate work and focus on what matters.
Join hundreds of data teams like Notion, AllTrails, Loom, Brex, and Algolia using Hex every day to make their work more impactful. Sign up today at hex.tech/datascienceweekly to get a 30-day free trial of the Hex Team plan!
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
News from PyTorch Conference 2023
Hello from the PyTorch Conference in San Francisco! We’re thrilled that we’re able to bring together leading researchers, developers, and academic communities to further the education and advancement of end-to-end machine learning framework…The PyTorch team has been hard at work this year to bring innovative releases that further enhance the AI and ML community...Read on for all of the news and happenings coming out of PyTorch Conference 2023!..
Deep Learning Ultra
Open source Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in PyTorch, OpenCV (compiled for GPU), TensorFlow 2 for GPU, PyG and NVIDIA RAPIDS, running on CUDA 12.1…
Storytelling vs. Exploring, With Global Happiness Data: A meta analysis on data visualization
I recently came across a piece that examines the difference between "storytelling" and "exploration" in the context of data visualization. The piece, by Amanda Makulec, observes that the two serve fundamentally different goals, but are often conflated in the industry…I wanted to explore (or I guess maybe tell a story about...) this distinction a bit more, as well as a few other factors that I think about when designing a visualization. My goal is to codify some of the broad decisions that go into designing a visualization, and to outline how those decisions can be translated into specific design choices…
LLM domination on job descriptions [Reddit]
Can anyone explain why many companies asking for LLM experience for data scientist roles? It wasn't there like 6-8 months ago, now around 70% of the job descriptions asking for that and it goes like Python, SQL and LLM. Looks a bit weird to be honest. What are they doing, creating their own ChatGPT?…
Why We’re Building an Open-Source Universal Translator
We’re building a small, unconnected box with a built-in display that can automatically translate between dozens of different languages. You can see it in the video above, and we’ve got working demos to share if you’re interested in trying it out. The form factor means it can be left in-place on a hotel front desk, brought to a meeting, placed in front of a TV, or anywhere you need continuous translation. The people we’ve shown this to have already asked to take them home for visiting relatives, colleagues, or themselves when traveling…
Interactive Demonstration of Ridge Regression and Intro to Hyperparameter Tuning
In the Fall of 2019 my students requested a demonstration to show the value of ridge regression. I wrote this interactive demonstration to show cases in which the use of regularization coefficient, a hyperparameter, that reduces the model flexibilty / sensivity to training data (reduces model variance) improves the prediction accuracy…
Multimodality and Large Multimodal Models (LMMs)
This post covers multimodal systems in general, including LMMs. It consists of 3 parts.
Part 1 covers the context for multimodality, including why multimodal, different data modalities, and types of multimodal tasks.
Part 2 discusses the fundamentals of a multimodal system, using the examples of CLIP, which lays the foundation for many future multimodal systems, and Flamingo, whose impressive performance gave rise to LMMs.
Part 3 discusses some active research areas for LMMs, including generating multimodal outputs and adapters for more efficient multimodal training, covering newer multimodal systems such as BLIP-2, LLaVA, LLaMA-Adapter V2, LAVIN, etc.
GenSim: Generating Robotic Simulation Tasks via Large Language Models
We propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability…Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks…
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
More than 90% of automotive innovations are based on electronics and software.
We, the BMW Group, offer you an interesting and varied internship in data science for Performance Control & Digitalization. To take our operations to the next level, the BMW Group – Performance Control & Digitalization department is looking for a Data science intern to contribute to the Supply Chain Innovations Think Tank of BMW Group and continue BMW’s leadership in supply chain management. The goal of the team will be to research emerging technologies including Data Science (ML, AI, BI etc.).
Location is Munich. Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
RLHF Papers
Some notes for browsing the list.
All papers are about LLMs unless tagged with a specific category, such as control for robotics / simulated agents or multimodal, or more than one!
Limited notes and summaries will be added as I go
This is a resource for keeping up with stuff that is added today and recently, I will not add all historic work, even things from earlier in 2023. For prominent past papers, see here…
Deep Learning Course
You can find here slides, recordings, and a virtual machine for François Fleuret's deep-learning courses 14x050 of the University of Geneva, Switzerland…This course is a thorough introduction to deep-learning, with examples in the PyTorch framework:
machine learning objectives and main challenges,
tensor operations,
automatic differentiation, gradient descent,
deep-learning specific techniques,
generative, recurrent, attention models…
SAT Solvers I: Introduction and applications
This tutorial concerns the Boolean satisfiability or SAT problem. We are given a formula containing binary variables that are connected by logical relations such as OR and AND. We aim to establish whether there is any way to set these variables so that the formula evaluates to TRUE…Algorithms that are applied to this problem are known as SAT solvers. The tutorial is divided into three parts. In part I, we introduce Boolean logic and the SAT problem. We discuss how to transform SAT problems into a standard form that is amenable to algorithmic manipulation. We categorize types of SAT solvers and present two naïve algorithms. We introduce several SAT constructions, which can be thought of as common sub-routines for SAT problems. Finally, we present some applications; the Boolean satisfiability problem may seem abstract, but as we shall see it has many practical uses…
* Based on unique clicks.
** Find last week's issue #516 here.
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S.
Is this newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.