Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week.
Book: Data Engineering Design Patterns (DEDP)
After writing for about 8+ years on my personal blog, and even doing it professionally for a year, I love the challenge that my former boss confronted me, to put my 20+ years of experience in data engineering into a book. And, as I love long formats, a book might just be the best format to bring it all together in one…the convergent evolution in data engineering captivated me, inspiring this book. My goal is to dissect these evolutionary paths, uncovering their unique strengths and weaknesses. By doing so, I aim to identify universal design patterns applicable across the spectrum of data engineering. Join me as we explore the essence of convergent evolution, its relevance in our field, and how it can guide us through the Data Engineering Lifecycle, enriching our understanding and practice….
Learning and Controlling Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy
We introduce a machine learning approach to determine the transition dynamics of silicon atoms on a single layer of carbon atoms, when stimulated by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition probabilities. These learned transition dynamics are then leveraged to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach…
Third wave 3D human pose and shape estimation:
Beyond 3D to understanding humans
We recently released a pre-print describing PoseGPT, a multi-modal large language model (LLM) that is trained to estimate 3D human pose from images, text, or both. This is interesting because it shows that an LLM can estimate 3D humans from images but this may not be so surprising. After all, LLMs are large capacity models and can be fine-tuned to do many tasks. More interesting and important, is that PoseGPT learns to relate 3D human pose to more general concepts about humans…
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Which Movies Are The Most Polarizing? A Statistical Analysis
Which films divide audiences, and what makes a movie divisive?…Released in 1959, Ed Wood's Plan 9 from Outer Space is an incomprehensible low-budget sci-fi yarn about aliens resurrecting the Earth's dead as zombies…Ed Wood's bizarro sci-fi masterwork exists amongst a highly specific canon of polarizing films—movies that are the object of intense adoration and equally harsh criticism (with little middle ground). People watch these movies and hail them as classics or openly lambast the film as evidence of societal decay. So today, we'll explore the most polarizing movies in film history, the characteristics that make them so schismatic, and the decline of divisive filmmaking in recent years…
How to think about the OpenAI Q* rumors
I’m skeptical that Q*—whatever it is—is the crucial breakthrough that will lead to artificial general intelligence. I certainly don’t think it’s a threat to humanity. But it might be an important step toward an AI with general reasoning abilities. In this piece, I’ll offer a guided tour of this important area of AI research and explain why step-by-step reasoning techniques designed for math problems could have much broader applications…
Will Scaling Solve Robotics?: Perspectives From Corl 2023
This year’s CoRL was the biggest CoRL yet, with over 900 attendees, 11 workshops, and almost 200 accepted papers. While there were a lot of cool new ideas (see this great set of notes for an overview of technical content), one particular debate seemed to be front-and-center: “is training a large neural network on a very large dataset a feasible way to solve robotics?”…My main goal here is to try to present the different sides of the argument as I heard them, without bias towards any side. Almost all the content is taken directly from talks I attended or conversations I had with fellow attendees. My hope is that this serves to deepen people’s understanding around the debate, and maybe even inspire future research ideas and directions…
Combining Bayes and Graph-based Causal Inference with Robert Ness
In this seminar, we discuss how to do causal graphical modeling with probabilistic programming, as well as tools and design patterns for doing so…
Arxiv Dives - Vision Transformers
We have a reading club every Friday called Arxiv Dives where we go over the fundamentals of a lot of the state of the art techniques used in Machine Learning today. Last week we dove into the "Vision Transformers" Paper from 2021 where the Google Brain team benchmarked training large scale transformers against ResNets. Though it is not groundbreaking research as of this week, I think with the pace of AI it is important to dive deep into past work and what others have tried! It's nice to take a step back and review the fundamentals as well as keeping up with the latest and greatest. Posted the notes and recap here if anyone finds it helpful…
What opinion about data science would you defend like this? [Reddit]
How We Investigated France’s Mass Profiling Machine
Lighthouse Reports partnered with Le Monde to investigate an algorithm deployed by France’s Caisse Nationale des Allocations Familiales (CNAF), the agency responsible for the French social security system. The algorithm, deployed for more than 10 years, attempts to predict which benefit recipients are committing fraud. All of the more than 13 million households, — representing nearly half the population — who receive some type of benefit are assigned a risk score. Using French freedom-of-information laws, Le Monde obtained the source code for three risk-scoring models deployed by the CNAF between 2010 and 2023…
Vertical AI: Why a Vertical Approach is Key to Building Enduring AI Applications
The rise of Vertical SaaS in the past decade has demonstrated the power of industry-specific software, producing dozens of winners like Toast, Shopify, Procore, and ServiceTitan. Yet there are still many markets underserved by Vertical SaaS: foundational industries with intrinsic barriers to technological disruption (e.g. unstructured data, constrained TAMs, slow sale cycles, low annual contract values, and tricky incumbents), and sectors that are either just emerging or undergoing a major transformation (e.g. the electrification of energy.) But now, two key developments have made it possible to build software that serves these outliers: 1) the rise of artificial intelligence that can tackle unstructured data and 2) the redefinition of Vertical SaaS as Vertical Software…
Teach Llamas to Talk: Recent Progress in Instruction Tuning
Collecting supervised fine-tuning or preference data is known to be prohibitively expensive (Lambert, 2023), thus it stayed as a corporate game until 2023, when people found cheaper ways to construct such data. Since then there have been numerous open-source efforts in developing instruction-tuned models. In the following, I will cover such efforts in four parts: SFT data, preference data, algorithms, and evaluation. In the end, I will introduce our latest work on instruction-following evaluation, which shows that it is important to set up the right evaluator and otherwise you may get misleading results…
We are BCG X.
BCG X is the tech build & design unit of BCG. Turbocharging BCG’s deep industry and functional expertise, BCG X brings together advanced tech knowledge and ambitious entrepreneurship to help organizations enable innovation at scale. With nearly 3,000 technologists, scientists, programmers, engineers, and human-centered designers located across 80+ cities, BCG X builds and designs platforms and software to address the world’s most important challenges and opportunities.
Our BCG X teams own the full analytics value-chain end to end: framing new business challenges, designing innovative algorithms, implementing, and deploying scalable solutions, and enabling colleagues and clients to fully embrace AI. Our product offerings span from fully custom-builds to industry specific leading edge AI software solutions.
Our Data Scientists and Senior Data Scientist are part of our rapidly growing team to apply data science methods and analytics to real-world business situations across industries to drive significant business impact. You'll have the chance to partner with clients in a variety of BCG regions and industries, and on key topics like climate change, enabling them to design, build, and deploy new and innovative solutions.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Text Embeddings Visually Explained
We take a visual approach to gain an intuition behind text embeddings, what use cases they are good for, and how they can be customized using finetuning…
LLM Visualization
A visualization and walkthrough of the LLM algorithm that backs OpenAI's ChatGPT. Explore the algorithm down to every add & multiply, seeing the whole process in action…
Football Analytics Bible
A space for football analytics projects by Edd Webster, including a curated list of publicly available resources published by the football analytics community…
* Based on unique clicks.
** Find last week's issue #523 here.
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week! :)
All our best,
Hannah & Sebastian
P.S. Was today’s newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.