|
|
Editor Picks
- DALL·E 2 and The Origin of Vibe Shifts
The point of this essay isn’t about predicting next year’s design trends. To me the more interesting thing is to understand the ecological process that generates those trends, seeing the true signaling function of visual design, and learning why some corporate status signaling is so effective and why some isn’t. Projecting further out, it’s about trying to picture a world where it’s cheap and easy for anyone to generate just about any kind of image they want...To answer these questions, we’re going to tap the most well-developed pool of knowledge on the use of costly signals and their evolution over time: biology....
- Learning with not Enough Data Part 3: Data Generation
Part 3 of “what if you don’t have enough training data” series - touch base on creating more synthetic data by data augmentation or model generation, as well as some ideas on how to work with noisy labels (given synthetic data might not be fully correct)...
A Message from this week's Sponsor:
ML and Data Developers Week (May 16~20)
Learn ML/data engineering from top minds practitioners at the ML and Data Developers Week, which is geared for engineering teams to discuss the practical solutions, challenges faced when building ML for the real world. With thousands of global ML devs/data scientists, deep dive tech talks, hands-on workshops, You can look forward to engaging conversations, insightful discussions, hands-on code labs, and peer networking. Free to join virtually and/or in-person with food, swags and prizes.
Data Science Articles & Videos
- Will It Scale? Applying Data, Science, and Economics to the Art of Ideas
In this interview, John List [chief economist at Walmart & professor of economics at the University of Chicago] discusses some of his new book’s themes [The Voltage Effect: How to Make Good Ideas Great and Great Ideas Scale], such as the importance of knowing when to quit or pivot, and how practicing the science of scaling can help ensure an idea’s success. He also shares his thoughts on the relationship between economics and technology, the state of behavioral economics and data science, and the prospect of using AI to reanimate promising, but previously unsuccessful ideas...
- Data Science at Stitch Fix
Podcast Interview with Olivia Liao, Senior Director of Data Science at Stitch Fix, a company that uses data science and expert stylists to deliver personalization at scale. We discuss how they blend data science and domain expertise, how they tune recommendations in light of logistics and supply chain constraints, and how they incorporate new developments in large language models, multimodal models and Responsible AI....
- Creating Confidence Intervals for Machine Learning Classifiers
This article outlines different methods for creating confidence intervals for machine learning models. Note that these methods also apply to deep learning...it’s worth highlighting that the big picture is to measure and report uncertainty. Confidence intervals are one way to do that. However, It is also helpful to include the average performance over different dataset splits or random seeds with the variance or standard deviation – I sometimes adopt this simpler approach as it is more straightforward to explain. But since this article is about confidence intervals, let’s define what they are and how we can construct them....
- It’s Our Moral Obligation to Make Data More Accessible
Most of the world’s data is sitting on a shelf, being used in a very narrow domain. This data, if properly activated, could solve some of the world’s biggest problems and lead to more health, happiness, and love for society. We could use this data to uncover some of society’s biggest secrets...Like Marc Andreessen’s piece, It’s Time to Build, this piece is a full-throated argument to massively increase the accessibility of data. And we need to do it now...
- The StatQuest Introduction to PyTorch [Video]
PyTorch is one of the most popular tools for making Neural Networks. This StatQuest walks you through a simple example of how to use PyTorch one step at a time. By the end of this StatQuest, you'll know how to create a new neural network from scratch, make predictions and graph the output, and optimize a parameter using backpropagation. BAM!!!...
- An arxiv-sanity-like view of ICLR 2022 papers
Hi, I am a fan of www.arxiv-sanity.com and like to have similar summaries for conference papers. I have ordered all ICLR2022 papers by rating and created 8-page thumbnails. With ICLR2022 now in full swing, the project can be useful in getting a quick overview of the accepted publications...
- Specification gaming: the flip side of AI ingenuity
Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome...This problem arises in the design of artificial agents. For example, a reinforcement learning agent can find a shortcut to getting lots of reward without completing the task as intended by the human designer. These behaviours are common, and we have collected around 60 examples so far (aggregating existing lists and ongoing contributions from the AI community)...In this post, we review possible causes for specification gaming, share examples of where this happens in practice, and argue for further work on principled approaches to overcoming specification problems...
- More Than Meets the Eye: A Closer Look at Encodings in Visualization
Encodings play a central role in visualization, but I believe our thinking about them is too simplistic. In a new paper, I argue that we need to distinguish between the encodings that specify how a visualization is drawn and the ones that are readable or actually read by an observer. While they largely or entirely overlap in some charts (like bar charts or scatterplots) they don’t in others (pie charts, line charts, etc.). And what exactly do you even specify in more complex visualizations like treemaps?...
- What is the value of the p-value? [Slides from the talk]
The debate over the value and interpretation of p-value has endured since the time of its inception nearly 100 years ago. The use and interpretation of p-values vary by a host of factors, especially by discipline. These differences have proven to be a barrier when developing and implementing boundary-crossing clinical and translational science. The purpose of this panel discussion is to discuss misconceptions, debates, and alternatives to the p-value...
- Compact word vectors with Bloom embeddings
A high-coverage word embedding table will usually be quite large. One million 32-bit floats occupies 4MB of memory, so one million 300-dimension vectors will be 1.2GB in size. Such a large model size is at least annoying for many applications, while for others it’s completely prohibitive...Probabilistic data structures are a natural fit for machine learning models, so they’re quite widely used. However, they’re definitely unintuitive, which is why we refer to this solution [using a probabilistic data structure ] as a “cheat”. We’ll start by introducing the full algorithm, without dwelling too long on why it works. We’ll then go back and fill in more of the intuition, and then describe how we use it in practice in Thinc, spaCy and floret...
Conference*
Join us at apply(), the ML data engineering conference - it’s free.
Speakers include practitioners from the Wikimedia Foundation, Facebook, Gojek, Snapchat, Instacart, Walmart, Stripe, Uber, Volvo, Snowflake, Databricks, and more. We’d love for you to join us.
Agenda highlights:
- Smitha Shyam, Director of Engineering at Uber: Uber's Michelangelo: Then and Now
- Chris Albon, Director of Machine Learning at Wikimedia Foundation: More Ethical Machine Learning Using Model Card at Wikimedia
- Matei Zaharia, Co-Founder and Chief Technologist at Databricks: The Future of Data for Machine Learning
- Chip Huyen, Co-Founder at Claypot AI: Machine Learning Platform for Online Prediction and Continual Learning
- Clem Delangue, CEO at Hugging Face: Is Open-Source Machine Learning Becoming the Most Impactful Technology of the Decade?
See the full agenda and register for free.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Jobs
- Data Scientist - Hungryroot - Remote
Hungryroot is looking for a Data Scientist to join our growing Data Team. As a Data Scientist, you will work closely with other Data Scientists and Data Engineers to develop various Machine Learning models that power Hungryroot and it’s AI functions. These models include traditional forecasting models, as well as more industry-specific optimization challenges.
As a Data Scientist at Hungryroot, you will work on answering questions like: how do you tell what food someone would like to eat this week, how do you determine whether they enjoyed it or not, maybe they liked their means last week, but are now looking for different options, maybe they like the same food on Tuesdays, but variety on Fridays, what about spicy food, is Green Chilly as spicy as Green Curry?
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
- Writing production grade code for ML in python [Reddit Discussion]
I have been interviewing for a machine learning lead position. I have successfully passed 3 interview rounds (coding , HR, system design). I have my final interview with the VP of Engineering. When asked how best to prepare myself, they said they would like to test my ability to write "production quality" code in python. While I do have some experience, the downside is I worked in small R&D teams for a long time. Though I am knowledgeable in python, perhaps, I might have not followed all the industry best practices...If you are a hiring manager or interviewer, how would you test this ability? How do I prepare myself to prove my ability to write production grade code?...
- Parametric vs. Non-parametric tests, and when to use them
Too often the statistical underpinnings of the data science community are overlooked. I’ve been lucky enough to have had both undergraduate and graduate courses dedicated solely to statistics, in addition to growing up with a statistician for a mother. So this article is what will likely be the first of several to share some basic statistical tests and when/where to use them!...A parametric test makes assumptions about a population’s parameters...
- Mathematical Foundations of Monte Carlo Methods
We will try to give a sense of what these Monte Carlo methods are, how they work, why, and what they are used for. This quick introduction, is for readers who do not have the time or the desire to get any further. But you may need to read all the remaining chapters if you are serious about learning what these methods are...This lesson is more an introduction to the mathematical tools upon which the Monte Carlo methods are built. The methods themselves are explained in the next lesson (Monte Carlo Methods in Practice)...
Books
-
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
|
|
|
|