Hello and thank you for tuning in to Issue #507!
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
Seeing this for the first time? Subscribe here:
Want to support us? Become a paid subscriber here.
If you don’t find this email useful, please unsubscribe here.
And now, let's dive into some interesting links from this week :)
What coal and Jevons’ paradox tell us about AI and data
What happens when a computer can write SQL or Python better than any human?…Do all the data jobs go away?…This is the basic form of a complex question that people are asking in the Age of LLMs; as it turns out, a lot of data practitioners’ tasks are both highly specialized (e.g., require arcane knowledge of pandas or SQL syntax) and also pretty rote (e.g., writing out that same syntax over and over). And those happen to be the kind of tasks that LLMs seem really well suited for. So, it's easy to imagine the Executives gearing up to replace all the data people with computers….The future is weird and uncertain, but I really don't think that's what's going to happen…and to understand why, let’s take a journey back to the 19th century…and look at Jevon’s Paradox…
The Direct Approach
Empirical scaling laws can help predict the cross-entropy loss associated with training inputs, such as compute and data. However, in order to predict when AI will achieve some subjective level of performance, it is necessary to devise a way of interpreting the cross-entropy loss of a model. This blog post provides a discussion of one such theoretical method, which we call the Direct Approach.
Do Machine Learning Models Memorize or Generalize?
In 2021, researchers made a striking discovery while training a series of tiny models on toy tasks. They found a set of models that suddenly flipped from memorizing their training data to correctly generalizing on unseen inputs after training for much longer. This phenomenon – where generalization seems to happen abruptly and long after fitting the training data – is called grokking and has sparked a flurry of interest…In this article we’ll examine the training dynamics of a tiny model and reverse engineer the solution it finds – and in the process provide an illustration of the exciting emerging field of mechanistic interpretability. While it isn’t yet clear how to apply these techniques to today’s largest models, starting small makes it easier to develop intuitions as we progress towards answering these critical questions about large language models…
Accelerate your success with AE's elite team of experts!
🚀 Get ahead with swift development of Minimum Viable Products (MVPs).
🚀 Lead the way in innovation with Digital Transformation Initiatives.
🚀 Boost your ROI with tailored AI/ML solutions.
Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Llama from scratch (or how to implement a paper without crying)
I want to provide some tips from my experience implementing a paper. I'm going to cover implementing a dramatically scaled-down version of Llama for training TinyShakespeare. This post is heavily inspired by Karpathy's Makemore series, which I highly recommend…
Whats the point of learning ML theory if industry doesn't care (other than interviews)? [Reddit Discussion]
To understand ML theory one needs to have good hold on stats, probability and basic algebra. Deep learning requires extensive knowledge of linear algebra. All of this takes months and months to understand. But in the end all that matters is whether you can implement a model or not. Especially today when we are looking at all per-trained (LLM) models, which is just few lines of code to train. I won't say implementation is not important. But it requires much less effort to master. Why one should (and would) waste his time learning all the maths?…
Principles for Planning an Academic Workshop
Workshops are my favourite type of academic meeting. Smaller than conferences, more focused on a particular area of interest, and featuring the latest results. Planning a workshop can be a challenging endeavour. As an organizer, it is hard to foresee what makes for a good experience for the audience. I have co-organized a number of workshops of different shapes and sizes, and put a lot of thought into what makes for the best audience experience. I will share some of my thoughts and opinions on how to plan an effective and enjoyable workshop. This post will focus on single-day workshops with a combination of invited talks and contributed papers, which in particular are common at machine learning conferences...
Unlocking Insights: Estimating Causal Effect Using Propensity Score Matching
At the end of reading this post, you will be able to understand how to estimate a causal effect with retrospective data, get familiar with the concept of confounders, learn how to balance your data using Propensity Score Matching, and most importantly, you will have fully available code to use for your research. If you feel comfortable with Propensity Score Matching, skip to the real-life example and use the code snippets.
Data Wrangling Functions
This repository contains examples of packages::functions() I commonly use when wrangling education research data. I am building this for myself, my team, and anyone else who comes across this site, to use as a reference for data cleaning projects…
Controlling Tail Risk in Online Ski-Rental
The classical ski-rental problem admits a textbook 2-competitive deterministic algorithm, and a simple randomized algorithm that is ee−1-competitive in expectation…We ask what happens to the optimal solution if we insist that the tail risk, i.e., the chance of the competitive ratio exceeding a specific value, is bounded by some constant δ. We find that this additional modification significantly changes the structure of the optimal solution. The probability of purchasing skis on a given day becomes non-monotone, discontinuous, and arbitrarily large (for sufficiently small tail risk δ and large purchase cost n)…
Viberary - Tired of bad genre-based book recommendations?
Viberary is a side project that I created to find books by vibe. I built it to satisfy an itch to do ML side projects and navigate the current boundary between search and recommendations. It's a production-grade compliment to my recent deep dive into embeddings…
I’m losing my voice due to illness, and I’m looking for ML/AI solution
I’m losing my voice due to an illness (Parkinson’s disease), and I would like to create an AI voice using recordings from 10 years ago. I used to be a prolific podcaster, and I have about 50 episodes of podcasts that I can use as input. Is this possible? What service or software can I use? My voice is beyond repair since Parkinson’s is a progressive disease. An AI voice would allow me to work and would open up new doors for me….
Growing Bonsai Networks with RNNs
This writeup introduces what I'm calling Bonsai Networks - extremely sparse computational graphs produced by training and pruning RNNs. They provide an interpretable view into the solutions learned by networks for simple logic problems, breaking out of the black box neural networks typically reside in. I give an overview of the process I use to create these networks which includes several custom neural network components and a training pipeline implemented from scratch in Tinygrad. I also include many interactive visualizations of the generated graphs and use them to reverse engineer some interesting solutions they learned for a variety of logic problems…
SynJax: Structured Probability Distributions for JAX
Today we are open-sourcing SynJax which is a JAX library for efficient probabilistic modeling of structured objects (sequences, segmentations, alignments, trees...). It can compute everything you would expect from a probability distribution: argmax, samples, marginals, entropy.….
MiniChain - A tiny library for coding with large language models
A tiny library for coding with large language models…There are several very popular libraries for prompt chaining, notably: LangChain, Promptify, and GPTIndex. These library are useful, but they are extremely large and complex. MiniChain aims to implement the core prompt chaining functionality in a tiny digestable library…
Barclays Capital Inc. seeks Assistant Vice President, Data Scientist in New York, NY (multiple positions available):
* Write Extract, Transform, Load (ETL) code to read from our data sources, and load data for analysis using source control (git; bitbucket) to version-control code contributions
* Encapsulate analysis code built on the ETL code to make work reusable by the team
* Automate analysis processes using Spark, Python, Pandas, SpaCy, Tensorflow, Keras, PyTorch, and other open-source large-scale computing and statistical software
* Create and maintain a Reddit data pipeline, with ad hoc maintenance to serve requests
* Review other coworkers’ contributions to our shared repository
* Telecommuting benefits permitted
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Mathematics of Machine Learning Summer School
Recently, deep neural networks have demonstrated stunning empirical results across many applications like vision, natural language processing, and reinforcement learning. The field is now booming with new mathematical problems, and in particular, the challenge of providing theoretical foundations for deep learning techniques is still largely open. On the other hand, learning theory already has a rich history, with many beautiful connections to various areas of mathematics (e.g., probability theory, high dimensional geometry, game theory). The purpose of the summer school is to introduce graduate students (and advanced undergraduates) to these foundational results, as well as to expose them to the new and exciting modern challenges that arise in deep learning and reinforcement learning…
I recorded a PySpark Big Data Course (Python API of Apache Spark) and uploaded it on YouTube
Hello everyone, I uploaded a PySpark course to my YouTube channel. I tried to cover wide range of topics including SparkContext and SparkSession, Resilient Distributed Datasets (RDDs), DataFrame and Dataset APIs, Data Cleaning and Preprocessing, Exploratory Data Analysis, Data Transformation and Manipulation, Group By and Window ,User Defined Functions and Machine Learning with Spark MLlib….
The Little Book of Deep Learning
This is a short introduction to deep learning for readers with a STEM background, originally designed to be read on a phone screen. It is distributed under the Creative Commons BY-NC-SA 4.0 International License, and was downloaded close to 250,000 times in the month following its public release…
* Based on unique clicks.
** Find last week's issue #506 here.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S.
If you found this newsletter helpful to your job, please consider becoming a paid subscriber here:
https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.