Data Science Weekly - Issue 507

Curated news, articles and jobs related to Data Science

Aug 11

Share

Issue #507
August 10 2023

Hello and thank you for tuning in to Issue #507!

Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

Seeing this for the first time? Subscribe here:

Want to support us? Become a paid subscriber here.

If you don’t find this email useful, please unsubscribe here.

And now, let's dive into some interesting links from this week :)

Editor's Picks

What coal and Jevons’ paradox tell us about AI and data
What happens when a computer can write SQL or Python better than any human?…Do all the data jobs go away?…This is the basic form of a complex question that people are asking in the Age of LLMs; as it turns out, a lot of data practitioners’ tasks are both highly specialized (e.g., require arcane knowledge of pandas or SQL syntax) and also pretty rote (e.g., writing out that same syntax over and over). And those happen to be the kind of tasks that LLMs seem really well suited for. So, it's easy to imagine the Executives gearing up to replace all the data people with computers….The future is weird and uncertain, but I really don't think that's what's going to happen…and to understand why, let’s take a journey back to the 19th century…and look at Jevon’s Paradox…

The Direct Approach
Empirical scaling laws can help predict the cross-entropy loss associated with training inputs, such as compute and data. However, in order to predict when AI will achieve some subjective level of performance, it is necessary to devise a way of interpreting the cross-entropy loss of a model. This blog post provides a discussion of one such theoretical method, which we call the Direct Approach.

Do Machine Learning Models Memorize or Generalize?
In 2021, researchers made a striking discovery while training a series of tiny models on toy tasks. They found a set of models that suddenly flipped from memorizing their training data to correctly generalizing on unseen inputs after training for much longer. This phenomenon – where generalization seems to happen abruptly and long after fitting the training data – is called grokking and has sparked a flurry of interest…In this article we’ll examine the training dynamics of a tiny model and reverse engineer the solution it finds – and in the process provide an illustration of the exciting emerging field of mechanistic interpretability. While it isn’t yet clear how to apply these techniques to today’s largest models, starting small makes it easier to develop intuitions as we progress towards answering these critical questions about large language models…

A Message from this week's Sponsor:

Hire AE’s World Class Tech Team

Accelerate your success with AE's elite team of experts!

🚀 Get ahead with swift development of Minimum Viable Products (MVPs).

🚀 Lead the way in innovation with Digital Transformation Initiatives.

🚀 Boost your ROI with tailored AI/ML solutions.

Schedule a Consultation Today

Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Llama from scratch (or how to implement a paper without crying)
I want to provide some tips from my experience implementing a paper. I'm going to cover implementing a dramatically scaled-down version of Llama for training TinyShakespeare. This post is heavily inspired by Karpathy's Makemore series, which I highly recommend…

Whats the point of learning ML theory if industry doesn't care (other than interviews)? [Reddit Discussion]
To understand ML theory one needs to have good hold on stats, probability and basic algebra. Deep learning requires extensive knowledge of linear algebra. All of this takes months and months to understand. But in the end all that matters is whether you can implement a model or not. Especially today when we are looking at all per-trained (LLM) models, which is just few lines of code to train. I won't say implementation is not important. But it requires much less effort to master. Why one should (and would) waste his time learning all the maths?…

Principles for Planning an Academic Workshop
Workshops are my favourite type of academic meeting. Smaller than conferences, more focused on a particular area of interest, and featuring the latest results. Planning a workshop can be a challenging endeavour. As an organizer, it is hard to foresee what makes for a good experience for the audience. I have co-organized a number of workshops of different shapes and sizes, and put a lot of thought into what makes for the best audience experience. I will share some of my thoughts and opinions on how to plan an effective and enjoyable workshop. This post will focus on single-day workshops with a combination of invited talks and contributed papers, which in particular are common at machine learning conferences...
Unlocking Insights: Estimating Causal Effect Using Propensity Score Matching
At the end of reading this post, you will be able to understand how to estimate a causal effect with retrospective data, get familiar with the concept of confounders, learn how to balance your data using Propensity Score Matching, and most importantly, you will have fully available code to use for your research. If you feel comfortable with Propensity Score Matching, skip to the real-life example and use the code snippets.
Data Wrangling Functions
This repository contains examples of packages::functions() I commonly use when wrangling education research data. I am building this for myself, my team, and anyone else who comes across this site, to use as a reference for data cleaning projects…

Controlling Tail Risk in Online Ski-Rental
The classical ski-rental problem admits a textbook 2-competitive deterministic algorithm, and a simple randomized algorithm that is ee−1-competitive in expectation…We ask what happens to the optimal solution if we insist that the tail risk, i.e., the chance of the competitive ratio exceeding a specific value, is bounded by some constant δ. We find that this additional modification significantly changes the structure of the optimal solution. The probability of purchasing skis on a given day becomes non-monotone, discontinuous, and arbitrarily large (for sufficiently small tail risk δ and large purchase cost n)…
Viberary - Tired of bad genre-based book recommendations?
Viberary is a side project that I created to find books by vibe. I built it to satisfy an itch to do ML side projects and navigate the current boundary between search and recommendations. It's a production-grade compliment to my recent deep dive into embeddings…

I’m losing my voice due to illness, and I’m looking for ML/AI solution
I’m losing my voice due to an illness (Parkinson’s disease), and I would like to create an AI voice using recordings from 10 years ago. I used to be a prolific podcaster, and I have about 50 episodes of podcasts that I can use as input. Is this possible? What service or software can I use? My voice is beyond repair since Parkinson’s is a progressive disease. An AI voice would allow me to work and would open up new doors for me….

Growing Bonsai Networks with RNNs
This writeup introduces what I'm calling Bonsai Networks - extremely sparse computational graphs produced by training and pruning RNNs. They provide an interpretable view into the solutions learned by networks for simple logic problems, breaking out of the black box neural networks typically reside in. I give an overview of the process I use to create these networks which includes several custom neural network components and a training pipeline implemented from scratch in Tinygrad. I also include many interactive visualizations of the generated graphs and use them to reverse engineer some interesting solutions they learned for a variety of logic problems…

Eliminating hallucinations (fast!) in Large Language Models with Finite State Machines
In this post, we will demonstrate a method that can constrain LLMs to generate only valid output! This can be done efficiently, effectively, and quite generally. In order to avoid too much complexity, we are going to focus on generating text that matches a regular expression (regex) query…

SynJax: Structured Probability Distributions for JAX
Today we are open-sourcing SynJax which is a JAX library for efficient probabilistic modeling of structured objects (sequences, segmentations, alignments, trees...). It can compute everything you would expect from a probability distribution: argmax, samples, marginals, entropy.….

MiniChain - A tiny library for coding with large language models
A tiny library for coding with large language models…There are several very popular libraries for prompt chaining, notably: LangChain, Promptify, and GPTIndex. These library are useful, but they are extremely large and complex. MiniChain aims to implement the core prompt chaining functionality in a tiny digestable library…

Jobs

Assistant Vice President, Data Scientist

Barclays Capital Inc. seeks Assistant Vice President, Data Scientist in New York, NY (multiple positions available):

* Write Extract, Transform, Load (ETL) code to read from our data sources, and load data for analysis using source control (git; bitbucket) to version-control code contributions

* Encapsulate analysis code built on the ETL code to make work reusable by the team

* Automate analysis processes using Spark, Python, Pandas, SpaCy, Tensorflow, Keras, PyTorch, and other open-source large-scale computing and statistical software

* Create and maintain a Reddit data pipeline, with ad hoc maintenance to serve requests

* Review other coworkers’ contributions to our shared repository

* Telecommuting benefits permitted

Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

Mathematics of Machine Learning Summer School
Recently, deep neural networks have demonstrated stunning empirical results across many applications like vision, natural language processing, and reinforcement learning. The field is now booming with new mathematical problems, and in particular, the challenge of providing theoretical foundations for deep learning techniques is still largely open. On the other hand, learning theory already has a rich history, with many beautiful connections to various areas of mathematics (e.g., probability theory, high dimensional geometry, game theory). The purpose of the summer school is to introduce graduate students (and advanced undergraduates) to these foundational results, as well as to expose them to the new and exciting modern challenges that arise in deep learning and reinforcement learning…
I recorded a PySpark Big Data Course (Python API of Apache Spark) and uploaded it on YouTube
Hello everyone, I uploaded a PySpark course to my YouTube channel. I tried to cover wide range of topics including SparkContext and SparkSession, Resilient Distributed Datasets (RDDs), DataFrame and Dataset APIs, Data Cleaning and Preprocessing, Exploratory Data Analysis, Data Transformation and Manipulation, Group By and Window ,User Defined Functions and Machine Learning with Spark MLlib….
The Little Book of Deep Learning
This is a short introduction to deep learning for readers with a STEM background, originally designed to be read on a phone screen. It is distributed under the Creative Commons BY-NC-SA 4.0 International License, and was downloaded close to 250,000 times in the month following its public release…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #506 here.

Cutting Room Floor

Thank you for joining us this week :)

All our best,
Hannah & Sebastian

P.S.
If you found this newsletter helpful to your job, please consider becoming a paid subscriber here:
https://datascienceweekly.substack.com/subscribe :)

You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.

ICYMI: Data Science Weekly - Issue 506

Monday, August 7, 2023

In case you missed last Thursday's post: Curated news, articles and jobs related to Data Science ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Data Science Weekly - Issue 506

Friday, August 4, 2023

Curated news, articles and jobs related to Data Science ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Data Science Weekly - Issue 505

Friday, July 28, 2023

Curated news, articles and jobs related to Data Science ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Data Science Weekly - Issue 504

Friday, July 21, 2023

Curated news, articles and jobs related to Data Science ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Data Science Weekly - Issue 503

Sunday, July 16, 2023

Curated news, articles and jobs related to Data Science ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

WebAIM November 2024 Newsletter

Friday, November 22, 2024

WebAIM November 2024 Newsletter Read this newsletter online at https://webaim.org/newsletter/2024/november Features Using Severity Ratings to Prioritize Web Accessibility Remediation When it comes to

➡️ Why Your Phone Doesn't Want You to Sideload Apps — Setting the Default Gateway in Linux

Friday, November 22, 2024

Also: Hey Apple, It's Time to Upgrade the Macs Storage, and More! How-To Geek Logo November 22, 2024 Did You Know Fantasy author JRR Tolkien is credited with inventing the main concept of orcs and

JSK Daily for Nov 22, 2024

Friday, November 22, 2024

JSK Daily for Nov 22, 2024 View this email in your browser A community curated daily e-mail of JavaScript news React E-Commerce App for Digital Products: Part 4 (Creating the Home Page) This component

Spyglass Dispatch: The Fate of Chrome • Amazon Tops Up Anthropic • Pros Quit Xitter • Brave Powers AI Search • Apple's Lazy AI River • RIP Enrique Allen

Friday, November 22, 2024

The Fate of Chrome • Amazon Tops Up Anthropic • Pros Quit Xitter • Brave Powers AI Search • Apple's Lazy AI River • RIP Enrique Allen The Spyglass Dispatch is a free newsletter sent out daily on

Data Science Weekly - Data Science Weekly - Issue 507

Data Science Weekly - Issue 507

Curated news, articles and jobs related to Data Science

Issue #507
August 10 2023

Editor's Picks

A Message from this week's Sponsor:

Hire AE’s World Class Tech Team

Schedule a Consultation Today

Data Science Articles & Videos

Jobs

Assistant Vice President, Data Scientist

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

ICYMI: Data Science Weekly - Issue 506

Data Science Weekly - Issue 506

Data Science Weekly - Issue 505

Data Science Weekly - Issue 504

Data Science Weekly - Issue 503

You Might Also Like

WebAIM November 2024 Newsletter

➡️ Why Your Phone Doesn't Want You to Sideload Apps — Setting the Default Gateway in Linux

JSK Daily for Nov 22, 2024

Spyglass Dispatch: The Fate of Chrome • Amazon Tops Up Anthropic • Pros Quit Xitter • Brave Powers AI Search • Apple's Lazy AI River • RIP Enrique Allen

Charted | How the Global Distribution of Wealth Has Changed (2000-2023) 💰

Daily Coding Problem: Problem #1616 [Easy]

The problem to solve

Issue #568: Random mazes, train clock, and ReKill

Whats Next for AI: Interpreting Anthropic CEOs Vision

iOS Cocoa Treats

Data Science Weekly - Data Science Weekly - Issue 507

Curated news, articles and jobs related to Data Science

Issue #507August 10 2023

Editor's Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Jobs

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

You Might Also Like

Issue #507
August 10 2023