Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week.
Geoscientists lick rocks [Twitter / X]
I’m feeling weirdly hurt by the viral tweet mocking geoscientists for licking rocks. I get that we’re a bit weird even for scientists and get a bit more blunt with our toolset, but licking rocks is a real strategy. Taste & texture are diagnostic…You don’t NEED to lick rocks; it’s just faster & easier…
The 'eu' in eucatastrophe – Why SciPy builds for Python 3.12 on Windows are a minor miracle
You've probably heard already that Python 3.12 was released recently…behind the ordinary-seeming "SciPy released builds compatible with Python 3.12" hides an extraordinary story worth telling, because of how several unrelated, multi-decade-long timelines happened to clash in a way that could have very easily led to no Python 3.12-compatible releases for a long time…We'll briefly shed some light on the following:
Why is Fortran still used in so many places?
How is that relevant to Python?
Past struggles of NumPy/SciPy with vanilla Python packaging.
What role conda-forge plays in this context….
Why use pytorch/jax at all? why don't people just write CUDA programs? [Twitter /X]
Response from Mark Saroufim from PyTorch: This is a good question, it gets to the root of the tradeoff between performance and flexibility so how do PyTorch folks think about this? Long answer: So if we're in a world where a single base model can be fine-tuned over all tasks and we're fairly certain that this base model won't change then indeed writing new PyTorch models from scratch doesn't make much sense for most people. You could presumably train some important models from scratch in PyTorch and then have people run inferences by loading the state_dict and then writing some hyper efficient C++ code. However, the single base model hypothesis seems empirically false?…
Hex is a collaborative workspace for data science and analytics. Now data teams can run their queries, notebooks, and interactive reports — all in one place.
Hex has Magical AI tools that can generate queries and code, create visualizations, and even kickstart a whole analysis, all from natural language prompts, allowing teams to accelerate work and focus on what matters.
Join hundreds of data teams like Notion, AllTrails, Loom, Brex, and Algolia using Hex every day to make their work more impactful. Sign up today at hex.tech/datascienceweekly to get a 30-day free trial of the Hex Team plan!
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Guesses as compressed probability distributions
People often make judgments about uncertain facts and events, for example `Germany will win the world cup'. Here we present a rational analysis of these judgments: we argue that a guess functions as a compressed encoding of the speaker's subjective probability distribution over relevant possibilities. So, a statement like `X will happen' encodes information not only about the probability of X but also, implicitly, about the probability of other possible outcomes…
Deploying LLMs in Production: Lessons Learned
In this live-streamed recording of Vanishing Gradients, Hamel Husain, founder of Parlance-Labs, a research and consultancy focused on LLMs, joins your host Hugo Bowne-Anderson, to talk about generative AI, large language models, the business value they can generate, and how to get started. We’ll delve into * Where Hamel is seeing the most business interest in LLMs (spoiler: the answer isn’t tech); * Common misconceptions about LLMs; * The skills you need to work with LLMs and GenAI models; * Tools and techniques, such as fine-tuning, RAGs, LoRA, hardware, and more! * Vendor APIs vs OSS models….
Why does so much government tech investment deliver so little?
The growing interest in AI from governments is a welcome development, but we believe that excitement should be tempered with discipline. We look at how government technology investment often fails to accomplish its goals. This usually stems from a lack of clear rationale for government action at all, inadequate funding, and the inherent limitations of top-down approaches to technological development. As a result, we see a combination of small, low value grants at one end and grandiose “grands projets” on the other. We see this pattern across EU-wide efforts and increasingly on a smaller scale in the UK. We propose tests that any serious government investment in technology should pass and provide two examples that meet the bar…
CausalPy
A Python package focussing on causal inference in quasi-experimental settings. The package allows for sophisticated Bayesian model fitting methods to be used in addition to traditional OLS…
AI x Labor Lit Review
This document summarizes Vincent’s explorations of the literature surrounding AI/automation and labor markets. The primary focus is on understanding economic trends and evidence, not policy proposals. If you’re looking for reading recs, I think the most interesting literature is everything in the “Read” section and the first two references in the “Might Read Later” section…
How Hessian Structure Explains Mysteries in Sharpness Regularization
Recent work has shown that first-order methods like Sharpness-Aware Minimization (SAM) which implicitly penalizes second-order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss…
What kind of mathematical foundations are required for conducting research across the vast specialised branches of AI/ML/DL? [Reddit]
The absolute basic mathematics that is required to understand basic ML/DL are calculus, linear algebra, probability and some convex optimisation. We are all aware of that. But ML and DL has become a vast field both in breadth and depth. A single person can't understand the field entirely. There are specialistions and sub-specialisations and further more. If you work in a branch of ML/DL research where some other math fundamentals are needed to understand research papers and do innovative research, can you mention your field of work and the math fundamentals that are required to gain entry into your field?…
Fast and forward stable randomized algorithms for linear least-squares problems
Iterative sketching and sketch-and-precondition are randomized algorithms used for solving overdetermined linear least-squares problems. When implemented in exact arithmetic, these algorithms produce high-accuracy solutions to least-squares problems faster than standard direct methods based on QR factorization…This paper proves that iterative sketching, appropriately implemented, is forward stable. Numerical experiments confirm the theoretical findings, demonstrating that iterative sketching is stable and faster than QR-based solvers for large problem instances…
NOIR: Neural Signal Operated Intelligent Robots for Everyday Activities
We present Neural Signal Operated Intelligent Robots (NOIR), a general-purpose, intelligent brain-robot interface system that enables humans to command robots to perform everyday activities through brain signals. Through this interface, humans communicate their intended objects of interest and actions to the robots using electroencephalography (EEG). Our novel system demonstrates success in an expansive array of 20 challenging, everyday household activities, including cooking, cleaning, personal care, and entertainment…
UK unites with global partners to accelerate development using AI
Along with Canada, the Bill and Melinda Gates Foundation, the USA and partners in Africa, the UK is helping to fund a £80 million ($100 million) boost in AI programming to combat inequality and boost prosperity on the continent.
The goals of the UK government’s AI for Development programme include:
unlocking the benefits of AI to the 700 million people who speak 46 African languages
making 5 or more African countries globally influential in the worldwide conversation on AI including in using AI to help achieve the Sustainable Development Goals
creating or scaling up at least 8 responsible AI research labs at African universities
helping at least 10 countries create sound regulatory frameworks for responsible, equitable and safe AI
help bring down the barriers to entry for African AI innovators with the private sector
We are Digital Science and we are advancing the research ecosystem.
Dimensions, part of the Digital Science family, is the world’s largest linked research information dataset, covering millions of research publications and connected by more than 1.3 billion citations. We are shaping the future of research and are looking for a Data Scientist to join the team.
The role will touch all aspects of data analysis & delivery, from managing specialised analytic infrastructure resources in secure environments to data collection/wrangling, visualization, and the development & delivery of interactive dashboards and other applications. You will work closely with team members with a diversity of intellectual and professional backgrounds to harness our unique data and product capabilities to address our customer’s critical needs.
Location is Fully Remote. Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Is SELECT DISTINCT really that bad?
I have been pushing back on DBT (sql on Sowflake) Pull Requests that use SELECT DISTINCT and instead ask people to create a surrogate key and aggregate/de-dupe explicitly on the keys they want to define uniqueness by. This is a lot more work. Yet, we all know the urge to SELECT DISTINCT “just in case” to avoid the dreaded duplicates test error or find by a stakeholder.
I find myself wondering lately if my blanket rule against SELECT DISTINCT and blocking people’s work because of it, is outdated and misguided, Am I unnecessarily asking more work without enough evidence to the value add or risk mitigation?…
Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory
This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorithms (such as the basic stochastic gradient descent (SGD) method, accelerated methods, and adaptive methods). We also cover several theoretical aspects of deep learning algorithms such as approximation capacities of ANNs (including a calculus for ANNs), optimization theory (including Kurdyka-Łojasiewicz inequalities), and generalization errors…
How do you figure out which machine learning algorithm to apply? [Reddit]
I'm still learning and I'm currently working on an assignment where I have to figure out which features of a product lead to higher sales. I think both a linear regression and a Random Forest Regressor can be applicable here but I'm not sure which one would yield the best results. How would you approach this problem? More broadly, what's your approach to making the decision on which algorithm to use in your day-to-day job?…
* Based on unique clicks.
** Find last week's issue #519 here.
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S. Was today’s newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.