Hello and thank you for tuning in to Issue #513!
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
Seeing this for the first time? Subscribe here:
If you find this newsletter helpful to your job, consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
If you don’t find this email useful, please unsubscribe here.
And now, let's dive into some interesting links from this week :)
The Playbook to AI and Start-Ups with Kleiner Perkins
One area of research and startups right now is how to run better and better models without upgrading hardware, since consumers don’t have the same infrastructure, says Leigh Marie Braswell, partner at Kleiner Perkins, in her discussion with Anurag Rana, senior software analyst at Bloomberg Intelligence. Their insightful talk contrasts the experience and priorities of the startup landscape with the objectives of the largest public companies in generative AI and large language models. The pair’s combined visibility into the arena touches on all aspects of software, including unstructured data, code generation, automating technical debt and infrastructure…
New Newsletter: AI Safety in China
Common perception: China doesn’t care about AI safety….Our perspective? China’s more invested in AI safety and risk mitigation than many realize….This newsletter aims to bridge the knowledge gap….The AI Safety in China newsletter will cover the latest updates in: Technical safety and alignment research in China China’s governance and policy efforts to reduce AI risk China’s positions on international AI governance…
Optimizing your LLM in production
In this blog post, we will go over the most effective techniques at the time of writing this blog post to tackle these challenges for efficient LLM deployment:
Lower Precision
Flash Attention
Architectural Innovations…
Interested in transitioning to data science or exploring new career paths? Take notes on the key insights into the field and its opportunities.
We will discuss :
→ Overview of the 2023 Big Tech job market, including the interview process.
In-depth examination of technical interviews, including algorithms, machine learning, and system/product evaluations.
→ Insights into onsite interviews, highlighting their importance.
→ Behavioral interviews and their role in assessing cultural fit.
→ Strategies and insights from real experiences for a deeper understanding of Big Tech interviews.
Join our event on October 16, 7 pm (UTC+3), registration is free.
Join here
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Vector Search Engine, Building an Open-Source Business, and Digital Technology Through the Lens of Language With Bob Van Luijt
This is my conversation with Bob Van Luijt, the CEO and co-founder of Weaviate, the business created around the open-source vector database Weaviate… Our wide-ranging conversation touches on his teenage years building software for fun, his education in jazz and music composition, his consultancy agency Kubrickology, his TEDx talk on digital technology through the lens of language, the founding story of Weaviate and the rise of vector search engines, the business model around open-source software, the AI-first database ecosystem, lessons learned from hiring and fundraising, and much more.
Some of you say “linear regression” like it’s a bad word [Reddit Discussion]
I don’t know about you, but I am HYPED to use linear regression and couldn’t care less for machine learning stuff more complex than ridge regression…Honestly, linear regression is the sign of a really talented data scientist. If you can get the job done with linear regression, respect on your name. Not saying you ain’t a champ if you don’t use OLS, sometimes you gotta break out a boosted model. But man, y’all should be happy to use OLS. It’s the best model ever…
Robust standard errors in mixed models
In a recent article in Multivariate Behavioral Research, we (Huang, Wiedermann, and Zhang) discuss a robust standard error that can be used with mixed models that accounts for violations of homogeneity. Note that these robust standard errors have been around for years though are not always provided in statistical software. These can also be computed using the CR2 package or the clubSandwich package. This page shows how to compute the traditional Liang and Zeger (1986) robust standard errors (CR0) and the CR2 estimator- see Bell and McCaffrey (2002) as well as McCaffrey, Bell, and Botts (2001) (BM and MBB)…
Data Analysis: SQL or Python vs point-and-click product analytics tools [Twitter / X Discussion]
Folks who use SQL or Python for data analysis, why do / don't you use point-and-click product analytics tools like Amplitude, Mixpanel, Heap, Pendo, etc? I observed a clear divide here between technical and non-technical folks in my career so far…
How to make history with LLMs & other generative models
In this post, I want to expand on some ideas that I’m particularly excited about and others that I’m less certain of reaching venture-scale as standalone businesses. I’ll also mention some companies I know that are working on each idea, but this is not meant to be a complete list (of companies or ideas- there are so many other exciting stealth companies for a potential future post…). Lastly, these are strong opinions, loosely held. I would love to be convinced to change my mind, and I believe it’s a sign of a great founder to be able to address risks, navigate the idea maze, and prove skeptics wrong…
The Overton Paradox in Three Graphs
Older people are more likely to say they are conservative…And older people believe more conservative things…But if you group people by decade of birth, most groups get more liberal as they get older…So if people get more liberal, on average, why are they more likely to say they are conservative? Now there are three ways to find out!..
Machine learning games
Welcome to Machine Learning Games, a repository containing set of games and simulations designed to experiment with QLearning, Neuroevolution, and PoseNet…
Augmenting PostgreSQL with AI using EvaDB
In this article, we illustrate how EvaDB seamlessly integrates AI into your PostgreSQL workflows for solving complex data manipulation tasks. In particular, we demonstrate how EvaDB enables AI-powered semantic join between tables that do not directly share a column that can be joined on…
Do people not use sci-kit learn / other traditional libraries anymore? [Reddit Discussion]
Recently saw a tweet which got quite some traction talking about how many people haven't used sci-kit learn in months as data scientists…This has been replaced with PyTorch, HuggingFace, langchain, supergradients etc…This didn't really make sense to me as the tooling mentioned isn't really comparable to sci-kit learn but I'm curious and slightly worried I might be falling behind and not up to date with things so just asking if I'm just behind the curve or what you guys think/ do…
Auditing AI: How Much Access Is Needed to Audit an AI System?
Checking whether an AI system satisfies certain predetermined criteria falls under the umbrella of AI auditing. Although many agree that auditing is necessary, precisely which components of the AI system should be audited remains unclear. For instance, what can an auditor deduce with access to an AI system’s training data? In this post, we examine four types of information that an auditor could access, discussing the benefits and drawbacks of each…
AI’s $200B Question
GPU capacity is getting overbuilt. Long-term, this is good. Short-term, things could get messy…Consider the following: For every $1 spent on a GPU, roughly $1 needs to be spent on energy costs to run the GPU in a data center. So if Nvidia sells $50B in run-rate GPU revenue by the end of the year (a conservative estimate based on analyst forecasts), that implies approximately $100B in data center expenditures (to include margin)…This implies that for each year of current GPU CapEx, $200B of lifetime revenue would need to be generated by these GPUs to pay back the upfront capital investment…The important question to be asking is: How much of this CapEx build out is linked to true end-customer demand, and how much of it is being built in anticipation of future end-customer demand? This is the $200B question…
Nintendo Technology Development (NTD): The worldwide pioneer in the creation of interactive entertainment, Nintendo Co., Ltd., of Kyoto, Japan, manufactures and markets hardware and software for its Nintendo Switch™ system and the Nintendo 3DS™ family of portable systems.
We are seeking a Sr Data Scientist to assist with the development of deep learning neural networks including, but not limited to, audio enhancement and computer vision. The role focuses on iterating over the training, quantization, and evaluation of neural networks implemented in PyTorch and/or TensorFlow.
Location is Redmon, WA, USA. Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Berkeley’s CS294-248: Topics in Database Theory
This course covers : static analysis of queries, basic query evaluation algorithms and hypertree decomposition, incremental view maintenance, worst-case-optimal joins, constraints and semantic query optimization, provenance semirings, and datalog with various extensions. These topics use concepts from logic and model theory, algorithms, graph theory, algebra, complexity theory, and probability theory. All the required theoretical background will be covered in the lectures...
Training Tiny Llamas for Fun—and Science
Exploring how SoftMax implementation can impact model performance using Karpathy's Tiny llama implementation…In this report, we're going to look at the new llama2.c repo and perform some experiments to see how softmax performs. But first, a cute mammal…
Python Pareto Principle - what is the 20% (algos, functions, libraries) that lets you develop 80% of code related to Data Engineering? [Reddit Discussion]
Python Pareto Principle - what is the 20% (algos, functions, libraries) that lets you develop 80% of code related to Data Engineering?…
* Based on unique clicks.
** Find last week's issue #512 here.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S.
If you found this newsletter helpful to your job, please consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.