Hello!
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If you don’t find this email useful, please unsubscribe here.
Is this newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week
Building LLM-Powered Web Apps with Client-Side Technology
It’s no secret that for a long time machine learning has been mostly a Python game, but the recent surge in popularity of ChatGPT has brought many new developers into the field. With JavaScript being the most widely-used programming language, it’s no surprise that this has included many web developers, who have naturally tried to build web apps. There’s been a ton of ink spilled on building with LLMs via API calls to the likes of OpenAI, Anthropic, Google, and others, so I thought I’d try a different approach and try to build a web app using exclusively local models and technologies, preferably those that run in the browser!…
What Every Developer Should Know About GPU Computing
Most programmers have an intimate understanding of CPUs and sequential programming because they grow up writing code for the CPU, but many are less familiar with the inner workings of GPUs and what makes them so special. Over the past decade, GPUs have become incredibly important because of their pervasive use in deep learning. Today, it is essential for every software engineer to possess a basic understanding of how they work. My goal with this article is to give you that background…
Is AI alignment on track? Is it progressing... too fast?
This is not an essay bashing AI alignment. I love alignment. I spent most of 2023 writing about it and working on it, getting some neat results, e.g. a partially automated process for jailbreaking GPT-4 and Claude, and even getting an academic paper to cite my essay on jailbreaks. …all of this in the breaks from yelling at AI researchers at parties about them not doing enough about alignment. Sorry kipply and Jacob!I wish this was a silly joke, but it’s not. Why else would I be sitting in a tiny dark room on a Saturday evening writing this essay and not at a party?…
Hex is a collaborative workspace for data science and analytics. Now data teams can run their queries, notebooks, and interactive reports — all in one place.
Hex has Magical AI tools that can generate queries and code, create visualizations, and even kickstart a whole analysis, all from natural language prompts, allowing teams to accelerate work and focus on what matters.
Join hundreds of data teams like Notion, AllTrails, Loom, Brex, and Algolia using Hex every day to make their work more impactful. Sign up today at hex.tech/datascienceweekly to get a 30-day free trial of the Hex Team plan!
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Joining CSV Data Without SQL: An IP Geolocation Use Case
Given an IP address associated with some network traffic, can we deduce from where on earth the traffic originated? There’s plenty of software and services out there to help provide this information, but a familiar one is the GeoLite2 data from MaxMind. Not only is it free, it’s provided in the form of downloadable data sets. This allows us to pick it apart in interesting ways…
Contractors who are called Data Scientists but can't do what I'd expect [Reddit]
I was hired as a senior member of a pre-existing data science team. I now manage a few other team members (who were there before me). They are all contractors and their day rate is HIGH. They are all 'Data Scientists' and graduates. I'm older. I've done lots of technical roles and I'm not really sure what my official title is. I can do data science but I really just build stuff. I've done Data Engineering in the past, MLOps, DevOps, Cloud etc. I'm a jack of all trades, master of none. Now, I know what I think a 'Data Scientist' should be able to do…
Why Now? Malloy Data Deep Dive
“If we knew all the things we know about data, and about programming with data, and about programming languages in general, and we were designing a query language today, what would it look like?”…Malloy is a language for describing data relationships and transformations within SQL databases. It:
Compiles to SQL optimized for your database.
Has both a semantic data model and query language.
Excels at reading and writing nested data sets.
Seamlessly handles what are complex/error-prone queries in SQL.
…As always within Why Now, we’ll build towards an understanding of the definition above via breaking it into its components…
Embeddings: What they are and why they matter
Embeddings are a really neat trick that often come wrapped in a pile of intimidating jargon. If you can make it through that jargon, they unlock powerful and exciting techniques that can be applied to all sorts of interesting problems…I gave a talk about embeddings at PyBay 2023. This article represents an improved version of that talk, which should stand alone even without watching the video. If you’re not yet familiar with embeddings I hope to give you everything you need to get started applying them to real-world problems…
Bayesian Regression Markets
Machine learning tasks are vulnerable to the quality of data used as input. Yet, it is often challenging for firms to obtain adequate datasets, with them being naturally distributed amongst owners, that in practice, may be competitors in a downstream market and reluctant to share information. Focusing on supervised learning for regression tasks, we develop a regression market to provide a monetary incentive for data sharing. Our proposed mechanism adopts a Bayesian framework, allowing us to consider a more general class of regression tasks. We present a thorough exploration of the market properties, and show that similar proposals in current literature expose the market agents to sizable financial risks, which can be mitigated in our probabilistic setting…
Multi-scale Generalized Hamiltonian Monte Carlo with Delayed Rejection
In this talk, I will demonstrate how we can combine two ideas, generalized Hamiltonian Monte Carlo and delayed rejection, to derive a sampler that is as efficient as Hamiltonian Monte Carlo, but is able to adapt its step size to deal with multi-scale distributions, much like a standard integrator for ordinary differential equations…
Generative AI Use Cases Companies Can Implement Today
Where do most companies actually start when it comes to incorporating generative AI? What generative AI use cases are realistic, achievable, and actually worth the ROI? We dug deep into the early adopters’ strategies to learn how companies are putting this technology to use today — and what it takes for a data team to implement Gen AI at scale…
Empirical Design in Reinforcement Learning
This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyper-parameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design…
lea - 🏃♀️ Minimalist alternative to dbt
lea is a minimalist alternative to tools like dbt, SQLMesh, and Google's Dataform. lea aims to be simple and opinionated, and yet offers the possibility to be extended. We happily use it every day to manage our data warehouse. We will actively maintain it and add features, while welcoming contributions. Right now lea is compatible with BigQuery and DuckDB….
Some things I learned about GAN training
For several months now, I've been working on a adversarial network that takes synthetic audio as input and enhances it to sound more natural. I know that this is different from the usual way you would use a GAN, which would be passing noise as input to the generator, turning it into a generative network. But I still think a lot of the things I learned during this project, especially about training stability, can be applied to other GANs, which is why I'm making this post!…
Learn how to overcome LLMOps challenges in pre- and post-production, build enterprise LLM infrastructure, and deliver measurable business value, including hands-on workshops and expert panel discussions with data science leaders. Register now for AI Forward 2023 — a FREE one-day virtual summit!
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
At Chewy, our mission is to be the most trusted and convenient destination for pet parents and partners, everywhere. Behind the scenes, our talented teams are made up of innovators, delighters, big-thinkers and, of course, passionate pet people—creating a place where you'll be empowered to build, grow and unleash your fullest potential.
We are looking for a Data Engineer II at our facility in Plantation, Florida, to collaborate with teams across Chewy to drive innovative solutions for data usage.
Location is Plantation, Florida. Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
* Based on unique clicks.
** Find last week's issue #517 here.
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S. Was today’s newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.