|
Hello! Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Local LLM-as-judge evaluation with lm-buddy, Prometheus and llamafile Evaluating models can also be costly, especially when LLMs are actively used to evaluate other models as in the LLM-as-Judge case. And while techniques to scale inference could also be applied to LLM judges, there does not seem to be a lot of interest in this direction…This post examines how different software components came together to allow LLM-as-judge evaluation without the need for expensive GPUs. All the components were built with and chosen for their user control, open source nature, and interoperability. These include Prometheus, an open-source model for LLM-as-judge evaluation; lm-buddy, the tool we developed and open-sourced at mzai to scale our own fine-tuning and evaluation tasks; and llamafile, a Mozilla Innovation project that brings LLMs into single, portable files. I will show how these components can work together to evaluate LLMs on cheap(er) hardware, and how we assessed the evaluators’ performance to make informed choices about them…
How to Install and Deploy LLaMA 3 Into Production Learn how to install and deploy LLaMA 3 into production with this step-by-step guide. From hardware requirements to deployment and scaling, we cover everything you need to know for a smooth implementation…LLaMA 3 Hardware Requirements And Selecting the Right Instances on AWS EC2…
Rules of Machine Learning: Best Practices for ML Engineering [PDF] This document is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google. It presents a style for machine learning, similar to the Google C++ Style Guide and other popular guides to practical programming. If you have taken a class in machine learning, or built or worked on a machine learned model, then you have the necessary background to read this document…
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results. Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to: Work with the most accurate and up-to-date metrics, completely in your own data warehouse Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly Save time in setting up, running, and analyzing experiments Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present. Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
The “it” in AI models is the dataset I’ve been at OpenAI for almost a year now. In that time, I’ve trained a lot of generative models. More than anyone really has any right to train. As I’ve spent these hours observing the effects of tweaking various model configurations and hyperparameters, one thing that has struck me is the similarities in between all the training runs…
Logging Implicit Human Feedback Foyle is an open source assistant to help software developers deal with the pain of devops. Developers are expected to operate their software which means dealing with the complexity of Cloud. Foyle aims to simplify operations with AI. One of Foyle’s central premises is that creating a UX that implicitly captures human feedback is critical to building AIs that effectively assist us with operations. This post describes how Foyle logs that feedback…
Overture Maps Buildings This notebook will give a quick overview of using the new Overture Maps Python library with Lonboard. We'll pass in a bounding box covering New York City and the Overture Python API will fetch only the data inside that bounding box. While Overture's buildings dataset contains 2.3 billion rows, by using a relatively small bounding box, we can download data for our query relatively quickly (around 30 seconds on my internet connection)…
A primer on algorithmic differentiation Differentiable programming is a programming paradigm in which complex computer programs (including those with control flows and data structures) can be differentiated end-to-end automatically, enabling gradient-based optimization of parameters in the program. In differentiable programming, a program is also defined as the composition of elementary operations, forming a computation graph…
Smartphone Bans, Student Outcomes and Mental Health How smartphone usage affects well-being and learning among children and adolescents is a concern for schools, parents, and policymakers. Combining detailed administrative data with survey data on middle schools’ smartphone policies, together with an event study design, I show that banning smartphones significantly decreases the health care take-up for psychological symptoms and diseases among girls. Post-ban bullying among both genders decreases. Additionally, girls’ GPA improves, and their likelihood of attending an academic high school track increases. These effects are larger for girls from low socio-economic backgrounds. Hence, banning smartphones from school could be a low-cost policy tool to improve student outcomes…
Slab + interval stats and geoms This vignette describes the slab+interval geoms and stats in ggdist. This is a flexible family of stats and geoms designed to make plotting distributions (such as priors and posteriors in Bayesian models, or even sampling distributions from other models) straightforward, and support a range of useful plots, including intervals, eye plots (densities + intervals), CCDF bar plots (complementary cumulative distribution functions + intervals), gradient plots, and histograms…
DSPy Integration Here are a few resources on using DSPy from the Weaviate team! The resources are broken into two categories: Hands on Learning: Content framed to build your technical understanding with end-to-end tutorials. Read and Listen: Content designed to help develop your conceptual understanding of these technologies…
Demographic bias in misdiagnosis by computational pathology models Despite increasing numbers of regulatory approvals, deep learning-based computational pathology systems often overlook the impact of demographic factors on performance, potentially leading to biases. This concern is all the more important as computational pathology has leveraged large public datasets that underrepresent certain demographic groups. Using publicly available data from The Cancer Genome Atlas and the EBRAINS brain tumor atlas, as well as internal patient data, we show that whole-slide image classification models display marked performance disparities across different demographic groups when used to subtype breast and lung carcinomas and to predict IDH1 mutations in gliomas…
Commented Transformers Highly commented implementations of Transformers in PyTorch for Creating a Transformer From Scratch series: The Attention Mechanism The Rest of the Transformer
The layers folder contains implementations for Bidirectional Attention, Causal Attention, and CausalCrossAttention. The models folder contains single file implementations for GPT-2 and BERT. Both models are compatible with torch.compile(..., fullgraph=True) …
Multi-Agent DSPy Programs: Bootstrapping & Aggregating Multiple ReAct Agents This is a quick (somewhat advanced) example of DSPy. You're given a hard QA task and an agent architecture (dspy.ReAct ), how do you get high scores without tinkering with prompts?..There are many ways, but this notebook shows one complex strategy that DSPy makes near-trivial to achieve: we'll automatically bootstrap five different highly-effective prompts for ReAct, then optimize an aggregator that combines their powers…As is usually the case with DSPy, the code to do this is probably shorter than describing it in English, so let's jump right into that…
Why are the central limit theorem and standard error formula so similar? My explanation could be flawed, but what I have come to understand, is that σ/√n= sample standard deviation, but when trying looking at the standard error formula, I was taught that it was s/√n. I even see it online as σ/√n, which is the exact same formula that demonstrates the central limit theorem. Clearly I am missing some important clarification and understanding. I really love statistics and want to become more competent, but my knowledge is quite elementary at this point. Can anyone shed some light on what exactly I might be missing?…
Tensor Puzzles - Penzai Edition This is a version of the tensor puzzles implemented the JAX Penzai library. Available on Github. Penzai is a really nice fit for these puzzles both because it comes with a really clean visualization library built-in and because it has a very nice named-tensor implementation. I recommend running in Colab…
reinforcement learning and optimal control This is the main textbook I use for my course at ASU. It is based on the class notes I developed over the years 2019-2023. It is a standalone book, but can also be used in conjunction with my videoolectures and slides, available at this site…The textbook is about 440 pages long and includes end-of-chapter exercises. It places primary emphasis on intuitive reasoning, based on the mathematical framework of dynamic programming. While mathematical proofs are deemphasized, the textbook relies on the theoretical development and analysis given in my Dynamic Programming (DP) and Reinforcement Learning (RL) books listed at this site. All of these books share a consistent notation and terminology…
Foundational Models for Robot Control Pre-trained large networks, sometimes called foundational models, are becoming increasingly useful in our research these days. Many works are pointing out how these models are advanced and can be generalized to many different tasks…I generally take the more humble view and accept that these methods fall somewhere between far better than starting from a randomly initialized model to can be helpful for planning. I gave a lecture in my class to piece together the very recent progress on using large or pre-trained models for robot control. There are likely more works to include, but I suspect this outline will help others catch up to the progress made in using large models for robotics…
tippecanoe - Build vector tilesets from large collections of GeoJSON features Builds vector tilesets from large (or small) collections of GeoJSON, FlatGeobuf, or CSV features, like these…
* Based on unique clicks. ** Find last week's issue #543 here.
Looking to get a job? Check out our “Get A Data Science Job” Course It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume. Promote yourself/organization to ~61,500 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :) Stay Data Science-y! All our best, Hannah & Sebastian
| |