Hello and thank you for tuning in to Issue #514!
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If you find this newsletter helpful to your job, consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
If you don’t find this email useful, please unsubscribe here.
And now, let's dive into some interesting links from this week :)
Mastering Customer Segmentation with LLM
Unlock advanced customer segmentation techniques using LLMs, and improve your clustering models with advanced techniques…In this article I will teach you advanced techniques, not only to define the clusters, but to analyze the result…This post is intended for those data scientists who want to have several tools to address clustering problems and be one step closer to being seniors DS… the 3 methods to we’ll study are: Kmeans, K-Prototype, and LLM + Kmeans
Lessons learned from implementing user-facing analytics / dashboards? (HN)
If you'd be up for sharing some lessons / takeaways / challenges here, or even better, having a chat (I'll reach out) that would be amazing…
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Inside the Matrix: Visualizing Matrix Multiplication, Attention and Beyond
Matrix multiplications (matmuls) are the building blocks of today’s ML models. This note presents mm, a visualization tool for matmuls and compositions of matmuls. Because mm uses all three spatial dimensions, it helps build intuition and spark ideas with less cognitive overhead than the usual squares-on-paper idioms, especially (though not only) for visual/spatial thinkers. And with three dimensions available for composing matmuls, along with the ability to load trained weights, we can visualize big, compound expressions like attention heads and observe how they actually behave, using im…
Reinforcement Learning for Diffusion Models from Scratch
One of the key ingredients for the mainstream success of language models is the use of Reinforcement Learning from Human Feedback (RLHF) where language models are trained with human feedback to produce outputs that users are more likely to prefer…A paper in May 2023 by the Levine Lab at UC Berkeley explored how the RLHF paradigm can be applied to diffusion models, resulting in an algorithm called DDPO. Here we’ll walk through a simple implementation of this DDPO algorithm. Let’s get started!..
Is there a great book on design patterns in data engineering? [Reddit]
I've read "Fundamentals of Data Engineering" by Reis. However, as the name says that book covers the fundamentals. There are loads of books on software engineering design patterns in general. Not for data engineering in my knowledge. I'm looking for a great book that goes through the popular data architecture patterns end-to-end. With code samples. Googled, but didn't find anything particularly exciting. Just a few blog posts. Which books do you folks suggest as essential reading for a data engineer?…
Python Polars Tutorial (Part 1): Getting Started with Data Analysis
In this Python Programming video, we will be learning how to get started with Polars. Polars is a Data Analysis Library that allows us to easily read, analyze, and modify data…We'll start by learning how to install Polars, how to load data into a Jupyter Notebook, and how to see information about the data we've loaded in….
A Hackers' Guide to Language Models
Starting with the foundational concepts, Jeremy Howard introduces the architecture and mechanics that make these AI systems tick. He then delves into critical evaluations of GPT-4, illuminates practical uses of language models in code writing and data analysis, and offers hands-on tips for working with the OpenAI API. The video also provides expert guidance on technical topics such as fine-tuning, decoding tokens, and running private instances of GPT models…
What is Synthetic Aperture Radar?
While most scientists using remote sensing are familiar with passive, optical images from the U.S. Geological Survey's Landsat, NASA's Moderate Resolution Imaging Spectroradiometer (MODIS), and the European Space Agency's Sentinel-2, another type of remote sensing data is making waves: Synthetic Aperture Radar, or SAR. SAR is a type of active data collection where a sensor produces its own energy and then records the amount of that energy reflected back after interacting with the Earth. While optical imagery is similar to interpreting a photograph, SAR data require a different way of thinking in that the signal is instead responsive to surface characteristics like structure and moisture…
Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers
We investigate the utility of modern LLMs in assisting professional writers…The design of our collaborative writing interface is grounded in the cognitive process model of writing that views writing as a goal-oriented thinking process encompassing non-linear cognitive activities: planning, translating, and reviewing…we find that while writers seek LLM's help across all three types of cognitive activities, they find LLMs more helpful in translation and reviewing…
Plotting phylogenetic trees in R: alternating clade highlights
If you’ve dipped a toe into plotting phylogenetic trees before, you will likely be aware of the R package ggtree. For even the most niche customisations, I’ve yet to encounter something that I couldn’t somehow manage to do with the help of ggtree…Here I’ll show how I highlight clades in my trees – probably the most fundamental customisation that anybody wants to be able to do – but without having to manually trawl through figuring out which nodes are associated with which clades…
BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization
BoTorch is a library for Bayesian Optimization research built on top of PyTorch, and is part of the PyTorch ecosystem…Bayesian Optimization (BayesOpt) is an established technique for sequential optimization of costly-to-evaluate black-box functions. It can be applied to a wide variety of problems, including hyperparameter optimization for machine learning algorithms, A/B testing, as well as many scientific and engineering problems…
LLMs hype has killed data science [Reddit]
That's it.
At my work in a huge company almost all traditional data science and ML work including even NLP has been completely eclipsed by management's insane need to have their own sh*tty, custom chatbot will LLMs for their one specific use case with 10 SharePoint docs. There are hundreds of teams doing the same thing including ones with no skills. Complete and useless insanity and waste of money due to FOMO.
How is "AI" going where you work?…
RTutor.ai - chat with your data in plain language
RTutor uses OpenAI's powerful large language model to translate natural language into R code, which is then excuted. You can request your analysis, just like asking a real person. Upload a data file (CSV, TSV/tab-delimited text files, and Excel) and just analyze it in plain English. Your results can be downloaded as an HTML report in minutes!…
You don’t have to be a Data Scientist [Reddit]
Just a PSA for anyone here that is starting their career, might feel overwhelmed with applying/interviewing for jobs, or is looking for a career change. If you’re interested in a Data career, know that there are many different roles out there other than a “data scientist” role. Here’s only a handful of the common titles I see out there these days…
Nintendo Technology Development (NTD): The worldwide pioneer in the creation of interactive entertainment, Nintendo Co., Ltd., of Kyoto, Japan, manufactures and markets hardware and software for its Nintendo Switch™ system and the Nintendo 3DS™ family of portable systems.
We are seeking a Sr Data Scientist to assist with the development of deep learning neural networks including, but not limited to, audio enhancement and computer vision. The role focuses on iterating over the training, quantization, and evaluation of neural networks implemented in PyTorch and/or TensorFlow.
Location is Redmon, WA, USA. Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Building an API in the cloud in fewer than 200 lines of code
Cloud tools and Python packages have become so powerful that you can build a (scalable) cloud-based API in fewer than 200 lines of code. In this blog post, you’ll see how to use Google Cloud, Terraform, and FastAPI to deploy a queryable data API on the cloud…
Spatial Statistics for Data Science: Theory and Practice with R
The book starts by providing a comprehensive overview of the types of spatial data and R packages for spatial data retrieval, manipulation, and visualization. Then, it provides a detailed explanation of the theoretical concepts of spatial statistics, along with fully reproducible examples demonstrating how to simulate, describe, and analyze areal, geostatistical, and point pattern data in various applications…
Beginner Level Deep Learning Tutorials in PyTorch!
Note that these tutorials expect some knowledge of deep learning concepts. While some of the concepts are explained we are mainly focusing on (in detail) how to implement them in python with PyTorch. I have compiled a list of additional resources that cover many of the concepts we look at, the YouTube series section are incredibly valuable! Deep learning google sheets If you have any good resources let me know and I can add them! If you can't find an explanation on something you want to know let me know and i'll try to find it!..
* Based on unique clicks.
** Find last week's issue #513 here.
Get A Data Science Job Course: After answering thousands of emails from readers like you, we put together our advice into a comprehensive course which will teach you everything related to getting a data science job. The course is broken into 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio, and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S.
If you found this newsletter helpful to your job, please consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.