Hello and thank you for tuning in to Issue #497.
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
***
Seeing this for the first time? Subscribe here:
***
Want to support us? Become a paid subscriber here.
***
If you don’t find this email useful, please unsubscribe here.
***
And now, let's dive into some interesting links from this week:
Hope you enjoy it!
:)
The Next Larger Context
Frequently we will be given problems to solve by other people. Early in our career, these problems will usually be well-scoped and specific…And as we grow as engineers these tasks become bigger but often the success criteria remains well-defined…A major challenge that I see many people hit right around the time they become senior engineers is that they want to progress farther in their career, but they expect that the way they can do that is that they will be given a hard problem to solve that can only be solved by the level of technical skill they’ve gained so far…Instead you should look at the problems you’ve been solving and that your team is solving, and follow Saarinen’s advice: look at them in the next larger context*. Here are some examples….
Don’t let yourself be fooled by data drift
If you search for information on ML monitoring online, there is a good chance that you'll come across various monitoring approaches advocating for putting data drift at the center of monitoring solutions…The purpose of this blog post is to demonstrate that not all data drift impacts model performance. Making drift methods hard to trust since they tend to produce a large number of false alarms. To illustrate this point, we will train an ML model using a real-world dataset, monitor the distribution of the model's features in production, and report any data drift that might occur…
Learning Julia with #TidyTuesday and Tidier.jl
Tidier.jl is a Julia implementation of the {tidyverse}, and after 10 weeks of data wrangling and plotting #TidyTuesday data in Julia, I wanted to share what I've learnt about Julia as an R user…
BigCode project: Code-generating LLMs boosted by Toloka's crowd
@Toloka teamed up with @huggingface and @ServiceNowRSRCH to power @BigCodeProject LLM PII data annotation project. Facts: 12K code chunks, 14 categories of data, 1399 Tolokers and 4349 hours of work in 4 days! Check out this post to learn what, why and how they made it happen (link)
Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Resources to Help Global Equality for PhDs in NLP / AI
This repo originates with a wish to promote Global Equality for people who want to do a PhD in NLP, following the idea that mentorship programs are an effective way to fight against segregation…information, such as (1) knowing what a PhD in NLP is like, (2) knowing what top grad schools look for when reviewing PhD applications, (3) broadening your horizon of what is good work, (4) knowing how careers in NLP in both academia and industry are, and many others…
Improving mathematical reasoning with process supervision
We've trained a model to achieve a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”) instead of simply rewarding the correct final answer (“outcome supervision”). In addition to boosting performance relative to outcome supervision, process supervision also has an important alignment benefit: it directly trains the model to produce a chain-of-thought that is endorsed by humans…
Super Data Science Podcast #681: XGBoost: The Ultimate Classifier, with Matt Harrison
Unlock the power of XGBoost by learning how to fine-tune its hyper-parameters and discover its optimal modeling situations. This and more, when best-selling author and leading Python consultant Matt Harrison teams up with Jon Krohn for yet another jam-packed technical episode! Are you ready to upgrade your data science toolkit in just one hour?…
My Approach to Building Large Technical Projects
Whether its building a new project from scratch, implementing a big feature, or beginning a large refactor, it can be difficult to stay motivated and complete large technical projects. A method that works really well for me is to continuously see real results and to order my work based on that…I'm not claiming that anything I say in this post is novel. It definitely shares various aspects of well-known software engineering or management practices. I'm just sharing the way I approach the larger technical work that I do and why I do it this way…I'll use my terminal emulator project as an example throughout this post so that there is realistic, concrete experience I can share…
Best way to defer on a question I don't know in an interview?
I'm an experienced data scientist but my current job of seven years wasn't particularly stats or ML intensive. I'm a wiz when it comes to data acquisition/wrangling/visualization but my last gig that really involved stats and modeling was quite a few years ago at this point. And unfortunately said job is coming to an end and I'm on the market for a new gig…I have a second round interview coming up and I'm worried about getting hit with a technical question I don't know the answer to…I'm wondering if anyone has suggestions on what to say or how to handle that scenario in a way that maximizes my chances of not scuttling the interview entirely…
Studies about how censorship handicaps a model’s capabilities?
Uncensored models, fine-tuned without artificial moralizing, such as “Wizard-Vicuna-13B-Uncensored-HF” performs well at LLM eval benchmarks even when compared with larger 65B, 40B, 30B models. Has there been any studies about how censorship handicaps a model’s capabilities?…
EU-U.S. Terminology and Taxonomy for Artificial Intelligence
Following the AI Roadmap suggestions for concrete activities aimed at aligning EU and U.S. risk-based approaches, a group of experts engaged to prepare an initial draft AI terminologies and taxonomies. A total number of 65 terms were identified with reference to key documents from the EU and the U.S…
Andrej Karpathy’s “State of GPT”
Learn about the training pipeline of GPT assistants like ChatGPT, from tokenization to pretraining, supervised finetuning, and Reinforcement Learning from Human Feedback (RLHF). Dive deeper into practical techniques and mental models for the effective use of these models, including prompting strategies, finetuning, the rapidly growing ecosystem of tools, and their future extensions…
All the Hard Stuff Nobody Talks About when Building Products with LLMs
There’s a lot of hype around AI, and in particular, Large Language Models (LLMs). To be blunt, a lot of that hype is just some demo b*llshit that would fall over the instant anyone tried to use it for a real task that their job depends on. The reality is far less glamorous: it’s hard to build a real product backed by an LLM…
Transformer models: an introduction and catalog — May 2023 Edition
First off, we added a whole lot of new models, including e.g. many from the Llama family…We also fixed some details on the catalog itself and added a field on the license status of each model, which has become very relevant recently….I also added quite a few links to similar surveys at the end of this post. Finally, there was a lot of editing throughout the paper that I have incorporated here too. Hope this makes it more useful!…
LanceDB
Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming….
Do you have an expertise in experimental design and Bayesian statistics? Experience with Stan (we're a Stan shop) or a comparable PPL? Want to work with awesome people on cool projects in the video game industry? We're hiring Data Scientists!
As part of our Data Services team, you will work with senior scientists and business intelligence analysts from the games and media industries. If you have the technical chops, can communicate what you are doing and why, and love working with others to answer interesting questions with data, this team’s for you!
About Game Data Pros:
Game Data Pros is a data application consultancy working in digital entertainment fields like video games and streaming video. We work with established global games and media companies, helping them to define experimentation and cross-promotion strategies. We are responsible for data science initiatives and also building data-aware tools that help manage data, run experiments, and perform analyses.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Big Ideas in Applied Math: Markov Chains
In this post, we’ll talk about Markov chains, a useful and general model of a random system evolving in time…
How to make fancy road trip maps with R and OpenStreetMap
Use R to get geocoded location and routing data from OpenStreetMap and explore our family’s impending 5,000 mile road trip around the USA…
A Modern Introduction to Online Learning
In this monograph, I introduce the basic concepts of Online Learning through a modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings…These notes do not require prior knowledge of convex analysis and all the required mathematical tools are rigorously explained. Moreover, all the included proofs have been carefully chosen to be as simple and as short as possible…
* Based on unique clicks.
** Find last week's issue #496 here.
Thanks for joining us this week :)
All our best,
Hannah & Sebastian
P.S.,
If you found this newsletter helpful, consider supporting us by becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe
:)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.