Hello and thank you for tuning in to Issue #493.
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
***
Seeing this for the first time? Subscribe here:
***
We run a subscriber-only Slack community where we tackle learning the latest tools, keeping up with the latest techniques, career entry & growth, and anything else that's stressing you out at the office.
So if this is useful for your work, you can become a paid subscriber here:
https://datascienceweekly.substack.com/subscribe
Let’s build great Data/ML products, drive results, and accelerate your career.
***
Lastly, if you don’t find this email useful, please unsubscribe here.
***
And now, let's dive into some interesting links from this week:
Hope you enjoy it!
:)
Google "We Have No Moat, And Neither Does OpenAI"
Leaked Internal Google Document Claims Open Source AI Will Outcompete Google and OpenAI…The text below is a very recent leaked document, which was shared by an anonymous individual on a public Discord server who has granted permission for its republication. It originates from a researcher within Google…
Advice for data scientists [Reddit Discussion]
I am now at a level of seniority in my career where I now have to hire data scientists, i.e. hire the very person I used to be. And the experience has been… well… underwhelming would be to put it kindly. There are serious skill deficits among the data scientists I have interviewed, among those whom I converse with in sister companies, and sadly even among those who I have ended up hiring…Here are the most serious problems I have observed among data scientists I have met and interacted with…
The Annual State of Data Quality Survey
It’s that time of year where we announce the results of our annual The State of Data Quality survey. The headline for this year was, without a doubt, the fact that data downtime nearly doubled year over year, driven by a 166% increase in time to resolution for data quality issues…
Track every customer interaction in real-time and gain a deep understanding of your customers’ behavior
Track every customer interaction in real-time and gain a deep understanding of your customers’ behavior
Segment Unify allows you to unite online and offline customer data in real-time across every platform and channel. Use Segment Profiles Sync to send identity resolved customer profiles to your data warehouse, where they can be used for advanced analytics and enhanced with valuable data-at-rest. Then use Segment Reverse ETL to immediately activate your ‘golden’ profiles across your CX tools of choice.
Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Using AI To Assess the Impact of News on Markets
In a recent study in collaboration with AI research platform Causality Link and the Toulouse School of Economics, French asset manager Amundi found that when incorporated into a long-short strategy, company-specific fundamental news can produce positive excess returns up to the day after publication. This report is a key milestone for anyone seeking to learn where and how investment intelligence can be gleaned from the media…
Improving Machine Learning Models by using Behavioral Data
Behavioral data is generated from the actions or behaviors of individuals or groups. In this article, we will demonstrate the benefits of using behavioral data, particularly web sessions data, to improve the accuracy of machine learning models…
Understanding Large Language Models -- A Cross-Section of the Most Relevant Literature To Get Up to Speed
Since transformers have such a big impact on everyone’s research agenda, I wanted to flesh out a short reading list for machine learning researchers and practitioners getting started. The following list below is meant to be read mostly chronologically, and I am entirely focusing on academic research papers. Of course, there are many additional resources out there that are useful…
Mojo may be the biggest programming language advance in decades
I remember the first time I used the v1.0 of Visual Basic. Back then, it was a program for DOS. Before it, writing programs was extremely complex and I’d never managed to make much progress beyond the most basic toy applications. But with VB, I drew a button on the screen, typed in a single line of code that I wanted to run when that button was clicked, and I had a complete application I could now run. It was such an amazing experience that I’ll never forget that feeling. It felt like coding would never be the same again…Writing code in Mojo, a new programming language from Modular1 is the second time in my life I’ve had that feeling…
Features missing from most LLM front-ends that should exist
We are fortunate that an awful lot of very smart people have implemented many really neat prompt engineering techniques…I am going to highlight some of these packages/techniques, and this is important because I will carefully explain and demonstrate conclusively that there are LLM analogies for these techniques which are being unjustly forgotten about/not implemented by any LLM front-end…My hope is that this gist puts the final nail in the coffin of our current non-creative approach to prompting LLMs. Let's list out the techniques that they've pioneered and which are broadly possible for us to use in NLP, that we have not implemented to my knowledge in any serious capacity in any repo…
Overview of several documents that you can implement to improve your team's data management practices
I thought it might be fun to share a thread giving an overview of several documents that you can implement to improve your team's data management practices…
What's Logs Got to Do With It?
Ordered beta regression can give you comparable, scale-free ATEs that can still be understood in the scale of the original data–all without using logs…What I’m going to show in this post is that the ordered beta regression model can also address issues with logs (and the related inverse hyperbolic sine transformation) because it can produce estimates (including ATEs) that are based on proportions, and thus naturally scale-free. When the scale of the outcome is an issue, the ordered beta regression can help address that problem by estimating regression coefficients or treatment effects that do not vary with scale and also include 0s…
Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings
We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in chess and other competitive games. We invite the entire community to join this effort by contributing new models and evaluating them by asking questions and voting for your favorite answer…
Causal inference with logistic regression - Part 3 of the GLM and causal inference series
Though OLS is an applied statistics workhorse and performs admirably in some cases, there are many contexts in which it’s just not appropriate. In medical trials, for example, many of the outcome variables are binary. Some typical examples are whether a participant still has the disease (coded 1) or not (coded 0), or whether a participant has died (coded 1) or is still alive (coded 0). In these cases, we want to model our data with a likelihood function that can handle binary data, and the go-to solution is the binomial.1 As we will see, some of the nice qualities from the OLS paradigm fall apart when we want to make causal inferences with binomial models. But no fear; we have solutions…
Try 4 new Arts and AI experiments
Today we are launching four new experiences for culture lovers of all ages powered by AI, created by Google Arts & Culture Lab’s artist in residence. Our artist residency program has been running since 2014 and supports artists & creative coders experimenting with emerging technologies to solve a cultural challenge, or to connect audiences with culture online in new ways. The starting point for these new experiments was applications of Google AI Image Generation Research to inspire cultural discovery and learning through play…
Do you have an expertise in experimental design and Bayesian statistics? Experience with Stan (we're a Stan shop) or a comparable PPL? Want to work with awesome people on cool projects in the video game industry? We're hiring Data Scientists!
As part of our Data Services team, you will work with senior scientists and business intelligence analysts from the games and media industries. If you have the technical chops, can communicate what you are doing and why, and love working with others to answer interesting questions with data, this team’s for you!
About Game Data Pros:
Game Data Pros is a data application consultancy working in digital entertainment fields like video games and streaming video. We work with established global games and media companies, helping them to define experimentation and cross-promotion strategies. We are responsible for data science initiatives and also building data-aware tools that help manage data, run experiments, and perform analyses.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
PyTorch DataLoader: Features, Benefits, and How to Use it
In this blog post, we will discuss the PyTorch DataLoader class in detail, including its features, benefits, and how to use it to load and preprocess data for deep learning models…
The Full Story of Large Language Models and RLHF
In this article we give a comprehensive overview of what’s really going on in the world of Language Models, building from the foundational ideas, all the way to the latest advancements...a) What is the learning process of a language model?, b) What is Reinforcement Learning from Human Feedback (RLHF) and how to make language models more aligned with human values?, and c) What makes these models dangerous or not aligned with human intentions in the first place?...
Prompt injection explained, with video, slides, and a transcript
I participated in a webinar this morning about prompt injection, organized by LangChain and hosted by Harrison Chase, with Willem Pienaar, Kojin Oshiba (Robust Intelligence), and Jonathan Cohen and Christopher Parisien (Nvidia Research). The full hour long webinar recording can be viewed on Crowdcast. I’ve extracted the first twelve minutes below, where I gave an introduction to prompt injection, why it’s an important issue and why I don’t think many of the proposed solutions will be effective…
* Based on unique clicks.
** Find last week's issue #492 here.
Thanks for joining us this week :)
All our best,
Hannah & Sebastian
P.S.,
Consider joining our the subscriber-only Slack community where we'll tackle learning the latest tools, keeping up with the latest techniques, career entry & growth, and anything else that's stressing you out at the office.
Become a paid subscriber here: https://datascienceweekly.substack.com/subscribe
:)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.