Data Science Weekly - Issue 495

Curated news, articles and jobs related to Data Science

Data Science Weekly

May 19

Share

Issue #495
May 18 2023

Hello and thank you for tuning in to Issue #495.

Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

***

Seeing this for the first time? Subscribe here:

***

Want to hear more often from us? Become a subscriber here.

***

If you don’t find this email useful, please unsubscribe here.

***

And now, let's dive into some interesting links from this week:

Hope you enjoy it!

:)

Editor's Picks

Absence of Evidence
If anyone tells you that absence of evidence is not evidence of absence, you have my permission to slap them. Of course, my permission will not prevent you from getting slapped back or charged with assault. Regardless, absence of evidence is very often evidence of absence, and sometimes strong evidence…To make this claim precise, I propose we use the Bayesian definition of evidence…

Designing & Building Metric Trees [Video]
Metrics are the most important primitive in the data world and driving the use of powerful and reliable metrics is the best way data teams can add value to their enterprises. In this talk, we'll walk through how data teams can best support the metric lifecycle, end-to-end from: 1. Designing useful metrics as part of metric trees 2. Developing these metrics off stable and standard data contracts 3. Operationalizing metrics to drive value…

Using Natural Language Processing for the analysis of global supply chains
In this project, we explored whether cutting-edge data science techniques, such as natural language processing (NLP) and transformer-based deep learning models, could enable us to construct supply chain networks from unstructured text. We applied these techniques to sentences from Reuters news…

A Message from this week's Sponsor:

BigCode project: Code-generating LLMs boosted by Toloka's crowd

@Toloka teamed up with @huggingface and @ServiceNowRSRCH to power @BigCodeProject LLM PII data annotation project. Facts: 12K code chunks, 14 categories of data, 1399 Tolokers and 4349 hours of work in 4 days! Check out this post to learn what, why and how they made it happen (link)

Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Search Personalization at Netflix
At Netflix, personalization plays a key role in several aspects of our user experience, from ranking titles to constructing an optimal Homepage. Although personalization is a well established research field, its application to search presents unique problems and opportunities. In this paper, we describe the evolution of Search personalization at Netflix, its unique challenges, and provide a high level overview of relevant solutions.

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2
In the first part of this series, we talked about design patterns for data creation and the pros & cons of each system from the data contract perspective. In the second part, we will focus on architectural patterns to implement data quality from a data contract perspective…

Causal inference with gamma regression or: The problem is with the link function, not the likelihood
So far the difficulties we have seen with covaraites, causal inference, and the GLM have all been restricted to discrete models (e.g., binomial, Poisson, negative binomial). In this sixth post of the series, we’ll see this issue can extend to models for continuous data, too. As it turns out, it may have less to do with the likelihood function, and more to do with the choice of link function. To highlight the point, we’ll compare Gaussian and gamma models, with both the identity and log links…
Text-to-Video: The Task, Challenges and the Current State
In this blog post, we will discuss the past, present, and future of text-to-video models. We will start by reviewing the differences between the text-to-video and text-to-image tasks, and discuss the unique challenges of unconditional and text-conditioned video generation. Additionally, we will cover the most recent developments in text-to-video models, exploring how these methods work and what they are capable of…
Level Up Your Navigation: Inside Stamen Design’s Route Simulator
What lessons on spatial navigation can we take from game design and apply to navigation in the real world?…The answer is that the visuals that accompany the driving directions you get from Google, Apple, Waze or most other apps designed for car navigation are heavily influenced by navigation in games. With some key differences, we inherit the paradigms surrounding camera systems from gaming; these allow us to describe and control a navigation experience from different perspectives (e.g. not just “top down” but also “third person”, “isometric” and “perspective”)…

AI chat apps with Shiny for Python
In the short time since they’ve become publicly available, chat interfaces for Large Language Models (LLMs) have become incredibly popular. With Shiny for Python, you can easily create your own chat application with just a few lines of code. If you’ve wanted to make a web application for interacting with AI, you can do it with Shiny for Python…
Against LLM maximalism
One vision for how LLMs can be used is what I’ll term LLM maximalist. If you have some task, you try to ask the LLM to do it as directly as possible…fundamentally the LLM maximalist position is that you want to trust the LLM to solve the problem. You’re preparing for the technologies to continue to improve, and the current pain-points to keep reducing over time…There are two big problems with this approach. One is that “working around the system’s limitations” is often going to be outright impossible…The second problem is that the LLM maximalist approach is fundamentally not modular…

Numbers every LLM Developer should know
At Google, there was a document put together by Jeff Dean, the legendary engineer, called Numbers every Engineer should know. It’s really useful to have a similar set of numbers for LLM developers to know that are useful for back-of-the envelope calculations. Here we share particular numbers we at Anyscale use, why the number is important and how to use it to your advantage…

The Simple Joys of Scaling Up
In the last two decades, transistor density has increased by 1000x; something that might have taken thousands of machines in 2002 could be done today in just one…After such a dramatic increase in hardware capability, we should ask ourselves, “Do the conditions that drove our scaling challenges in 2003 still exist?” After all, we’ve made our systems far more complex and added a lot of overhead. Is it all still necessary? If you can do the job on a single machine, isn’t that going to be a better alternative?…This post will dig into why scale-out became so dominant, take a look at whether those rationales still hold, and then explore some advantages of scale-up architecture…

Inside GitHub: Working with the LLMs behind GitHub Copilot
Due to the growing interest in LLMs and generative AI models, we decided to speak to the researchers and engineers at GitHub who helped build the early versions of GitHub Copilot and talk through what it was like to work with different LLMs from OpenAI, and how model improvements have helped evolve GitHub Copilot to where it is today—and beyond…

Reimagining Meta’s infrastructure for the AI age
We are now executing on an ambitious plan to build the next generation of Meta’s infrastructure backbone – specifically built for AI – and in this blog post we’re sharing some details on our recent progress. The projects we’re announcing here touch many of the layers of our hardware and software stack as well as the customized network that connects these technologies from top to bottom. They include our first custom chip for running AI models, a new AI-optimized data center design, and phase 2 of our 16,000 GPU supercomputer for AI research…

llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs
I’ve been building out a small suite of command-line tools for working with ChatGPT, GPT-4 and potentially other language models in the future. The three tools I’ve built so far are…

Jobs

Game Data Pros: Data Scientist

Do you have an expertise in experimental design and Bayesian statistics? Experience with Stan (we're a Stan shop) or a comparable PPL? Want to work with awesome people on cool projects in the video game industry? We're hiring Data Scientists!

As part of our Data Services team, you will work with senior scientists and business intelligence analysts from the games and media industries. If you have the technical chops, can communicate what you are doing and why, and love working with others to answer interesting questions with data, this team’s for you!

About Game Data Pros:

Game Data Pros is a data application consultancy working in digital entertainment fields like video games and streaming video. We work with established global games and media companies, helping them to define experimentation and cross-promotion strategies. We are responsible for data science initiatives and also building data-aware tools that help manage data, run experiments, and perform analyses.

Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

Introducing LLM University — Your Go-To Learning Resource for NLP
We're excited to announce the launch of LLM University (LLMU), a set of comprehensive learning resources for anyone interested in natural language processing (NLP), from beginners to advanced learners. Join us to master NLP skills and start building your own AI applications!…
Ask HN: Can someone ELI5 transformers and the “Attention is all we need” paper?
I have zero AI/ML knowledge but Steve Yegge on Medium thinks that the team behind Transformers deserves a Nobel…Makes me want to better understand this tech…Edit: thank you for some amazing top level responses and links to valuable content on this subject…
Techniques for Speeding Up Model Training
This unit covers various techniques to accelerate deep learning training…Mixed-precision training…We cover mixed-precision training, a method that uses both 16-bit and 32-bit floating-point types to reduce memory usage and increase training speed, particularly on modern GPUs that have specialized hardware for 16-bit calculations….Multi-GPU training…We also delve into strategies for multi-GPU training, including data parallelism and model parallelism, where the former distributes different mini-batches of data across multiple GPUs and the latter splits a single model across several GPUs….Other performance tips…

Last Week's Newsletter's 3 Most Clicked Links

The Complete Introduction to Survival Analysis in Python

AI girlfriends are going to be a huge market [Twitter]

The leaked google memo is a great overview of what open source AI has achieved... but the conclusion is wrong [Twitter]

* Based on unique clicks.
** Find last week's issue #494 here.

Cutting Room Floor

Thanks for joining us this week :)

All our best,
Hannah & Sebastian

P.S.,
If this newsletter is helpful to your job, please consider supporting us by becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe

:)

You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.

Data Science Weekly - Data Science Weekly - Issue 495

Data Science Weekly - Issue 495

Curated news, articles and jobs related to Data Science

Issue #495
May 18 2023

Editor's Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Jobs

Game Data Pros: Data Scientist

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

Data Science Weekly - Issue 494

Data Science Weekly - Issue 493

Data Science Weekly - Issue 492

Data Science Weekly - Issue 491

Data Science Weekly - Issue 490

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR

Data Science Weekly - Data Science Weekly - Issue 495

Curated news, articles and jobs related to Data Science

Issue #495May 18 2023

Editor's Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Jobs

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

You Might Also Like

Issue #495
May 18 2023