[in case you missed it] Data Science Weekly - Issue 477

Curated news, articles and jobs related to Data Science.
Keep up with all the latest developments

Email not displaying correctly?
View it in your browser.

Issue #477

January 11 2023

Editor's Picks

The Economics of Maps
For centuries, maps have codified the extent of human geographic knowledge and shaped discovery and economic decision-making...In this essay, we first review and unify recent literature in a variety of different fields that highlights the economic and social consequences of maps, along with an overview of the modern geospatial industry. We then outline our economic framework in which a given map is the result of economic choices around map data and designs, resulting in variations in private and social returns to mapmaking. We highlight five important economic and institutional factors shaping mapmakers' data and design choices...

Cinematic Techniques in Narrative Visualization
The many genres of narrative visualization (e.g. data comics, data videos) each offer a unique set of affordances and constraints. To better understand a genre that we call cinematic visualizations-3D visualizations that make highly deliberate use of a camera to convey a narrative-we gathered 50 examples and analyzed their traditional cinematic aspects to identify the benefits and limitations of the form. While the cinematic visualization approach can violate traditional rules of visualization, we find that through careful control of the camera, cinematic visualizations enable immersion in data-driven, anthropocentric environments, and can naturally incorporate in-situ narrators, concrete scales, and visual analogies...

NLP Startup Funding in 2022
I track company funding and acquisitions in the natural language processing space. In 2022, I found just over 340 relevant funding events, ranging from pre-seed funding all the way through to late-stage Series E and F rounds. In this article, I focus in on early-stage companies: specifically, those who reported pre-seed funding, seed funding or Series A funding rounds...I attempt to impose some organisation and structure over the offerings of these companies, with the aim of highlighting the technology and application areas that have been considered worthy of investment over the last twelve months...

A Message from this week's Sponsor:

Get Your Models Into Production Faster With Encord

Forget about fragmented tools and notebooks for creating your active learning pipelines.

Encord is a single integrated platform that makes it quicker and easier to build production computer vision models using active learning pipelines.

Encord helps you streamline your machine learning projects, giving you a single platform for labeling any visual data, managing annotators, improving training data quality and debugging your datasets and models.

Get in touch to arrange your free trial of Encord and see how we can help you get your models into production faster.

Data Science Articles & Videos

State Space Model Book Club
Our causal inference book club was a success. Over 300 people took part and every session was well attended! So we're going to do this again. This time we'll focus State Space Models, and specifically on Dynamax a new library that makes these simple to use in a modern data stack...read on for why we picked this topic and the details of this next phase of our book club...

Bringing "balance" to your data
In research and data science, we sometimes encounter biased data: that is, data that has not been sampled completely randomly and suffers from an over- or under-indexing toward the population of interest...With survey data playing a key role in research and product work at Meta, we observed a growing need for software tools that make survey statistics methods accessible for researchers and engineers. This has led us to develop “balance”: A Python package for adjusting biased data samples. In balance we introduce a simple easy-to-use framework for weighting data and evaluating its biases with and without adjustments...

nanoGPT
The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of minGPT that prioritizes teeth over education. Still under active development, but currently the file train.py reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in 38 hours of training. The code itself is plain and readable: train.py is a ~300-line boilerplate training loop and model.py a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI. That's it...

Language Models are Drummers: Drum Composition with Natural Language Pre-Training
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances...

Understanding Inverse Probability of Treatment Weighting (IPTW) in Causal Inference
In this post I will provide an intuitive and illustrated explanation of inverse probability of treatment weighting (IPTW), which is one of various propensity score (PS) methods. IPTW is an alternative to multivariate linear regression in the context of causal inference, since both attempt to ascertain the effect of a treatment on an outcome in the presence of confounds. It is important to note the current evidence does not support the claim that IPTW is superior to multivariate linear models (Glynn et al., 2006). However, IPTW does confer certain theoretical and practical benefits that we will review in this post...

Seven ways humanists are using computers to understand text
The image below is a map of a few things you might do with text...The idea is to give you a loose sense of how different activities are related to different disciplinary traditions. We’ll start in the center, and spiral out; this is just a way to organize discussion, and isn’t necessarily meant to suggest a sequential work flow...

How to Objectively Compare Two Ranked Lists in Python
A simplified explanation and implementation of Rank Biased Overlap...imagine you and your friend have both watched all 8 Harry Potter movies...But there’s a catch — you have watched each movie the day it was released, without missing a single premier...Your friend, however, watched the 2nd movie first, then the 4th and 5th, and then binge-watched the rest when it was available on Netflix...Theoretically, you and your friend are on an equal footing — both have watched all the movies of the series...Is it really equal though?...

Forecasting Potential Misuses of Language Models for Disinformation Campaigns — and How to Reduce Risk
OpenAI researchers collaborated with Georgetown University’s Center for Security and Emerging Technology and the Stanford Internet Observatory to investigate how large language models might be misused for disinformation purposes. The collaboration included an October 2021 workshop bringing together 30 disinformation researchers, machine learning experts, and policy analysts, and culminated in a co-authored report building on more than a year of research. This report outlines the threats that language models pose to the information environment if used to augment disinformation campaigns and introduces a framework for analyzing potential mitigations...

Announcing new R Shiny UI components
I’m thrilled to share that the latest release of the {bslib} R package introduces new a Card API, Value boxes, and a responsive grid-like layout. These new UI components work in Shiny, R Markdown, Quarto (or really any R-based HTML project) and work best alongside the new {bsicons} package (an R interface to Bootstrap icons) as well as the latest versions of {htmlwidgets} and {shiny}...

Numerical Marvels Inside Python [Video]
Speaker Raymond Hettinger has been a prolific contributor to the CPython project for over a decade, having implemented and maintained many of Python's great features. He has been instrumental in modules like bisect, collections, decimal, functools, itertools, math, random, with types like namedtuple, sets, dictionaries, and in many other places around the codebase. He has contributed to the modification of nearly 90,000 lines of code in the CPython repository, and has made over 160 changes in the PEP repository...

Superposition, Memorization, and Double Descent
In this note, we offer a very preliminary investigation of training the same toy models in our previous paper on limited datasets. Despite being extremely simple, the toy model turns out to be a surprisingly rich case study for overfitting. In particular, we find the following: a) Overfitting corresponds to storing data points, rather than features, in superposition, b) Depending on dataset size, our models fall into two different regimes: an overfitting regime (characterized by storing data points in superposition), and a generalizing regime (characterized by storing features in superposition), and c) We observe double descent as the model transitions between these regimes...

Self-serve feature platforms: architectures and APIs
This post consists of two parts. The first part discusses the evolution of feature platforms, how they differ from model platforms and feature stores. The second part discusses the core challenges of making feature platforms self-serve for data scientists and increase the iteration speed for feature engineering...

Tool*

Build powerful ML visualizations with Comet

With just 2 lines of code, Comet automatically logs metrics, hyperparameters, libraries, and more. This means automatic chart generation so you can easily manage training runs in real time. When you combine that with:

built-in visualizations (like the image panel),
custom project views, and
your own python panels,

Comet is a powerful tool for optimizing your ML workflow. All for free! Less friction, more ML.

Create your free account.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

Jobs

Data Scientist / Machine Learning Engineer - Epsilon - NYC

Epsilon Strategy and Insights, Data Sciences team is looking for a talented team player in a Data Scientist/Machine Learning Engineer role. You are an expert, mentor and advocate. You have strong machine learning and deep learning background and are passionate about transforming data into ml models. You welcome the challenge of data science and are proficient in Python, Spark MLLib, Tensorflow, Keras, ML algorithms and Deep Neural Networks, Big Data. You must be self-driven, take initiative and want to work in a dynamic, busy and innovative group...

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

University of Washington's LING 575: NLP for Cultural Analytics
Surveys tools, frameworks, and skills needed to apply natural language processing methods to applications in the humanities and social sciences, with a focus on the analysis of large digital text corpora, including social media, literature, and historical documents. Topics will include data collection, text processing and machine learning techniques, data visualization, and ethical considerations...

Stanford's CS324 - Large Language Models
In this course, students will learn the fundamentals about the modeling, theory, ethics, and systems aspects of large language models, as well as gain hands-on experience working with them...

Software Engineering at Google
In March, 2020, we published a book titled “Software Engineering at Google” curated by Titus Winters, Tom Manshreck and Hyrum Wright...The Software Engineering at Google book (“SWE Book”) is not about programming, per se, but about the engineering practices utilized at Google to make their codebase sustainable and healthy. (These practices are paramount for common infrastructural code such as Abseil.)...We are happy to announce that we are providing a digital version of this book in HTML free of charge...

Last Week's Newsletter's 3 Most Clicked Links

Data Pipeline Design Patterns - #1. Data flow patterns

Changing my feminine first name to a masculine nickname on my resume gave me way more responses per application

How Shapley Values Work

* Based on unique clicks.
** Find last week's newsletter here.

Cutting Room Floor

List of unsolved problems in Biology taped to the wall of the Synthetic Neurobiology lab at MIT's Media Lab

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Follow on Twitter

unsubscribe from this list update subscription preferences

[in case you missed it] Data Science Weekly - Issue 477

Issue #477

January 11 2023

Editor's Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Tool*

Jobs

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

Data Science Weekly - Issue 478

Data Science Weekly - Issue 476

Data Science Weekly - Issue 475

Data Science Weekly - Issue 474

Data Science Weekly - Issue 473

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR

[in case you missed it] Data Science Weekly - Issue 477

Issue #477 January 11 2023

Editor's Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Tool*

Jobs

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

You Might Also Like

Issue #477

January 11 2023