Forwarded this email? Subscribe here for more

Data Science Weekly - Issue 523

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Data Science Weekly

Dec 1

READ IN APP

Issue #523
November 30, 2023 ( 10 years! )

Hello!

Once a week (for the last decade), we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)

And now…let's dive into some interesting links from this week.

Editor's Picks

U Waterloo CS 886: Recent Advances on Foundation Models Winter 2024’s Reading List
Wenhu Chen at UWCheritonCS is teaching a course "Recent Advances on Foundation Models" starting in January 2024. The full set of readings is already online! Great way to catch up…

Who said what: using machine learning to correctly attribute quotes
Today’s blog does not come to you from any developer in product and engineering but from our talented colleagues in data and insight. Here, the Guardian’s data scientists share how they have teamed up with PhD students from University College London to train a machine learning model to accurately attribute quotes. Below the two teams explain how they’ve been teaching a machine to understand “who said what?”
Marker - Convert PDF to markdown quickly with high accuracy
Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk…Marker is a pipeline of deep learning models: 1) Extract text, OCR if necessary (heuristics, tesseract), 2) Detect page layout (layout segmenter, column detector), 3) Clean and format each block (heuristics, nougat), and 4) Combine blocks and postprocess complete text (heuristics, pdf_postprocessor)…

A Message from this week's Sponsor:

Is your A/B testing system reliable?

There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.

Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:

Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis

Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.

Download the white paper to see if you have all seven, and if you don't, what you could be missing.

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Plotting proportional data as nested circles in R
A guide demonstrating how to make a static nested circle plot using the R packages packcircles and ggplot2…For better or worse, you see proportional data represented with nested circles a fair bit in the media. Plotting proportional data is almost always best done with some kind of bar chart but, occasionally, I do come across a case where I think nested circles convey a message well (as well as being visually attractive). In this article I’ll demonstrate how to make a static nested circle plot using the R packages packcircles and ggplot2, although if you want interactivity you can also check out circlepackeR, which creates snazzy html widgets…

Introducing llamafile
llamafile lets you turn large language model (LLM) weights into executables.
Say you have a set of LLM weights in the form of a 4GB file (in the commonly-used GGUF format). With llamafile you can transform that 4GB file into a binary that runs on six OSes without needing to be installed.
This makes it dramatically easier to distribute and run LLMs. It also means that as models and their weights formats continue to evolve over time, llamafile gives you a way to ensure that a given set of weights will remain usable and perform consistently and reproducibly, forever…

Making Large Language Models Uncool Again
In this fireside chat, Jeremy joins Hugo Bowne-Anderson, Outerbounds’ Head of Developer Relations, to talk about the current state of LLMs, how to get beyond the intense hype to deliver actual value, the existential threat posed by increasing concentration in vendor models, why we need more OSS LLMs, and how AI education will need to change…
Superbolts
Recently I read a Scientific American article about superbolts, which are lightning strikes that “can be 1,000 times as strong as ordinary strikes”. This reminded me of distributions I’ve seen of many natural phenomena — like earthquakes, asteroids, and solar flares — where the most extreme examples are thousands of times bigger than the ordinary ones. So the article about superbolts made we wonder 1) Whether superbolts are really a separate category, or whether they are just extreme examples from a long-tailed distribution, and 2) Whether the distribution is well-modeled by a Student t-distribution on a log scale, like many of the examples I’ve looked at…
A gentle reminder that the market is a sh*tshow now if you are looking for a job. Just hang in there [Reddit]
I work for a govt entity at 30% lower than my market rate in a LCOL area. Usually we feel happy if we get 10 applications total and a couple of decent ones. But recently opened one for a DS position with entry level requirements. In a couple of weeks we got "flooded" with applicants with MS and PhD from tier 1 universities including Carnegie Mellon, Northwestern, University of Chicago, John Hopkins. I have been looking for a new role and this was a good reminder for me as well that the market is sh*t and if you are looking for a new role, there couldn't be any worse time lately. So, if you are on the same boat, just hang tight. Not getting an offer has a lot to do with factors beyond your control…

Pushing the Frontiers of Biodiversity Research: Unveiling the Global Diversity, Distribution, and Conservation of Fungi
Fungi comprise approximately 20% of all eukaryotic species and are connected to virtually all life forms on Earth. Yet, their diversity remains contentious, their distribution elusive, and their conservation neglected. We aim to flip this situation by synthesizing current knowledge. We present a revised estimate of 2–3 million fungal species with a “best estimate” at 2.5 million. To name the unknown >90% of these by the end of this century, we propose recognition of species known only from DNA data and call for large-scale sampling campaigns…
Spider 1.0 - Yale Semantic Parsing and Text-to-SQL Challenge
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas…

Qwen - chat & pretrained large language model proposed by Alibaba Cloud
In brief, we have strong base language models, which have been stably pretrained for up to 3 trillion tokens of multilingual data with a wide coverage of domains, languages (with a focus on Chinese and English), etc. They are able to achieve competitive performance on benchmark datasets. Additionally, we have chat models that are aligned with human preference based on SFT and RLHF (not released yet), which are able to chat, create content, extract information, summarize, translate, code, solve math problems, and so on, and are able to use tools, play as agents, or even play as code interpreters, etc…

Reinforcement Learning from Human Feedback
I gave an RLHF lecture at Stanford today, here are the slides. The newer figures from other talks I've given: * visuals on history of RLHF / related fields * figures on advanced RL methods (CAI / DPO / rejection sampling)…

My team only uses Excel to manage all of our critical data and I’m struggling to fix it [Reddit]
My manager tasked me with finding ways to improve our team’s data management practices. We use one Excel workbook with 40+ sheets as our central hub for how we store, manage, and interact with our data critical to day-to-day operations. Most of our team uses this Excel file, oftentimes simultaneously, which causes mistaken data entries, conflicting filtering, and so on….It has “worked” so far but continues to grow beyond its useful limit…On one hand, it seems stupid to have all this data laying around in Excel. On the other hand, I don’t know how to find an acceptable solution that balances sophistication and user-friendliness for the “business users”. On top of that, I have to develop this all alone and self-guided. This is one of my first projects and I don’t want to fail. Btw, this is a fortune 50 company in a highly regulated industry 😭. Any suggestions would help greatly for my own sanity…

Deep generative modeling of the human proteome reveals over a hundred novel genes involved in rare genetic disorders
popEVE - a deep generative model of the human proteome that reveal over a hundred novel genes involved in rare genetic disorders…Identifying causal mutations accelerates genetic disease diagnosis, and therapeutic development…we developed a deep generative model, popEVE, that combines evolutionary information with population sequence data and achieves state-of-the-art performance at ranking variants by severity to distinguish patients with severe developmental disorders from potentially healthy individuals…

Use ChatGPT as your interior designer [X/Twitter]
I just uploaded our floor plan and a short brief of our needs…ChatGPT outputted the same stuff in seconds that we spent weeks trying to get right…Things are about to get really crazy…

Jobs

Data Scientist – BCG X

We are BCG X.

BCG X is the tech build & design unit of BCG. Turbocharging BCG’s deep industry and functional expertise, BCG X brings together advanced tech knowledge and ambitious entrepreneurship to help organizations enable innovation at scale. With nearly 3,000 technologists, scientists, programmers, engineers, and human-centered designers located across 80+ cities, BCG X builds and designs platforms and software to address the world’s most important challenges and opportunities.

Our BCG X teams own the full analytics value-chain end to end: framing new business challenges, designing innovative algorithms, implementing, and deploying scalable solutions, and enabling colleagues and clients to fully embrace AI. Our product offerings span from fully custom-builds to industry specific leading edge AI software solutions.

Our Data Scientists and Senior Data Scientist are part of our rapidly growing team to apply data science methods and analytics to real-world business situations across industries to drive significant business impact. You'll have the chance to partner with clients in a variety of BCG regions and industries, and on key topics like climate change, enabling them to design, build, and deploy new and innovative solutions.

Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

Language Models: A Guide for the Perplexed
Given the growing importance of AI literacy, we decided to write this tutorial to help narrow the gap between the discourse among those who study language models -- the core technology underlying ChatGPT and similar products -- and those who are intrigued and want to learn more about them. In short, we believe the perspective of researchers and educators can add some clarity to the public's understanding of the technologies beyond what's currently available, which tends to be either extremely technical or promotional material generated about products by their purveyors…
If you had to list a “tier list” of software that data scientists should be competent with prior to their first job, what would it be? [Reddit]
May or may not be asking this so I can aggregate courses for me to learn/upskill. But basically I feel like being the R/SQL/Python guy I’m missing out on a lot of other tools and tech. Give me a list of more tools I should know as an incoming data scientist. Cloud platforms? Git? Docker? List anything and everything you would hope a data scientist should be good to pickup or know before starting…
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #522 here.

Cutting Room Floor

Whenever you're ready, three ways we can help:

Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.

Thank you for joining us this week (and for the last decade!) :)

All our best,
Hannah & Sebastian

P.S. Was today’s newsletter helpful to your job?

Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)

You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.