|
Hello! Once a week (for the last decade), we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week.
Who said what: using machine learning to correctly attribute quotes Today’s blog does not come to you from any developer in product and engineering but from our talented colleagues in data and insight. Here, the Guardian’s data scientists share how they have teamed up with PhD students from University College London to train a machine learning model to accurately attribute quotes. Below the two teams explain how they’ve been teaching a machine to understand “who said what?”
Marker - Convert PDF to markdown quickly with high accuracy Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk…Marker is a pipeline of deep learning models: 1) Extract text, OCR if necessary (heuristics, tesseract), 2) Detect page layout (layout segmenter, column detector), 3) Clean and format each block (heuristics, nougat), and 4) Combine blocks and postprocess complete text (heuristics, pdf_postprocessor)…
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results. Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to: Work with the most accurate and up-to-date metrics, completely in your own data warehouse Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly Save time in setting up, running, and analyzing experiments Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present. Download the white paper to see if you have all seven, and if you don't, what you could be missing. * Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Plotting proportional data as nested circles in R A guide demonstrating how to make a static nested circle plot using the R packages packcircles and ggplot2…For better or worse, you see proportional data represented with nested circles a fair bit in the media. Plotting proportional data is almost always best done with some kind of bar chart but, occasionally, I do come across a case where I think nested circles convey a message well (as well as being visually attractive). In this article I’ll demonstrate how to make a static nested circle plot using the R packages packcircles and ggplot2, although if you want interactivity you can also check out circlepackeR, which creates snazzy html widgets…
Introducing llamafile llamafile lets you turn large language model (LLM) weights into executables. Say you have a set of LLM weights in the form of a 4GB file (in the commonly-used GGUF format). With llamafile you can transform that 4GB file into a binary that runs on six OSes without needing to be installed. This makes it dramatically easier to distribute and run LLMs. It also means that as models and their weights formats continue to evolve over time, llamafile gives you a way to ensure that a given set of weights will remain usable and perform consistently and reproducibly, forever…
Making Large Language Models Uncool Again In this fireside chat, Jeremy joins Hugo Bowne-Anderson, Outerbounds’ Head of Developer Relations, to talk about the current state of LLMs, how to get beyond the intense hype to deliver actual value, the existential threat posed by increasing concentration in vendor models, why we need more OSS LLMs, and how AI education will need to change…
Superbolts Recently I read a Scientific American article about superbolts, which are lightning strikes that “can be 1,000 times as strong as ordinary strikes”. This reminded me of distributions I’ve seen of many natural phenomena — like earthquakes, asteroids, and solar flares — where the most extreme examples are thousands of times bigger than the ordinary ones. So the article about superbolts made we wonder 1) Whether superbolts are really a separate category, or whether they are just extreme examples from a long-tailed distribution, and 2) Whether the distribution is well-modeled by a Student t-distribution on a log scale, like many of the examples I’ve looked at…
A gentle reminder that the market is a sh*tshow now if you are looking for a job. Just hang in there [Reddit] I work for a govt entity at 30% lower than my market rate in a LCOL area. Usually we feel happy if we get 10 applications total and a couple of decent ones. But recently opened one for a DS position with entry level requirements. In a couple of weeks we got "flooded" with applicants with MS and PhD from tier 1 universities including Carnegie Mellon, Northwestern, University of Chicago, John Hopkins. I have been looking for a new role and this was a good reminder for me as well that the market is sh*t and if you are looking for a new role, there couldn't be any worse time lately. So, if you are on the same boat, just hang tight. Not getting an offer has a lot to do with factors beyond your control…
Pushing the Frontiers of Biodiversity Research: Unveiling the Global Diversity, Distribution, and Conservation of Fungi Fungi comprise approximately 20% of all eukaryotic species and are connected to virtually all life forms on Earth. Yet, their diversity remains contentious, their distribution elusive, and their conservation neglected. We aim to flip this situation by synthesizing current knowledge. We present a revised estimate of 2–3 million fungal species with a “best estimate” at 2.5 million. To name the unknown >90% of these by the end of this century, we propose recognition of species known only from DNA data and call for large-scale sampling campaigns…
Spider 1.0 - Yale Semantic Parsing and Text-to-SQL Challenge Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas…
Qwen - chat & pretrained large language model proposed by Alibaba Cloud In brief, we have strong base language models, which have been stably pretrained for up to 3 trillion tokens of multilingual data with a wide coverage of domains, languages (with a focus on Chinese and English), etc. They are able to achieve competitive performance on benchmark datasets. Additionally, we have chat models that are aligned with human preference based on SFT and RLHF (not released yet), which are able to chat, create content, extract information, summarize, translate, code, solve math problems, and so on, and are able to use tools, play as agents, or even play as code interpreters, etc…
Reinforcement Learning from Human Feedback I gave an RLHF lecture at Stanford today, here are the slides. The newer figures from other talks I've given: * visuals on history of RLHF / related fields * figures on advanced RL methods (CAI / DPO / rejection sampling)…
My team only uses Excel to manage all of our critical data and I’m struggling to fix it [Reddit] My manager tasked me with finding ways to improve our team’s data management practices. We use one Excel workbook with 40+ sheets as our central hub for how we store, manage, and interact with our data critical to day-to-day operations. Most of our team uses this Excel file, oftentimes simultaneously, which causes mistaken data entries, conflicting filtering, and so on….It has “worked” so far but continues to grow beyond its useful limit…On one hand, it seems stupid to have all this data laying around in Excel. On the other hand, I don’t know how to find an acceptable solution that balances sophistication and user-friendliness for the “business users”. On top of that, I have to develop this all alone and self-guided. This is one of my first projects and I don’t want to fail. Btw, this is a fortune 50 company in a highly regulated industry 😭. Any suggestions would help greatly for my own sanity…
Deep generative modeling of the human proteome reveals over a hundred novel genes involved in rare genetic disorders popEVE - a deep generative model of the human proteome that reveal over a hundred novel genes involved in rare genetic disorders…Identifying causal mutations accelerates genetic disease diagnosis, and therapeutic development…we developed a deep generative model, popEVE, that combines evolutionary information with population sequence data and achieves state-of-the-art performance at ranking variants by severity to distinguish patients with severe developmental disorders from potentially healthy individuals…
We are BCG X. BCG X is the tech build & design unit of BCG. Turbocharging BCG’s deep industry and functional expertise, BCG X brings together advanced tech knowledge and ambitious entrepreneurship to help organizations enable innovation at scale. With nearly 3,000 technologists, scientists, programmers, engineers, and human-centered designers located across 80+ cities, BCG X builds and designs platforms and software to address the world’s most important challenges and opportunities. Our BCG X teams own the full analytics value-chain end to end: framing new business challenges, designing innovative algorithms, implementing, and deploying scalable solutions, and enabling colleagues and clients to fully embrace AI. Our product offerings span from fully custom-builds to industry specific leading edge AI software solutions. Our Data Scientists and Senior Data Scientist are part of our rapidly growing team to apply data science methods and analytics to real-world business situations across industries to drive significant business impact. You'll have the chance to partner with clients in a variety of BCG regions and industries, and on key topics like climate change, enabling them to design, build, and deploy new and innovative solutions.
Apply here Want to post a job here? Email us for details --> team@datascienceweekly.org
Language Models: A Guide for the Perplexed Given the growing importance of AI literacy, we decided to write this tutorial to help narrow the gap between the discourse among those who study language models -- the core technology underlying ChatGPT and similar products -- and those who are intrigued and want to learn more about them. In short, we believe the perspective of researchers and educators can add some clarity to the public's understanding of the technologies beyond what's currently available, which tends to be either extremely technical or promotional material generated about products by their purveyors…
If you had to list a “tier list” of software that data scientists should be competent with prior to their first job, what would it be? [Reddit] May or may not be asking this so I can aggregate courses for me to learn/upskill. But basically I feel like being the R/SQL/Python guy I’m missing out on a lot of other tools and tech. Give me a list of more tools I should know as an incoming data scientist. Cloud platforms? Git? Docker? List anything and everything you would hope a data scientist should be good to pickup or know before starting…
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing…
* Based on unique clicks. ** Find last week's issue #522 here.
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week (and for the last decade!) :) All our best, Hannah & Sebastian
P.S. Was today’s newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.
| |