Data Science Weekly - Data Science Weekly - Issue 515
Data Science Weekly - Issue 515Curated news, articles and jobs related to Data Science, AI, & Machine LearningIssue #515 |
|
Hello!
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If you don’t find this email useful, please unsubscribe here.
Is this newsletter helpful to your job? Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week
Editor's Picks
My F100 company analyzed why our good data scientists are good and here's the recap [Reddit]
A small team of internal researchers inside the company spent time investigating which data scientists preformed the best, which preformed the worst, and what factors played into this. The top 3 indicators of a high preforming data scientist were…
Beyond Hypothesis Tests & P-values: Iterative Refinement in Science and Business
The trajectory of scientific and business methodologies has been marked by evolving frameworks, each seeking to address the multifaceted challenges of their respective domains. While the principle of falsification, introduced by Karl Popper, set the stage, it became increasingly evident that more adaptable and iterative methodologies were needed to navigate the complexities of the modern era. George E.P. Box’s insights on the synergy between theory and practice offered a beacon. Building upon this, Devezer and Buzbas have further refined this vision for today’s intricate challenges. Yet, there remains a pressing need to popularize and integrate these principles, especially within the business community…The art of data: Empowering art institutions with data and analytics
To help art institutions get started on the journey to increase their leverage of technology, we collaborated with seven leading US art institutions to gain insights into how to strengthen their data and analytics practices…Our work included the creation of an easy-to-use, objective, and scalable dashboard designed to inform institutional strategy, improve business operations, and establish the proper use of data and analytics within each organization…Building data and analytics capabilities: Five-step framework…
A Message from this week's Sponsor:
Build full-stack, private applications with the power of zero-knowledge on Aleo
Aleo is a Layer-1 blockchain built from the ground up with zero-knowledge woven into every layer of the tech stack.
Why build with us?
Aleo harnesses the power of zero-knowledge to deliver both privacy and scalability. By handling computation off-chain, it opens the door to a new era of private, scalable dapps.
You don’t need a Ph.D in cryptography to use zero-knowledge. With our language, Leo, zero-knowledge circuits are automatically generated based on your program allowing any developer to access zero-knowledge proofs.
Development integrates seamlessly on the web with the Aleo SDK. Manage accounts, deploy programs, and integrate with the Aleo network right in your browser.
Know a problem that can be solved with zk? Get paid for your good ideas with Aleo's Ignition grants. Grants start at $3,000 for simple applications.
Start building > https://aleo.org/grants/
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Story(line) Visualizations
I’ve noticed there are many research papers regarding storyline visualizations. Storyline visualizations all stem from this sketch of Randall Munroe’s movie narrative charts…Various researchers have incrementally pushed this idea forwards, algorithmically generating the chart, optimizing the layout of the chart, etc. Here’s a few…But why storylines? What’s the point of adopting Randall’s sketch and automating it? What’s the storyline do well, and why use it?…
Challenges in evaluating AI systems
What many people working inside and outside of AI don’t fully appreciate is how difficult it is to build robust and reliable model evaluations…In this post, we discuss some of the challenges that we have encountered in developing AI evaluations. In order of less-challenging to more-challenging, we discuss:Multiple choice evaluations
Third-party evaluation frameworks like BIG-bench and HELM
Using crowdworkers to measure how helpful or harmful our models are
Using domain experts to red team for national security-relevant threats
Using generative AI to develop evaluations for generative AI
Working with a non-profit organization to audit our models for dangerous capabilities
We conclude with a few policy recommendations that can help address these challenges…
torch2jax - Run PyTorch in JAX 🤝
Run PyTorch in JAX…Mix-and-match PyTorch and JAX code with seamless, end-to-end autodiff, use JAX classics likejit
,grad
, andvmap
on PyTorch code, and run PyTorch models on TPUs…torch2jax uses abstract interpretation (aka tracing) to move JAX values through PyTorch code. As a result, you get a JAX-native computation graph that follows exactly your PyTorch code, down to the last epsilon…Machine UnLearning for Harry Potter
This paper, “Who's Harry Potter? Approximate Unlearning in LLMs” shared an interesting idea. The goal is to have an LLM “unlearn” some of it’s knowledge. In the case of the paper they’re interested in removing knowledge from the popular Harry Potter books. The paper shares two techniques: Technique 1: Reverse Finetuning and Technique 2: Reverse Anchoring…A Beginner's Guide to Sequence Analytics in SQL
Sequences: they’re all around us. More specifically, they’re in your data warehouse with timestamps, payloads, and mysterious columns from Segment. Many of the real, course-changing insights that data teams dream are hidden deep inside these elusive event streams. This post will help you find them, using your favorite neighborhood query language. For the purposes of this journey, imagine you’re a Data Scientist at Netflix. You and your team want to better understand what your watch funnel looks like – what does a successful session entail? – as well as understand how users interact with important features like search…
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
A 166-page report from Microsoft qualitatively exploring GPT-4V capabilities and usage. Describes visual+text prompting techniques, few-shot learning, reasoning, etc…Worst Data Engineering Mistake youve seen? [Reddit]
I started work at a company that just got databricks and did not understand how it worked. So, they set everything to run on their private clusters with all purpose compute (3x's the price) with auto terminate turned off because they were ok with things running over the weekend. Finance made them stop using databricks after two months lol…Im sure people have f*cked up worse. What is the worst you’ve experienced?…
Computational Power and AI
In this article we answer the following questions: What is compute and why does it matter? How is the demand for compute shaping AI development? What kind of hardware is involved? What are the components of compute hardware? What does the supply chain for AI hardware look like? What does the market for data centers look like? How can demand for compute be addressed? How are governments responding? What are the policy implications?…
Trials of developing OPT-175B (Episode 77 of the Stanford MLSys Seminar “Foundation Models Limited Series” with Susan Zhang) [Video]
LLM development at scale is an extraordinarily resource-intensive process, requiring compute resources that many do not have access to. The experimentation process will also appear rather haphazard in comparison, given limited compute-time to fully ablate all architectural / hyper-parameter choices…In this talk, we will walk through the development lifecycle of OPT-175B, covering infrastructure and training convergence challenges faced at scale, along with methods of addressing these issues going forward…
ONE book to recommend to social science phds who want to build probability and stats skills? [Twitter/X discussion]
You get to recommend ONE book to social science PhDs who want to build probability and stats skills. What is it?…
Natural language processing for African languages
In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts…
What size is that correlation?
Suppose you find a correlation of 0.36. How would you characterize it? I posed this question to the stalwart few still floating on the wreckage of Twitter, and here are the responses…Whether a correlation is big or small, important or not, and useful or not, depends on the context, of course. But to be more specific, it depends on whether you are trying to predict, explain, or decide. And what you report should follow…
Jobs
Data Science Intern:
Performance Control & Digitalization
More than 90% of automotive innovations are based on electronics and software.
We, the BMW Group, offer you an interesting and varied internship in data science for Performance Control & Digitalization. To take our operations to the next level, the BMW Group – Performance Control & Digitalization department is looking for a Data science intern to contribute to the Supply Chain Innovations Think Tank of BMW Group and continue BMW’s leadership in supply chain management. The goal of the team will be to research emerging technologies including Data Science (ML, AI, BI etc.).
Location is Munich. Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
TorchGeo is a PyTorch domain library, similar to torchvision, providing datasets, samplers, transforms, and pre-trained models specific to geospatial data.The goal of this library is to make it simple:
for machine learning experts to work with geospatial data, and
for remote sensing experts to explore machine learning solutions…
Annotated Forest Plots using ggplot2
This post contains a short R code walkthrough to make annotated forest plots like the one shown above. There are packages to make plots like these such as forester, forestplot, and ggforestplot, but sometimes I still prefer to make my own. The big picture of this is that we’ll be making three separate ggplot2 objects and putting them together with patchwork. You could also use packages like cowplot, gridarrange or ggarrange to put the intermediate plot objects together. You can skip to the end to see the full code…The medium is the message: R programmers as content creators
I had a blast speaking at [at]cascadiarconf about how #RStats users are content creators! Making [at]quarto_pub slides was a joy, as usual. The recording is lost to time but I have a (mostly) verbatim script 😀…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #514 here.
Cutting Room Floor
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Whenever you're ready, 2 ways we can help you:
Get A Data Science Job Course: A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio, and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S.
Was this edition helpful to your job? Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.
You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.
Older messages
Data Science Weekly - Issue 514
Friday, September 29, 2023
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Data Science Weekly - Issue 513
Thursday, September 21, 2023
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Data Science Weekly - Issue 512
Sunday, September 17, 2023
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Data Science Weekly - Issue 511
Thursday, September 7, 2023
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Data Science Weekly - Issue 510
Thursday, August 31, 2023
Curated news, articles and jobs related to Data Science
You Might Also Like
JSK Daily for Nov 22, 2024
Friday, November 22, 2024
JSK Daily for Nov 22, 2024 View this email in your browser A community curated daily e-mail of JavaScript news React E-Commerce App for Digital Products: Part 4 (Creating the Home Page) This component
Spyglass Dispatch: The Fate of Chrome • Amazon Tops Up Anthropic • Pros Quit Xitter • Brave Powers AI Search • Apple's Lazy AI River • RIP Enrique Allen
Friday, November 22, 2024
The Fate of Chrome • Amazon Tops Up Anthropic • Pros Quit Xitter • Brave Powers AI Search • Apple's Lazy AI River • RIP Enrique Allen The Spyglass Dispatch is a free newsletter sent out daily on
Charted | How the Global Distribution of Wealth Has Changed (2000-2023) 💰
Friday, November 22, 2024
This graphic illustrates the shifts in global wealth distribution between 2000 and 2023. View Online | Subscribe | Download Our App Presented by: MSCI >> Get the Free Investor Guide Now FEATURED
Daily Coding Problem: Problem #1616 [Easy]
Friday, November 22, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Alibaba. Given an even number (greater than 2), return two prime numbers whose sum will
The problem to solve
Friday, November 22, 2024
Use problem framing to define the problem to solve This week, Tom Parson and Krishna Raha share tools and frameworks to identify and address challenges effectively, while Voltage Control highlights
Issue #568: Random mazes, train clock, and ReKill
Friday, November 22, 2024
View this email in your browser Issue #568 - November 22nd 2024 Weekly newsletter about Web Game Development. If you have anything you want to share with our community please let me know by replying to
Whats Next for AI: Interpreting Anthropic CEOs Vision
Friday, November 22, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 22, 2024? The HackerNoon
iOS Cocoa Treats
Friday, November 22, 2024
View in browser Hello, you're reading Infinum iOS Cocoa Treats, bringing you the latest iOS related news straight to your inbox every week. Using the SwiftUI ImageRenderer The SwiftUI ImageRenderer
iOS Dev Weekly - Issue 688
Friday, November 22, 2024
How do you get an app featured on the App Store? There's a new process, and it's great! 📝 View on the Web Archives ISSUE 688 November 22nd 2024 Comment Every developer, from solo indie devs to
Why Nvidia's CEO loves NotebookLM
Friday, November 22, 2024
I love my Alexa-enabled microwave; Best early Black Friday deals -- ZDNET ZDNET Tech Today - US November 22, 2024 Jensen Huang Even Nvidia's CEO is obsessed with Google's NotebookLM AI tool