Data Science Weekly - Issue 496

Curated news, articles and jobs related to Data Science

May 26

Share

Issue #496
May 25 2023

Hello and thank you for tuning in to Issue #495.

Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

***

Seeing this for the first time? Subscribe here:

***

Want to support us? Become a paid subscriber here.

***

If you don’t find this email useful, please unsubscribe here.

***

And now, let's dive into some interesting links from this week:

Hope you enjoy it!

:)

Editor's Picks

Juergen Schmidhuber, Renowned ‘Father Of Modern AI,’ Says His Life’s Work Won’t Lead To Dystopia
The potential to revolutionize various industries and improve our lives is clear, as are the equal dangers if bad actors leverage the technology for personal gain. Are we headed towards a dystopian future, or is there reason to be optimistic? I had a chance to sit down with Schmidhuber to understand his perspective on this seemingly fast-moving AI train that will leap us into the future….

Copilot in Power BI Demo [YouTube]
Using Copilot, you can simply describe the visuals and insights you’re looking for and Copilot will do the rest. Users can create and tailor reports in seconds, generate and edit DAX calculations, create narrative summaries, and ask questions about their data, all in conversational language. With the ability to easily tailor the tone, scope and style of narratives and add them seamlessly within reports, Power BI can also deliver data insights even more impactfully through easy-to-understand text summaries…

Reflections on a Life in Computer Science and Statistics: Norm Matloff
On July 1, 2023, I will retire, after--incredibly--48 years on the faculty of the University of California, Davis. Just like George Washington :-) , I will give this "Farewell Address." It certainly won't be historical like Washington's, and it's not really a true farewell--I'll have an office, and intend to continue to be active in research and in writing books and software--but I hope some will find it interesting, maybe surprising, and possibly useful...

A Message from this week's Sponsor:

BigCode project: Code-generating LLMs boosted by Toloka's crowd

@Toloka teamed up with @huggingface and @ServiceNowRSRCH to power @BigCodeProject LLM PII data annotation project. Facts: 12K code chunks, 14 categories of data, 1399 Tolokers and 4349 hours of work in 4 days! Check out this post to learn what, why and how they made it happen (link)

Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Open Source Sports Analytics with PySport
If you're looking for fun data sets for learning, for teaching, maybe a conference talk, or even if you're just really into them, sports offers up a continuous stream of rich data that many people can relate to. Yet, accessing that data can be tricky. Sometimes it's locked away in obscure file formats. Other times, the data exists but without a clear API to access it. On this episode, we talk about PySport - something of an awesome list of a wide range of libraries (mostly but not all Python) for accessing a wide variety of sports data from the NFL, NBA, F1, and more. We have Koen Vossen, maintainer of PySport to talk through some of the more popular projects…

Chainlit 🔗💡- Build & share Python LLM apps in minutes
Chainlit is an open-source Python package that makes it incredibly fast to build and share LLM apps. Integrate the Chainlit API in your existing code to spawn a ChatGPT-like interface in minutes!…

a16z’s AI Canon
In this post, we’re sharing a curated list of resources we’ve relied on to get smarter about modern AI. We call it the “AI Canon” because these papers, blog posts, courses, and guides have had an outsized impact on the field over the past several years. We start with a gentle introduction to transformer and latent diffusion models, which are fueling the current AI wave. Next, we go deep on technical learning resources; practical guides to building with large language models (LLMs); and analysis of the AI market. Finally, we include a reference list of landmark research results, starting with “Attention is All You Need” — the 2017 paper by Google that introduced the world to transformer models and ushered in the age of generative AI….
New Podcast: Yet Another Infra Deep Dive
Yet Another Infra Deep Dive brings you insightful and thought-provoking discussions on the world of infrastructure software. Our hosts, Ian Livingstone, a tech advisor for Snyk, and Tim Chen, a General Partner at Essence VC, team up with a rotating cast of guests to dive deep into the latest trends and hot topics in the YAIG community (www.yaig.dev). With a focus on delivering actionable and informative content, the YAID podcast is perfect for infra enthusiasts looking to stay up-to-date on the latest developments in their field. So if you're passionate about infrastructure software and want more check us out…
Evaluation of African American Language Bias in Natural Language Generation
We evaluate how well LLMs understand African American Language (AAL) in comparison to their performance on White Mainstream English (WME), the encouraged "standard" form of English taught in American classrooms. We measure LLM performance using automatic metrics and human judgments for two tasks: a counterpart generation task, where a model generates AAL (or WME) given WME (or AAL), and a masked span prediction (MSP) task, where models predict a phrase that was removed from their input…

Developing early ideas
I find myself writing the same email over and over to grad students who are developing early ideas. and every time i deviate from this approach in my own work my papers go sideways. sharing the latest email in case it is helpful…
Sophia, a new optimizer that is 2x faster than Adam on LLMs.
Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner…

the tiny corp raised $5.1M
The tiny corp is a computer company. We sell computers for more than they cost to make; I’ve been thinking about this one for a while. In the limit, it’s a chip company, but there’s a lot of intermediates along the way. The human brain has about 20 PFLOPS of compute. I’ve written various blog posts about this. Sadly, 20 PFLOPS of compute is not accessible to most people, costing about $1M to buy or $100/hr to rent. With the way AI is going, we risk large entities controlling the majority of the compute in the world. I do not want “I think there’s a world market for maybe five computers.” to ever be the world we live in. The goal of the tiny corp is: “to commoditize the petaflop”…

Anyone else been mildly horrified once they dive into the company's data? [Reddit Discussion]
I'm a few months into my first job as a data analyst at a mobile gaming company. We make freemium games where users can play for awhile until they run out of coins/energy then have to wait varying amounts of time…So I don't know what I was expecting, but the first time I saw how much money some people spend on these games I felt like I was going to throw up. Most people never make a purchase. But some people spend insane amounts of money. Like upsetting amounts of money…Anyone else ever seen things like this while working as a data analyst?…

The Curse of Dimensionality: Weird things happen in higher dimensions. Even useful information can overload a machine learning model
Most of us have very reasonable intuition that more information is always better. For example, the more I know about a potential borrower, the better I can predict whether that borrower will default on a loan. Surprisingly, this intuition is false, sometimes spectacularly false. I’ve written separately about how extra information can be worse than useless because fields in a sample of data can cluster or correlate in misleading ways. Such irrelevant or random chance effects can obscure real ones. In fact, matters get worse. Extra information can cause problems even if it’s useful. This surprising fact is due to phenomena that arise only in high dimensions and is known as The Curse of Dimensionality…

Small Language Models Improve Giants by Rewriting Their Outputs
Large language models (LLMs) have demonstrated impressive few-shot learning capabilities, but they often underperform compared to fine-tuned models on challenging tasks…In this light, we propose a method to correct LLM outputs without relying on their weights. First, we generate a pool of candidates by few-shot prompting an LLM. Second, we refine the LLM-generated outputs using a smaller model, the LM-corrector (LMCor), which is trained to rank, combine and rewrite the candidates to produce the final target output…

Your guide to AI: May 2023
Welcome to the latest issue of your guide to AI, an editorialized newsletter covering key developments in AI research (particularly for this issue!), industry, geopolitics and startups during April 2023…

Jobs

Game Data Pros: Data Scientist

Do you have an expertise in experimental design and Bayesian statistics? Experience with Stan (we're a Stan shop) or a comparable PPL? Want to work with awesome people on cool projects in the video game industry? We're hiring Data Scientists!

As part of our Data Services team, you will work with senior scientists and business intelligence analysts from the games and media industries. If you have the technical chops, can communicate what you are doing and why, and love working with others to answer interesting questions with data, this team’s for you!

About Game Data Pros:

Game Data Pros is a data application consultancy working in digital entertainment fields like video games and streaming video. We work with established global games and media companies, helping them to define experimentation and cross-promotion strategies. We are responsible for data science initiatives and also building data-aware tools that help manage data, run experiments, and perform analyses.

Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

MIT’s MAS.S68: Generative AI for Constructive Communication
Evaluation and New Research Methods
Advances in large language models recently popularized by ChatGPT represent a remarkable leap forward in language processing by machines. We invite you to join the conversation shaping the future of communication technology. What does this mean for us, how can we make the most of these advancements, and what are the risks? What research opportunities have opened up? What kinds of evaluation are called for? We will bring together a group of practitioners and experts for guided discussions, hands-on experimentation, and project critiques…
Stanford’s Machine Learning with Graphs
This course covers important research on the structure and analysis of such large social and information networks and on models and algorithms that abstract their basic properties. Students will explore how to practically analyze large-scale network data and how to reason about it through models for network structure and evolution…
Data-Science-Interview-Questions-Answers
I started an intuitive on LinkedIn in which I post a daily data science interview questions. For better access the questions and answers will be updated in this repo. The questions can be divided into six categories: machine learning questions, deep learning questions, statistics questions, probability questions, python questions and resume based questions…

Last Week's Newsletter's 3 Most Clicked Links

Numbers every LLM Developer should know

Using Natural Language Processing for the analysis of global supply chains

Absence of Evidence

* Based on unique clicks.
** Find last week's issue #495 here.

Cutting Room Floor

Thanks for joining us this week :)

All our best,
Hannah & Sebastian

P.S.,
If you found this newsletter helpful, consider supporting us by becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe

:)

You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.

Data Science Weekly - Data Science Weekly - Issue 496

Data Science Weekly - Issue 496

Curated news, articles and jobs related to Data Science

Issue #496
May 25 2023

Editor's Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Jobs

Game Data Pros: Data Scientist

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

Data Science Weekly - Issue 495

Data Science Weekly - Issue 494

Data Science Weekly - Issue 493

Data Science Weekly - Issue 492

Data Science Weekly - Issue 491

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR

Data Science Weekly - Data Science Weekly - Issue 496

Curated news, articles and jobs related to Data Science

Issue #496May 25 2023

Editor's Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Jobs

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

You Might Also Like

Issue #496
May 25 2023