RSNA-2024

RSNA 2024 Lumbar Spine Degenerative Classification 腰椎退行性分类

Classify lumbar spine degenerative conditions:对腰椎退行性疾病进行分类

[TOC]

Overview 概述

The goal of this competition is to create models that can be used to aid in the detection and classification of degenerative spine conditions using lumbar spine MR images. Competitors will develop models that simulate a radiologist’s performance in diagnosing spine conditions.

本次竞赛的目标是创建可用于帮助使用腰椎 MR 图像检测和分类脊柱退行性疾病的模型。参赛者将开发模型来模拟放射科医生诊断脊柱疾病的表现。

Description 描述

Low back pain is the leading cause of disability worldwide, according to the World Health Organization, affecting 619 million people in 2020. Most people experience low back pain at some point in their lives, with the frequency increasing with age. Pain and restricted mobility are often symptoms of spondylosis, a set of degenerative spine conditions including degeneration of intervertebral discs and subsequent narrowing of the spinal canal (spinal stenosis), subarticular recesses, or neural foramen with associated compression or irritations of the nerves in the low back.

根据世界卫生组织的数据,腰痛是全球范围内导致残疾的主要原因,到 2020 年,腰痛将影响 6.19 亿人。大多数人在一生中的某个阶段都会经历腰痛,且频率随着年龄的增长而增加。疼痛和活动受限通常是脊椎病的症状,这是一组退行性脊柱疾病,包括椎间盘退变和随后的椎管变窄(椎管狭窄)、关节下隐窝或神经孔,并伴有下肢神经受压或刺激。

Magnetic resonance imaging (MRI) provides a detailed view of the lumbar spine vertebra, discs and nerves, enabling radiologists to assess the presence and severity of these conditions. Proper diagnosis and grading of these conditions help guide treatment and potential surgery to help alleviate back pain and improve overall health and quality of life for patients.

磁共振成像 (MRI) 提供腰椎、椎间盘和神经的详细视图,使放射科医生能够评估这些病症的存在和严重程度。对这些病症的正确诊断和分级有助于指导治疗和潜在的手术,以帮助减轻背痛并改善患者的整体健康和生活质量。

RSNA has teamed with the American Society of Neuroradiology (ASNR) to conduct this competition exploring whether artificial intelligence can be used to aid in the detection and classification of degenerative spine conditions using lumbar spine MR images.

RSNA 与美国神经放射学会 (ASNR)合作举办了本次竞赛,探讨人工智能是否可以利用腰椎 MR 图像来帮助检测和分类脊柱退行性疾病。

The challenge will focus on the classification of five lumbar spine degenerative conditions: Left Neural Foraminal Narrowing, Right Neural Foraminal Narrowing, Left Subarticular Stenosis, Right Subarticular Stenosis, and Spinal Canal Stenosis. For each imaging study in the dataset, we’ve provided severity scores (Normal/Mild, Moderate, or Severe) for each of the five conditions across the intervertebral disc levels L1/L2, L2/L3, L3/L4, L4/L5, and L5/S1.

挑战将集中于五种腰椎退行性疾病的分类:左神经椎间孔狭窄、右神经椎间孔狭窄、左关节下狭窄、右关节下狭窄和椎管狭窄。对于数据集中的每项影像学研究,我们为椎间盘 L1/L2、L2/L3、L3/L4、L4/L5 级别的五种情况中的每一种提供了严重性评分(正常/轻度、中度或严重)和 L5/S1。

To create the ground truth dataset, the RSNA challenge planning task force collected imaging data sourced from eight sites on five continents. This multi-institutional, expertly curated dataset promises to improve standardized classification of degenerative lumbar spine conditions and enable development of tools to automate accurate and rapid disease classification.

为了创建实况数据集,RSNA 挑战计划工作组收集了来自五大洲八个站点的成像数据。这个多机构、专业策划的数据集有望改善退行性腰椎疾病的标准化分类,并支持开发自动化准确、快速的疾病分类工具。

Challenge winners will be recognized at an event during the RSNA 2024 annual meeting. For more information on the challenge, contact RSNA Informatics staff at informatics@rsna.org.

挑战赛获胜者将在 RSNA 2024 年年会期间的活动中获得表彰。有关挑战的更多信息,请联系 RSNA 信息学工作人员: informatics@rsna.org

Evaluation 评估

Submissions are evaluated using the average of sample weighted log losses and an any_severe_spinal prediction generated by the metric. The metric notebook can be found here.

使用样本加权对数损失的平均值和由该指标生成的any_severe_spinal预测来评估提交的结果。公制笔记本可以在这里找到

The sample weights are as follows:

样本权重如下:

  • 1 for normal/mild. 1 为正常/轻度。
  • 2 for moderate. 2为中等。
  • 4 for severe. 4为严重。

For each row ID in the test set, you must predict a probability for each of the different severity levels. The file should contain a header and have the following format:

对于测试集中的每个行 ID,您必须预测每个不同严重性级别的概率。该文件应包含标头并具有以下格式:

1
2
3
4
5
row_id,normal_mild,moderate,severe
123456_left_neural_foraminal_narrowing_l1_l2,0.333,0.333,0.333
123456_left_neural_foraminal_narrowing_l2_l3,0.333,0.333,0.333
123456_left_neural_foraminal_narrowing_l3_l4,0.333,0.333,0.333
etc.
row_id normal_mild moderate severe
123456_left_neural_foraminal_narrowing_l1_l2 0.333 0.333 0.333
123456_left_neural_foraminal_narrowing_l2_l3 0.333 0.333 0.333
123456_left_neural_foraminal_narrowing_l3_l4 0.333 0.333 0.333
etc.

In rare cases the lowest vertebrae aren’t visible in the imagery. You still need to make predictions (nulls will cause errors), but those rows will not be scored.

在极少数情况下,图像中看不到最低的椎骨。您仍然需要进行预测(空值会导致错误),但这些行不会被评分。

For this competition, the any_severe_scalar has been set to 1.0.

对于本次比赛,any_severe_scalar已设置为1.0

Dataset Description 数据集描述

The goal of this competition is to identify medical conditions affecting the lumbar spine in MRI scans.
本次比赛的目标是通过 MRI 扫描识别影响腰椎的医疗状况。

This competition uses a hidden test. When your submitted notebook is scored, the actual test data (including a full length sample submission) will be made available to your notebook.
本次比赛采用隐藏测试方式。当您提交的笔记本电脑被评分时,实际的测试数据(包括完整长度的样本提交)将提供给您的笔记本电脑。

Files 文件

train.csv Labels for the train set. 训练集的标签。

  • study_id - The study ID. Each study may include multiple series of images.
    study_id - 研究 ID,每个研究可能包括多个系列的图像。
  • [condition]_[level] - The target labels, such as spinal_canal_stenosis_l1_l2 , with the severity levels of Normal/Mild, Moderate , or Severe . Some entries have incomplete labels.
    [condition]_[level] - 目标标签,例如spinal_canal_stenosis_l1_l2 ,严重程度级别为 Normal/MildModerateSevere。有些条目的标签不完整。

train_label_coordinates.csv
训练标签坐标.csv

  • study_id
  • series_id - The imagery series ID.
    series_id - 图像系列 ID。
  • instance_number - The image’s order number within the 3D stack.
    instance_number - 图像在 3D 堆栈中的顺序号。
  • condition - There are three core conditions: spinal canal stenosis, neural_foraminal_narrowing, and subarticular_stenosis. The latter two are considered for each side of the spine.
    condition - 共有三种核心病症:椎管狭窄、神经椎间孔狭窄和关节下狭窄。脊柱的每一侧都考虑后两者。
  • level - The relevant vertebrae, such as l3_l4
    level - 相关椎骨,例如l3_l4
  • [x/y] - The x/y coordinates for the center of the area that defined the label.
    [x/y] - 定义标签的区域中心的 x/y 坐标。

sample_submission.csv 样本提交.csv

  • row_id - A slug of the study ID, condition, and level such as 12345_spinal_canal_stenosis_l3_l4.
    row_id - 研究 ID、条件和级别的 slug,例如12345_spinal_canal_stenosis_l3_l4
  • [normal_mild/moderate/severe] - The three prediction columns.
    [normal_mild/moderate/severe] - 三个预测列。

[train/test]_images/[study_id]/[series_id]/[instance_number].dcm The imagery data.图像数据。

[train/test]_series_descriptions.csv

  • study_id

  • series_id

  • series_description The scan’s orientation.
    series_description 扫描方向。

方案说明

Transformer + KAN 6.09 结果很不好

BELKA_2024

[TOC]

Leash Bio - Predict New Medicines with BELKA

BELKA 预测新药

Predict small molecule-protein interactions using the Big Encoded Library for Chemical Assessment (BELKA)

使用化学评估大编码库(BELKA)预测小分子蛋白质相互作用

Overview

In this competition, you’ll develop machine learning (ML) models to predict the binding affinity of small molecules to specific protein targets – a critical step in drug development for the pharmaceutical industry that would pave the way for more accurate drug discovery. You’ll help predict which drug-like small molecules (chemicals) will bind to three possible protein targets.

在这场比赛中,你将开发机器学习(ML)模型来预测小分子与特定蛋白质靶标(目标蛋白)的结合亲和力——这是制药行业药物开发的关键一步,将为更准确的药物发现铺平道路。你将帮助预测哪种药物样的小分子(化学物质)将与三种可能的蛋白质靶点结合。

Description

Small molecule drugs are chemicals that interact with cellular protein machinery and affect the functions of this machinery in some way. Often, drugs are meant to inhibit the activity of single protein targets, and those targets are thought to be involved in a disease process. A classic approach to identify such candidate molecules is to physically make them, one by one, and then expose them to the protein target of interest and test if the two interact. This can be a fairly laborious and time-intensive process.

小分子药物是与细胞蛋白质机制相互作用并以某种方式影响该机制功能的化学物质。通常,药物旨在抑制单个蛋白质靶标的活性,而这些靶标被认为与疾病过程有关。识别这类候选分子的一种经典方法是一个接一个地进行物理制造,然后将其暴露于感兴趣的蛋白质靶点,并测试两者是否相互作用。这可能是一个相当费力和耗时的过程。

The US Food and Drug Administration (FDA) has approved roughly 2,000 novel molecular entities in its entire history. However, the number of chemicals in druglike space has been estimated to be 10^60, a space far too big to physically search. There are likely effective treatments for human ailments hiding in that chemical space, and better methods to find such treatments are desirable to us all.

美国食品药品监督管理局(FDA)已经批准了大约2000种新型分子实体在其整个历史. 然而,类药物领域的化学物质数量估计为$10^60$,这个空间太大了,无法进行物理搜索。在这个化学空间里,可能有有效的治疗人类疾病的方法,而找到更好的治疗方法对我们所有人来说都是可取的。

To evaluate potential search methods in small molecule chemistry, competition host Leash Biosciences physically tested some 133M small molecules for their ability to interact with one of three protein targets using DNA-encoded chemical library (DEL) technology. This dataset, the Big Encoded Library for Chemical Assessment (BELKA), provides an excellent opportunity to develop predictive models that may advance drug discovery.

为了评估小分子化学中潜在的搜索方法,比赛主办方Leash Biosciences使用DNA编码化学文库(DEL)技术对约133M个小分子进行了物理测试,以确定它们与三个蛋白质靶标之一相互作用的能力。该数据集,即化学评估大编码库(BELKA),为开发可能促进药物发现的预测模型提供了极好的机会。

Datasets of this size are rare and restricted to large pharmaceutical companies. The current best-curated public dataset of this kind is perhaps bindingdb, which, at 2.8M binding measurements, is much smaller than BELKA.

这种规模的数据集非常罕见,仅限于大型制药公司。目前这类最好的公共数据集可能是bindingdb,在2.8M的结合测量值下,比BELKA小得多。

This competition aims to revolutionize small molecule binding prediction by harnessing ML techniques. Recent advances in ML approaches suggest it might be possible to search chemical space by inference using well-trained computational models rather than running laboratory experiments. Similar progress in other fields suggest using ML to search across vast spaces could be a generalizable approach applicable to many domains. We hope that by providing BELKA we will democratize aspects of computational drug discovery and assist the community in finding new lifesaving medicines.

这项竞赛旨在通过利用ML技术彻底改变小分子结合预测。ML方法的最新进展表明,使用训练有素的计算模型而不是进行实验室 实验,通过推理搜索化学空间是可能的。其他 领域的类似进展表明,使用ML在广阔的空间中搜索可能是一种适用于许多领域的通用方法。我们希望通过提供BELKA,我们将使计算药物发现的各个方面民主化,并帮助社区寻找新的救命药物。

Here, you’ll build predictive models to estimate the binding affinity of unknown chemical compounds to specified protein targets. You may use the training data provided; alternatively, there are a number of methods to make small molecule binding predictions without relying on empirical binding data (e.g. DiffDock, and this contest was designed to allow for such submissions).

在这里,你将建立预测模型来估计未知化合物与特定蛋白质靶标的结合亲和力。您可以使用提供的培训数据;或者,有许多方法可以在不依赖经验结合数据的情况下进行小分子结合预测(例如DiffDock,而本次竞赛旨在允许此类提交)。

Your work will contribute to advances in small molecule chemistry used to accelerate drug discovery.

你的工作将有助于促进用于加速药物发现的小分子化学的进步。

Evaluation

This metric for this competition is the average precision calculated for each (protein, split group) and then averaged for the final score. Please see this forum post for important details.

这项比赛的指标是为每个(蛋白质、分组)计算的平均精度,然后为最终得分取平均值。请参阅此论坛帖子了解重要细节。

Here’s the code for the implementation.

这是代码以供实施。

Submission File

For each id in the test set, you must predict a probability for the binary target binds target. The file should contain a header and have the following format:

对于测试集中的每个id您必须预测二进制目标“绑定”目标的概率。该文件应包含一个标头,并具有以下格式:

1
2
3
4
5
id,binds
295246830,0.5
295246831,0.5
295246832,0.5
etc.

Timeline

  • April 4, 2024 - Start Date.
  • July 1, 2024 - Entry Deadline. You must accept the competition rules before this date in order to compete.
  • July 1, 2024 - Team Merger Deadline. This is the last day participants may join or merge teams.
  • July 8, 2024 - Final Submission Deadline.

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

Prizes

  • First Prize: $12,000
  • Second Prize: $10,000
  • Third Prize: $10,000
  • Fourth Prize: $8,000
  • Fifth Prize: $5,000
  • Top Student Group: $5,000 to the highest performing student team. A team would be considered a student team if majority members (e.g. at least 3 out of a 5 member team) are students enrolled in a high school or university degree. In the case of an even number of members, half of them must be students.

Competition Host

Leash Biosciences is a discovery-stage biotechnology company that seeks to improve medicinal chemistry with machine learning approaches and massive data collection. Leash is comprised of wet lab scientists and dry lab scientists in equal numbers, and is proudly headquartered in Salt Lake City, Utah, USA.

Additional Details

Chemical Representations

One of the goals of this competition is to explore and compare many different ways of representing molecules. Small molecules have been [represented](https://pubs.acs.org/doi/10.1021/acsinfocus.7e7006?ref=infocus%2FAI_& Machine Learning) with SMILES, graphs, 3D structures, and more, including more esoteric methods such as spherical convolutional neural nets. We encourage competitors to explore not only different methods of making predictions but also to try different ways of representing the molecules.

We provide the molecules in SMILES format.

这场比赛的目标之一是探索和比较许多不同的分子表现方式。小分子已经用SMILES、图形、3D结构等表示,包括更深奥的方法,如球形卷积神经网络。我们鼓励竞争对手不仅探索不同的预测方法,还尝试不同的分子表示方法。

我们提供SMILES格式的分子。

SMILES

SMILES is a concise string notation used to represent the structure of chemical molecules. It encodes the molecular graph, including atoms, bonds, connectivity, and stereochemistry as a linear sequence of characters, by traversing the molecule graph. SMILES is widely used in machine learning applications for chemistry, such as molecular property prediction, drug discovery, and materials design, as it provides a standardized and machine-readable format for representing and manipulating chemical structures.

The SMILES in this dataset should be sufficient to be translated into any other chemical representation format that you want to try. A simple way to perform some of these translations is with RDKit.

SMILES是一种简明的字符串表示法,用于表示化学分子的结构。它通过遍历分子图,将分子图(包括原子、键、连接性和立体化学)编码为线性字符序列。SMILES广泛用于化学的机器学习应用,如分子性质预测、药物发现和材料设计,因为它为表示和操纵化学结构提供了标准化和机器可读的格式。
该数据集中的SMILES应该足以转换为您想要尝试的任何其他化学表示格式。执行其中一些翻译的一种简单方法是使用RDKit.

Details about the experiments

DELs are libraries of small molecules with unique DNA barcodes covalently attached

Traditional high-throughput screening requires keeping individual small molecules in separate, identifiable tubes and demands a lot of liquid handling to test each one of those against the protein target of interest in a separate reaction. The logistical overhead of these efforts tends to restrict screening collections, called libraries, to 50K-5M small molecules. A scalable solution to this problem, DNA-encoded chemical libraries, was described in 2009. As DNA sequencing got cheaper and cheaper, it became clear that DNA itself could be used as a label to identify, and deconvolute, collections of molecules in a complex mixture. DELs leverage this DNA sequencing technology.

These barcoded small molecules are in a pool (many in a single tube, rather than one tube per small molecule) and are exposed to the protein target of interest in solution. The protein target of interest is then rinsed to remove small molecules in the DEL that don’t bind the target, and the remaining binders are collected and their DNA sequenced.

DEL是共价连接有独特DNA条形码的小分子库
传统高通量筛选需要将单个小分子保持在单独的、可识别的管中,并且需要大量的液体处理来在单独的反应中针对感兴趣的蛋白质靶标测试其中的每一个。这些工作的后勤开销往往将筛选收藏(称为文库)限制在5000万至500万个小分子以内。这个问题的一个可扩展的解决方案,DNA编码的化学文库,在2009年描述. 随着DNA测序变得越来越便宜,很明显,DNA本身可以用作标签来识别和消除复杂混合物中分子的聚集。DELs这种DNA测序技术。
这些条形码小分子在一个池中(许多在单管中,而不是每个小分子一管),并暴露于溶液中感兴趣的蛋白质靶标。然后冲洗感兴趣的蛋白质靶标,以去除DEL中不与靶标结合的小分子,收集剩余的结合物并对其DNA进行测序。

DELs are manufactured by combining different building blocks

An intuitive way to think about DELs is to imagine a Mickey Mouse head as an example of a small molecule in the DEL. We attach the DNA barcode to Mickey’s chin. Mickey’s left ear is connected by a zipper; Mickey’s right ear is connected by velcro. These attachment points of zippers and velcro are analogies to different chemical reactions one might use to construct the DEL.

We could purchase ten different Mickey Mouse faces, ten different zipper ears, and ten different velcro ears, and use them to construct our small molecule library. By creating every combination of these three, we’ll have 1,000 small molecules, but we only needed thirty building blocks (faces and ears) to make them. This combinatorial approach is what allows DELs to have so many members: the library in this competition is composed of 133M small molecules. The 133M small molecule library used here, AMA014, was provided by AlphaMa. It has a triazine core and superficially resembles the DELs described here.

DEL是通过组合不同的构建块来制造的
一个思考DEL的直观方法是想象一个米老鼠的头作为DEL中一个小分子的例子。我们把DNA条形码贴在米奇的下巴上。米奇的左耳由拉链连接;米奇的右耳是用尼龙搭扣连接的。拉链和尼龙搭扣的这些连接点类似于可能用于构建DEL的不同化学反应。
我们可以购买十个不同的米老鼠脸、十个不同拉链耳朵和十个不同尼龙搭扣耳朵,并用它们来构建我们的小分子库。通过创建这三者的每一个组合,我们将拥有1000个小分子,但我们只需要30个构建块(脸和耳朵)就可以制造它们。这种组合方法使DEL能够拥有如此多的成员:这场竞争中的文库由133M个小分子组成。这里使用的133M小分子文库AMA014由AlphaMa提供。它有一个三嗪核心,表面上类似于此处描述的DEL。

Acknowledgements

Leash Biosciences is grateful for the generous cosponsorship of Top Harvest Capital and AlphaMa.

Citation

Andrew Blevins, Ian K Quigley, Brayden J Halverson, Nate Wilkinson, Rebecca S Levin, Agastya Pulapaka, Walter Reade, Addison Howard. (2024). Leash Bio - Predict New Medicines with BELKA. Kaggle. https://kaggle.com/competitions/leash-BELKA


Dataset Description

Overview

The examples in the competition dataset are represented by a binary classification of whether a given small molecule is a binder or not to one of three protein targets. The data were collected using DNA-encoded chemical library (DEL) technology.

比赛数据集中的例子由给定小分子是否与三个蛋白质靶标之一结合的二元分类表示。使用DNA编码化学文库(DEL)技术收集数据。

We represent chemistry with SMILES (Simplified Molecular-Input Line-Entry System) and the labels as binary binding classifications, one per protein target of three targets.

我们用SMILES(简化分子输入 行输入系统)和二元绑定分类来表示化学,三个靶标中的每个蛋白质靶标都有一个。

Files

[train/test].[csv/parquet] - The train or test data, available in both the csv and parquet formats.

  • id - A unique example_id that we use to identify the molecule-binding target pair.
  • buildingblock1_smiles - The structure, in SMILES, of the first building block
  • buildingblock2_smiles - The structure, in SMILES, of the second building block
  • buildingblock3_smiles - The structure, in SMILES, of the third building block
  • molecule_smiles - The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker.
  • protein_name - The protein target name
  • binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set.

sample_submission.csv - A sample submission file in the correct format

[train/test].[csv/parquet] - 训练或测试数据,csv和parquet格式均可。

  • id - 我们用来识别分子结合靶标对的唯一示例_id。
  • buildingblock1_smiles - 第一个构建块的结构,以SMILES表示
  • buildingblock2_smiles - 第二个构建块的结构,以SMILES表示
  • buildingblock3_smiles - 第三个构建块的结构,以SMILES表示
  • molecule_smiles - 完全组装的分子的结构,以SMILES表示。这包括三个构建块和三嗪核心。请注意,我们使用[Dy]作为DNA连接子的替代。
  • protein_name - 蛋白质靶标名称
  • binds - 目标列。分子是否与蛋白质结合的二进制类标签。不适用于测试集。

Competition data

All data were generated in-house at Leash Biosciences. We are providing roughly 98M training examples per protein, 200K validation examples per protein, and 360K test molecules per protein. To test generalizability, the test set contains building blocks that are not in the training set. These datasets are very imbalanced: roughly 0.5% of examples are classified as binders; we used 3 rounds of selection in triplicate to identify binders experimentally. Following the competition, Leash will make all the data available for future use (3 targets × 3 rounds of selection × 3 replicates × 133M molecules, or 3.6B measurements).

所有数据均由Leash Biosciences公司内部生成。我们为每种蛋白质提供了大约 98M 个训练实例,为每种蛋白提供了 200K 个验证实例,为每个蛋白质提供了 360K 个测试分子。为了测试可推广性,测试集包含不在训练集中的构建块。这些数据集非常不平衡:大约0.5%的示例被归类为绑定;我们使用了三轮一式三份的选择来实验鉴定粘合剂。比赛结束后,Leash将提供所有数据供未来使用(3个靶标×3轮选择×3个重复×3.33M个分子,或3.6B测量值)。

Targets

Proteins are encoded in the genome, and names of the genes encoding those proteins are typically bestowed by their discoverers and regulated by the Hugo Gene Nomenclature Committee. The protein products of these genes can sometimes have different names, often due to the history of their discovery.

We screened three protein targets for this competition.

蛋白质在基因组中编码,编码这些蛋白质的基因的名称通常由其发现者命名,并由雨果基因命名委员会监管。这些基因的蛋白质产物有时可能有不同的名称,通常是由于它们的发现历史。
我们为这次比赛筛选了三个蛋白质靶点。

EPHX2 (sEH)

The first target, epoxide hydrolase 2, is encoded by the EPHX2 genetic locus, and its protein product is commonly named “soluble epoxide hydrolase”, or abbreviated to sEH. Hydrolases are enzymes that catalyze certain chemical reactions, and EPHX2/sEH also hydrolyzes certain phosphate groups. EPHX2/sEH is a potential drug target for high blood pressure and diabetes progression, and small molecules inhibiting EPHX2/sEH from earlier DEL efforts made it to clinical trials.

EPHX2/sEH was also screened with DELs, and hits predicted with ML approaches, in a recent study but the screening data were not published. We included EPHX2/sEH to allow contestants an external gut check for model performance by comparing to these previously-published results.

We screened EPHX2/sEH purchased from Cayman Chemical, a life sciences commercial vendor. For those contestants wishing to incorporate protein structural information in their submissions, the amino sequence is positions 2-555 from UniProt entry P34913, the crystal structure can be found in PDB entry 3i28, and predicted structure can be found in AlphaFold2 entry 34913. Additional EPHX2/sEH crystal structures with ligands bound can be found in PDB.

第一个靶标环氧化物水解酶2由EPHX2基因座编码,其蛋白产物通常被命名为“可溶性环氧化物水解酶”,或缩写为sEH。水解酶是催化某些化学反应的酶,EPHX2/sEH也水解某些磷酸基团。EPHX2/sEH是高血压和糖尿病进展的潜在药物靶点,早期DEL研究中抑制EPHX2/s EH的小分子已进入临床试验.
EPHX2/sEH也用DEL进行了筛选,并用ML方法预测了命中率(https://blog.research.google/2020/06/unlocking-chemome-with-dna-encoded.html学习https://pubs.acs.org/doi/10.1021/acs.jmedchem.0c00452)但筛选数据没有公布。我们纳入了EPHX2/sEH,通过与之前公布的结果进行比较,让参赛者能够对模型性能进行外部检查。
我们筛选了EPHX2/sEH购自开曼化学. 在PDB中可以发现具有结合配体的额外的EPHX2/sEH晶体结构。

BRD4

The second target, bromodomain 4, is encoded by the BRD4 locus and its protein product is also named BRD4. Bromodomains bind to protein spools in the nucleus that DNA wraps around (called histones) and affect the likelihood that the DNA nearby is going to be transcribed, producing new gene products. Bromodomains play roles in cancer progression and a number of drugs have been discovered to inhibit their activities.

BRD4 has been screened with DEL approaches previously but the screening data were not published. We included BRD4 to allow contestants to evaluate candidate molecules for oncology indications.

We screened BRD4 purchased from Active Motif, a life sciences commercial vendor. For those contestants wishing to incorporate protein structural information in their submissions, the amino acid sequence is positions 44-460 from UniProt entry O60885-1, the crystal structure (for a single domain) can be found in PDB entry 7USK and predicted structure can be found in AlphaFold2 entry O60885. Additional BRD4 crystal structures with ligands bound can be found in PDB.

ALB (HSA)

The third target, serum albumin, is encoded by the ALB locus and its protein product is also named ALB. The protein product is sometimes abbreviated as HSA, for “human serum albumin”. ALB, the most common protein in the blood, is used to drive osmotic pressure (to bring fluid back from tissues into blood vessels) and to transport many ligands, hormones, fatty acids, and more.

Albumin, being the most abundant protein in the blood, often plays a role in absorbing candidate drugs in the body and sequestering them from their target tissues. Adjusting candidate drugs to bind less to albumin and other blood proteins is a strategy to help these candidate drugs be more effective.

ALB has been screened with DEL approaches previously but the screening data were not published. We included ALB to allow contestants to build models that might have a larger impact on drug discovery across many disease types. The ability to predict ALB binding well would allow drug developers to improve their candidate small molecule therapies much more quickly than physically manufacturing many variants and testing them against ALB empirically in an iterative process.

We screened ALB purchased from Active Motif. For those contestants wishing to incorporate protein structural information in their submissions, the amino acid sequence is positions 25 to 609 from UniProt entry P02768, the crystal structure can be found in PDB entry 1AO6, and predicted structure can be found in AlphaFold2 entry P02768. Additional ALB crystal structures with ligands bound can be found in PDB.

Good luck!