IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    datawhale_NLP任务一

    chenxia发表于 2023-08-18 02:48:31
    love 0

    摘要

    0x01 问题描述

    通过机器学习的方式对论文摘要等信息的理解,来判断该论文是否属于医学领域的文献。
    数据集中包括publicdata-train,publicdata-test,trainB,数据集中包含标题、作者、摘要和关键词
    模型:可以采用task1基于文本提取的TF-IDF+LR分类的机器学习方式;也可以task2采用基于Bert大模型微调的方式
    评价:最终的评价标准采用F1-score来进行分析

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # input:
    Inflammatory Breast Cancer: What to Know About This Unique, Aggressive Breast Cancer.,

    [Arjun Menta, Tamer M Fouad, Anthony Lucci, Huong Le-Petross, Michael C Stauder, Wendy A Woodward, Naoto T Ueno, Bora Lim],

    Inflammatory breast cancer (IBC) is a rare form of breast cancer that accounts for only 2% to 4% of all breast cancer cases. Despite its low incidence, IBC contributes to 7% to 10% of breast cancer caused mortality. Despite ongoing international efforts to formulate better diagnosis, treatment, and research, the survival of patients with IBC has not been significantly improved, and there are no therapeutic agents that specifically target IBC to date. The authors present a comprehensive overview that aims to assess the present and new management strategies of IBC.,

    Breast changes; Clinical trials; Inflammatory breast cancer; Trimodality care.

    # output:
    1 for yes and 0 for no

    0x02 Baseline

    可以使用CV或者TF来对原始文本进行分析

    之后利用分类器比如LR或者lightGBM进行分类处理

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    import pandas as pd 
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    import warnings
    warnings.filterwarnings("ignore")

    # Read the dataset
    df_train = pd.read_csv('../data/train.csv')
    df_valid = pd.read_csv('../data/test.csv')
    df_test = pd.read_csv('../data/testB.csv')
    df_train = df_train.fillna('')
    df_valid.fillna('')
    df_test = df_test.fillna('')

    # Features
    df_train['text'] = df_train.apply(lambda x:' '.join(x.drop('uuid').astype(str)),axis=1)
    df_valid['text'] = df_valid.apply(lambda x:' '.join(x.drop('uuid').astype(str)),axis=1)
    df_test['text'] = df_test.apply(lambda x:' '.join(x.drop('uuid').astype(str)),axis=1)

    # df_train['text']
    vector = TfidfVectorizer().fit(df_train['text'].tolist())

    vocab = vector.vocabulary_
    train_vector = vector.transform(df_train['text'])
    valid_vector = vector.transform(df_valid['text'])
    test_vector = vector.transform(df_test['text'])

    df_trainv = pd.DataFrame(train_vector.toarray(),columns=vocab)
    df_validv = pd.DataFrame(valid_vector.toarray(),columns=vocab)
    df_testv = pd.DataFrame(test_vector.toarray(),columns=vocab)

    df_train.describe()

    # lightGBM model
    import lightgbm as lgb
    data = lgb.Dataset(df_trainv,df_train['label'])
    # data_val = lgb.Dataset(df_validv,df_valid['label'])
    params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
    }
    model = lgb.train(params, data)

    y_pred = model.predict(test_vector)
    y_pred = np.where(y_pred>=0.5,1,0)
    df_test['label'] = y_pred

    df_test['Keywords'] = df_test['title'].fillna('')
    df_test[['uuid', 'Keywords', 'label']].to_csv('../data/task1_lightGBM.csv', index=None)

    0x03 结果

    最终的提交结果为

    lightGBM结果



沪ICP备19023445号-2号
友情链接