Clean Messy Clinical Data Using Python

Tài liệu tương tự
MD Paper-Based Test ELA Vietnamese Script for Administrating PARCC TAM

FAQs Những câu hỏi thường gặp 1. What is the Spend Based Rewards program for Visa Vietnam? The Spend Based Rewards program for Visa Vietnam is a servi

说明书 86x191mm

Screen Test (Placement)

Khoa hoc - Cong nghe - Thuy san.indd

Microsoft Word - Persevere-2 Stories.docx

Tóm tắt ngữ pháp tiếng Anh Tổng hợp và biên soạn: Thầy Tâm - Anh Văn ( TÓM TẮT NGỮ PHÁP TIẾNG ANH Mục lục Tóm tắt

Microsoft Word - Listen to Your Elders-2 Stories.docx

NMPED 2019 Spring ADMINISTRATOR Manual

GIẢI PHÁP NÂNG CAO CHẤT LƯỢNG QUẢN TRỊ RỦI RO TRONG HOẠT ĐỘNG TÍN DỤNG TẠI VIETCOMBANK HUẾ

Website review luanvancaohoc.com

UBND TỈNH ĐỒNG THÁP SỞ GIÁO DỤC VÀ ĐÀO TẠO Số: 1284/SGDĐT-GDTrH-TX&CN V/v hướng dẫn tổ chức dạy học bộ môn tiếng Anh cấp trung học năm học C

Microsoft Word - Kindness and Mercy-2 Stories.docx

UW MEDICINE PATIENT EDUCATION Crutch Walking vietnamese Đi Bằng Nạng Hướng dẫn từng bước và những lời khuyên về an toàn Tài liệu này hướng dẫn cách sử

Stored Procedures Stored Procedures Bởi: Khoa CNTT ĐHSP KT Hưng Yên Trong những bài học trước đây khi dùng Query Analyzer chúng ta có thể đặt tên và s

1 Überschrift 1

Translation and Cross-Cultural Adaptation of the Vietnamese Version of the Hip Dysfunction and Osteoarthritis Outcome Score (HOOS) Adams CL 1, Leung A

Microsoft Word - bai 16 pdf

TrÝch yÕu luËn ¸n

BIỂU ĐẠT HÌNH THÁI DĨ THÀNH TIẾNG ANH TRONG TIẾNG VIỆT 1. Mục đích và phương pháp 1.1. Mục đích 19 ThS. Trương Thị Anh Đào Dựa trên nền tảng lý thuyết

Lesson 4: Over the phone (continued) Bài 4: Nói chuyện qua điện thoại (tiếp theo) Trần Hạnh và toàn Ban Tiếng Việt, Đài Úc Châu, xin thân chào quí bạn

Blood pool and Hemangioma - Khoang chứa máu và U máu gan Hoàng Văn Trung Normally when we look at lesions filling with contrast, the density of these

Microsoft Word - Huong dan su dung Mailchimp.docx

copy Vietnamese by Khoa Dang Nguyen

Microsoft Word - status_code_trong_servlet.docx

7 Drinks to Clean Your Kidneys Naturally 7 thức uống làm sạch thận của bạn một cách tự nhiên (II) (continuing) (tiếp theo) 4 Stinging Nettle Cây tầm m

Hướng dẫn cụ thể từng bước để đăng ký sử dụng Đơn đăng ký không tín chỉ sau đó ghi danh vào các lớp không tín chỉ. 1 tháng Sáu, 2018 Các sinh viên dự

TiengAnhB1.Com CẨM NANG LUYỆN THI CHỨNG CHỈ TIẾNG ANH B1 1 P a g e

Mau ban thao TCKHDHDL

Kiến trúc tập lệnh1

Slide 1

OpenStax-CNX module: m Giới thiệu về ngôn ngữ C và môi trường turbo C 3.0 ThS. Nguyễn Văn Linh This work is produced by OpenStax-CNX and licens

4. Kết luận Đề tài nghiên cứu Phát triển hệ thống nâng hạ tàu bằng đường triền dọc có hai đoạn cong quá độ, kết hợp sử dụng xe chở tàu thông minh đã t

2018 Vietnamese FL Written examination

PHÂN LỚP DỮ LIỆU MẤT CÂN BẰNG VỚI THUẬT TOÁN HBU 1. GIỚI THIỆU NGUYỄN THỊ LAN ANH Khoa Tin học, Trường Đại học Sư phạm, Đại học Huế Tóm tắt: Dữ liệu m

Microsoft PowerPoint - L2-Gioi_thieu_WEKA.ppt [Compatibility Mode]

Microsoft Word - 03-GD-HO THI THU HO(18-24)

Using a Walker - Vietnamese

Lesson 10: Talking business (continued) Bài 10: Bàn chuyện kinh doanh (tiếp tục) Trần Hạnh và toàn Ban Tiếng Việt Đài Úc Châu xin thân chào bạn. Mời b

HEINONLINE

NGHIÊN CỨU TIÊN LƯỢNG TỬ VONG BẰNG THANG ĐIỂM FOUR Ở BỆNH NHÂN HÔN MÊ Võ Thanh Dinh 1, Vũ Anh Nhị 2 TÓM TẮT Mở đầu: Năm 2005, Wijdicks và cộng sự đề x

ỦY BAN NHÂN DÂN TỈNH TRÀ VINH TRƯỜNG ĐẠI HỌC TRÀ VINH ISO 9001:2008 NGUYỄN THÚY AN GIẢI PHÁP PHÁT TRIỂN NGUỒN NHÂN LỰC NGÀNH TÀI NGUYÊN VÀ MÔI TRƯỜNG

HƯỚNG DẪN SỬ DỤNG 1) Các thông số cài đặt client (MS Outlook, Outlook Express, Thunder Bird ) 2) Hướng dẫn đổi password 3) Hướng dẫn

Tìm hiểu ngôn ngữ lập trình Visual Basic Tìm hiểu ngôn ngữ lập trình Visual Basic Bởi: Khuyet Danh Tìm hiểu ngôn ngữ lập trình Visual Basic Tổng quan

H_中英-01.indd

Microsoft Word - LedCenterM_HDSD.doc

H_中英-01.indd

Numerat619.pmd

Microsoft Word - 16_LTXC_LocThanh.doc

HEADING 1: PHẦN 1: QUẢN LÝ VÀ DUY TRÌ HỆ ĐIỀU HÀNH

(Microsoft Word - 4. \320\340o Thanh Tru?ng doc)

Exchange Server - Recipient Configuration - Create Mailbox Exchange Server - Recipient Configuration - Create Mailbox Bởi: Phạm Nguyễn Bảo Nguyên Chún

Java cơ bản

Catalogue 2019

Microsoft Word - menh-de-quan-he-trong-tieng-anh.docx

Các biến và các kiểu dữ liệu trong JavaScript Các biến và các kiểu dữ liệu trong JavaScript Bởi: Hà Nội Aptech Các biến (Variables) Biến là một tham c

GIÁO TRÌNH Microsoft Word 2013

tcvn tiªu chuèn quèc gia national STANDARD tcvn : 2009 ISO : 1994 XuÊt b n lçn 1 First edition CẦN TRỤC TỪ VỰNG PHẦN 2: CẦN TRỤC TỰ HÀNH

Photographing, Filming and Recording students at Abbotsford Primary School Annual Consent Form and Collection Notice During the school year there are

E_Brochure_Mooncake

Muallim Journal of Social Sciences and Humanities (MJSSH) Volume 1- Issue 2 (2017), Pages / ISSN: eissn USAGE OF THE NEWS MAKER SOFTWA

Web: truonghocmo.edu.vn Thầy Tuấn: BÀI TẬP LUYỆN TẬP LƯỢNG TỪ Khóa học Chinh phục kỳ thi THPT QG - Cấu tạo câu Th

3Szczepankiewicz.indd

MCSA 2012: Distributed File System (DFS) MCSA 2012: Distributed File System (DFS) Cuongquach.com Ở bài học hôm nay, mình xin trình bày về Distributed

Manicurist SPF-ICOC-VIET

TẠP CHÍ KHOA HỌC - ĐẠI HỌC ĐỒNG NAI, SỐ ISSN NHU CẦU HỌC TẬP KỸ NĂNG SỐNG CỦA HỌC SINH TRUNG HỌC PHỔ THÔNG TẠI THÀNH PHỐ BIÊN HÒA, T

Microsoft Word - cu_phap_sqlite.docx

H_中英-01.indd

Microsoft Word - Business Visitor Checklist - Feb 2013.doc

Microsoft Word - Tang Duc Thang

Microsoft Word - EFT_lesson 2.doc

Microsoft Word - TT HV_NguyenThiThom_K18.doc

Bản ghi:

ABSTRACT PharmaSUG 2022 - Paper EP-07 Clean Messy Clinical Data Using Python Abhinav Srivastva, Exelixis Inc Data in clinical trials can be transmitted in various formats such as Excel, CSV, tab-delimited, ASCII files or via any Electronic Data Capture (EDC) tool. A potential problem arises when data has embedded special characters or even non-printable (control) characters which affects all downstream analysis and reporting; in addition to being non-compliant with CDISC standards. The paper explores various ways to handle these unwanted characters by leveraging the capabilities of Python language. The discussion begins with correcting these characters using their ASCII code, followed by Unicode, and later exploring normalization techniques such as NFC, NFD, NFKC, NFKD to normalize unicode characters. Often times, data cleaning will require a combination of these techniques to get the desired result. INTRODUCTION Python provides some useful libraries that can aid in identifying and treating unwanted characters in the text. Some of these resemble closely with SAS programming language. The paper uses regular expression and character translation available in Python for most of these tasks. Below is a summary of different ways of data cleaning techniques covered in the paper: Example 1: Using Replace and Translate string methods Example 2: Using Regular expression Example 3: Using ASCII Codes Example 4: Using Unicode Example 5: Using Normalization (NFC, NFD, NFKC, NFKD) Example 6: Flagging special characters for review and removing control characters Create a Test String Before looking at the techniques, let s use a clean test string and add special and control characters to it. clean = 'The quick brown fox jumps over the lazy dog!' Next, let s add special characters (ü and ó) and a carriage return (\r) to the test string and store in a variable called messy. # add special characters (ü and ó) sp = 'The qüick brown fóx jumps over the lazy dóg!' # add control words (carriage return character) messy = sp + chr(13) messy >>[Out] 'The qüick brown fóx jumps over the lazy dóg!\r' messy = 'The qüick brown fóx jumps over the lazy dóg!\r' Example 1: Using REPLACE and TRANSLATE string methods We can use Python s replace and translate method to selectively exclude special and control characters that we added above as demonstrated below: 1

# Using replace() clean_replace = messy.replace('ü', 'u').replace('ó', 'o').replace('\r', '') clean_replace >>[Out] 'The quick brown fox jumps over the lazy dog!' # Using translate() charmap = {'ü': 'u', 'ó' : 'o', '\r': None} trans_table = messy.maketrans(charmap) clean_trans = messy.translate(trans_table) clean_trans >>[Out] 'The quick brown fox jumps over the lazy dog!' Both of these methods got rid of unwanted characters and matches up exactly with the original clean string. # Check equality, otherwise print a custom error message assert clean_trans == clean, "Strings dont match" Example 2: Using Regular Expressions To use regular expression and perform string manipulation, re and string are two libraries that needs to be imported. A pattern can be made to include alphabets, numbers, punctuations as shown below. A list of punctuations can be conveniently obtained by calling string.punctuation # Import re module import re, string # Call string.punctuation string.punctuation >>[Out] '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{ }~' Next, we can build a pattern as: # Build a regular expression pattern re.sub(r'[^a-za-z0-9!]', '', messy) >>[Out] 'The qick brown fx jumps over the lazy dg!' Notice that this method simply got rid of all unwanted characters, which is not the desired result, as the meaning of the text has also got altered. Therefore, a careful review is needed before deleting characters. 2

Example 3: Using ASCII A broad classification of ASCII codes can be done as shown in Table 1: ASCII code classification. Type ASCII Codes Non-printable characters 0-31, 127 ASCII extended characters (special) 128-255 ASCII Printable characters 32-126 Table 1: ASCII code classification Knowing these ranges, we can utilize the codes to remove unwanted characters using Translate method as shown below: # Using ASCII codes from Table 1 np_sp = set(map(chr, list(range(0,32)) + list(range(127,256)))) ord_dict = {ord(character):none for character in np_sp} def filter_np_sp(text): return text.translate(ord_dict) clean_ascii = filter_np_sp(messy) clean_ascii >>[Out] 'The qick brown fx jumps over the lazy dg!' An alternative to manually listing ASCII codes, can be simply using Python s string.printable attribute to generate a list of readily available printable characters. string.printable >>[Out] '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!" #$%&\'()*+,-./:;<=>?@[\\]^_`{ }~ \t\n\r\x0b\x0c' Some of the latter characters in the above list falls under control characters, so the list can be slightly improvised to get a clean set of printable characters. print_list = string.printable excl_list = '\t\n\r\x0b\x0c' if excl_list in print_list: print_list = print_list.replace(excl_list,'') print_list >>[Out] '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!" #$%&\'()*+,-./:;<=>?@[\\]^_`{ }~ ' Finally, this clean print_list can be used to keep only the printable character set as shown below. clean_print = ''.join([x if x in print_list else '' for x in messy]) clean_print >>[Out] 'The qick brown fx jumps over the lazy dg!' 3

Just like Example 2: Using Regular Expressions, it got rid of all unwanted results which is not the desired outcome in this case. Instead of deleting special characters, flagging for review is a better solution and is covered in Example 6: Flagging Special Characters but dropping Control characters. Example 4: Using Unicode Unicode characters provides a rich set of characters beyond ASCII codes. Unicodes can also be grouped as shown in this partial output in Table 2. Complete list can be looked up as pointed out in REFERENCES section. Abbr Long Description Lu Uppercase Letter an uppercase letter LI Lowercase Letter a lowercase letter Lt Titlecase Latter a digraphic character, with first part uppercase Cc Control a C0 or C1 control code Table 2: Unicode classification For Unicode method, we first import unicode library, and can also assess the number of unicodes available on the operating system. # import unicode module import unicodedata, sys # print # of unicodes print("{:,}".format(sys.maxunicode)) >>[Out] 1,114,111 Let s first remove control characters using Cc codes as provided in Table 2. all_chars = (chr(i) for i in range(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # use regular expressions control_char_re = re.compile('[%s]' % re.escape(control_chars)) def remove_control_chars(s): return control_char_re.sub('', s) clean_unc = remove_control_chars(messy) clean_unc >>[Out] 'The qüick brown fóx jumps over the lazy dóg!' Looking at the above output, carriage return \r is no longer present in the string. For removing special characters or keeping only printable characters, unicode list is quite large to filter out irrelevant ones. For example, Lu code for uppercase characters (Table 2) will yield a vast list of uppercase characters which is beyond English alphabet list. 4

# Exploring UPPERCASE unicode character set with Lu (Table 2) all_chars = (chr(i) for i in range(sys.maxunicode)) categories = {'Lu'} uppercase_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) uppercase_chars >>[Out] 'ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÜÝÞĀĂĄĆĈĊČĎĐĒĔĖ ĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽƁƂƄƆƇƉƊƋƎƏƐƑƓƔƖƗ ƝƟƠƢƤƦƧƩƬƮƯƱƲƳƵƷƸƼDŽLJNJǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZǴǶǷǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨ ȪȬȰȲȺȻȽȾɁɃɄɅɆɈɊɌɎͰͲͶΆΈΉΊΌΎΏΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫϏϒϓϔϘϚϜϞϠϢϤϦϨ ϪϬϴϷϹϺϽϾϿЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯѠѢѤѦѨѪѬѮѰѲ ѴѶѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӀӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂ ԄԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԱԲԳԴԵԶԷԸԹԺԻԼԽԾԿՀՁՂՃՄՅՆՇՈՉՊՋՌՍՎՏՐՑՒՓՔՕՖႠႡႢႣႤႥ... many more... Deleting special characters using Unicode method can especially useful when data is transmitted over formats allowing non-english like characters. However, in some instances, there could be some meaningful text that should be normalized to retain relevant information instead of dropping it as covered in Example 5: Normalization. Example 5: Normalization Unicode characters can be normalized to extract relevant information contained in them. Unicodes can be Composed (merge separate characters into a single unicode) or Decomposed (break a single unicode into separate characters). Based on this idea, 4 methods of normalization exist as - NFC, NFD, NFKC, and NFKD. More details can be looked up on the link provided in REFERENCES section. Applying decomposition (NFKD) to our example results in: # Using Normalization NFKD method print(ascii(unicodedata.normalize("nfkd", messy))) >[Out] 'The qu\u0308ick brown fo\u0301x jumps over the lazy do\u0301g!\r' As evident from the result, ü is decomposed into u and \u0308, and likewise ó is decomposed into o and \u0301. Next, these unwanted characters \u0308 and \u0301 can be easily removed as demonstrated below. def norm_func(data): return ''.join(x for x in unicodedata.normalize('nfkd', data) if x in print_list) clean_norm = norm_func(messy) clean_norm >>[Out] 'The quick brown fox jumps over the lazy dog!' This approach turns out to be the best solution for our test string so far without impacting any of its meaning. Example 6: Flagging Special Characters but dropping Control characters A good starting point is to flag special characters for review and adjudicate on a case-by-case basis. However, dropping control characters is a safe option as they don t alter the meaning of the text. In this example, we will import a CSV file (sample data provided in APPENDIX) and flag unwanted characters by storing them as a Python list in a new column in the revised CSV file. 5

# Read a CSV file called messy_file.csv and write to clean_file.csv mask_print = {ord(char):none for char in print_list} # mask print_list with open('messy_file.csv', errors='ignore') as csv_f: reader = csv.reader(csv_f) with open('clean_file.csv', 'w', newline='') as wf: csv_writer = csv.writer(wf, delimiter=',') for row in reader: spchr=[] data = [i.translate(ord_dict).strip() for i in row] # exclude control chars spchr = [''.join(set(i.translate(mask_print))) for i in row] # store special chars spchr = list(filter(none, set(spchr))) # filter empty strings if len(spchr)==0: final_data = data + spchr else: final_data = data + [spchr] csv_writer.writerow(final_data) However, if all unwanted characters were simply removed without review in the above example, then the results would have been very different as can be observed by running this code. # remove all unwanted characters with open('messy_file.csv', encoding="utf-8", errors='ignore') as csv_f: reader = csv.reader(csv_f) with open('excl_all.csv', 'w', newline='') as wf: csv_writer = csv.writer(wf, delimiter=',') for row in reader: data = [i.encode('ascii', 'ignore').decode('ascii').translate(ord_dict).strip() for i in row] csv_writer.writerow(data) CONCLUSION The paper presented several options which can be considered when dealing with special and control characters. Example 6: Flagging Special Characters but dropping Control characters, approach could serve as a good starting reference when processing external files (csv, excel, etc) from a vendor that requires a detailed data quality check. REFERENCES [1] Unicode reference : http://www.unicode.org/reports/tr44/#general_category_values [2] Unicode normalization : https://en.wikipedia.org/wiki/unicode_equivalence#normalization CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Name : Abhinav Srivastva Enterprise : Exelixis Inc asrivastva@exelixis.com 6

APPENDIX TEXT1 TEXT2 ** Special chars ** Liver toxicity when bilirubin <= 1.5 ULN Otherwise normal [' '] Capable of understanding and complying Site error can be reported [' '] Investigator-assessed clinical benefit which outweighs the potential risks Lábs taken from all sources when possible Test to be performed every 2 weeks Recovery to baseline or <= Grade 1 CTCAE v5 from toxicities Continue at the current dóse level with supportive care Known brain metastases or cranial epidural disease ['á', 'ó'] Platelets >= 100,000/µL without transfµsion (@ST) >=3 ULN. ['µ'] This will be false positive This will be false negative ['\n'] Radiographic tumor assessments are to continued CT to be performed [' ', '\n'] Radiation therapy for bone metastasis within 2 weeks prior to any screening assessments 7