Datasets and code can be found here.
In the rapidly evolving landscape of healthcare, leveraging data to drive informed decisions is no longer a luxury but a necessity. My latest project delves deep into the realm of healthcare analytics, harnessing the power of SQL databases and Python to uncover actionable insights.
This endeavor encompasses meticulous data preparation, patient segmentation through clustering algorithms, robust predictive modeling to forecast hospitalization charges and assess patient risk scores, geospatial analysis to visualize healthcare metrics across regions, and the development of an interactive dashboard for real-time monitoring. Here's a comprehensive overview of the journey and the methodologies employed.
1. Data Architecture: Laying the Foundation
2. Data Preparation: Cleaning and Structuring SQL Integration
3. Patient Segmentation: Unveiling Clusters
4. Predictive Modeling: Forecasting Charges and Assessing Risks
5. Geospatial Analysis: Mapping Metrics Across States
6. Cost Analysis: Unveiling Financial Insights
7. Chronic Disease Management: Predicting Complications
8. Dashboard development: Real-time Monitoring with Streamlit
9. Scripts: Structuring and Querying the Data
10. Conclusion
At the heart of any data-driven project lies a well-structured database. Our healthcare dataset is organized within the healthcare_data
schema, comprising three pivotal tables:
Medical Examination Data
Patient health metrics and medical history information
Customer_ID |
Customer_BMI |
HBA1C |
Heart_Issues |
Any_Transplants |
Cancer_history |
NumberOfMajorSurgeries |
smoker |
1 |
47.41 |
7.47 |
No |
No |
No |
0 |
yes |
2 |
30.36 |
5.77 |
No |
No |
No |
0 |
yes |
3 |
34.485 |
11.87 |
yes |
No |
No |
2 |
yes |
Hospitalization Details
Patient hospitalization records and associated costs
Customer_ID |
date |
children |
charges |
Hospital_tier |
City_tier |
State_ID |
2335 |
07-09-92 |
0 |
$563.84 |
2 |
3 |
R1013 |
2334 |
07-09-92 |
0 |
$570.62 |
2 |
1 |
R1013 |
2333 |
06/30/1992 |
0 |
$600.00 |
2 |
1 |
R1013 |
Patient Information
Basic patient identification data
Customer_ID |
name |
1 |
Hawks, Ms. Kelly |
2 |
Lehner, Mr. Matthew D |
3 |
Lu, Mr. Phil |
Before diving into any analysis, it's imperative to ensure the data's integrity and consistency. The preparedataforSQL.py script is meticulously crafted to cleanse and preprocess the raw data, making it SQL-ready. Here's a breakdown of its functionalities:
- Processing hospitalization_details.csv: entries with invalid or missing dates are removed, then converts textual month representations (January, February) to numerical values, and restructures the date into a standarized time, following a mm/dd/yyyy format. 'ID' prefix is striped from Customer_ID and we ensure it's an integer. Numerical values are also striped from Hospital_tier and City_tier.
- Processing Medical_Examinatios.csv: ID normalization once again, transforms textual description of major surgeries into numerical counts.
- Processing Names.xlsx: we ensure consistency across datasets here.
This complete cleaning ensures that the data fed into our analytical model is both accurate and reliable.
Understanding patient segments is crucial for personalized healthcare strategies, and the patientsegmentation.py script employs k-means clustering algorithm to identify distinct patient groups based on various health metrics and demographic factors.
How so? First we have to connect to the healthcare_data, during this project I made use of MySQL database. Then features are selected: age, children, charges, hospital_tier, city_tier, BMI, HBA1C, heart_issues, any_transplants, cancer_history, numberofmajorsurgeries and smoker.
The data is standarized and features are scaled using StandardScaler to ensure uniformity across all of them. And for optimal cluster determination, we employ the Elbow Method (helps find the optimal number of clusters (k) in k-means clustering by plotting the sum of squared distances (inertia) against different k values) to ascertain the ideal number of clusters, visualized through an inertia plot. With the chosen number of clusters (k=4) we execute it, and project the high-dimensional data into two principal components using PCA for intuitive visualization.
Elbow Method for Optimal Cluster Selection
The plot of inertia versus the number of clusters depicts a clear elbow at around k=4, indicating that four clusters is the ideal number to segment the patients.
Patient Clusters Visualization
Visual representation of the four distinct clusters derived from our analysis. Each color represents a different cluster, and the two principal components (PC1 and PC2) are used to reduce the data's dimensions for easier visualization.
The scatterplot reveals clear separation among the clusters, with some overlap between clusters 1, 2 and 3, suggesting shared characteristics among these groups, while cluster 0 stands distinctly apart. This visualization provides a high-level understanding of how the clusters are organized and the relationships between the patient groups.
These clusters uncovered distinct patient profiles that align with specific health and demographic characteristics, next is a breakdown of them:
- Cluster 0, 144 patients: youngest group with an average age of 20.81, least likely to have children with an average of 0.34, highest number of major surgeries at 1.13 average and no presence of cancer history.
Smallest group but their profile stands out due to their high surgery rate despite their young age. They may benefit from additional monitoring or pre-surgery interventions to improve outcomes.
- Cluster 1, 805 patients: oldest group with an average age of 54.11, moderate presence of children with an average of 0.65, relatively low surgeries at 0.90 on average, cancer history is presence but low, at 2.36%.
Second largest group, representing an older demographic but with stable health. Might need targeted chronic care management ato prevent further complications.
- Cluster 2, 372 patients: middle-aged group, with an average age of 36.62, higher than average children at 0.97 and average surgery rate at 1.0 per person. 100% of patients have a cancer history.
Small cluster, all of whom have a cancer history, which can be prioritized for ongoing treatment and care coordination.
- Cluster 3, 1008 patients: average age similar to the cluster above, at 34.49. Highest children average at 1.44 and fewest surgeries, at 0.32 per patient. No significant presence of cancer or smokers in this group.
Largest group, reflecting a more typical, low-complication patient group, requiring fewer medical interventions. This group can be managed with preventive strategies to keep their health stable and reduce the likelihood of future complications.
The predictionmodeling.py script takes a deep dive into predictive analytics, focusing on two primary objectives: 1) predicting hospitalization charges, and 2) creating risk scores for patient complications.
Both tasks are essential in modern healthcare systems, as they help optimize financial planning and identifying hih-risk patients to deliver targeted, personalized care.
Accurate forecasting of hospitalization charges is crucial for effective financial management and planning, it enables hospitals and healthcare providers to anticipate costs and optimize resource allocation, ensuring efficing healthcare delivery.
The data was extracted from the hospitalization_details and medical_examination tables, combining a rich set of features including patient demographics, health metrics and hospitalization details. Then the features were selected (same as previously) and a Gradient Boosting Regressor was utilized due to its ability to handle complex interactions between features. It is particularly suited for capturing non-linear relationships in healthcare costs. A grid search was conducted with cross-validation to fine-tune hyperparameters such as learning rate, max depth and the number of estimators.
For evaluation, the model's performance was evaluated using the Mean Squared Error (MSE) and R² score to measure its predictive accuracy. Feature importance were visualized to understand which factors had the most significant impact on hospitalization charges.
The model's SME was 77,051,054.69 which indicates the average squared difference between predicted and actual charges.
The model achieved an R² score of 0.43, meaning that 43% of the variance in hospitalization charges can be explained by the model. While the score reflects a reasonable predictive capacity, it also suggests there are additional factors influencing charges that were not captured currently.
The top 3 most influential features driving hospitalization charges were: 1) hospital tier (57%), 2) BMI (22%) and 3) age (18%).
Model Results Dashboard
Mean Squared Error
77,051,054.69
Feature Importance in Hospitalization Charges
Hospital Tier
57%
BMI
22%
Age
18%
Other
3%
Now, for the risk scores for patient complications. Identifying patients at high risk of developing complications is critical for proactive healthcare management. By predicting which patients are more likely to experience severe health issues, healthcare provides can prioritie interventions, reduce complications, and potentially lower overall costs.
Complications, for the model, were defined primarly by the presence of heart issues or any transplants, as these conditions typically indicate a higher likelihood of severe health outcomes. Gradient Boosting Classifier was used to predict the likelihood of complications, with Grid Search for cross-validation to optimize hyperparameters.
The model was this time evaluated using a classification report that measured key metrics such as precision, recall and F-1 score. The ROC AUC score was also calculated to measure the model's ability to discriminate between high-risk and low-risk patients, which were classified into: low risk, moderate risk and high risk.
The model performed exceptionally well, achieving high precision and recall for both risk groups, with an overally accuracy of 97%.
Risk Assessment Model Dashboard
Patient Risk Classification
Low Risk: Regular monitoring, preventive care
Moderate Risk: Enhanced monitoring, early intervention
High Risk: Intensive monitoring, immediate intervention
Optimal Model Hyperparameters
Learning Rate
0.1
Max Depth
7
Min Samples Leaf
4
Min Samples Split
10
N Estimators
100
The ability to predict hospitalization costs and assess patient risk based on complications offers powerful tools for both operational efficiency and patient care. These insights can inform financial planning, resource allocation, and help healthcare providers deliver more personalized care to patients, ultimately improving outcomes and reducing costs.
The geospatial analysis reveals regional disparities in healthcare charges and patient health metrics. Such insights are instrumental in identifying areas that require targeted interventions, optimizing resource distribution, and formulating region-specific healthcare policies.
Regional Healthcare Metrics Distribution
Interactive visualization showing the distribution of healthcare metrics across different states, highlighting regional patterns and disparities
Effective cost management is crucial for the sustainability of healthcare institutions. The costanalysis.py script conducts an in-depth analysis of hospitalization charges, exploring their distribution, influencing factors, and interrelationships with other health metrics.
Data was loaded from cost_analysis_view, a dataset that includes patient demographics health metrics, and hospitalization charges.
With just a quick EDA (Exploratory Data Analysis) we can gather a lot of information already:
1) Distribution of Charges: as visualized in the histogram, the distribution of hospitalization charges is righ-skewed, with most patients incurring charges below $20,000, while a few cases exceed $50,000. This skew suggests that while most hospitalizations remain moderately priced, outliers drive up the average.
2) Charges vs. Age: a scatterplot shows a general upward trend in hospitalization charges as patient age increases. Although the relationship between age and charges is not strictly linear, older patients tend to face higher costs, potentially due to the complexity of care required.
3) Charges by Hospital Tier: a boxplot reveals significant differences in charges across hospital tiers. Tier 1 hospitals, often the most advanced or specialized, show the highest median charges, with a wider range of costs compared to Tier 2 and Tier 3 hospitals. This suggests that hospital tier is a key determinant of the financial burden.
4) Correlation Matrix: a heatmap visualizes the correlations between key variables. Charges are strongly negatively correlated with hospital_tier, meaning that higher-tier hospitals (with lower numerical values for tier) are associated with higher costs.
Distribution of Charges
Charges vs. Age Relationship
Correlation Matrix Heatmap
This cost analysis provides actionable insights for healthcare administrators to refine budgeting and resource allocation strategies. By focusing on high-cost areas, such as patients with elevated BMI or older age groups, healthcare providers can design interventions that reduce costs while maintaining quality care.
The Ordinary Least Squares (OLS) regression was employed to quantify the impact of various factors on hospitalization charges. The OLS regression model achieved an Adjusted R-squared of 0.411, indicating that the model explains about 41.1% of the variability in hospitalization charges. Although there is room for improvement, this provides valuable insights into key cost drivers.
Significant predictors as age (older patients, higher costs, about $310 per year of age), also having more children correlates with a slight increase in charges, hospital tier, as expected, where higher tier hospitals are associated with significantly higher charges. Also higher BMI is a significant predictor, with an average of $380 increase per unit increase in BMI.
The VIF calculations reveal no severe multicollinearity among the predictors, as all VIF values were below 5, except for the constant. This suggests the model is stable and that the predictors are not highly redundant.
This cost analysis provides actionable insights for healthcare administrators to refine budgeting and resource allocation strategies. By focusing on high-cost areas, such as patients with elevated BMI or older age groups, healthcare providers can design interventions that reduce costs while maintaining quality care.
Managing chronic diseases effectively is critical for improving patient outcomes while reducing healthcare costs. The chronicdisease.py script leverages machine learning techniques to predict complications in patients with chronic conditions, stratifying risk levels based on various health factors.
Categorical variables were one-hot encoded where necessary, ensuring that the machine learning algorithms could appropriately interpret the data.
The Gradient Boosting Classifier was chosen for its strong ability to handle imbalanced datasets and provide high predictive accuracy. The classifier was wrapped in a pipeline to streamline preprocessing, scaling numerical features, and encoding categorical ones.
A Grid Search with cross-validation was conducted to fine-tune the hyperparameters, optimizing the classifier's performance.
Model Performance Metrics
Evaluation metrics including Precision, Recall, and F1-Score showing the model's strong discriminative ability with ROC AUC score of 0.85
ROC Curve Analysis
ROC curve demonstrating the model's excellent predictive power across different classification thresholds
Top 10 Predictive Features
Feature importance analysis highlighting HBA1C and Number of Major Surgeries as the most significant predictors for complications
To facilitate real-time monitoring and interactive exploration of healthcare data, an intuitive dashboard was developed using Streamlit. The app.py script orchestrates various data visualizations and functionalities, providing stakeholders with a comprehensive overview of patient metrics, facility utilization, and financial insights.
The dashboard serves as a centralized hub for monitoring healthcare operations, enabling data-driven decision-making. Its interactive nature allows users to drill down into specific metrics, fostering a deeper understanding of underlying patterns and trends.
Interactive Healthcare Analytics Dashboard
Real-time interactive dashboard built with Streamlit, showcasing key healthcare metrics, patient data, and financial insights
Scripts can be found at my GitHub, if you have any questions feel free to contact me on LinkedIn!
This project exemplifies the transformative potential of data analytics in healthcare. By integrating robust data preparation, advanced machine learning techniques, and interactive visualization tools, it provides a comprehensive framework for understanding and improving patient care, optimizing financial operations, and strategically planning healthcare services. As healthcare continues to evolve, such data-driven approaches will be instrumental in fostering a more efficient, responsive, and patient-centric healthcare system.