An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data.
Show Abstract
Abstract
Frequent itemset mining (FIM) is a common approach for discovering hidden frequent patterns from transactional databases used in prediction, association rules, classification, etc. Apriori is an FIM elementary algorithm with iterative nature used to find the frequent itemsets. Apriori is used to scan the dataset multiple times to generate big frequent itemsets with different cardinalities.Apriori performance descends when data gets bigger due to the multiple dataset scan to extract the frequent itemsets. Eclat is a scalable version of the Apriori algorithm that utilizes a vertical layout.
The vertical layout has many advantages; it helps to solve the problem of multiple datasets scanning and has information that helps to find each itemset support. In a vertical layout, itemset support can be achieved by intersecting transaction ids (tidset/tids) and pruning irrelevant itemsets. However,when tids become too big for memory, it affects algorithms efficiency. In this paper, we introduce SHFIM (spark-based hybrid frequent itemset mining), which is a three-phase algorithm that utilizes both horizontal and vertical layout diffset instead of tidset to keep track of the differences between transaction ids rather than the intersections. Moreover, some improvements are developed to decrease the number of candidate itemsets. SHFIM is implemented and tested over the Spark framework,
which utilizes the RDD (resilient distributed datasets) concept and in-memory processing that tackles MapReduce framework problem. We compared the SHFIM performance with Spark-based Eclat and dEclat algorithms for the four benchmark datasets. Experimental results proved that SHFIM outperforms Eclat and dEclat Spark-based algorithms in both dense and sparse datasets in terms of execution time.
|
Mohamed Reda Al-Bana,
Marwa Salah Farhan,
Nermin Abdelhakim Othman,
|
0 |
Download Full Paper |
0 |
The Impact of Global Structural Information in Graph Neural Networks Applications
Show Abstract
Abstract
Graph Neural Networks (GNNs) rely on the graph structure to define an aggregation strategy where each node updates its representation by combining information from its neighbours .A known limitation of GNNs is that, as the number of layers increases, information gets smoothed and squashed and node embeddings become indistinguishable, negatively affecting performance .Therefore, practical GNN models employ few layers and only leverage the graph structure in terms of limited, small neighbourhoods around each node. Inevitably, practical GNNs do not capture information depending on the global structure of the graph. While there have been several works
studying the limitations and expressivity of GNNs, the question of whether practical applications on graph structured data require global structural knowledge or not remains unanswered. In this work, we empirically address this question by giving access to global information to several GNN models, and observing the impact it has on downstream performance. Our results show that global information can in fact provide significant benefits for common graph-related tasks. We further identify a novel regularization strategy that leads to an average accuracy improvement of more than 5% on all considered tasks.
|
Davide Buffelli,
Fabio Vandin,
|
0 |
Download Full Paper |
0 |
A Repertoire of Virtual-Reality, Occupational Therapy Exercises for Motor Rehabilitation Based on Action Observation
Show Abstract
Abstract
There is a growing interest in action observation treatment (AOT), i.e., a rehabilitative procedure combining action observation, motor imagery, and action execution to promote the recovery,maintenance, and acquisition of motor abilities. AOT studies employed basic upper limb gestures as stimuli, but—in principle—the AOT approach can be effectively extended to more complex actions like occupational gestures. Here, we present a repertoire of virtual-reality (VR) stimuli depicting occupational therapy exercises intended for AOT, potentially suitable for occupational safety and
injury prevention. We animated a humanoid avatar by fitting the kinematics recorded by a healthy subject performing the exercises. All the stimuli are available via a custom-made graphical user interface, which allows the user to adjust several visualization parameters like the viewpoint, the number of repetitions, and the observed movement’s speed. Beyond providing clinicians with a set of VR stimuli promoting via AOT the recovery of goal-oriented, occupational gestures, such a repertoire could extend the use of AOT to the field of occupational safety and injury prevention.
|
Emilia Scalona,
Doriana De Marco,
Arturo Nuara,
Adolfo Zilli,
Maddalena Fabbri-Destro,
Maria Chiara Bazzini,
Elisa Taglione,
Fabrizio Pasqualetti,
Generoso Della Polla,
Nicola Francesco Lopomo,
Pietro Avanzini,
|
0 |
Download Full Paper |
0 |
TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels
Show Abstract
Abstract
As the world struggles with several compounded challenges caused by the COVID-19 pandemic in the health, economic, and social domains, timely access to disaggregated national and sub-national data are important to understand the emergent situation but it is difficult to obtain. The widespread usage of social networking sites, especially during mass convergence events, such as health emergencies, provides instant access to citizen-generated data offering rich information about public opinions, sentiments, and situational updates useful for authorities to gain insights.We offer a large-scale social sensing dataset comprising two billion multilingual tweets posted from 218 countries by 87 million users in 67 languages. We used state-of-the-art machine learning models to enrich the data with sentiment labels and named-entities. Additionally, a gender identification approach is proposed to segregate user gender. Furthermore, a geolocalization approach is devised to geotag tweets at country, state, county, and city granularities, enabling a myriad of data analysis tasks to understand real-world issues at national and sub-national levels. We believe this multilingual data with broader geographical and longer temporal coverage will be a cornerstone for researchers to study impacts of the ongoing global health catastrophe and to manage adverse consequences related to people’s health, livelihood, and social well-being.
|
Muhammad Imran,
Umair Qazi,
Ferda Ofli,
|
0 |
Download Full Paper |
0 |
Knowledge Management Model for Smart Campus in Indonesia
Show Abstract
Abstract
The application of smart campuses (SC), especially at higher education institutions (HEI) in Indonesia, is very diverse, and does not yet have standards. As a result, SC practice is spread across various areas in an unstructured and uneven manner. KM is one of the critical components of SC. However, the use of KM to support SC is less clearly discussed. Most implementations and assumptions still consider the latest IT application as the SC component. As such, this study aims to identify the components of the KM model for SC. This study used a systematic literature review (SLR) technique with PRISMA procedures, an analytical hierarchy process, and expert interviews. SLR is used to identify the components of the conceptual model, and AHP is used for model priority component analysis. Interviews were used for validation and model development. The results show that KM, IoT, and big data have the highest trends. Governance, people, and smart education have the highest trends. IT is the highest priority component. The KM model for SC has five main layers grouped in phases of the system cycle. This cycle describes the organization’s intellectual ability to adapt in achieving SC indicators. The knowledge cycle at HEIs focuses on education, research, and community service.
|
Deden Sumirat Hidayat,
Dana Indra Sensuse,
|
0 |
Download Full Paper |
0 |
Multi-Temporal Surface Water Classification for Four Major Rivers from the Peruvian Amazon
Show Abstract
Abstract
We describe a new minimum extent, persistent surface water classification for reaches of four major rivers in the Peruvian Amazon (i.e., Amazon, Napo, Pastaza, Ucayali). These data were generated by the Peruvian Amazon Rural Livelihoods and Poverty (PARLAP) Project which aims to better understand the nexus between livelihoods (e.g., fishing, agriculture, forest use, trade), poverty, and conservation in the Peruvian Amazon over a 35,000 km river network. Previous surface water datasets do not adequately capture the temporal changes in the course of the rivers, nor discriminate between primary main channel and non-main channel (e.g., oxbow lakes) water. We generated
the surface water classifications in Google Earth Engine from Landsat TM 5, 7 ETM+, and 8 OLI satellite imagery for time periods from circa 1989, 2000, and 2015 using a hierarchical logical binary classification predominantly based on a modified Normalized Difference Water Index (mNDWI) and shortwave infrared surface reflectance. We included surface reflectance in the blue band and brightness temperature to minimize misclassification. High accuracies were achieved for all time periods (>90%).
|
Margaret Kalacska,
Oliver T. Coomes,
J. Pablo Arroyo-Mora,
Yoshito Takasaki,
Christian Abizaid,
|
0 |
Download Full Paper |
0 |
Open Government Data Use in the Brazilian States and Federal District Public Administrations
Show Abstract
Abstract
This research investigates whether, why, and how open government data (OGD) is used and reused by Brazilian state and district public administrations. A new online questionnaire was developed and collected data from 26 of the 27 federation units between June and July 2021. The resulting dataset was cleaned and anonymized. It contains an insight on 158 parameters for 26 federation units explored. This article describes the questionnaire metadata and the methods applied to collect and treat data. The data file was divided into four sections: respondent profile (identify the respondent and his workplace), OGD use/consumption, what OGD is used for by public administrations, and why OGD is used by public administrations (benefits, barriers, drivers, and barriers to OGD use/reuse). Results provide the state of the play of OGD use/reuse in the federation units administrations. Therefore, they could be used to inform open data policy and decision-making processes. Furthermore, they could be the starting point for discussing how OGD could better support the digital transformation in the public sector.
|
Ilka Kawashita,
Ana Alice Baptista,
Delfina Soares,
|
0 |
Download Full Paper |
0 |
#PraCegoVer: A Large Dataset for Image Captioning in Portuguese
Show Abstract
Abstract
Automatically describing images using natural sentences is essential to visually impaired people’s inclusion on the Internet. This problem is known as Image Captioning. There are many datasets in the literature, but most contain only English captions, whereas datasets with captions described in other languages are scarce. We introduce the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese. In contrast to popular datasets, #PraCegoVer has only one reference per image, and both mean and variance of reference sentence length are significantly high, which makes our dataset
challenging due to its linguistic aspect. We carry a detailed analysis to find the main classes and topics in our data. We compare #PraCegoVer to MS COCO dataset in terms of sentence length and word frequency. We hope that #PraCegoVer dataset encourages more works addressing the automatic generation of descriptions in Portuguese.
|
Gabriel Oliveira dos Santos,
Esther Luna Colombini,
Sandra Avila,
|
0 |
Download Full Paper |
0 |
Linking and Sharing Technology: Partnerships for Data Innovations for Management of Agricultural Big Data
Show Abstract
Abstract
Combining data into a centralized, searchable, and linked platform will provide a data exploration platform to agricultural stakeholders and researchers for better agricultural decision making, thus fully utilizing existing data and preventing redundant research. Such a data repository requires readiness to share data, knowledge, and skillsets and working with Big Data infrastructures. With the adoption of new technologies and increased data collection, agricultural workforces need to update their knowledge, skills, and abilities. The partnerships for data innovation (PDI) effort
integrates agricultural data by efficiently capturing them from field, lab, and greenhouse studies using a variety of sensors, tools, and apps and provides a quick visualization and summary of statistics for real-time decision making. This paper aims to evaluate and provide examples of case studies currently using PDI and use its long-term continental US database (18 locations and 24 years) to test the cover crop and grazing effects on soil organic carbon (SOC) storage. The results show that legume and rye (Secale cereale L.) cover crops increased SOC storage by 36% and 50%, respectively, compared with oat (Avena sativa L.) and rye mixtures and low and high grazing intensities improving the upper SOC by 69–72% compared with a medium grazing intensity. This was likely due to legumes providing a more favorable substrate for SOC formation and high grazing intensity systems having continuous manure
deposition. Overall, PDI can be used to democratize data regionally and nationally and therefore can address large-scale research questions aimed at addressing agricultural grand challenges.
|
Tulsi P. Kharel,
Amanda J. Ashworth,
Phillip R. Owens,
|
0 |
Download Full Paper |
0 |
Analysing Computer Science Courses over Time
Show Abstract
Abstract
In this paper we consider courses of a Computer Science degree in an Italian university from the year 2011 up to 2020. For each course, we know the number of exams taken by students during a given calendar year and the corresponding average grade; we also know the average normalized value of the result obtained in the entrance test and the distribution of students according to the gender. By using classification and clustering techniques, we analyze different data sets obtained by pre-processing the original data with information about students and their exams, and highlight which courses show a significant deviation from the typical progression of the courses of the same
teaching year, as time changes. Finally, we give heat maps showing the order in which exams were taken by graduated students. The paper shows a reproducible methodology that can be applied to any degree course with a similar organization, to identify courses that present critical issues over time. A strength of the work is to consider courses over time as variables of interest, instead of the more frequently used personal and academic data concerning students.
|
Renza Campagni,
Donatella Merlini,
Maria Cecilia Verri,
|
0 |
Download Full Paper |
0 |
Regression-Based Approach to Test Missing Data Mechanisms
Show Abstract
Abstract
Missing data occur in almost all surveys; in order to handle them correctly it is essential to know their type. Missing data are generally divided into three types (or generating mechanisms):missing completely at random, missing at random, and missing not at random. The first step to understand the type of missing data generally consists in testing whether the missing data are missing completely at random or not. Several tests have been developed for that purpose, but they have difficulties when dealing with non-continuous variables and data with a low quantity of missing
data. Our approach checks whether the missing data are missing completely at random or missing at random using a regression model and a distribution test, and it can be applied to continuous and categorical data. The simulation results show that our regression-based approach tends to be more sensitive to the quantity and the type of missing data than the commonly used methods.
|
Serguei Rouzinov,
André Berchtold,
|
0 |
Download Full Paper |
0 |
Managing FAIR Tribological Data Using Kadi4Mat
Show Abstract
Abstract
The ever-increasing amount of data generated from experiments and simulations in engineering sciences is relying more and more on data science applications to generate new knowledge. Comprehensive metadata descriptions and a suitable research data infrastructure are essential prerequisites for these tasks. Experimental tribology, in particular, presents some unique challenges in this regard due to the interdisciplinary nature of the field and the lack of existing standards. In this work, we demonstrate the versatility of the open source research data infrastructure Kadi4Mat by
managing and producing FAIR tribological data. As a showcase example, a tribological experiment is conducted by an experimental group with a focus on comprehensiveness. The result is a FAIR data package containing all produced data as well as machine- and user-readable metadata. The close collaboration between tribologists and software developers shows a practical bottom-up approach and how such infrastructures are an essential part of our FAIR digital future.
|
Nico Brandt,
Philipp Zschumme,
Nikolay T. Garabedian,
Paul J. Schreiber,
Christian Greiner,
Ephraim Schoof,
Michael Selzer,
|
0 |
Download Full Paper |
0 |
VC-SLAM—A Handcrafted Data Corpus for the Construction of Semantic Models
Show Abstract
Abstract
Ontology-based data management and knowledge graphs have emerged in recent years as efficient approaches for managing and utilizing diverse and large data sets. In this regard, research on algorithms for automatic semantic labeling and modeling as a prerequisite for both has made steady progress in the form of new approaches. The range of algorithms varies in the type of information used (data schema, values, or metadata), as well as in the underlying methodology (e.g., use of different machine learning methods or external knowledge bases). Approaches that have
been established over the years, however, still come with various weaknesses. Most approaches are evaluated on few small data corpora specific to the approach. This reduces comparability and also limits statements for the general applicability and performance of those approaches. Other research areas, such as computer vision or natural language processing solve this problem by providing unified data corpora for the evaluation of specific algorithms and tasks. In this paper, we present and publish VC-SLAM to lay the necessary foundation for future research. This corpus allows the evaluation and comparison of semantic labeling and modeling approaches across different methodologies, and it is the first corpus that additionally allows to leverage textual data documentations for semantic labeling and modeling. Each of the contained 101 data sets consists of labels, data and metadata, as well as corresponding semantic labels and a semantic model that were manually created by human experts using an ontology that was explicitly built for the corpus. We provide statistical information about the corpus as well as a critical discussion of its strengths and shortcomings, and test the corpus with existing methods for labeling and modeling.
|
Andreas Burgdorf,
Alexander Paulus,
André Pomp,
Tobias Meisen,
|
0 |
Download Full Paper |
0 |
An Empirical Study on Data Validation Methods of Delphi and General Consensus
Show Abstract
Abstract
Data collection and review are the building blocks of academic research regardless of the discipline. The gathered and reviewed data, however, need to be validated in order to obtain accurate information. The Delphi consensus is known as a method for validating the data. However, several studies have shown that this method is time-consuming and requires a number of rounds to complete. Until now, there has been no clear evidence that validating data by a Delphi consensus is more significant than by a general consensus. In this regard, if data validation between both methods are not significantly different, then just using a general consensus method is sufficient, easier,and less time-consuming. Hence, this study aims to find out whether or not data validation by a Delphi consensus method is more significant than by a general consensus method. This study firstly collected and reviewed the data of sustainable building criteria, secondly validated these data by applying each consensus method, and finally made a comparison between both consensus methods. The results showed that seventeen of the valid criteria obtained from the general consensus and
reduced by the Delphi consensus were found to be inconsistent for sustainable building assessments in Cambodia. Therefore, this study concludes that using the Delphi consensus method is more significant in validating the gathered and reviewed data. This experiment contributes to the selection and application of consensus methods in validating data, information, or criteria, especially in engineering fields.
|
Puthearath Chan,
|
0 |
Download Full Paper |
0 |
Collaborative Data Use between Private and Public Stakeholders—A Regional Case Study
Show Abstract
Abstract
Research and development are facilitated by sharing knowledge bases, and the innovation process benefits from collaborative efforts that involve the collective utilization of data. Until now,most companies and organizations have produced and collected various types of data, and stored them in data silos that still have to be integrated with one another in order to enable knowledge creation. For this to happen, both public and private actors must adopt a flexible approach to achieve the necessary transition to break data silos and create collaborative data sharing between
data producers and users. In this paper, we investigate several factors influencing cooperative data usage and explore the challenges posed by the participation in cross-organizational data ecosystems by performing an interview study among stakeholders from private and public organizations in the context of the project IDE@S, which aims at fostering the cooperation in data science in the Austrian federal state of Styria. We highlight technological and organizational requirements of data infrastructure, expertise, and practises towards collaborative data usage.
|
Claire Jean-Quartier,
Miguel Rey Mazón,
Mario Lovri´c,
Sarah Stryeck,
|
0 |
Download Full Paper |
0 |
Development of a Web-Based Prediction System for Students’ Academic Performance
Show Abstract
Abstract
Educational Data Mining (EDM) is used to extract and discover interesting patterns from educational institution datasets using Machine Learning (ML) algorithms. There is much academic information related to students available. Therefore, it is helpful to apply data mining to extract factors affecting students’ academic performance. In this paper, a web-based system for predicting academic performance and identifying students at risk of failure through academic and demographic factors is developed. The ML model is developed to predict the total score of a course at the early
stages. Several ML algorithms are applied, namely: Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Artificial Neural Network (ANN), and Linear Regression (LR).This model applies to the data of female students of the Computer Science Department at Imam Abdulrahman bin Faisal University (IAU). The dataset contains 842 instances for 168 students.Moreover, the results showed that the prediction’s Mean Absolute Percentage Error (MAPE) reached 6.34%, and the academic factors had a higher impact on students’ academic performance than the demographic factors, the midterm exam score in the top. The developed web-based prediction system is available on an online server and can be used by tutors.
|
Dabiah Alboaneen,
Modhe Almelihi,
Rawan Alsubaie,
Raneem Alghamdi,
Lama Alshehri,
Renad Alharthi,
|
0 |
Download Full Paper |
0 |
The Comparison of Cybersecurity Datasets
Show Abstract
Abstract
Almost all industrial internet of things (IIoT) attacks happen at the data transmission layer according to a majority of the sources. In IIoT, different machine learning (ML) and deep learning (DL) techniques are used for building the intrusion detection system (IDS) and models to detect the attacks in any layer of its architecture. In this regard, minimizing the attacks could be the major objective of cybersecurity, while knowing that they cannot be fully avoided. The number of people resisting the attacks and protection system is less than those who prepare the attacks. Well-reasoned and learningbacked problems must be addressed by the cyber machine, using appropriate methods alongside quality datasets. The purpose of this paper is to describe the development of the cybersecurity datasets used to train the algorithms which are used for building IDS detection models, as well as analyzing and summarizing the different and famous internet of things (IoT) attacks. This is carried out by assessing the outlines of various studies presented in the literature and the many problems with IoT threat detection. Hybrid frameworks have shown good performance and high detection rates compared to standalone machine learning methods in a few experiments. It is the researchers’ recommendation to employ hybrid frameworks to identify IoT attacks for the foreseeable future.
|
Ahmed Alshaibi,
Mustafa Al-Ani,
Abeer Al-Azzawi,
Anton Konev,
Alexander Shelupanov,
|
0 |
Download Full Paper |
0 |