Pesquisa | Portal Regional da BVS

1.

Explaining Wrong Queries Using Small Examples.

Miao, Zhengjie; Roy, Sudeepa; Yang, Jun.

Proc ACM SIGMOD Int Conf Manag Data ; 2019: 503-520, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31439982

RESUMO

For testing the correctness of SQL queries, e.g., evaluating student submissions in a database course, a standard practice is to execute the query in question on some test database instance and compare its result with that of the correct query. Given two queries Q 1 and Q 2, we say that a database instance D is a counterexample (for Q 1 and Q 2) if Q 1(D) differs from Q 2(D); such a counterexample can serve as an explanation of why Q 1 and Q 2 are not equivalent. While the test database instance may serve as a counterexample, it may be too large or complex to read and understand where the inequivalence comes from. Therefore, in this paper, given a known counterexample D for Q 1 and Q 2, we aim to find the smallest counterexample D' â D where Q 1(D') ≠ Q 2(D'). The problem in general is NP-hard. We give a suite of algorithms for finding the smallest counterexample for different classes of queries, some more tractable than others. We also present an efficient provenance-based algorithm for SPJUD queries that uses a constraint solver, and extend it to more complex queries with aggregation, group-by, and nested queries. We perform extensive experiments indicating the effectiveness and scalability of our solution on student queries from an undergraduate database course and on queries from the TPC-H benchmark. We also report a user study from the course where we deployed our tool to help students with an assignment on relational algebra.

2.

RATest: Explaining Wrong Relational Queries Using Small Examples.

Miao, Zhengjie; Roy, Sudeepa; Yang, Jun.

Proc ACM SIGMOD Int Conf Manag Data ; 2019: 1961-1964, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31388247

RESUMO

We present a system called RATEST, designed to help debug relational queries against reference queries and test database instances. In many applications, e.g., classroom learning and regression testing, we test the correctness of a user query Q by evaluating it over a test database instance D and comparing its result with that of evaluating a reference (correct) query Q 0 over D. If Q(D) differs from Q 0(D), the user knows Q is incorrect. However, D can be large (often by design), which makes debugging Q difficult. The key idea behind RATEST is to show the user a much smaller database instance D' â D, which we call a counterexample, such that Q(D') ≠ Q 0(D'). RATEST builds on data provenance and constraint solving, and employs a suite of techniques to support, at interactive speed, complex queries involving differences and group-by aggregation. We demonstrate an application of RATEST in learning: it has been used successfully by a large undergraduate database course in a university to help students with a relational algebra assignment.

3.

Interpretable Almost-Exact Matching for Causal Inference.

Dieng, Awa; Liu, Yameng; Roy, Sudeepa; Rudin, Cynthia; Volfovsky, Alexander.

Proc Mach Learn Res ; 89: 2445-2453, 2019 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-31198908

RESUMO

Matching methods are heavily used in the social and health sciences due to their interpretability. We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. The method proposed in this work aims to match units on a weighted Hamming distance, taking into account the relative importance of the covariates; the algorithm aims to match units on as many relevant variables as possible. To do this, the algorithm creates a hierarchy of covariate combinations on which to match (similar to downward closure), in the process solving an optimization problem for each unit in order to construct the optimal matches. The algorithm uses a single dynamic program to solve all of the units' optimization problems simultaneously. Notable advantages of our method over existing matching procedures are its high-quality interpretable matches, versatility in handling different data distributions that may have irrelevant variables, and ability to handle missing data by matching on as many available covariates as possible.

4.

iQCAR: inter-Query Contention Analyzer for Data Analytics Frameworks.

Kalmegh, Prajakta; Babu, Shivnath; Roy, Sudeepa.

Proc ACM SIGMOD Int Conf Manag Data ; 2019: 918-935, 2019 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-33840884

RESUMO

Resource interferences caused by concurrent queries is one of the key reasons for unpredictable performance and missed workload SLAs in cluster computing systems. Analyzing these inter-query resource interactions is critical in order to answer time-sensitive questions like 'who is creating resource conflicts to my query'. More importantly, diagnosing whether the resource blocked times of a 'victim' query are caused by other queries or some other external factor can help the database administrator narrow down the many possibilities of query performance degradation. We introduce iQCAR, an inter-Query Contention Analyzer, that attributes blame for the slowdown of a query to concurrent queries. iQCAR models the resource conflicts using a multi-level directed acyclic graph that can help administrators compare impacts from concurrent queries, identify most contentious queries, resources and hosts in an online execution for a selected time window. Our experiments using TPCDS queries on Apache Spark show that our approach is substantially more accurate than other methods based on overlap time between concurrent queries.

5.

Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances.

Miao, Zhengjie; Zeng, Qitian; Glavic, Boris; Roy, Sudeepa.

Proc ACM SIGMOD Int Conf Manag Data ; 2019: 485-502, 2019 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-31983802

RESUMO

Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counter-balancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.

6.

QAGView: Interactively Summarizing High-Valued Aggregate Query Answers.

Wen, Yuhao; Zhu, Xiaodan; Roy, Sudeepa; Yang, Jun.

Proc ACM SIGMOD Int Conf Manag Data ; 2018: 1709-1712, 2018 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-31447495

RESUMO

Methods for summarizing and diversifying query results have drawn significant attention recently, because they help present query results with lots of tuples to users in more informative ways. We present QAGView (Quick AGgregate View), which provides a holistic overview of high-valued aggregate query answers to the user in the form of summaries (showing high-level properties that emerge from subsets of answers) with coverage guarantee (for a user-specified number of top-valued answers) that is both diverse (avoiding overlapping or similar summaries) and relevant (focusing on high-valued aggregate answers). QAGView allows users to view the high-level summaries as clusters, and to expand individual clusters for their constituent result tuples. Users can fine-tune the behavior of QAGView by specifying a number of parameters according their preference. To help users choose appropriate parameters interactively, QAGView employ a suite of optimizations that enable quick preview of how the quality of the summaries changes over wide ranges of parameter settings, as well as real-time visualization of how the summaries evolve in response to parameter updates.

7.

Interactive Summarization and Exploration of Top Aggregate Query Answers.

Wen, Yuhao; Zhu, Xiaodan; Roy, Sudeepa; Yang, Jun.

Proceedings VLDB Endowment ; 11(13): 2196-2208, 2018 09.

Artigo em Inglês | MEDLINE | ID: mdl-31179155

RESUMO

We present a system for summarization and interactive exploration of high-valued aggregate query answers to make a large set of possible answers more informative to the user. Our system outputs a set of clusters on the high-valued query answers showing their common properties such that the clusters are diverse as much as possible to avoid repeating information, and cover a certain number of top original answers as indicated by the user. Further, the system facilitates interactive exploration of the query answers by helping the user (i) choose combinations of parameters for clustering, (ii) inspect the clusters as well as the elements they contain, and (iii) visualize how changes in parameters affect clustering. We define optimization problems, study their complexity, explore properties of the solutions investigating the semi-lattice structure on the clusters, and propose efficient algorithms and optimizations to achieve these goals. We evaluate our techniques experimentally and discuss our prototype with a graphical user interface that facilitates this interactive exploration. A user study is conducted to evaluate the usability of our approach.

8.

Computing Optimal Repairs for Functional Dependencies.

Livshits, Ester; Kimelfeld, Benny; Roy, Sudeepa.

Proc ACM SIGACT SIGMOD SIGART Symp Princ Database Syst ; 2018: 225-237, 2018 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-31205405

RESUMO

We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair) that is obtained by a minimum number of tuple deletions, and an optimal update repair (optimal U-repair) that is obtained by a minimum number of value (cell) up-dates. For computing an optimal S-repair, we present a polynomial-time algorithm that succeeds on certain sets of FDs and fails on others. We prove the following about the algorithm. When it succeeds, it can also incorporate weighted tuples and duplicate tuples. When it fails, the problem is NP-hard, and in fact, APX-complete (hence, cannot be approximated better than some constant). Thus, we establish a dichotomy in the complexity of computing an optimal S-repair. We present general analysis techniques for the complexity of computing an optimal U-repair, some based on the dichotomy for S-repairs. We also draw a connection to a past dichotomy in the complexity of finding a "most probable database" that satisfies a set of FDs with a single attribute on the left hand side; the case of general FDs was left open, and we show how our dichotomy provides the missing generalization and thereby settles the open problem.

9.

iQCAR: A Demonstration of an Inter-Query Contention Analyzer for Cluster Computing Frameworks.

Kalmegh, Prajakta; Lundberg, Harrison; Xu, Frederick; Babu, Shivnath; Roy, Sudeepa.

Proc ACM SIGMOD Int Conf Manag Data ; 2018: 1721-1724, 2018 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-31289421

RESUMO

Unpredictability in query runtimes can arise in a shared cluster as a result of resource contentions caused by inter-query interactions. iQCAR - inter Query Contention AnalyzeR is a system that formally models these interferences between concurrent queries and provides a framework to attribute blame for contentions. iQCAR leverages a multi-level directed acyclic graph called iQC-Graph to diagnose the aberrations in query schedules that lead to these resource contentions. The demonstration will enable users to perform a step-wise deep exploration of such resource contentions faced by a query at various stages of its execution. The interface will allow users to identify top-k victims and sources of contentions, diagnose high-contention nodes and resources in the cluster, and rank their impacts on the performance of a query. Users will also be able to navigate through a set of rules recommended by iQCAR to compare how application of each rule by the cluster scheduler resolves the contentions in subsequent executions.

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA